• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Adaptive product normalization: Using online learning for record linkage in comparison shopping (2005)

by M Bilenko, S Basu, M Sahami
Venue:In ICDM
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 16
Next 10 →

Towards Domain-Independent Information Extraction from Web Tables

by Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krüpl, Bernhard Pollak - WORLD WIDE WEB CONFERENCE , 2007
"... Traditionally, information extraction from web tables has focused on small, more or less homogeneous corpora, often based on assumptions about the use of tags. A multitude of different HTML implementations of web tables make these approaches difficult to scale. In this paper, we approach the ..."
Abstract - Cited by 34 (3 self) - Add to MetaCart
Traditionally, information extraction from web tables has focused on small, more or less homogeneous corpora, often based on assumptions about the use of <table> tags. A multitude of different HTML implementations of web tables make these approaches difficult to scale. In this paper, we approach the problem of domain-independent information extraction from web tables by shifting our attention from the tree-based representation of web pages to a variation of the two-dimensional visual box model used by web browsers to display the information on the screen. The thereby obtained topological and style information allows us to fill the gap created by missing domain-specific knowledge about content and table templates. We believe that, in a future step, this approach can become the basis for a new way of large-scale knowledge acquisition from the current “Visual Web.”

Adaptive blocking: Learning to scale up record linkage

by Mikhail Bilenko - In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM-2006 , 2006
"... Many data mining tasks require computing similarity between pairs of objects. Pairwise similarity computations are particularly important in record linkage systems, as well as in clustering and schema mapping algorithms. Because the number of object pairs grows quadratically with the size of the dat ..."
Abstract - Cited by 17 (1 self) - Add to MetaCart
Many data mining tasks require computing similarity between pairs of objects. Pairwise similarity computations are particularly important in record linkage systems, as well as in clustering and schema mapping algorithms. Because the number of object pairs grows quadratically with the size of the dataset, computing similarity between all pairs is impractical and becomes prohibitive for large datasets and complex similarity functions. Blocking methods alleviate this problem by efficiently selecting approximately similar object pairs for subsequent distance computations, leaving out the remaining pairs as dissimilar. Previously proposed blocking methods require manually constructing an indexbased similarity function or selecting a set of predicates, followed by hand-tuning of parameters. In this paper, we introduce an adaptive framework for automatically learning blocking functions that are efficient and accurate. We describe two predicate-based formulations of learnable blocking functions and provide learning algorithms for training them. The effectiveness of the proposed techniques is demonstrated on real and simulated datasets, on which they prove to be more accurate than non-adaptive blocking methods. 1

Learning Similarity Metrics for Event Identification in Social Media

by Hila Becker, Mor Naaman, Luis Gravano
"... Social media sites (e.g., Flickr, YouTube, and Facebook) are a popular distribution outlet for users looking to share their experiences and interests on the Web. These sites host substantial amounts of user-contributed materials (e.g., photographs, videos, and textual content) for a wide variety of ..."
Abstract - Cited by 16 (7 self) - Add to MetaCart
Social media sites (e.g., Flickr, YouTube, and Facebook) are a popular distribution outlet for users looking to share their experiences and interests on the Web. These sites host substantial amounts of user-contributed materials (e.g., photographs, videos, and textual content) for a wide variety of real-world events of different type and scale. By automatically identifying these events and their associated user-contributed social media documents, which is the focus of this paper, we can enable event browsing and search in state-of-the-art search engines. To address this problem, we exploit the rich “context ” associated with social media content, including user-provided annotations (e.g., title, tags) and automatically generated information (e.g., content creation time). Using this rich context, which includes both textual and non-textual features, we can define appropriate document similarity metrics to enable online clustering of media to events. As a key contribution of this paper, we explore a variety of techniques for learning multi-feature similarity metrics for social media documents in a principled manner. We evaluate our techniques on large-scale, realworld datasets of event images from Flickr. Our evaluation results suggest that our approach identifies events, and their associated social media documents, more effectively than the state-of-the-art strategies on which we build.

Object identification with constraints

by Steffen Rendle, Lars Schmidt-thieme - In ICDM , 2006
"... Object identification aims at identifying different representations of the same object based on noisy attributes such as descriptions of the same product in different online shops or references to the same paper in different publications. Numerous solutions have been proposed for solving this task, ..."
Abstract - Cited by 12 (6 self) - Add to MetaCart
Object identification aims at identifying different representations of the same object based on noisy attributes such as descriptions of the same product in different online shops or references to the same paper in different publications. Numerous solutions have been proposed for solving this task, almost all of them based on similarity functions of a pair of objects. Although today the similarity functions are learned from a set of labeled training data, the structural information given by the labeled data is not used. By formulating a generic model for object identification we show how almost any proposed identification model can easily be extended for satisfying structural constraints. Therefore we propose a model that uses structural information given as pairwise constraints to guide collective decisions about object identification in addition to a learned similarity measure. We show with empirical experiments on public and on real-life data that combining both structural information and attribute-based similarity enormously increases the overall performance for object identification tasks. 1

Frameworks for entity matching: A comparison

by Hanna Köpcke, Erhard Rahm , 2009
"... ..."
Abstract - Cited by 12 (6 self) - Add to MetaCart
Abstract not found

Table Extraction Using Spatial Reasoning on the CSS2 Visual Box Model

by Wolfgang Gatterbauer, Paul Bohunsky , 2006
"... Tables on web pages contain a huge amount of semantically explicit information, which makes them a worthwhile target for automatic information extraction and knowledge acquisition from the Web. However, the task of table extraction from web pages is difficult, because of HTML's design purpose to con ..."
Abstract - Cited by 7 (2 self) - Add to MetaCart
Tables on web pages contain a huge amount of semantically explicit information, which makes them a worthwhile target for automatic information extraction and knowledge acquisition from the Web. However, the task of table extraction from web pages is difficult, because of HTML's design purpose to convey visual instead of semantic information. In this paper, we propose a robust technique for table extraction from arbitrary web pages. This technique relies upon the positional information of visualized DOM element nodes in a browser and, hereby, separates the intricacies of code implementation from the actual intended visual appearance. The novel aspect of the proposed web table extraction technique is the effective use of spatial reasoning on the CSS2 visual box model, which shows a high level of robustness even without any form of learning (F-measure ~ 90%). We describe the ideas behind our approach, the tabular pattern recognition algorithm operating on a double topographical grid structure and allowing for effective and robust extraction, and general observations on web tables that should be borne in mind by any automatic web table extraction mechanism.

Learnable Similarity Functions and Their Applications to Clustering and Record Linkage

by Mikhail Bilenko , 2004
"... rship (Xing et al. 2003), and relative comparisons (Schultz & Joachims 2004). These approaches have shown improvements over traditional similarity functions for different data types such as vectors in Euclidean space, strings, and database records composed of multiple text fields. While these initia ..."
Abstract - Cited by 6 (0 self) - Add to MetaCart
rship (Xing et al. 2003), and relative comparisons (Schultz & Joachims 2004). These approaches have shown improvements over traditional similarity functions for different data types such as vectors in Euclidean space, strings, and database records composed of multiple text fields. While these initial results are encouraging, there still remains a large number of similarity functions that are currently unable to adapt to a particular domain. In our research, we attempt to bridge this gap by developing both new learnable similarity functions and methods for their application to particular problems in machine learning and data mining. In preliminary work, we proposed two learnable similarity functions for strings that adapt distance computations given training pairs of equivalent and non-equivalent strings (Bilenko & Mooney 2003a). The first function is based on a probabilistic model of edit distance with affine gaps (Gus- Copyright c # 2004, American Association for Artificial Intelli

L.: Scaling record linkage to non-uniform distributed class sizes

by Steffen Rendle, Lars Schmidt-thieme - Proceedings of the 12th Pacific-Asia Conference on Knowledge Discovery and Data Mining , 2008
"... Abstract. Record linkage is a central task when information from different sources is integrated. Record linkage models use so-called blockers for reducing the search space by discarding obviously different record pairs. In practice, important problems have Zipf distributed class sizes with some lar ..."
Abstract - Cited by 4 (4 self) - Add to MetaCart
Abstract. Record linkage is a central task when information from different sources is integrated. Record linkage models use so-called blockers for reducing the search space by discarding obviously different record pairs. In practice, important problems have Zipf distributed class sizes with some large classes where blocking is not applicable any more. Therefore we propose two novel meta algorithms for scaling arbitrary record linkage models to such data sets. The first one parallelizes problems by creating overlapping subproblems and the second one reduces the search space for large classes effectively. Our evaluation shows that both scaling techniques are effective and are able to scale state-of-the-art models to challenging datasets. 1

Towards parameter-free blocking for scalable record linkage

by Tr-cs-- Stephen, M Blackburn, Robin Garner, Chris Hoffmann, Asjad M Khan, Kathryn S Mckinley, Rotem Bentzur, Daniel Feinberg, Daniel Frampton, Samuel Z Guyer, Martin Hirzel, Antony Hosking, Maria Jump, Han Lee, Thomas Vandrunen, Daniel Von Dincklage, Peter Christen, Peter Christen , 2007
"... or send email to: Technical-DOT-Reports-AT-cs-DOT-anu.edu.au A list of technical reports, including some abstracts and copies of some full reports may be found at: ..."
Abstract - Cited by 3 (2 self) - Add to MetaCart
or send email to: Technical-DOT-Reports-AT-cs-DOT-anu.edu.au A list of technical reports, including some abstracts and copies of some full reports may be found at:

Query Result Clustering for Object-level Search ∗

by Jongwuk Lee, Seung-won Hwang, Zaiqing Nie, Ji-rong Wen
"... Query result clustering has recently attracted a lot of attention to provide users with a succinct overview of relevant results. However, little work has been done on organizing the query results for object-level search. Object-level search result clustering is challenging because we need to support ..."
Abstract - Cited by 2 (0 self) - Add to MetaCart
Query result clustering has recently attracted a lot of attention to provide users with a succinct overview of relevant results. However, little work has been done on organizing the query results for object-level search. Object-level search result clustering is challenging because we need to support diverse similarity notions over object-specific features (such as the price and weight of a product) of heterogeneous domains. To address this challenge, we propose a hybrid subspace clustering algorithm called Hydra. Algorithm Hydra captures the user perception of diverse similarity notions from millions of Web pages and disambiguates different senses using featurebased subspace locality measures. Our proposed solution, by combining wisdom of crowds and wisdom of data, achieves robustness and efficiency over existing approaches. We extensively evaluate our proposed framework and demonstrate how to enrich user experiences in object-level search using a real-world product search scenarios.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University