Results 1 - 10
of
20
Semi-Markov conditional random fields for information extraction
- In Advances in Neural Information Processing Systems 17
, 2004
"... We describe semi-Markov conditional random fields (semi-CRFs), a conditionally trained version of semi-Markov chains. Intuitively, a semi-CRF on an input sequence x outputs a “segmentation ” of x, in which labels are assigned to segments (i.e., subsequences) of x rather than to individual elements x ..."
Abstract
-
Cited by 254 (10 self)
- Add to MetaCart
(Show Context)
We describe semi-Markov conditional random fields (semi-CRFs), a conditionally trained version of semi-Markov chains. Intuitively, a semi-CRF on an input sequence x outputs a “segmentation ” of x, in which labels are assigned to segments (i.e., subsequences) of x rather than to individual elements xi of x. Importantly, features for semi-CRFs can measure properties of segments, and transitions within a segment can be non-Markovian. In spite of this additional power, exact learning and inference algorithms for semi-CRFs are polynomial-time—often only a small constant factor slower than conventional CRFs. In experiments on five named entity recognition problems, semi-CRFs generally outperform conventional CRFs. 1
Interactive Deduplication using Active Learning
, 2002
"... Deduplication is a key operation in integrating data from multiple sources. The main challenge in this task is designing a function that can resolve when a pair of records refer to the same entity in spite of various data inconsistencies. Most existing systems use hand-coded functions. One way to ov ..."
Abstract
-
Cited by 242 (5 self)
- Add to MetaCart
Deduplication is a key operation in integrating data from multiple sources. The main challenge in this task is designing a function that can resolve when a pair of records refer to the same entity in spite of various data inconsistencies. Most existing systems use hand-coded functions. One way to overcome the tedium of hand-coding is to train a classifier to distinguish between duplicates and non-duplicates. The success of this method critically hinges on being able to provide a covering and challenging set of training pairs that bring out the subtlety of the deduplication function. This is non-trivial because it requires manually searching for various data inconsistencies between any two records spread apart in large lists.
We present our design of a learning-based deduplication
system that uses a novel method of interactively discovering
challenging training pairs using active learning. Our
experiments on real-life datasets show that active learning
signi#12;cantly reduces the number of instances needed to
achieve high accuracy. We investigate various design issues
that arise in building a system to provide interactive
response, fast convergence, and interpretable output.
Exploiting dictionaries in named entity extraction: Combining semi-markov extraction processes and data integration method
- In Proceedings of the ACM SIGKDD Conference
, 2004
"... We consider the problem of improving named entity recognition (NER) systems by using external dictionaries—more specifically, the problem of extending state-of-the-art NER systems by incorporating information about the similarity of extracted entities to entities in an external dictionary. This is d ..."
Abstract
-
Cited by 98 (6 self)
- Add to MetaCart
(Show Context)
We consider the problem of improving named entity recognition (NER) systems by using external dictionaries—more specifically, the problem of extending state-of-the-art NER systems by incorporating information about the similarity of extracted entities to entities in an external dictionary. This is difficult because most high-performance named entity recognition systems operate by sequentially classifying words as to whether or not they participate in an entity name; however, the most useful similarity measures score entire candidate names. To correct this mismatch we formalize a semi-Markov extraction process which relaxes the usual Markov assumptions. This process is based on sequentially classifying segments of several adjacent words, rather than single words. In addition to allowing a natural way of coupling NER and high-performance record linkage methods, this formalism also allows the direct use of other useful entity-level features, and provides a more natural formulation of the NER problem than sequential word classification. Experiments in multiple domains show that the new model can substantially improve extraction performance, relative to previously published methods for using external dictionaries in NER.
Tuning Schema Matching Software Using Synthetic Scenarios
- IN PROC. VLDB’05
, 2005
"... Most recent schema matching systems assemble multiple components, each employing a particular matching technique. The domain user must then tune the system: select the right component to be executed and correctly adjust their numerous "knobs" (e.g., thresholds, formula coefficients) ..."
Abstract
-
Cited by 82 (1 self)
- Add to MetaCart
(Show Context)
Most recent schema matching systems assemble multiple components, each employing a particular matching technique. The domain user must then tune the system: select the right component to be executed and correctly adjust their numerous "knobs" (e.g., thresholds, formula coefficients). Tuning is skill- and time-intensive, but (as we show) without it the matching accuracy is significantly inferior. We describe
Constraint-based entity matching
- In AAAI
, 2005
"... Entity matching is the problem of deciding if two given men-tions in the data, such as Helen Hunt and H. M. Hunt, refer to the same real-world entity. Numerous solutions have been developed, but they have not considered in depth the problem of exploiting integrity constraints that frequently ex-ist ..."
Abstract
-
Cited by 44 (1 self)
- Add to MetaCart
Entity matching is the problem of deciding if two given men-tions in the data, such as Helen Hunt and H. M. Hunt, refer to the same real-world entity. Numerous solutions have been developed, but they have not considered in depth the problem of exploiting integrity constraints that frequently ex-ist in the domains. Examples of such constraints include a mention with age two cannot match a mention with salary 200K and if two paper citations match, then their authors are likely to match in the same order. In this paper we de-scribe a probabilistic solution to entity matching that exploits such constraints to improve matching accuracy. At the heart of the solution is a generative model that takes into account the constraints during the generation process, and provides well-dened interpretations of the constraints. We describe a novel combination of EM and relaxation labeling algorithms that efciently learns the model, thereby matching mentions in an unsupervised way, without the need for annotated train-ing data. Experiments on several real-world domains show that our solution can exploit constraints to signicantly im-prove matching accuracy, by 3-12 % F-1, and that the solution scales up to large data sets.
Efficient batch top-k search for dictionary-based entity recognition
- In ICDE-06
, 2006
"... We consider the problem of speeding up Entity Recognition systems that exploit existing large databases of structured entities to improve extraction accuracy. These systems require the computation of the maximum similarity scores of several overlapping segments of the input text with the entity data ..."
Abstract
-
Cited by 33 (1 self)
- Add to MetaCart
(Show Context)
We consider the problem of speeding up Entity Recognition systems that exploit existing large databases of structured entities to improve extraction accuracy. These systems require the computation of the maximum similarity scores of several overlapping segments of the input text with the entity database. We formulate a Batch-Top-K problem with the goal of sharing computations across overlapping segments. Our proposed algorithm performs a factor of three faster than independent Top-K queries and only a factor of two slower than an unachievable lower bound on total cost. We then propose a novel modification of the popular Viterbi algorithm for recognizing entities so as to work with easily computable bounds on match scores, thereby reducing the total inference time by a factor of eight compared to stateof-the-art methods. 1
Source-aware entity matching: A compositional approach
- Dogmatix tracks down duplicates in XML. In SIGMOD-05. [35
, 2006
"... Entity matching (a.k.a. record linkage) plays a crucial role in integrating multiple data sources, and numerous matching solutions have been developed. However, the solutions have largely exploited only information available in the mentions and employed a single matching technique. We show how to ex ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
(Show Context)
Entity matching (a.k.a. record linkage) plays a crucial role in integrating multiple data sources, and numerous matching solutions have been developed. However, the solutions have largely exploited only information available in the mentions and employed a single matching technique. We show how to exploit information about data sources to significantly improve matching accuracy. In particular, we observe that different sources often vary substantially in their level of semantic ambiguity, thus requiring different matching techniques. In addition, it is often beneficial to group and match mentions in related sources first, before considering other sources. These observations lead to a large space of matching strategies, analogous to the space of query evaluation plans considered by a relational optimizer. We propose viewing entity matching as a composition of basic steps into a “match execution plan”. We analyze formal properties of the plan space, and show how to find a good match plan. To do so, we employ ideas from social network analysis to infer the ambiguity and relatedness of data sources. We conducted extensive experiments on several real-world data sets on the Web and in the domain of personal information management (PIM). The results show that our solution significantly outperforms current best matching methods. 1.
Learning to Predict from Textual Data
"... Given a current news event, we tackle the problem of generating plausible predictions of future events it might cause. We present a new methodology for modeling and predicting such future news events using machine learning and data mining techniques. Our Pundit algorithm generalizes examples of caus ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
Given a current news event, we tackle the problem of generating plausible predictions of future events it might cause. We present a new methodology for modeling and predicting such future news events using machine learning and data mining techniques. Our Pundit algorithm generalizes examples of causality pairs to infer a causality predictor. To obtain precisely labeled causality examples, we mine 150 years of news articles and apply semantic natural language modeling techniques to headlines containing certain predefined causality patterns. For generalization, the model uses a vast number of world knowledge ontologies. Empirical evaluation on real news articles shows that our Pundit algorithm performs as well as non-expert humans. 1.
Probabilistic Graphical Models and their Role in Databases
"... Probabilistic graphical models provide a framework for compact representation and efficient reasoning about the joint probability distribution of several interdependent variables. This is a classical topic with roots in statistical physics. In recent years, spurred by several applications in unstruc ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
Probabilistic graphical models provide a framework for compact representation and efficient reasoning about the joint probability distribution of several interdependent variables. This is a classical topic with roots in statistical physics. In recent years, spurred by several applications in unstructured data integration, sensor networks, image processing, bio-informatics, and code design, the topic has received renewed interest in the machine learning, data mining, and database communities. Techniques from graphical models have also been applied to many topics directly of interest to the database community including information extraction, sensor data analysis, imprecise data representation and querying, selectivity estimation for query optimization, and data privacy. As database research continues to expand beyond the confines of traditional enterprise domains, we expect
Research Track Paper Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods
"... We consider the problem of improving named entity recognition (NER) systems by using external dictionaries—more specifically, the problem of extending state-of-the-art NER systems by incorporating information about the similarity of extracted entities to entities in an external dictionary. This is d ..."
Abstract
- Add to MetaCart
(Show Context)
We consider the problem of improving named entity recognition (NER) systems by using external dictionaries—more specifically, the problem of extending state-of-the-art NER systems by incorporating information about the similarity of extracted entities to entities in an external dictionary. This is difficult because most high-performance named entity recognition systems operate by sequentially classifying words as to whether or not they participate in an entity name; however, the most useful similarity measures score entire candidate names. To correct this mismatch we formalize a semi-Markov extraction process, which is based on sequentially classifying segments of several adjacent words, rather than single words. In addition to allowing a natural way of coupling high-performance NER methods and highperformance similarity functions, this formalism also allows the direct use of other useful entity-level features, and provides a more natural formulation of the NER problem than sequential word classification. Experiments in multiple domains show that the new model can substantially improve extraction performance over previous methods for using external dictionaries in NER.