Results 1 - 10
of
16
Vgram: Improving performance of approximate queries on string collections using variable-length grams
- In VLDB’07
"... Many applications need to solve the following problem of approximate string matching: from a collection of strings, how to find those similar to a given string, or the strings in another (possibly the same) collection of strings? Many algorithms are developed using fixed-length grams, which are subs ..."
Abstract
-
Cited by 31 (8 self)
- Add to MetaCart
Many applications need to solve the following problem of approximate string matching: from a collection of strings, how to find those similar to a given string, or the strings in another (possibly the same) collection of strings? Many algorithms are developed using fixed-length grams, which are substrings of a string used as signatures to identify similar strings. In this paper we develop a novel technique, called VGRAM, to improve the performance of these algorithms. Its main idea is to judiciously choose high-quality grams of variable lengths from a collection of strings to support queries on the collection. We give a full specification of this technique, including how to select high-quality grams from the collection, how to generate variable-length grams for a string based on the preselected grams, and what is the relationship between the similarity of the gram sets of two strings and their edit distance. A primary advantage of the technique is that it can be adopted by a plethora of approximate string algorithms without the need to modify them substantially. We present our extensive experiments on real data sets to evaluate the technique, and show the significant performance improvements on three existing algorithms. 1.
Locating Complex Named Entities in Web Text
- In Proc. of IJCAI
, 2007
"... Named Entity Recognition (NER) is the task of locating and classifying names in text. In previous work, NER was limited to a small number of predefined entity classes (e.g., people, locations, and organizations). However, NER on the Web is a far more challenging problem. Complex names (e.g., film or ..."
Abstract
-
Cited by 22 (3 self)
- Add to MetaCart
Named Entity Recognition (NER) is the task of locating and classifying names in text. In previous work, NER was limited to a small number of predefined entity classes (e.g., people, locations, and organizations). However, NER on the Web is a far more challenging problem. Complex names (e.g., film or book titles) can be very difficult to pick out precisely from text. Further, the Web contains a wide variety of entity classes, which are not known in advance. Thus, hand-tagging examples of each entity class is impractical. This paper investigates a novel approach to the first step in Web NER: locating complex named entities in Web text. Our key observation is that named entities can be viewed as a species of multiword units, which can be detected by accumulating n-gram statistics over the Web corpus. We show that this statistical method’s F1 score is 50% higher than that of supervised techniques including Conditional Random Fields (CRFs) and Conditional Markov Models (CMMs) when applied to complex names. The method also outperforms CMMs and CRFs by 117 % on entity classes absent from the training data. Finally, our method outperforms a semi-supervised CRF by 73%. 1
Is Knowledge-Free Induction of Multiword Unit Dictionary Headwords a Solved Problem?
, 2001
"... We seek a knowledge-free method for inducing multiword units from text corpora for use as machine-readable dictionary headwords. We provide two major evaluations of nine existing collocation-finders and illustrate the continuing need for improvement. We use Latent Semantic Analysis to make modest ga ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
We seek a knowledge-free method for inducing multiword units from text corpora for use as machine-readable dictionary headwords. We provide two major evaluations of nine existing collocation-finders and illustrate the continuing need for improvement. We use Latent Semantic Analysis to make modest gains in performance, but we show the significant challenges encountered in trying this approach.
Using small random samples for the manual evaluation of statistical association measures
- COMPUTER SPEECH AND LANGUAGE
, 2004
"... In this paper, we describe the empirical evaluation of statistical association measures for the extraction of lexical collocations from text corpora. We argue that the results of an evaluation experiment cannot easily be generalized to a different setting. Consequently, such experiments have to be c ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
In this paper, we describe the empirical evaluation of statistical association measures for the extraction of lexical collocations from text corpora. We argue that the results of an evaluation experiment cannot easily be generalized to a different setting. Consequently, such experiments have to be carried out under conditions that are as similar as possible to the intended use of the measures. Finally, we show how an evaluation strategy based on random samples can reduce the amount of manual annotation work significantly, making it possible to perform many more evaluation experiments under specific conditions.
Stamatatos E.: N-gram Feature Selection for Authorship Identification
- In: 12th International Conference on Artificial Intelligence: Methodology, Systems, Applications
, 2006
"... Abstract. Automatic authorship identification offers a valuable tool for supporting crime investigation and security. It can be seen as a multi-class, single-label text categorization task. Character n-grams are a very successful approach to represent text for stylistic purposes since they are able ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
Abstract. Automatic authorship identification offers a valuable tool for supporting crime investigation and security. It can be seen as a multi-class, single-label text categorization task. Character n-grams are a very successful approach to represent text for stylistic purposes since they are able to capture nuances in lexical, syntactical, and structural level. So far, character n-grams of fixed length have been used for authorship identification. In this paper, we propose a variable-length n-gram approach inspired by previous work for selecting variable-length word sequences. Using a subset of the new Reuters corpus, consisting of texts on the same topic by 50 different authors, we show that the proposed approach is at least as effective as information gain for selecting the most significant n-grams although the feature sets produced by the two methods have few common members. Moreover, we explore the significance of digits for distinguishing between authors showing that an increase in performance can be achieved using simple text pre-processing. 1
A Nonparametric Method for Extraction of Candidate Phrasal Terms
- Proceedings of ACL’2005
, 2005
"... This paper introduces a new method for identifying candidate phrasal terms (also known as multiword units) which applies a nonparametric, rank-based heuristic measure. Evaluation of this measure, the mutual rank ratio metric, shows that it produces better results than standard statistical measures w ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
This paper introduces a new method for identifying candidate phrasal terms (also known as multiword units) which applies a nonparametric, rank-based heuristic measure. Evaluation of this measure, the mutual rank ratio metric, shows that it produces better results than standard statistical measures when applied to this task. 1
Using Morphological, Syntactical and Statistical Information for Automatic Term Acquisition *
, 2002
"... Terminologies are useful in all areas that use specialized languages. ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Terminologies are useful in all areas that use specialized languages.
ATA -- Automatic Term Acquisition
- In Proceedings of the Workshop on Extraction of Knowledge from Databases
, 2001
"... Terminological acquisition is an important issue when learning about Natural Language Processing (NLP) due to the constant terminological renewal caused by technological changes. Terms play a key role in several NLP activities such as machine translation, automatic indexing, text understanding, and ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Terminological acquisition is an important issue when learning about Natural Language Processing (NLP) due to the constant terminological renewal caused by technological changes. Terms play a key role in several NLP activities such as machine translation, automatic indexing, text understanding, and information retrieval. This is especially true at this time when corpora in electronic format keep growing in number and variety. In this work we start by using morphological and syntactic information to locate candidate noun phrases, and then we use statistical information to improve result accuracy.
One Size Fits All? A Simple Technique to Perform
- Several NLP Tasks, in 4 th International Conference, EsTAL 2004, J.L. Vicedo et al (Eds), LNAI 3230
, 2004
"... Abstract. Word fragments or n-grams have been widely used to perform different Natural Language Processing tasks such as information retrieval [1] [2], document categorization [3], automatic summarization [4] or, even, genetic classification of languages [5]. All these techniques share some common a ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. Word fragments or n-grams have been widely used to perform different Natural Language Processing tasks such as information retrieval [1] [2], document categorization [3], automatic summarization [4] or, even, genetic classification of languages [5]. All these techniques share some common aspects such as: (1) documents are mapped to a vector space where n-grams are used as coordinates and their relative frequencies as vector weights, (2) many of them compute a context which plays a role similar to stop-word lists, and (3) cosine distance is commonly used for document-to-document and query-to-document comparisons. blindLight is a new approach related to these classical n-gram techniques although it introduces two major differences: (1) Relative frequencies are no more used as vector weights but replaced by n-gram significances, and (2) cosine distance is abandoned in favor of a new metric inspired by sequence alignment techniques although not so computationally expensive. This new approach can be simultaneously used to perform document categorization and clustering, information retrieval, and text summarization. In this paper we will describe the foundations of such a technique and its application to both a particular categorization problem (i.e., language identification) and information retrieval tasks. 1

