Results 1 - 10
of
22
Matching Techniques for Large Music Databases
, 1999
"... With the growth in digital representations of music, and of music stored in these representations, it is increasingly attractive to search collections of music. One mode of search is by similarity, but, for music, similarity search presents several difficulties: in particular, deciding what part of ..."
Abstract
-
Cited by 66 (4 self)
- Add to MetaCart
With the growth in digital representations of music, and of music stored in these representations, it is increasingly attractive to search collections of music. One mode of search is by similarity, but, for music, similarity search presents several difficulties: in particular, deciding what part of the music is likely to be perceived as the theme by a listener, and deciding whether two pieces of music with different sequences of notes represent the same theme. In this paper we propose a three-stage framework for matching pieces of music. We use the framework to compare a range of techniques for determining whether two pieces of music are similar, by experimentally testing their ability to retrieve different transcriptions of the same piece of music from a large collection of MIDI files. These experiments show that different comparison techniques differ widely in their effectiveness; and
Manipulation of Music For Melody Matching
, 1998
"... Large volumes of music are available online, represented in performance formats such as MIDI and, increasingly, in abstract notation such as SMDL. Many types of user would find it valuable to search collections of music via queries representing music fragments, but such searching requires a reliable ..."
Abstract
-
Cited by 53 (2 self)
- Add to MetaCart
Large volumes of music are available online, represented in performance formats such as MIDI and, increasingly, in abstract notation such as SMDL. Many types of user would find it valuable to search collections of music via queries representing music fragments, but such searching requires a reliable technique for identifying whether a provided fragment occurs within a piece of music. The problem of matching fragments to music is made difficult by the psychology of music perception, because literal matching may have little relation to perceived melodic similarity, and by the interactions between the multiple parts of typical pieces of music. In this paper we analyse the properties of music, music perception, and music database users, and use the analysis to propose alternative techniques for extracting monophonic melodies from polyphonic music; we believe that such melodies can subsequently be used for matching of queries to data. We report on experiments with music listeners, which rank our proposed techniques for extracting melodies.
Indexing and Retrieval for Genomic Databases
- IEEE Transactions on Knowledge and Data Engineering
, 2002
"... Genomic sequence databases are widely used by molecular biologists for homology searching. Amino-acid and nucleotide databases are increasing in size exponentially, and mean sequence lengths are also increasing. In searching such databases, it is desirable to use heuristics to perform computationall ..."
Abstract
-
Cited by 40 (6 self)
- Add to MetaCart
Genomic sequence databases are widely used by molecular biologists for homology searching. Amino-acid and nucleotide databases are increasing in size exponentially, and mean sequence lengths are also increasing. In searching such databases, it is desirable to use heuristics to perform computationally intensive local alignments on selected sequences only and to reduce the costs of the alignments that are attempted. We present an index-based approach for both selecting sequences that display broad similarity to a query and for fast local alignment. We show experimentally that the indexed approach results in signi cant savings in computationally intensive local alignments, and that index-based searching is as accurate as existing exhaustive search schemes.
Phonetic String Matching: Lessons from Information Retrieval
, 1996
"... Phonetic matching is used in applications such as name retrieval, where the spelling of a name is used to identify other strings that are likely to be of similar pronunciation. In this paper we explain the parallels between information retrieval and phonetic matching, and describe our new phonetic m ..."
Abstract
-
Cited by 38 (2 self)
- Add to MetaCart
Phonetic matching is used in applications such as name retrieval, where the spelling of a name is used to identify other strings that are likely to be of similar pronunciation. In this paper we explain the parallels between information retrieval and phonetic matching, and describe our new phonetic matching techniques. Our experimental comparison with existing techniques such as Soundex and edit distances, which is based on recall and precision, demonstrates that the new techniques are superior. In addition, reasoning from the similarity of phonetic matching and information retrieval, we have applied combination of evidence to phonetic matching. Our experiments with combining demonstrate that it leads to substantial improvements in effectiveness.
Probabilistic Retrieval of OCR Degraded Text Using N-Grams
- European Conference on Digital Libraries
, 1997
"... . The retrieval of OCR degraded text using n-gram formulations within a probabilistic retrieval system is examined in this paper. Direct retrieval of documents using n-gram databases of 2 and 3-grams or 2, 3, 4 and 5-grams resulted in improved retrieval performance over standard (word based) queries ..."
Abstract
-
Cited by 32 (0 self)
- Add to MetaCart
. The retrieval of OCR degraded text using n-gram formulations within a probabilistic retrieval system is examined in this paper. Direct retrieval of documents using n-gram databases of 2 and 3-grams or 2, 3, 4 and 5-grams resulted in improved retrieval performance over standard (word based) queries on the same data when a level of 10 percent degradation or worse was achieved. A second method of using n-grams to identify appropriate matching and near matching terms for query expansion which also performed better than using standard queries is also described. This method was less effective than direct n-gram query formulations but can likely be improved with alternative query component weighting schemes and measures of term similarity. Finally, a web based retrieval application using n-gram retrieval of OCR text and display, with query term highlighting, of the source document image is described. 1 Introduction A major problem with retrieval of OCR text from image data is the inevitabl...
Searchable Words on the Web
- International Journal of Digital Libraries
, 2001
"... In designing data structures for text databases, it is valuable to know how many different words are likely to be encountered in a particular collection. For example, vocabulary accumulation is central to index construction for text database systems; it is useful to be able to estimate the space req ..."
Abstract
-
Cited by 14 (6 self)
- Add to MetaCart
In designing data structures for text databases, it is valuable to know how many different words are likely to be encountered in a particular collection. For example, vocabulary accumulation is central to index construction for text database systems; it is useful to be able to estimate the space requirements and performance characteristics of the main-memory data structures used for this task. However, it is not clear how many distinct words will be found in a text collection or whether new words will continue to appear after inspecting large volumes of data. We propose practical definitions of a word, and investigate new word occurrences under these models in a large text collection. We inspected around two billion word occurrences in 45 gigabytes of world-wide web documents, and found just over 9.74 million different words in 5.5 million documents; overall, 1 word in 200 was new. We observe that new words continue to occur, even in very large data sets, and that choosing stricter definitions of what constitutes a word has only limited impact on the number of new words found.
Alias Detection in Link Data Sets
- In Proceedings of the International Conference on Intelligence Analysis
, 2004
"... The problem of detecting aliases - multiple text string identifiers corresponding to the same entity - is increasingly important in the domains of biology, intelligence, marketing, and geoinformatics. This report investigates the extent to which probabilistic methods can help. ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
The problem of detecting aliases - multiple text string identifiers corresponding to the same entity - is increasingly important in the domains of biology, intelligence, marketing, and geoinformatics. This report investigates the extent to which probabilistic methods can help.
Identification of Confusable Drug Names: A New Approach and Evaluation Methodology
- In Proceedings of COLING 2004
, 2004
"... This paper addresses the mitigation of medical errors due to the confusion of sound-alike and look-alike drug names. Our approach involves application of two new methods--- one based on orthographic similarity ("lookalike ") and the other based on phonetic similarity ("sound-alike"). We presen ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
This paper addresses the mitigation of medical errors due to the confusion of sound-alike and look-alike drug names. Our approach involves application of two new methods--- one based on orthographic similarity ("lookalike ") and the other based on phonetic similarity ("sound-alike"). We present a new recall-based evaluation methodology for determining the effectiveness of different similarity measures on drug names. We show that the new orthographic measure (BI-SIM) outperforms other commonly used measures of similarity on a set containing both look-alike and sound-alike pairs, and that the feature-based phonetic approach (ALINE) outperforms orthographic approaches on a test set containing solely sound-alike confusion pairs. However, an approach that combines several different measures achieves the best results on both test sets.
Efficient Approximate Entity Extraction with Edit Distance Constraints
"... Named entity recognition aims at extracting named entities from unstructured text. A recent trend of named entity recognition is finding approximate matches in the text with respect to a large dictionary of known entities, as the domain knowledge encoded in the dictionary helps to improve the extrac ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Named entity recognition aims at extracting named entities from unstructured text. A recent trend of named entity recognition is finding approximate matches in the text with respect to a large dictionary of known entities, as the domain knowledge encoded in the dictionary helps to improve the extraction performance. In this paper, we study the problem of approximate dictionary matching with edit distance constraints. Compared to existing studies using token-based similarity constraints, our problem definition enables us to capture typographical or orthographical errors, both of which are common in entity extraction tasks yet may be missed by token-based similarity constraints. Our problem is technically challenging as existing approaches based on q-gram filtering have poor performance due to the existence of many short entities in the dictionary. Our proposed solution is based on an improved neighborhood generation method employing novel partitioning and prefix pruning techniques. We also propose an efficient document processing algorithm that minimizes unnecessary comparisons and enumerations and hence achieves good scalability. We have conducted extensive experiments on several publicly available named entity recognition datasets. The proposed algorithm outperforms alternative approaches by up to an order of magnitude.
On the Development of Name Search Techniques for Arabic
, 2003
"... The need for effective identity matching systems has led to extensive research in the area of name search. For the most part, such work has been limited to English and other Latin-based languages. Consequently, algorithms such as Soundex and n-gram matching are of limited utility for languages such ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
The need for effective identity matching systems has led to extensive research in the area of name search. For the most part, such work has been limited to English and other Latin-based languages. Consequently, algorithms such as Soundex and n-gram matching are of limited utility for languages such as Arabic, which has vastly different morphologic features that rely heavily on phonetic information. The dearth of work in this field is partly caused by the lack of standardized test data. Consequently, we have built a collection of 7,939 Arabic names, along with 50 training queries and 111 test queries. We use this collection to evaluate a variety of algorithms, including a derivative of Soundex tailored to Arabic (ASOUNDEX), measuring effectiveness by using standard information retrieval measures. Our results show an improvement of 70 % over existing approaches.

