MetaCart Sign in to MyCiteSeerX

Include Citations | Advanced Search | Help

Disambiguated Search | Include Citations | Advanced Search | Help

Building Minority Language Corpora by Learning to Generate Web Search Queries (2000) [17 citations — 5 self]

by Rayid Ghani ,  Rosie Jones ,  Dunja Mladenic
Knowledge and Information Systems
Add To MetaCart

Abstract:

The Web is an obvious source of valuable information but the process of collecting, organizing and utilizing these resources is difficult. We describe CorpusBuilder, an approach for automatically generating Web-search queries for collecting documents matching a minority concept. We use the concept of text documents belonging to a minority natural language on the Web. Individual documents are automatically labeled as relevant or non-relevant using a language filter and the feedback is used to learn what query-lengths and inclusion/exclusion term-selection methods are helpful for finding previously unseen documents in the target language. Our system learns to select good query terms using a variety of term scoring methods. We find that using odds-ratio scores calculated over the documents acquired so far was one of the most consistently accurate query-generation methods. We also parameterize the query length using a Gamma distribution and present empirical results with learning methods that vary the time horizon used when learning from the results of past queries. We find that our systems performs well whether we initialize it with a whole document, or with a handful of words elicited from a user. Experiments applying the same approach to multiple languages are also presented showing that our approach generalizes well across several languages regardless of the initial conditions. 1.

Citations

494 Statistical methods for speech recognition – Jelinek - 1997
147 Focused crawling using context graphs – Diligenti, Coetzee, et al. - 2000
143 N-gram-based text categorization – Cavnar, Trenkle - 1994
124 Finding related pages in the World Wide Web – Dean, Henzinger - 1999
88 Using reinforcement learning to spider the webefficiently – Rennie, McCallum - 1999
73 Mining the web for bilingual text – Resnik - 1999
70 Document categorization and query generation on the world wide web using WebACE – Boley, Gini, et al. - 1999
64 Feature selection for unbalanced class distribution and naive bayes – Mladenic, Grobelnik - 1999
63 Relevance Feedback and Inference Networks – Haines, Croft - 1993
57 Improving category specific web search by learning query modifications – Glover, Flake, et al. - 2001
52 The mathematics of statistical machine translation – Brown, Pietra, et al. - 1993
50 A Winnow-Based Approach to Context-Sensitive Spelling Correction – Golding, Roth - 1999
38 On-line algorithms in machine learning – Blum - 1996
23 WebSail: From On-Line Learning to Web Search – Chen, Meng, et al. - 2000
10 Learning a monolingual language model from a multilingual text database – Ghani, Jones - 2000
8 The creation, distribution and use of linguistic data – Liberman, Cieri - 1998