Building Minority Language Corpora by Learning to Generate Web Search Queries (2000) [17 citations — 5 self]
http://www.cs.cmu.edu/~TextLearning/corpusbuilder/
http://www.cs.cmu.edu/afs/cs/project/theo-4/text-l
DBLP
CACHED:
Abstract:
The Web is an obvious source of valuable information but the process of collecting, organizing and utilizing these resources is difficult. We describe CorpusBuilder, an approach for automatically generating Web-search queries for collecting documents matching a minority concept. We use the concept of text documents belonging to a minority natural language on the Web. Individual documents are automatically labeled as relevant or non-relevant using a language filter and the feedback is used to learn what query-lengths and inclusion/exclusion term-selection methods are helpful for finding previously unseen documents in the target language. Our system learns to select good query terms using a variety of term scoring methods. We find that using odds-ratio scores calculated over the documents acquired so far was one of the most consistently accurate query-generation methods. We also parameterize the query length using a Gamma distribution and present empirical results with learning methods that vary the time horizon used when learning from the results of past queries. We find that our systems performs well whether we initialize it with a whole document, or with a handful of words elicited from a user. Experiments applying the same approach to multiple languages are also presented showing that our approach generalizes well across several languages regardless of the initial conditions. 1.

