Results 1 - 10
of
10
QProber: A system for automatic classification of hidden-web databases
- ACM TOIS
, 2003
"... The contents of many valuable web-accessible databases are only available through search interfaces and are hence invisible to traditional web “crawlers. ” Recently, commercial web sites have started to manually organize web-accessible databases into Yahoo!-like hierarchical classification schemes. ..."
Abstract
-
Cited by 53 (11 self)
- Add to MetaCart
The contents of many valuable web-accessible databases are only available through search interfaces and are hence invisible to traditional web “crawlers. ” Recently, commercial web sites have started to manually organize web-accessible databases into Yahoo!-like hierarchical classification schemes. Here, we introduce QProber, a modular system that automates this classification process by using a small number of query probes, generated by document classifiers. QProber can use a variety of types of classifiers to generate the probes. To classify a database, QProber does not retrieve or inspect any documents or pages from the database, but rather just exploits the number of matches that each query probe generates at the database in question. We have conducted an extensive experimental evaluation of QProber over collections of real documents, experimenting with different types of document classifiers and retrieval models. We have also tested our system with over one hundred web-accessible databases. Our experiments show that our system has low overhead and achieves high classification accuracy across a variety of databases.
BootCaT: Bootstrapping Corpora and Terms from the Web
- In Proceedings of LREC 2004
, 2004
"... This paper introduces the BootCaT toolkit, a suite of perl programs implementing an iterative procedure to bootstrap specialized corpora and terms from the web. The procedure requires only a small set of seed terms as input. The seeds are used to build a corpus via automated Google queries, and more ..."
Abstract
-
Cited by 45 (9 self)
- Add to MetaCart
This paper introduces the BootCaT toolkit, a suite of perl programs implementing an iterative procedure to bootstrap specialized corpora and terms from the web. The procedure requires only a small set of seed terms as input. The seeds are used to build a corpus via automated Google queries, and more terms are extracted from this corpus. In turn, these new terms are used as seeds to build a larger corpus via automated queries, and so forth. The corpus and the unigram terms are then used to extract multi-word terms. We conducted an evaluation of the tools by applying them to the construction of English and Italian corpora and term lists from the domain of psychiatry. The results illustrate the potential usefulness of the tools.
Querying Text Databases for Efficient Information Extraction
- In Proceedings of the 19th IEEE International Conference on Data Engineering (ICDE
, 2003
"... A wealth of information is hidden within unstructured text. This information is often best exploited in structured or relational form, which is suited for sophisticated query processing, for integration with relational databases, and for data mining. Current information extraction techniques extract ..."
Abstract
-
Cited by 37 (9 self)
- Add to MetaCart
A wealth of information is hidden within unstructured text. This information is often best exploited in structured or relational form, which is suited for sophisticated query processing, for integration with relational databases, and for data mining. Current information extraction techniques extract relations from a text database by examining every document in the database, or use filters to select promising documents for extraction. The exhaustive scanning approach is not practical or even feasible for large databases, and the current filtering techniques require human involvement to maintain and to adopt to new databases and domains. In this paper, we develop an automatic query-based technique to retrieve documents useful for the extraction of user-defined relations from large text databases, which can be adapted to new domains, databases, or target relations with minimal human effort. We report a thorough experimental evaluation over a large newspaper archive that shows that we significantly improve the efficiency of the extraction process by focusing only on promising documents.
Retrieving Japanese specialized terms and corpora from the World Wide Web
- Proceedings of KONVENS 2004
, 2004
"... The BootCaT toolkit (Baroni and Bernardini, 2004) is a suite of perl programs implementing a procedure to bootstrap specialized corpora and terms from the web using minimal knowledge sources. In this paper, we report ongoing work in which we apply the BootCaT procedure to a Japanese corpus and term ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
The BootCaT toolkit (Baroni and Bernardini, 2004) is a suite of perl programs implementing a procedure to bootstrap specialized corpora and terms from the web using minimal knowledge sources. In this paper, we report ongoing work in which we apply the BootCaT procedure to a Japanese corpus and term extraction task in the hotel terminology domain. The results of our experiments are very encouraging, indicating that the BootCaT procedure can be successfully applied, with relatively small modifications, to a language very different from English and the other Indo-European languages on which we tested the procedure originally.
Linguistic Resource Creation for Research and Technology Development: A Recent Experiment
- ACM Transactions on Asian Language Information Processing (TALIP
, 2003
"... Advances in statistical machine learning encourage language-independent approaches to linguistic technology development. Experiments in “porting ” technologies to handle new natural languages have revealed a great potential for multilingual computing, but also a frustrating lack of linguistic resour ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Advances in statistical machine learning encourage language-independent approaches to linguistic technology development. Experiments in “porting ” technologies to handle new natural languages have revealed a great potential for multilingual computing, but also a frustrating lack of linguistic resources for most languages. Recent efforts to address the lack of available resources have focused either on intensive resource development for a small number of languages or development of technologies for rapid porting. The Linguistic Data Consortium recently participated in an experiment falling primarily under the first approach, the surprise language exercise. This article describes linguistic resource creation within this context, including the overall methodology for surveying and collecting language resources, as well as details of the resources developed during the exercise. The article concludes with discussion of a new approach to solving the problem of limited linguistic resources, one that has recently proven effective in identifying core linguistic resources for less common studied languages.
Experience building a large corpus for Chinese lexicon construction
- In [6
, 2006
"... The World Wide Web (WWW) provides a large and constantly growing renewable source of natural language data in many of the world’s languages. Computational linguists and lexicographers ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The World Wide Web (WWW) provides a large and constantly growing renewable source of natural language data in many of the world’s languages. Computational linguists and lexicographers
Frontiers in linguistic annotation for lower-density languages
- In Proceedings of COLING/ACL2006 Workshop on Frontiers in Linguistically Annotated Corpora
, 2006
"... The languages that are most commonly subject to linguistic annotation on a large scale tend to be those with the largest populations or with recent histories of linguistic scholarship. In this paper we discuss the problems associated with lowerdensity languages in the context of the development of l ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The languages that are most commonly subject to linguistic annotation on a large scale tend to be those with the largest populations or with recent histories of linguistic scholarship. In this paper we discuss the problems associated with lowerdensity languages in the context of the development of linguistically annotated resources. We frame our work with three key questions regarding the definition of lower-density languages; increasing available resources and reducing data requirements. A number of steps forward are identified for increasing the number lowerdensity language corpora with linguistic annotations. 1
Measuring Web-Corpus Randomness: A Progress Report
"... The Web allows fast and inexpensive construction of general purpose corpora, i.e., corpora that are not meant to represent a specific sublanguage, but a language as a whole, and thus should be unbiased with respect to domains and genres. In this paper, we present an automated, quantitative, knowledg ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
The Web allows fast and inexpensive construction of general purpose corpora, i.e., corpora that are not meant to represent a specific sublanguage, but a language as a whole, and thus should be unbiased with respect to domains and genres. In this paper, we present an automated, quantitative, knowledge-poor method to evaluate the randomness (with respect to a number of non-random partitions) of a Web corpus. The method is based on the comparison of the word frequency distributions of the target corpus to word frequency distributions from corpora built in deliberately biased ways. We first show that the measure of randomness we devised gives the expected results when tested on random samples from the whole British National Corpus and from biased subsets of BNC documents. We then apply the method to the task of building a corpus via queries to the Google search engine. We obtain very encouraging results, indicating that our approach can be used, reliably, to distinguish between biased and unbiased document sets. More specifically, the results indicate that medium frequency query terms might lead to more random results (and thus to a less biased corpus) than either high frequency terms or terms selected from the whole frequency spectrum. 1
ICDE 2003 Querying Text Databases for Efficient Information Extraction
"... A wealth of information is hidden within unstructured text. This information is often best exploited in structured or relational form, which is suited for sophisticated query processing, for integration with relational databases, and for data mining. Current information extraction techniques extract ..."
Abstract
- Add to MetaCart
A wealth of information is hidden within unstructured text. This information is often best exploited in structured or relational form, which is suited for sophisticated query processing, for integration with relational databases, and for data mining. Current information extraction techniques extract relations from a text database by examining every document in the database, or use filters to select promising documents for extraction. The exhaustive scanning approach is not practical or even feasible for large databases, and the current filtering techniques require human involvement to maintain and to adopt to new databases and domains. In this paper, we develop an automatic query-based technique to retrieve documents useful for the extraction of user-defined relations from large text databases, which can be adapted to new domains, databases, or target relations with minimal human effort. We report a thorough experimental evaluation over a large newspaper archive that shows that we significantly improve the efficiency of the extraction process by focusing only on promising documents. 1

