Results 1 - 10
of
1,796
Explorations into Unsupervised Corpus Quality Assessment
, 2008
"... Corpora, large bodies of text, are of great importance to the field of Natural Language Processing. They are for instance used to train systems on specific tasks through machine learning. When a system is trained on a corpus of low quality, it will not provide reliable results. We search for a metri ..."
Abstract
- Add to MetaCart
Corpora, large bodies of text, are of great importance to the field of Natural Language Processing. They are for instance used to train systems on specific tasks through machine learning. When a system is trained on a corpus of low quality, it will not provide reliable results. We search for a
The Influence of Corpus Quality on Statistical Measurements on Language Resources
"... The quality of statistical measurements on corpora is strongly related to a strict definition of the measuring process and to corpus quality. In the case of multiple result inspections, an exact measurement of previously specified parameters ensures compatibility of the different measurements perfor ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
The quality of statistical measurements on corpora is strongly related to a strict definition of the measuring process and to corpus quality. In the case of multiple result inspections, an exact measurement of previously specified parameters ensures compatibility of the different measurements
RCV1: A new benchmark collection for text categorization research
- JOURNAL OF MACHINE LEARNING RESEARCH
, 2004
"... Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire stories recently made available by Reuters, Ltd. for research purposes. Use of this data for research on text categorization requires a detailed understanding of the real world constraints under which the data ..."
Abstract
-
Cited by 663 (11 self)
- Add to MetaCart
Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire stories recently made available by Reuters, Ltd. for research purposes. Use of this data for research on text categorization requires a detailed understanding of the real world constraints under which
CORPUS QUALITY IMPROVEMENTS FOR STATISTICAL MACHINE TRANSLATION
"... Abstract- In this paper, we tended to explore what data quality is important for parallel corpuses. This work is impelled by our attempts to grasp the factors which may have an effect on the quality of corpus for statistical machine translations nowadays. I. ..."
Abstract
- Add to MetaCart
Abstract- In this paper, we tended to explore what data quality is important for parallel corpuses. This work is impelled by our attempts to grasp the factors which may have an effect on the quality of corpus for statistical machine translations nowadays. I.
The Impact of Corpus Quality and Type on Topic based Text Segmentation Evaluation
, 2008
"... HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte p ..."
Abstract
- Add to MetaCart
pluridisciplinaire HAL, est destinée au dépôt et a ̀ la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. The impact of corpus quality and type on topic based text
THE EFFECT OF PARALLEL CORPUS QUALITY VS SIZE IN ENGLISH-TO- TURKISH SMT
"... A parallel corpus plays an important role in statistical machine translation (SMT) systems. In this study, our aim is to figure out the effects of parallel corpus size and quality in the SMT. We develop a machine learning based classifier to classify parallel sentence pairs as high-quality or poor-q ..."
Abstract
- Add to MetaCart
A parallel corpus plays an important role in statistical machine translation (SMT) systems. In this study, our aim is to figure out the effects of parallel corpus size and quality in the SMT. We develop a machine learning based classifier to classify parallel sentence pairs as high-quality or poor-quality
Learning dictionaries for information extraction by multi-level bootstrapping
- in AAAI’99/IAAI’99 – Proceedings of the 16th National Conference on Artificial Intelligence & 11th Innovative Applications of Artificial Intelligence Conference
"... Information extraction systems usually require two dictionaries: a semantic lexicon and a dictionary of extraction patterns for the domain. We present a multilevel bootstrapping algorithm that generates both the semantic lexicon and extraction patterns simultaneously. As input, our technique require ..."
Abstract
-
Cited by 378 (21 self)
- Add to MetaCart
pages and a corpus of terrorism news articles. The algorithm produced high-quality dictionaries for several semantic categories.
Modeling Word Meaning: Distributional Semantics and the Corpus Quality-Quantity Trade-Off ABSTRACT
"... Dictionaries constructed using distributional models of lexical semantics have a wide range of applications in NLP and in the modeling of linguistic cognition. However when constructing such a model, we are faced with range of corpora to choose from. Often there is a choice between small carefully c ..."
Abstract
- Add to MetaCart
of the resulting semantic representations using a set of behavioral and neural-activity benchmarks that depend on wordsimilarity. We find that the quality of the input text has a very strong effect on the performance of the output model, and that a corpus of high quality at a small size can outperform a corpus
Author manuscript, published in "CLA'08: Computational Linguistic Association, (2008)" The impact of corpus quality and type on topic based text segmentation evaluation
, 2008
"... Abstract—In this paper, we try to fathom the real impact of corpus quality on methods performances and their evaluations. The considered task is topic-based text segmentation, and two highly different unsupervised algorithms are compared: C99, a word-based system, augmented with LSA, and T ranseg, a ..."
Abstract
- Add to MetaCart
Abstract—In this paper, we try to fathom the real impact of corpus quality on methods performances and their evaluations. The considered task is topic-based text segmentation, and two highly different unsupervised algorithms are compared: C99, a word-based system, augmented with LSA, and T ranseg
Reading Tea Leaves: How Humans Interpret Topic Models
"... Probabilistic topic models are a popular tool for the unsupervised analysis of text, providing both a predictive model of future text and a latent topic representation of the corpus. Practitioners typically assume that the latent space is semantically meaningful. It is used to check models, summariz ..."
Abstract
-
Cited by 238 (26 self)
- Add to MetaCart
Probabilistic topic models are a popular tool for the unsupervised analysis of text, providing both a predictive model of future text and a latent topic representation of the corpus. Practitioners typically assume that the latent space is semantically meaningful. It is used to check models
Results 1 - 10
of
1,796