• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations

Tools

Sorted by:
Try your query at:
Semantic Scholar Scholar Academic
Google Bing DBLP
Results 1 - 10 of 1,796
Next 10 →

Explorations into Unsupervised Corpus Quality Assessment

by Matje Van De Camp , 2008
"... Corpora, large bodies of text, are of great importance to the field of Natural Language Processing. They are for instance used to train systems on specific tasks through machine learning. When a system is trained on a corpus of low quality, it will not provide reliable results. We search for a metri ..."
Abstract - Add to MetaCart
Corpora, large bodies of text, are of great importance to the field of Natural Language Processing. They are for instance used to train systems on specific tasks through machine learning. When a system is trained on a corpus of low quality, it will not provide reliable results. We search for a

The Influence of Corpus Quality on Statistical Measurements on Language Resources

by Thomas Eckart, Uwe Quasthoff, Dirk Goldhahn
"... The quality of statistical measurements on corpora is strongly related to a strict definition of the measuring process and to corpus quality. In the case of multiple result inspections, an exact measurement of previously specified parameters ensures compatibility of the different measurements perfor ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
The quality of statistical measurements on corpora is strongly related to a strict definition of the measuring process and to corpus quality. In the case of multiple result inspections, an exact measurement of previously specified parameters ensures compatibility of the different measurements

RCV1: A new benchmark collection for text categorization research

by David D. Lewis, Yiming Yang, Tony G. Rose, Fan Li - JOURNAL OF MACHINE LEARNING RESEARCH , 2004
"... Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire stories recently made available by Reuters, Ltd. for research purposes. Use of this data for research on text categorization requires a detailed understanding of the real world constraints under which the data ..."
Abstract - Cited by 663 (11 self) - Add to MetaCart
Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire stories recently made available by Reuters, Ltd. for research purposes. Use of this data for research on text categorization requires a detailed understanding of the real world constraints under which

CORPUS QUALITY IMPROVEMENTS FOR STATISTICAL MACHINE TRANSLATION

by Prof Shikha Maheshwari, Prof Himanshu Sharma, Jaipur India
"... Abstract- In this paper, we tended to explore what data quality is important for parallel corpuses. This work is impelled by our attempts to grasp the factors which may have an effect on the quality of corpus for statistical machine translations nowadays. I. ..."
Abstract - Add to MetaCart
Abstract- In this paper, we tended to explore what data quality is important for parallel corpuses. This work is impelled by our attempts to grasp the factors which may have an effect on the quality of corpus for statistical machine translations nowadays. I.

The Impact of Corpus Quality and Type on Topic based Text Segmentation Evaluation

by Violaine Prince, Violaine Prince , 2008
"... HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte p ..."
Abstract - Add to MetaCart
pluridisciplinaire HAL, est destinée au dépôt et a ̀ la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. The impact of corpus quality and type on topic based text

THE EFFECT OF PARALLEL CORPUS QUALITY VS SIZE IN ENGLISH-TO- TURKISH SMT

by Eray Yıldızahmed, Cüneyd Tantuğand, Banu Diri, Informatics Faculty
"... A parallel corpus plays an important role in statistical machine translation (SMT) systems. In this study, our aim is to figure out the effects of parallel corpus size and quality in the SMT. We develop a machine learning based classifier to classify parallel sentence pairs as high-quality or poor-q ..."
Abstract - Add to MetaCart
A parallel corpus plays an important role in statistical machine translation (SMT) systems. In this study, our aim is to figure out the effects of parallel corpus size and quality in the SMT. We develop a machine learning based classifier to classify parallel sentence pairs as high-quality or poor-quality

Learning dictionaries for information extraction by multi-level bootstrapping

by Ellen Riloff, Rosie Jones - in AAAI’99/IAAI’99 – Proceedings of the 16th National Conference on Artificial Intelligence & 11th Innovative Applications of Artificial Intelligence Conference
"... Information extraction systems usually require two dictionaries: a semantic lexicon and a dictionary of extraction patterns for the domain. We present a multilevel bootstrapping algorithm that generates both the semantic lexicon and extraction patterns simultaneously. As input, our technique require ..."
Abstract - Cited by 378 (21 self) - Add to MetaCart
pages and a corpus of terrorism news articles. The algorithm produced high-quality dictionaries for several semantic categories.

Modeling Word Meaning: Distributional Semantics and the Corpus Quality-Quantity Trade-Off ABSTRACT

by Seshadri Sridharan, Brian Murphy
"... Dictionaries constructed using distributional models of lexical semantics have a wide range of applications in NLP and in the modeling of linguistic cognition. However when constructing such a model, we are faced with range of corpora to choose from. Often there is a choice between small carefully c ..."
Abstract - Add to MetaCart
of the resulting semantic representations using a set of behavioral and neural-activity benchmarks that depend on wordsimilarity. We find that the quality of the input text has a very strong effect on the performance of the output model, and that a corpus of high quality at a small size can outperform a corpus

Author manuscript, published in "CLA'08: Computational Linguistic Association, (2008)" The impact of corpus quality and type on topic based text segmentation evaluation

by Alexandre Labadié, Violaine Prince , 2008
"... Abstract—In this paper, we try to fathom the real impact of corpus quality on methods performances and their evaluations. The considered task is topic-based text segmentation, and two highly different unsupervised algorithms are compared: C99, a word-based system, augmented with LSA, and T ranseg, a ..."
Abstract - Add to MetaCart
Abstract—In this paper, we try to fathom the real impact of corpus quality on methods performances and their evaluations. The considered task is topic-based text segmentation, and two highly different unsupervised algorithms are compared: C99, a word-based system, augmented with LSA, and T ranseg

Reading Tea Leaves: How Humans Interpret Topic Models

by Jonathan Chang, Jordan Boyd-graber, Sean Gerrish, Chong Wang, David M. Blei
"... Probabilistic topic models are a popular tool for the unsupervised analysis of text, providing both a predictive model of future text and a latent topic representation of the corpus. Practitioners typically assume that the latent space is semantically meaningful. It is used to check models, summariz ..."
Abstract - Cited by 238 (26 self) - Add to MetaCart
Probabilistic topic models are a popular tool for the unsupervised analysis of text, providing both a predictive model of future text and a latent topic representation of the corpus. Practitioners typically assume that the latent space is semantically meaningful. It is used to check models
Next 10 →
Results 1 - 10 of 1,796
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University