Results 1 - 10
of
7,658
Europarl: A Parallel Corpus for Statistical Machine Translation
"... We collected a corpus of parallel text in 11 languages from the proceedings of the European Parliament, which are published on the web 1. This corpus has found widespread use in the NLP community. Here, we focus on its acquisition and its application as training data for statistical machine translat ..."
Abstract
-
Cited by 519 (1 self)
- Add to MetaCart
We collected a corpus of parallel text in 11 languages from the proceedings of the European Parliament, which are published on the web 1. This corpus has found widespread use in the NLP community. Here, we focus on its acquisition and its application as training data for statistical machine
Semantic similarity based on corpus statistics and lexical taxonomy
- Proc of 10th International Conference on Research in Computational Linguistics, ROCLING’97
, 1997
"... This paper presents a new approach for measuring semantic similarity/distance between words and concepts. It combines a lexical taxonomy structure with corpus statistical information so that the semantic distance between nodes in the semantic space constructed by the taxonomy can be better quantifie ..."
Abstract
-
Cited by 873 (0 self)
- Add to MetaCart
quantified with the computational evidence derived from a distributional analysis of corpus data. Specifically, the proposed measure is a combined approach that inherits the edge-based approach of the edge counting scheme, which is then enhanced by the node-based approach of the information content
Data Corpus: 1 CD By
"... Web pages often embed scripts for a variety of purposes, including advertising and dynamic interaction. Understanding embedded scripts and their purposes can often help to interpret or provide crucial information about the web page. I have developed a functionality-based categorization of JavaScript ..."
Abstract
- Add to MetaCart
classification performance. I perform experiments on the standard WT10G web page corpus, and show that my techniques eliminate over 50 % of errors over a standard text classification baseline. Subject Descriptors:
Building a Large Annotated Corpus of English: The Penn Treebank
- COMPUTATIONAL LINGUISTICS
, 1993
"... There is a growing consensus that significant, rapid progress can be made in both text understanding and spoken language understanding by investigating those phenomena that occur most centrally in naturally occurring unconstrained materials and by attempting to automatically extract information abou ..."
Abstract
-
Cited by 2740 (10 self)
- Add to MetaCart
-1992), this corpus has been annotated for part-of-speech (POS) information. In addition, over half of it has been annotated for skeletal syntactic structure. These materials are available to members of the Linguistic Data Consortium; for details, see Section 5.1.
A Detailed Description of the AVOZES Data Corpus
, 2004
"... The AVOZES data corpus has recently been made publicly available for other interested researchers. ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The AVOZES data corpus has recently been made publicly available for other interested researchers.
Overview of the Face Recognition Grand Challenge
- In IEEE CVPR
, 2005
"... Over the last couple of years, face recognition researchers have been developing new techniques. These developments are being fueled by advances in computer vision techniques, computer design, sensor design, and interest in fielding face recognition systems. Such advances hold the promise of reducin ..."
Abstract
-
Cited by 461 (32 self)
- Add to MetaCart
of reducing the error rate in face recognition systems by an order of magnitude over Face Recognition Vendor Test (FRVT) 2002 results. The Face Recognition Grand Challenge (FRGC) is designed to achieve this performance goal by presenting to researchers a six-experiment challenge problem along with data corpus
An Empirical Study of Smoothing Techniques for Language Modeling
, 1998
"... We present an extensive empirical comparison of several smoothing techniques in the domain of language modeling, including those described by Jelinek and Mercer (1980), Katz (1987), and Church and Gale (1991). We investigate for the first time how factors such as training data size, corpus (e.g., Br ..."
Abstract
-
Cited by 1224 (21 self)
- Add to MetaCart
We present an extensive empirical comparison of several smoothing techniques in the domain of language modeling, including those described by Jelinek and Mercer (1980), Katz (1987), and Church and Gale (1991). We investigate for the first time how factors such as training data size, corpus (e
RCV1: A new benchmark collection for text categorization research
- JOURNAL OF MACHINE LEARNING RESEARCH
, 2004
"... Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire stories recently made available by Reuters, Ltd. for research purposes. Use of this data for research on text categorization requires a detailed understanding of the real world constraints under which the data ..."
Abstract
-
Cited by 663 (11 self)
- Add to MetaCart
Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire stories recently made available by Reuters, Ltd. for research purposes. Use of this data for research on text categorization requires a detailed understanding of the real world constraints under which
Probabilistic Latent Semantic Indexing
, 1999
"... Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. Fitted from a training corpus of text documents by a generalization of the Expectation Maximization algorithm, the utilized ..."
Abstract
-
Cited by 1225 (10 self)
- Add to MetaCart
Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. Fitted from a training corpus of text documents by a generalization of the Expectation Maximization algorithm, the utilized
A Program for Aligning Sentences in Bilingual Corpora
, 1993
"... This paper will describe a method and a program (align) for aligning sentences based on a simple statistical model of character lengths. The program uses the fact that longer sentences in one language tend to be translated into longer sentences in the other language, and that shorter sentences tend ..."
Abstract
-
Cited by 529 (5 self)
- Add to MetaCart
the maximum likelihood alignment of sentences. It is remarkable that such a simple approach works as well as it does. An evaluation was performed based on a trilingual corpus of economic reports issued by the Union Bank of Switzerland (UBS) in English, French, and German. The method correctly aligned all
Results 1 - 10
of
7,658