• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Statistical identification of language (1994)

by T Dunning
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 32
Next 10 →

Topic Segmentation: Algorithms and Applications

by Jeffrey C. Reynar , 1998
"... ..."
Abstract - Cited by 50 (1 self) - Add to MetaCart
Abstract not found

Multi-candidate reduction: Sentence compression as a tool for document summarization tasks

by David Zajic, Bonnie J. Dorr, Jimmy Lin, Richard Schwartz - Information Processing and Management Special Issue on Summarization , 2007
"... This article examines the application of two single-document sentence compression techniques to the problem of multi-document summarization—a “parse-and-trim ” approach and a statistical noisy-channel approach. We introduce the Multi-Candidate Reduction (MCR) framework for multi-document summarizati ..."
Abstract - Cited by 15 (7 self) - Add to MetaCart
This article examines the application of two single-document sentence compression techniques to the problem of multi-document summarization—a “parse-and-trim ” approach and a statistical noisy-channel approach. We introduce the Multi-Candidate Reduction (MCR) framework for multi-document summarization, in which many compressed candidates are generated for each source sentence. These candidates are then selected for inclusion in the final summary based on a combination of static and dynamic features. Evaluations demonstrate that sentence compression is a valuable component of a larger multi-document summarization framework.

Reconsidering language identification for written language resources

by Baden Hughes, Timothy Baldwin, Steven Bird, Jeremy Nicholson, Andrew Mackinlay - Proceedings of LREC2006 , 2006
"... The task of identifying the language in which a given document (ranging from a sentence to thousands of pages) is written has been relatively well studied over several decades. Automated approaches to written language identification are used widely throughout research and industrial contexts, over b ..."
Abstract - Cited by 11 (0 self) - Add to MetaCart
The task of identifying the language in which a given document (ranging from a sentence to thousands of pages) is written has been relatively well studied over several decades. Automated approaches to written language identification are used widely throughout research and industrial contexts, over both oral and written source materials. Despite this widespread acceptance, a review of previous research in written language identification reveals a number of questions which remain open and ripe for further investigation. 1.

Multi-Language Text Indexing for Internet Retrieval

by Martin Wechsler, Páraic Sheridan, Peter Schäuble - In Proceedings of the 5th RIAO Conference, Computer-Assisted Information Searching on the Internet , 1997
"... : We address here the issues associated with indexing multilingual collections of information, as is found for example on the internet. We examine in particular the task of language identification and the use of stemming algorithms for several European languages. We also present the lessons we have ..."
Abstract - Cited by 9 (4 self) - Add to MetaCart
: We address here the issues associated with indexing multilingual collections of information, as is found for example on the internet. We examine in particular the task of language identification and the use of stemming algorithms for several European languages. We also present the lessons we have learned from our experience in using the SPIDER information retrieval system as a search engine over the intranet of the ETH Zurich; a multilingual intranet which contains documents in English, French, German and Italian. KeyWords: multilingual retrieval, stemming, language identification 1 Introduction The past number of years has seen an ever-increasing interest among the Information Retrieval community in research into systems that provide effective retrieval of documents and texts in languages other than English. This is evidenced, for example, by the interest in retrieval systems for languages such as Spanish and Chinese at the annual Text REtrieval Conferences (TREC) over the past...

Automatic Headline Generation for Newspaper Stories

by David Zajic, Bonnie Dorr, Richard Schwartz - IN THE PROCEEDINGS OF THE ACL WORKSHOP ON AUTOMATIC SUMMARIZATION/DOCUMENT UNDERSTANDING CONFERENCE (DUC , 2002
"... In this paper we propose a novel application of Hidden Markov Models to automatic generation of informative headlines for English texts. We propose four decoding parameters to make the headlines appear more like Headlinese, the language of informative newspaper headlines. We also allow for morpholog ..."
Abstract - Cited by 9 (1 self) - Add to MetaCart
In this paper we propose a novel application of Hidden Markov Models to automatic generation of informative headlines for English texts. We propose four decoding parameters to make the headlines appear more like Headlinese, the language of informative newspaper headlines. We also allow for morphological variation in words between headline and story English. Informal and formal evaluations indicate that our approach produces informative headlines, mimicking a Headlinese style generated by humans.

Automatically Building a Corpus for a Minority Language from the Web

by Rosie Jones, Rayid Ghani - Proceedings of the Student Research Workshop at the 38th Annual Meeting of the Association for Computational Linguistics , 2000
"... We present an approach to language-specific query-based sampling which, given a single document in a target language, can find many more examples of documents in that language, by automatically constructing queries to access such documents on the world wide web. We propose a number of metho ..."
Abstract - Cited by 8 (0 self) - Add to MetaCart
We present an approach to language-specific query-based sampling which, given a single document in a target language, can find many more examples of documents in that language, by automatically constructing queries to access such documents on the world wide web. We propose a number of methods for building search queries to quickly obtain documents in the target language. They perform accurately and efficiently for building a corpus of documents in Tagalog starting from a single seed document, when these documents are only 2.5% of the documents in a collection. We found that sampling with a query consisting of a word seleccted according to its probability from the minority language corpus constructed so far was very successful. This method built a corpus of documents with word frequencies similar to those in the corpus based on all Tagalog documents in our collection, and required a relatively small number of search queries. It also quickly acquired a good c...

Identification of transliterated foreign words in Hebrew script

by Yoav Goldberg, Michael Elhadad - In Proc. CICLing, volume LNCS 4919 , 2008
"... Abstract. We present a loosely-supervised method for context-free identification of transliterated foreign names and borrowed words in Hebrew text. The method is purely statistical and does not require the use of any lexicons or linguistic analysis tool for the source languages (Hebrew, in our case) ..."
Abstract - Cited by 7 (0 self) - Add to MetaCart
Abstract. We present a loosely-supervised method for context-free identification of transliterated foreign names and borrowed words in Hebrew text. The method is purely statistical and does not require the use of any lexicons or linguistic analysis tool for the source languages (Hebrew, in our case). It also does not require any manually annotated data for training – we learn from noisy data acquired by over-generation. We report precision/recall results of 80/82 for a corpus of 4044 unique words, containing 368 foreign words. 1

Script and language identification in degraded and distorted document images

by Shijian Lu, Chew Lim Tan, Weihua Huang - In 21st National Conference on Artificial Intelligence , 2006
"... Abstract. This paper presents a language identification technique that differentiates Latin-based languages in degraded and distorted document images. Different from the reported methods that transform word images through a character shape coding process, our method directly captures word shapes wit ..."
Abstract - Cited by 4 (4 self) - Add to MetaCart
Abstract. This paper presents a language identification technique that differentiates Latin-based languages in degraded and distorted document images. Different from the reported methods that transform word images through a character shape coding process, our method directly captures word shapes with the local extremum points and the horizontal intersection numbers, which are both tolerant of noise, character segmentation errors, and slight skew distortions. For each language studied, a word shape template and a word frequency template are firstly constructed based on the proposed word shape coding scheme. Identification is then accomplished based on Bray Curtis or Hamming distance between the word shape code of query images and the constructed word shape and frequency templates. Experiments show the average identification rate upon eight Latin-based languages reaches over 99%.... 1

Applying Compression to Natural Language Processing

by W. J. Teahan, John G. Cleary - SPAE : The Corpus of Spoken Professional American-English. I have , 1997
"... A number of powerful modelling techniques have been developed in recent years to compress natural language text. The best of these are adaptive models operating on the character and word level which are able to perform almost as well as humans at predicting text. We show how to apply character based ..."
Abstract - Cited by 3 (1 self) - Add to MetaCart
A number of powerful modelling techniques have been developed in recent years to compress natural language text. The best of these are adaptive models operating on the character and word level which are able to perform almost as well as humans at predicting text. We show how to apply character based methods to five areas where language modelling is critical, providing novel solutions to each of these problems.

M.: Comparing Natural Language Identification Methods based on Markov Processes

by Peter Vojtek, Mária Bieliková - In: Slovko, International Seminar on Computer Treatment of Slavic and East European Languages , 2007
"... Abstract. We discover and experiment with categorization-based methods to natural language identification. Two approaches to language identification based on Markov processes are compared, both methods treat the incoming text on the character level. We performed series of experiments with the aim to ..."
Abstract - Cited by 3 (0 self) - Add to MetaCart
Abstract. We discover and experiment with categorization-based methods to natural language identification. Two approaches to language identification based on Markov processes are compared, both methods treat the incoming text on the character level. We performed series of experiments with the aim to make certain of high precision in language identification task of selected methods and also with the objective to compare them against themselves. Experimental evaluation was based on largescaled Multilingual Reuters Corpus with various European and Slavic languages. Our research results showed that both methods are comparable in the task of natural language identification achieving recall as high as 99,75%. 1
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University