Results 1 - 10
of
10
Reconsidering language identification for written language resources
- Proceedings of LREC2006
, 2006
"... The task of identifying the language in which a given document (ranging from a sentence to thousands of pages) is written has been relatively well studied over several decades. Automated approaches to written language identification are used widely throughout research and industrial contexts, over b ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
The task of identifying the language in which a given document (ranging from a sentence to thousands of pages) is written has been relatively well studied over several decades. Automated approaches to written language identification are used widely throughout research and industrial contexts, over both oral and written source materials. Despite this widespread acceptance, a review of previous research in written language identification reveals a number of questions which remain open and ripe for further investigation. 1.
Multi-Language Text Indexing for Internet Retrieval
- In Proceedings of the 5th RIAO Conference, Computer-Assisted Information Searching on the Internet
, 1997
"... : We address here the issues associated with indexing multilingual collections of information, as is found for example on the internet. We examine in particular the task of language identification and the use of stemming algorithms for several European languages. We also present the lessons we have ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
: We address here the issues associated with indexing multilingual collections of information, as is found for example on the internet. We examine in particular the task of language identification and the use of stemming algorithms for several European languages. We also present the lessons we have learned from our experience in using the SPIDER information retrieval system as a search engine over the intranet of the ETH Zurich; a multilingual intranet which contains documents in English, French, German and Italian. KeyWords: multilingual retrieval, stemming, language identification 1 Introduction The past number of years has seen an ever-increasing interest among the Information Retrieval community in research into systems that provide effective retrieval of documents and texts in languages other than English. This is evidenced, for example, by the interest in retrieval systems for languages such as Spanish and Chinese at the annual Text REtrieval Conferences (TREC) over the past...
Evaluation of a language identification system for mono- and multilingual text documents
- In SAC ’06: Proc. of the 2006 ACM symposium on Applied computing
"... Language identification is a classification task between a pre-defined model and a text in an unknown language. This paper presents the implementation of a tool for language identification for mono- and multi-lingual documents. The tool includes four algorithms for language identification. An evalua ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Language identification is a classification task between a pre-defined model and a text in an unknown language. This paper presents the implementation of a tool for language identification for mono- and multi-lingual documents. The tool includes four algorithms for language identification. An evaluation for eight languages including Ukrainian and Russian and various text lengths is presented. It could be shown that n-gram-based approaches outperform word-based algorithms for short texts. For longer texts, the performance is comparable. The tool can also identify language changes within one multi-lingual document. Keywords Language identification, n-gram indexing, language model, evaluation 1
Study of Some Distance Measures for Language and Encoding Identification
- in Proceedings of ACL 2006 Workshop on Linguistic Distance,Sydney
, 2006
"... To determine how close two language models (e.g., n-grams models) are, we can use several distance measures. If we can represent the models as distributions, then the similarity is basically the similarity of distributions. And a number of measures are based on information theoretic approach. In thi ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
To determine how close two language models (e.g., n-grams models) are, we can use several distance measures. If we can represent the models as distributions, then the similarity is basically the similarity of distributions. And a number of measures are based on information theoretic approach. In this paper we present some experiments on using such similarity measures for an old Natural Language Processing (NLP) problem. One of the measures considered is perhaps a novel one, which we have called mutual cross entropy. Other measures are either well known or based on well known measures, but the results obtained with them vis-avis one-another might help in gaining an insight into how similarity measures work in practice. The first step in processing a text is to identify the language and encoding of its contents. This is a practical problem since for many languages, there are no universally followed text encoding standards. The method we have used in this paper for language and encoding identification uses pruned character n-grams, alone as well augmented with word n-grams. This method seems to give results comparable to other methods. 1
Language Identification of Search Engine Queries
, 2009
"... We consider the language identification problem for search engine queries. First, we propose a method to automatically generate a data set, which uses clickthrough logs of the Yahoo! Search Engine to derive the language of a query indirectly from the language of the documents clicked by the users. N ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We consider the language identification problem for search engine queries. First, we propose a method to automatically generate a data set, which uses clickthrough logs of the Yahoo! Search Engine to derive the language of a query indirectly from the language of the documents clicked by the users. Next, we use this data set to train two decision tree classifiers; one that only uses linguistic features and is aimed for textual language identification, and one that additionally uses a non-linguistic feature, and is geared towards the identification of the language intended by the users of the search engine. Our results show that our method produces a highly reliable data set very efficiently, and our decision tree classifier outperforms some of the best methods that have been proposed for the task of written language identification on the domain of search engine queries.
One Size Fits All? A Simple Technique to Perform
- Several NLP Tasks, in 4 th International Conference, EsTAL 2004, J.L. Vicedo et al (Eds), LNAI 3230
, 2004
"... Abstract. Word fragments or n-grams have been widely used to perform different Natural Language Processing tasks such as information retrieval [1] [2], document categorization [3], automatic summarization [4] or, even, genetic classification of languages [5]. All these techniques share some common a ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. Word fragments or n-grams have been widely used to perform different Natural Language Processing tasks such as information retrieval [1] [2], document categorization [3], automatic summarization [4] or, even, genetic classification of languages [5]. All these techniques share some common aspects such as: (1) documents are mapped to a vector space where n-grams are used as coordinates and their relative frequencies as vector weights, (2) many of them compute a context which plays a role similar to stop-word lists, and (3) cosine distance is commonly used for document-to-document and query-to-document comparisons. blindLight is a new approach related to these classical n-gram techniques although it introduces two major differences: (1) Relative frequencies are no more used as vector weights but replaced by n-gram significances, and (2) cosine distance is abandoned in favor of a new metric inspired by sequence alignment techniques although not so computationally expensive. This new approach can be simultaneously used to perform document categorization and clustering, information retrieval, and text summarization. In this paper we will describe the foundations of such a technique and its application to both a particular categorization problem (i.e., language identification) and information retrieval tasks. 1
Accès Multilingue Aux Systèmes D'information
, 2001
"... Introduction La socit globale de l'information a radicalement transform la faon d'acqurir la connaissance, de la dissminer et de l'changer, provoquant une rvolution dans le monde des bibliothques. Les utilisateurs de collections mises en rseau et distribues au plan international ont besoin de pouvo ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Introduction La socit globale de l'information a radicalement transform la faon d'acqurir la connaissance, de la dissminer et de l'changer, provoquant une rvolution dans le monde des bibliothques. Les utilisateurs de collections mises en rseau et distribues au plan international ont besoin de pouvoir trouver, retrouver et comprendre une information pertinente, quelles qu'en soient la langue et la forme de stockage. Beaucoup d'utilisateurs ont une connaissance partielle des langues trangres, mais leur capacit peut se rvler insuffisante pour formuler correctement les quations de recherche propres leur besoin d'information. Ces utilisateurs seront considrablement aids s'ils peuvent entrer leur requte dans leur langue maternelle, car ils sont capables d'examiner et d'extraire l'information de documents pertinents, mme s'ils ne sont pas traduits. Les utilisateurs monolingues, pour leur part, pourront utiliser les aides la traduction pour comprendre les rsultats de recherche dans un
Multilingual Access for Information Systems
, 2001
"... : With the rapid growth of the global information society, the concept of library has evolved to embrace all kinds of information collections, on all kinds of storage media, and using many different access methods. The users of today's information networks and digital libraries, no longer restric ..."
Abstract
- Add to MetaCart
: With the rapid growth of the global information society, the concept of library has evolved to embrace all kinds of information collections, on all kinds of storage media, and using many different access methods. The users of today's information networks and digital libraries, no longer restricted by geographic or spatial boundaries, want to be able to find, retrieve and understand relevant information wherever and in whatever language it may have been stored. For this reason, much attention has been given over the past few years to the study and development of tools and technologies for multilingual information access (MLIA). The tutorial will provide participants with an overview of the main issues of interest in this sector. Topics covered will include: character encoding, specific requirements of particular languages and scripts, localization and presentation issues, techniques for cross-language retrieval, the importance of resources. 2 1.
CORPUS LINGUISTICS AND THE DESIGN OF A RESPONSE MESSAGE
, 2001
"... Most research related to SETI, the Search for Extra-Terrestrial Intelligence, is focussed on techniques for detection of possible incoming signals from extraterrestrial intelligent sources, and algorithms for analysis of these signals to identify intelligent language-like characteristics. However, a ..."
Abstract
- Add to MetaCart
Most research related to SETI, the Search for Extra-Terrestrial Intelligence, is focussed on techniques for detection of possible incoming signals from extraterrestrial intelligent sources, and algorithms for analysis of these signals to identify intelligent language-like characteristics. However, another issue for research and debate is the nature of our response, should a signal arrive and be detected. The design of potentially the most significant communicative act in history should not be decided solely by astrophysicists; the Corpus Linguistics

