Results 1 -
8 of
8
Wikipedia-based semantic interpretation for natural language processing
- J. Artif. Int. Res
"... Adequate representation of natural language semantics requires access to vast amounts of common sense and domain-specific world knowledge. Prior work in the field was based on purely statistical techniques that did not make use of background knowledge, on limited lexicographic knowledge bases such a ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
Adequate representation of natural language semantics requires access to vast amounts of common sense and domain-specific world knowledge. Prior work in the field was based on purely statistical techniques that did not make use of background knowledge, on limited lexicographic knowledge bases such as WordNet, or on huge manual efforts such as the CYC project. Here we propose a novel method, called Explicit Semantic Analysis (ESA), for fine-grained semantic interpretation of unrestricted natural language texts. Our method represents meaning in a high-dimensional space of concepts derived from Wikipedia, the largest encyclopedia in existence. We explicitly represent the meaning of any text in terms of Wikipedia-based concepts. We evaluate the effectiveness of our method on text categorization and on computing the degree of semantic relatedness between fragments of natural language text. Using ESA results in significant improvements over the previous state of the art in both tasks. Importantly, due to the use of natural concepts, the ESA model is easy to explain to human users. 1.
Explicit Versus Latent Concept Models for Cross-Language Information Retrieval
"... The field of information retrieval and text manipulation (classification, clustering) still strives for models allowing semantic information to be folded in to improve performance with respect to standard bag-of-word based models. Many approaches aim at a concept-based retrieval, but differ in the n ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
The field of information retrieval and text manipulation (classification, clustering) still strives for models allowing semantic information to be folded in to improve performance with respect to standard bag-of-word based models. Many approaches aim at a concept-based retrieval, but differ in the nature of the concepts, which range from linguistic concepts as defined in lexical resources such as WordNet, latent topics derived from the data itself- as in Latent Semantic Indexing (LSI) or (Latent Dirichlet Allocation (LDA)- to Wikipedia articles as proxies for concepts, as in the recently proposed Explicit Semantic Analysis (ESA) model. A crucial question which has not been answered so far is whether models based on explicitly given concepts (as in the ESA model for instance) perform inherently better than retrieval models based on “latent” concepts (as in LSI and/or LDA). In this paper we investigate this question closer in the context of a cross-language setting, which inherently requires concept-based retrieval bridging between different languages. In particular, we compare the recently proposed ESA model with two latent models (LSI and LDA) showing that the former is clearly superior to the both. From a general perspective, our results contribute to clarifying the role of explicit vs. implicitly derived or latent concepts in (crosslanguage) information retrieval research. 1
V.: A late fusion approach to cross-lingual document re-ranking
- In: Proceedings of the 19th ACM international conference on Information and knowledge management
, 2010
"... The field of information retrieval still strives to develop models which allow semantic information to be integrated in the ranking process to improve performance in comparison to standard bagof-words based models. Cross-lingual information retrieval is an example of where such a model is required, ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
The field of information retrieval still strives to develop models which allow semantic information to be integrated in the ranking process to improve performance in comparison to standard bagof-words based models. Cross-lingual information retrieval is an example of where such a model is required, as content or concepts often need to be matched across languages. To overcome this problem, a conceptual model has been adopted in ranking an entire corpus which normally exploits latent/implicit features of the text. One of the drawbacks of this model is that the computational cost is significant and often intractable in modern test collections. Therefore, approaches utilizing conceptbased models for re-ranking initial retrieval results have attracted a considerable amount of study, in particular the latent concept model. However, fitting such a model to a smaller collection is less meaningful than fitting it into the whole corpus. This paper proposes a late fusion method which incorporates scores generated by using external knowledge to enhance the space produced by the latent concept method. This method is further demonstrated to be suitable for multilingual re-ranking purposes. To illustrate the effectiveness of the proposed method, experiments were conducted over test collections across three languages. The results demonstrate that the method can comfortably achieve improvements in retrieval performance over several re-ranking methods.
Using Explicit Semantic Analysis for Cross-Lingual Link Discovery
"... This paper explores how to automatically generate cross-language links between resources in large document collections. The paper presents new methods for Cross-Lingual Link Discovery (CLLD) based on Explicit Semantic Analysis (ESA). The methods are applicable to any multilingual document collection ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This paper explores how to automatically generate cross-language links between resources in large document collections. The paper presents new methods for Cross-Lingual Link Discovery (CLLD) based on Explicit Semantic Analysis (ESA). The methods are applicable to any multilingual document collection. In this report, we present their comparative study on the Wikipedia corpus and provide new insights into the evaluation of link discovery systems. In particular, we measure the agreement of human annotators in linking articles in different language versions of Wikipedia, and compare it to the results achieved by the presented methods. 1
Multilingual Expert Search using Linked Open Data as Interlingual Representation
"... Abstract. Most Information Retrieval models take documents as Bagof-Words and are thereby bound to the language of the documents. In this paper, we present an approach using Linked Open Data resources, i.e. URIs, as interlingual document representations. Documents and queries are summarized by the r ..."
Abstract
- Add to MetaCart
Abstract. Most Information Retrieval models take documents as Bagof-Words and are thereby bound to the language of the documents. In this paper, we present an approach using Linked Open Data resources, i.e. URIs, as interlingual document representations. Documents and queries are summarized by the resources they contain. We show the applicability of our approach for multilingual retrieval with a case study on expert search. 1
Explicit vs. Latent Concept Models for Cross-Language Information Retrieval
"... The field of information retrieval and text manipulation (classification, clustering) still strives for models allowing semantic information to be folded in to improve performance with respect to standard bag-of-word based models. Many approaches aim at a concept-based retrieval, but differ in the n ..."
Abstract
- Add to MetaCart
The field of information retrieval and text manipulation (classification, clustering) still strives for models allowing semantic information to be folded in to improve performance with respect to standard bag-of-word based models. Many approaches aim at a concept-based retrieval, but differ in the nature of the concepts, which range from linguistic concepts as defined in lexical resources such as WordNet, latent topics derived from the data itself- as in Latent Semantic Indexing (LSI) or (Latent Dirichlet Allocation (LDA)- to Wikipedia articles as proxies for concepts, as in the recently proposed Explicit Semantic Analysis (ESA) model. A crucial question which has not been answered so far is whether models based on explicitly given concepts (as in the ESA model for instance) perform inherently better than retrieval models based on “latent” concepts (as in LSI and/or LDA). In this paper we investigate this question closer in the context of a cross-language setting, which inherently requires concept-based retrieval bridging between different languages. In particular, we compare the recently proposed ESA model with two latent models (LSI and LDA) showing that the former is clearly superior to the both. From a general perspective, our results contribute to clarifying the role of explicit vs. implicitly derived or latent concepts in (crosslanguage) information retrieval research. 1
Supporter
, 2011
"... The introduction of reasoning capabilities in question-answering (QA) systems appeared in the late 70s. A second generation of QA systems, aimed at being cooperative, emerged in the late 80s- early 90s. In these systems, quite advanced reasoning models were developed on closed domains to go beyond t ..."
Abstract
- Add to MetaCart
The introduction of reasoning capabilities in question-answering (QA) systems appeared in the late 70s. A second generation of QA systems, aimed at being cooperative, emerged in the late 80s- early 90s. In these systems, quite advanced reasoning models were developed on closed domains to go beyond the production of direct responses to a query, in particular when the query has no response or when it
Sponsored by: Endorsed by:
"... Multilingualism has become an issue of major interest for the Semantic Web community, in light of the substantial growth of internet users that create and update knowledge all over the world in languages other than English. This process has been accelerated due to initiatives such as the Linked Data ..."
Abstract
- Add to MetaCart
Multilingualism has become an issue of major interest for the Semantic Web community, in light of the substantial growth of internet users that create and update knowledge all over the world in languages other than English. This process has been accelerated due to initiatives such as the Linked Data initiative, which encourages not only governments and public institutes to make their data available to the public, but also private organizations in domains as far apart as medicine, cartography or music. These actors publish their data sources in the languages they are available in, and, as such, in order to make this information available to an international community, multilingual knowledge representation, access and translation are an impending need. This second edition of the MSW workshop focused on the representation of multilingual information in the Semantic Web and Linked Data, specifically addressing issues in the cross‐lingual discovery of mappings between multilingual Linked Data vocabularies and data sets, and the cross‐lingual querying of knowledge repositories. The workshop brought together researchers from several distinct communities, including natural language processing, computational linguistics, human‐computer

