• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Large-scale named entity disambiguation based on Wikipedia data. (2007)

by Silviu Cucerzan
Venue:In EMNLP ’07,
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 238
Next 10 →

DBpedia -- A Crystallization Point for the Web of Data

by Christian Bizer , Jens Lehmann , Georgi Kobilarov , Sören Auer , Christian Becker , Richard Cyganiak , Sebastian Hellmann , 2009
"... The DBpedia project is a community effort to extract structured information from Wikipedia and to make this information accessible on the Web. The resulting DBpedia knowledge base currently describes over 2.6 million entities. For each of these entities, DBpedia defines a globally unique identifier ..."
Abstract - Cited by 374 (36 self) - Add to MetaCart
The DBpedia project is a community effort to extract structured information from Wikipedia and to make this information accessible on the Web. The resulting DBpedia knowledge base currently describes over 2.6 million entities. For each of these entities, DBpedia defines a globally unique identifier that can be dereferenced over the Web into a rich RDF description of the entity, including human-readable definitions in 30 languages, relationships to other resources, classifications in four concept hierarchies, various facts as well as data-level links to other Web data sources describing the entity. Over the last year, an increasing number of data publishers have begun to set data-level links to DBpedia resources, making DBpedia a central interlinking hub for the emerging Web of data. Currently, the Web of interlinked data sources around DBpedia provides approximately 4.7 billion pieces of information and covers domains such as geographic information, people, companies, films, music, genes, drugs, books, and scientific publications. This article describes the extraction of the DBpedia knowledge base, the current status of interlinking DBpedia with other data sources on the Web, and gives an overview of applications that facilitate the Web of Data around DBpedia.

Dbpedia spotlight: Shedding light on the web of documents

by Pablo N. Mendes, Max Jakob, Andrés García-silva, Christian Bizer - In Proceedings of the 7th International Conference on Semantic Systems (I-Semantics , 2011
"... Interlinking text documents with Linked Open Data enables the Web of Data to be used as background knowledge within document-oriented applications such as search and faceted browsing. As a step towards interconnecting the Web of Documents with the Web of Data, we developed DBpedia Spotlight, a syste ..."
Abstract - Cited by 174 (5 self) - Add to MetaCart
Interlinking text documents with Linked Open Data enables the Web of Data to be used as background knowledge within document-oriented applications such as search and faceted browsing. As a step towards interconnecting the Web of Documents with the Web of Data, we developed DBpedia Spotlight, a system for automatically annotating text documents with DBpedia URIs. DBpedia Spotlight allows users to configure the annotations to their specific needs through the DBpedia Ontology and quality measures such as prominence, topical pertinence, contextual ambiguity and disambiguation confidence. We compare our approach with the state of the art in disambiguation, and evaluate our results in light of three baselines and six publicly available annotation systems, demonstrating the competitiveness of our system. DBpedia Spotlight is shared as open source and deployed as a Web Service freely available for public use.
(Show Context)

Citation Context

...r approaches for precision, leaving little flexibility for users with use cases where recall is important, or they have not evaluated the applicability of their approaches with more general use cases =-=[10, 6, 7, 19]-=-. SemTag [10] was the first Web-scale named entity disambiguation system. They used metadata associated with each entity in an entity catalog derived from TAP [13] as context for disambiguation. SemTa...

An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links

by David Milne, Ian H. Witten - In Proceedings of AAAI 2008 , 2008
"... This paper describes a new technique for obtaining measures of semantic relatedness. Like other recent approaches, it uses Wikipedia to provide structured world knowledge about the terms of interest. Our approach is unique in that it does so using the hyperlink structure of Wikipedia rather than its ..."
Abstract - Cited by 167 (8 self) - Add to MetaCart
This paper describes a new technique for obtaining measures of semantic relatedness. Like other recent approaches, it uses Wikipedia to provide structured world knowledge about the terms of interest. Our approach is unique in that it does so using the hyperlink structure of Wikipedia rather than its category hierarchy or textual content. Evaluation with manually defined measures of semantic relatedness reveals this to be an effective compromise between the ease of computation of the former approach and the accuracy of the latter.
(Show Context)

Citation Context

....70 Rubenstein and Goodenough 0.52 0.82 0.64 WordSimilarity-353 0.49 0.75 0.69 Weighted average 0.49 0.76 0.68 Table 4: Performance of semantic relatedness measures for three standard datasets. 2006; =-=Cucerzan, 2007-=-); key phrases (Mihalcea and Csomai, 2007); categories (Gabrilovich and Markovitch, 2006); or entries in existing ontologies (Medelyan and Legg, 2008) and thesauri (Ruiz-Casado et al., 2005). In these...

YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia

by Johannes Hoffart, Fabian M. Suchanek, Klaus Berberich, Gerhard Weikum , 2010
"... We present YAGO2, an extension of the YAGO knowledge base, in which entities, facts, and events are anchored in both time and space. YAGO2 is built automatically from Wikipedia, GeoNames, and WordNet. It contains 80 million facts about 9.8 million entities. Human evaluation confirmed an accuracy o ..."
Abstract - Cited by 158 (20 self) - Add to MetaCart
We present YAGO2, an extension of the YAGO knowledge base, in which entities, facts, and events are anchored in both time and space. YAGO2 is built automatically from Wikipedia, GeoNames, and WordNet. It contains 80 million facts about 9.8 million entities. Human evaluation confirmed an accuracy of 95 % of the facts in YAGO2. In this paper, we present the extraction methodology, the integration of the spatio-temporal dimension, and our knowledge representation SPOTL, an extension of the original SPO-triple

Collective Annotation of Wikipedia Entities in Web Text

by Sayali Kulkarni, Amit Singh, Ganesh Ramakrishnan, Soumen Chakrabarti
"... To take the first step beyond keyword-based search toward entity-based search, suitable token spans (“spots”) on documents must be identified as references to real-world entities from an entity catalog. Several systems have been proposed to link spots on Web pages to entities in Wikipedia. They are ..."
Abstract - Cited by 105 (9 self) - Add to MetaCart
To take the first step beyond keyword-based search toward entity-based search, suitable token spans (“spots”) on documents must be identified as references to real-world entities from an entity catalog. Several systems have been proposed to link spots on Web pages to entities in Wikipedia. They are largely based on local compatibility between the text around the spot and textual metadata associated with the entity. Two recent systems exploit inter-label dependencies, but in limited ways. We propose a general collective disambiguation approach. Our premise is that coherent documents refer to entities from one or a few related topics or domains. We give formulations for the trade-off between local spot-to-entity compatibility and measures of global coherence between entities. Optimizing the overall entity assignment is NP-hard. We investigate practical solutions based on local hill-climbing, rounding integer linear programs, and pre-clustering entities followed by local optimization within clusters. In experiments involving over a hundred manuallyannotated Web pages and tens of thousands of spots, our approaches significantly outperform recently-proposed algorithms.
(Show Context)

Citation Context

... and Wikify!, sacrifices recall for high precision. For the spots picked by M&W for labeling, even random disambiguation achieves an F1 score of 0.53. Cucerzan’s algorithm. To our knowledge, Cucerzan =-=[4]-=- was the first to recognize general interdependence between entity labels in the context of Wikipedia annotations. He represents each entity γ as a high-dimensional feature vector g(γ), and expressed ...

TAGME: On-the-fly annotation of short text fragents (by Wikipedia entities). Available on http://arxiv.org/abs/1006.3498

by Paolo Ferragina, Ugo Scaiella
"... We designed and implemented Tagme, a system that is able to efficiently and judiciously augment a plain-text with pertinent hyperlinks to Wikipedia pages. The specialty of Tagme with respect to known systems [5, 8] is that it may annotate texts which are short and poorly composed, such as snippets o ..."
Abstract - Cited by 82 (6 self) - Add to MetaCart
We designed and implemented Tagme, a system that is able to efficiently and judiciously augment a plain-text with pertinent hyperlinks to Wikipedia pages. The specialty of Tagme with respect to known systems [5, 8] is that it may annotate texts which are short and poorly composed, such as snippets of search-engine results, tweets, news, etc.. This annotation is extremely informative, so any task that is currently addressed using the bag-of-words paradigm could benefit from using this annotation to draw upon (the millions of) Wikipedia pages and their inter-relations. Categories andSubject Descriptors
(Show Context)

Citation Context

... techniques to draw on a vast network of concepts and their inter-relations. To our knowledge the first work that addressed the problem of linking spots to Wikipedia pages was Wikify [6], followed by =-=[2]-=-. Recently Milne and Witten [8] proposed an approach that yielded considerable improvements by hinging on three main ingredients: (i) the identification in the input text of a set C of so-called conte...

Mining meaning from Wikipedia

by Olena Medelyan, David Milne, Catherine Legg, Ian H. Witten , 2009
"... Wikipedia is a goldmine of information; not just for its many readers, but also for the growing community of researchers who recognize it as a resource of exceptional scale and utility. It represents a vast investment of manual effort and judgment: a huge, constantly evolving tapestry of concepts an ..."
Abstract - Cited by 76 (2 self) - Add to MetaCart
Wikipedia is a goldmine of information; not just for its many readers, but also for the growing community of researchers who recognize it as a resource of exceptional scale and utility. It represents a vast investment of manual effort and judgment: a huge, constantly evolving tapestry of concepts and relations that is being applied to a host of tasks. This article provides a comprehensive description of this work. It focuses on research that extracts and makes use of the concepts, relations, facts and descriptions found in Wikipedia, and organizes the work into four broad categories: applying Wikipedia to natural language processing; using it to facilitate information retrieval and information extraction; and as a resource for ontology building. The article addresses how Wikipedia is being used as is, how it is being improved and adapted, and how it is being combined with other structures to create entirely new resources. We identify the research groups and individuals involved, and how their work has developed in the last few years. We provide a comprehensive list of the open-source software they have produced.

Unsupervised query segmentation using generative language models and wikipedia

by Bin Tan - In WWW ’08
"... In this paper, we propose a novel unsupervised approach to query segmentation, an important task in Web search. We use a generative query model to recover a query’s underlying concepts that compose its original segmented form. The model’s parameters are estimated using an expectation-maximization (E ..."
Abstract - Cited by 65 (1 self) - Add to MetaCart
In this paper, we propose a novel unsupervised approach to query segmentation, an important task in Web search. We use a generative query model to recover a query’s underlying concepts that compose its original segmented form. The model’s parameters are estimated using an expectation-maximization (EM) algorithm, optimizing the minimum description length objective function on a partial corpus that is specific to the query. To augment this unsupervised learning, we incorporate evidence from Wikipedia. Experiments show that our approach dramatically improves performance over the traditional approach that is based on mutual information, and produces comparable results with a supervised method. In particular, the basic generative language model contributes a 7.4 % improvement over the mutual information based method (measured by segment F1 on the Intersection test set). EM optimization further improves the performance by 14.3%. Additional knowledge from Wikipedia provides another improvement of 24.3%, adding up to a total of 46 % improvement (from 0.530 to
(Show Context)

Citation Context

...g some seed labeled data for unsupervised learning as used in [1], only that Wikipedia is readily available. Wikipedia has been used in many applications in NLP, including named entity disambiguation =-=[7, 9]-=-, question answering [10], text categorization [12], and conference resolution [20]. To our knowledge, this is the first time that it is used in query segmentation. 3. A GENERATIVE MODEL FOR QUERY SEG...

Rijke. Adding semantics to microblog posts

by Edgar Meij, Wouter Weerkamp, Maarten De Rijke - In WSDM ’12. ACM , 2012
"... Microblogs have become an important source of information for the purpose of marketing, intelligence, and reputation management. Streams of microblogs are of great value because of their direct and real-time nature. Determining what an individual microblog post is about, however, can be non-trivial ..."
Abstract - Cited by 64 (14 self) - Add to MetaCart
Microblogs have become an important source of information for the purpose of marketing, intelligence, and reputation management. Streams of microblogs are of great value because of their direct and real-time nature. Determining what an individual microblog post is about, however, can be non-trivial because of creative language usage, the highly contextualized and informal nature of microblog posts, and the limited length of this form of communication. We propose a solution to the problem of determining what a mi-croblog post is about through semantic linking: we add seman-tics to posts by automatically identifying concepts that are seman-tically related to it and generating links to the corresponding Wiki-pedia articles. The identified concepts can subsequently be used for, e.g., social media mining, thereby reducing the need for man-ual inspection and selection. Using a purpose-built test collection of tweets, we show that recently proposed approaches for semantic linking do not perform well, mainly due to the idiosyncratic nature of microblog posts. We propose a novel method based on machine learning with a set of innovative features and show that it is able to achieve significant improvements over all other methods, espe-cially in terms of precision.
(Show Context)

Citation Context

...n seen as a way of providing semantics to digital items. The idea has been used for different media types (such as text [27, 28] and multimedia [34]) and for different text genres (such as news pages =-=[9]-=-, queries [23], archives [6], and radiology reports [14]). A simple and frequently taken approach for linking text to concepts is to perform lexical matching between (parts of the text) and the concep...

Understanding User’s Query Intent with Wikipedia

by Jian Hu, Gang Wang, Fred Lochovsky, Jian-tao Sun, Zheng Chen - WWW 2009 MADRID! TRACK: SEARCH / SESSION: QUERY CATEGORIZATION , 2009
"... Understanding the intent behind a user’s query can help search engine to automatically route the query to some corresponding vertical search engines to obtain particularly relevant contents, thus, greatly improving user satisfaction. There are three major challenges to the query intent classificatio ..."
Abstract - Cited by 57 (3 self) - Add to MetaCart
Understanding the intent behind a user’s query can help search engine to automatically route the query to some corresponding vertical search engines to obtain particularly relevant contents, thus, greatly improving user satisfaction. There are three major challenges to the query intent classification problem: (1) Intent representation; (2) Domain coverage and (3) Semantic interpretation. Current approaches to predict the user’s intent mainly utilize machine learning techniques. However, it is difficult and often requires many human efforts to meet all these challenges by the statistical machine learning approaches. In this paper, we propose a general methodology to the problem of query intent classification. With very little human effort, our method can discover large quantities of intent concepts by leveraging Wikipedia, one of the best human knowledge base. The Wikipedia
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University