Results 1 - 10
of
17
DBpedia -- A Crystallization Point for the Web of Data
, 2009
"... The DBpedia project is a community effort to extract structured information from Wikipedia and to make this information accessible on the Web. The resulting DBpedia knowledge base currently describes over 2.6 million entities. For each of these entities, DBpedia defines a globally unique identifier ..."
Abstract
-
Cited by 70 (11 self)
- Add to MetaCart
The DBpedia project is a community effort to extract structured information from Wikipedia and to make this information accessible on the Web. The resulting DBpedia knowledge base currently describes over 2.6 million entities. For each of these entities, DBpedia defines a globally unique identifier that can be dereferenced over the Web into a rich RDF description of the entity, including human-readable definitions in 30 languages, relationships to other resources, classifications in four concept hierarchies, various facts as well as data-level links to other Web data sources describing the entity. Over the last year, an increasing number of data publishers have begun to set data-level links to DBpedia resources, making DBpedia a central interlinking hub for the emerging Web of data. Currently, the Web of interlinked data sources around DBpedia provides approximately 4.7 billion pieces of information and covers domains such as geographic information, people, companies, films, music, genes, drugs, books, and scientific publications. This article describes the extraction of the DBpedia knowledge base, the current status of interlinking DBpedia with other data sources on the Web, and gives an overview of applications that facilitate the Web of Data around DBpedia.
Design challenges and misconceptions in named entity recognition
- PROCEEDINGS OF THE THIRTEENTH CONFERENCE ON COMPUTATIONAL NATURAL LANGUAGE LEARNING (CONLL)
, 2009
"... We analyze some of the fundamental design challenges and misconceptions that underlie the development of an efficient and robust NER system. In particular, we address issues such as the representation of text chunks, the inference approach needed to combine local NER decisions, the sources of prior ..."
Abstract
-
Cited by 23 (3 self)
- Add to MetaCart
We analyze some of the fundamental design challenges and misconceptions that underlie the development of an efficient and robust NER system. In particular, we address issues such as the representation of text chunks, the inference approach needed to combine local NER decisions, the sources of prior knowledge and how to use them within an NER system. In the process of comparing several solutions to these challenges we reach some surprising conclusions, as well as develop an NER system that achieves 90.8 F1 score on the CoNLL-2003 NER shared task, the best reported result for this dataset.
Mining Wiki Resources for Multilingual Named Entity Recognition,” ACL’08
, 2008
"... In this paper, we describe a system by which the multilingual characteristics of Wikipedia can be utilized to annotate a large corpus of text with Named Entity Recognition (NER) tags requiring minimal human intervention and no linguistic expertise. This process, though of value in languages for whic ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
In this paper, we describe a system by which the multilingual characteristics of Wikipedia can be utilized to annotate a large corpus of text with Named Entity Recognition (NER) tags requiring minimal human intervention and no linguistic expertise. This process, though of value in languages for which resources exist, is particularly useful for less commonly taught languages. We show how the Wikipedia format can be used to identify possible named entities and discuss in detail the process by which we use the Category structure inherent to Wikipedia to determine the named entity type of a proposed entity. We further describe the methods by which English language data can be used to bootstrap the NER process in other languages. We demonstrate the system by using the generated corpus as training sets for a variant of BBN's Identifinder in French, Ukrainian, Spanish, Polish, Russian, and Portuguese, achieving overall F-scores as high as 84.7% on independent, human-annotated corpora, comparable to a system trained on up to 40,000 words of human-annotated newswire. 1
Exploiting locality of Wikipedia links in entity ranking
- In ECIR
, 2008
"... Abstract. Information retrieval from web and XML document collections is ever more focused on returning entities instead of web pages or XML elements. There are many research fields involving named entities; one such field is known as entity ranking, where one goal is to rank entities in response to ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
Abstract. Information retrieval from web and XML document collections is ever more focused on returning entities instead of web pages or XML elements. There are many research fields involving named entities; one such field is known as entity ranking, where one goal is to rank entities in response to a query supported with a short list of entity examples. In this paper, we describe our approach to ranking entities from the Wikipedia XML document collection. Our approach utilises the known categories and the link structure of Wikipedia, and more importantly, exploits link co-occurrences to improve the effectiveness of entity ranking. Using the broad context of a full Wikipedia page as a baseline, we evaluate two different algorithms for identifying narrow contexts around the entity examples: one that uses predefined types of elements such as paragraphs, lists and tables; and another that dynamically identifies the contexts by utilising the underlying XML document structure. Our experiments demonstrate that the locality of Wikipedia links can be exploited to significantly improve the effectiveness of entity ranking. 1
Tell me more, not just ”more of the same
- In IUI ’10: Proceeding of the 14th international conference on Intelligent user interfaces, 81–90
, 2010
"... The Web makes it possible for news readers to learn more about virtually any story that interests them. Media outlets and search engines typically augment their information with links to similar stories. It is up to the user to determine what new information is added by them, if any. In this paper w ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
The Web makes it possible for news readers to learn more about virtually any story that interests them. Media outlets and search engines typically augment their information with links to similar stories. It is up to the user to determine what new information is added by them, if any. In this paper we present Tell Me More, a system that performs this task automatically: given a seed news story, it mines the web for similar stories reported by different sources and selects snippets of text from those stories which offer new information beyond the seed story. New content may be classified as supplying: additional quotes, additional actors, additional figures and additional information depending on the criteria used to select it. In this paper we describe how the system identifies new and informative content with respect to a news story. We also show that providing an explicit categorization of new information is more useful than a binary classification (new/not-new). Lastly, we show encouraging results from a preliminary evaluation of the system that validates our approach and encourages further study.
Extracting named entities and synonyms from wikipedia
- In Proceedings of AINA’2010
, 2010
"... In many search domains, both contents and searches are frequently tied to named entities such as a person, a company or similar. An example of such a domain is a news archive. One challenge from an information retrieval point of view is that a single entity can have more than one way of referring to ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
In many search domains, both contents and searches are frequently tied to named entities such as a person, a company or similar. An example of such a domain is a news archive. One challenge from an information retrieval point of view is that a single entity can have more than one way of referring to it. In this paper we describe how to use Wikipedia contents to automatically generate a dictionary of named entities and synonyms that are all referring to the same entity. This dictionary can subsequently be used to improve search quality, for example using query expansion. Through an experimental evaluation we show that with our approach, we can find named entities and their synonyms with a high degree of accuracy. 2 1
A Named Entity Labeler for German: exploiting Wikipedia and distributional clusters
"... Named Entity Recognition is a relatively well-understood NLP task, with many publicly available training resources and software for English. Other languages tend to be underserved in this area. For German, CoNLL-2003 provides training data, but there are no publicly available, ready-to-use tools. We ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Named Entity Recognition is a relatively well-understood NLP task, with many publicly available training resources and software for English. Other languages tend to be underserved in this area. For German, CoNLL-2003 provides training data, but there are no publicly available, ready-to-use tools. We fill this gap and develop a German NER system with state-of-the-art performance. In addition to CoNLL 2003 labeled training data, we use two additional resources: (i) 32 million words of unlabeled text and (ii) infobox labels in German Wikipedia articles. We extract informative features of word-types from those resources and train a supervised model on the labeled training data. This approach allows us to deal better with word-types unseen in the training data and achieve state-of-the-art performance on German with little engineering effort. 1.
ThemExplorer: Finding and Browsing geo-referenced Images
- In Proc. of Content Based Multimedia Indexing Workshop
, 2008
"... Among the useful information that make browsing or finding pictures on the Web easier, geographic data take advantages from the growing amount of geo-referenced image collections and recent map-based interfaces (Google Map and Earth, Yahoo! Map, etc.). Most large scale systems for visualizing geogra ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Among the useful information that make browsing or finding pictures on the Web easier, geographic data take advantages from the growing amount of geo-referenced image collections and recent map-based interfaces (Google Map and Earth, Yahoo! Map, etc.). Most large scale systems for visualizing geographic entities are weakly structured (unless for commercial entities), with inhomogeneous coverage; they also make little or no use of image processing techniques in search and retrieval. In this paper, we tackle with these problems by introducing a enriched and adapted version of a geographical database and content-based facility in a new map-based visualization tool, called ThemExplorer. We present the system and evaluate different dimensions, proving its usefulness for browsing geo-referenced images. In section 1 we set up the global argument; in section 2, we discuss related work; section 3 includes an architectural overview of ThemExplorer; in Section 4, we present the contribution of our geographical database using heterogeneous sources on the Web; in section 5, we detail the CBIR techniques associated to ThemExplorer. Before concluding and presenting some future works, we describe a series of evaluation in section 6. 1.
Named Entity Recognition: Adapting to Microblogging
"... In this project, we seek to create a Named Entity Recognizer (NER) tuned for use on Twitter posts. We will be identifying Named Entities and classifying them as People, Locations, or Organizations. We hope to identify language features and methods that effectively transfer the techniques and knowled ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
In this project, we seek to create a Named Entity Recognizer (NER) tuned for use on Twitter posts. We will be identifying Named Entities and classifying them as People, Locations, or Organizations. We hope to identify language features and methods that effectively transfer the techniques and knowledge from Named Entity Recognition research on formal sources, such as news articles, to less structured microblogging texts. In the process, we will identify differences between microblogging text and formal prose which are relevant to NER. 1.1 Summary There has been much research in Named Entity Recognition on news articles. However, many applications of NER, and Natural Language Processing in general, involve analyzing data that is less structured, such as blog posts, instant messages, and movie reviews. In our project, we will attempt to create a classifier that performs NER on microblog postings. In doing so, we hope to explore approaches to transferring learning from a domain with more data available to one with less.
Wikipedia as the Premiere Source for Targeted Hypernym Discovery
"... Abstract. Targeted Hypernym Discovery (THD) applies lexico-syntactic (Hearst) patterns on a suitable corpus with the intent to extract one hypernym at a time. Using Wikipedia as the corpus in THD has recently yielded promising results in a number of tasks. We investigate the reasons that make Wikipe ..."
Abstract
- Add to MetaCart
Abstract. Targeted Hypernym Discovery (THD) applies lexico-syntactic (Hearst) patterns on a suitable corpus with the intent to extract one hypernym at a time. Using Wikipedia as the corpus in THD has recently yielded promising results in a number of tasks. We investigate the reasons that make Wikipedia articles such an easy target for lexicosyntactic patterns, and suggest that it is primarily the adherence of its contributors to Wikipedia’s Manual of Style. We propose the hypothesis that extractable patterns are more likely to appear in articles covering popular topics, since these receive more attention including the adherence to the rules from the manual. However, two preliminary experiments carried out with 131 and 100 Wikipedia articles do not support this hypothesis. 1

