• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Named entity recognition: Adapting to microblogging. In Senior Thesis, (2009)

by Brian Locke, James Martin
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 12
Next 10 →

Named entity recognition in tweets: An experimental study.

by Alan Ritter , Mausam Sam Clark , Oren Etzioni - In Proceedings of Empirical Methods for Natural Language Processing EMNLP, , 2011
"... Abstract People tweet more than 100 Million times daily, yielding a noisy, informal, but sometimes informative corpus of 140-character messages that mirrors the zeitgeist in an unprecedented manner. The performance of standard NLP tools is severely degraded on tweets. This paper addresses this issu ..."
Abstract - Cited by 143 (11 self) - Add to MetaCart
Abstract People tweet more than 100 Million times daily, yielding a noisy, informal, but sometimes informative corpus of 140-character messages that mirrors the zeitgeist in an unprecedented manner. The performance of standard NLP tools is severely degraded on tweets. This paper addresses this issue by re-building the NLP pipeline beginning with part-of-speech tagging, through chunking, to named-entity recognition. Our novel T-NER system doubles F 1 score compared with the Stanford NER system. T-NER leverages the redundancy inherent in tweets to achieve this performance, using LabeledLDA to exploit Freebase dictionaries as a source of distant supervision. LabeledLDA outperforms cotraining, increasing F 1 by 25% over ten common entity types. Our NLP tools are available at: http:// github.com/aritter/twitter_nlp
(Show Context)

Citation Context

...considering mentions as “documents”. This is likely due to the fact that there isn’t enough context to effectively learn topics when the “documents” are very short (typically fewer than 10 words). End to End System: Finally we present the end to end performance on segmentation and classification (T-NER) in Table 12. We observe that T-NER again outperforms co-training. Moreover, comparing against the Stanford Named Entity Recognizer on the 3 MUC types, T-NER doubles F1 score. 4 Related Work There has been relatively little previous work on building NLP tools for Twitter or similar text styles. Locke and Martin (2009) train a classifier to recognize named entities based on annotated Twitter data, handling the types PERSON, LOCATION, and ORGANIZATION. Developed in parallel to our work, Liu et al. (2011) investigate NER on the same 3 types, in addition to PRODUCTs and present a semi1531 supervised approach using k-nearest neighbor. Also developed in parallel, Gimpell et al. (2011) build a POS tagger for tweets using 20 coarse-grained tags. Benson et. al. (2011) present a system which extracts artists and venues associated with musical performances. Recent work (Han and Baldwin, 2011; Gouws et al., 2011) has ...

Benchmarking the extraction and disambiguation of named entities on the semantic web

by Marieke Van Erp - In Proceedings of the 9th International Conference on Language Resources and Evaluation , 2014
"... Named entity recognition and disambiguation are of primary importance for extracting information and for populating knowledge bases. Detecting and classifying named entities has traditionally been taken on by the natural language processing community, whilst linking of entities to external resources ..."
Abstract - Cited by 7 (2 self) - Add to MetaCart
Named entity recognition and disambiguation are of primary importance for extracting information and for populating knowledge bases. Detecting and classifying named entities has traditionally been taken on by the natural language processing community, whilst linking of entities to external resources, such as those in DBpedia, has been tackled by the Semantic Web community. As these tasks are treated in different communities, there is as yet no oversight on the performance of these tasks combined. We present an approach that combines the state-of-the art from named entity recognition in the natural language processing domain and named entity linking from the semantic web community. We report on experiments and results to gain more insights into the strengths and limitations of current approaches on these tasks. Our approach relies on the numerous web extractors supported by the NERD framework, which we combine with a machine learning algorithm to optimize recognition and linking of named entities. We test our approach on four standard data sets that are composed of two diverse text types, namely newswire and microposts.
(Show Context)

Citation Context

...due to their brief and fleeting nature, microposts provide a challenging playground for text analysis tools that are oftentimes tuned to longer and more stable texts. A first attempt was detailed in (=-=Locke, 2009-=-), in which the authors train a classifier with an annotated Twitter corpus to detect named entities of types Person, Location and Organization. The classifier performance was in average among the fou...

Nerit: Named Entity Recognition for Informal Text

by David Etter, Francis Ferraro, Ryan Cotterell, Olivia Buzek, Benjamin Van Durme, David Etter, Francis Ferraro, Ryan Cotterell, Olivia Buzek, Benjamin Van , 2012
"... recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsor. ..."
Abstract - Cited by 4 (2 self) - Add to MetaCart
recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsor.

A generic openworld named entity disambiguation approach for tweets

by Mena B. Habib, Maurice Van Keulen - In Proceedings of the 5th International Conference on Knowledge Discovery and Information Retrieval, KDIR 2013 , 2013
"... Abstract: Social media is a rich source of information. To make use of this information, it is sometimes required to extract and disambiguate named entities. In this paper, we focus on named entity disambiguation (NED) in twitter messages. NED in tweets is challenging in two ways. First, the limited ..."
Abstract - Cited by 4 (4 self) - Add to MetaCart
Abstract: Social media is a rich source of information. To make use of this information, it is sometimes required to extract and disambiguate named entities. In this paper, we focus on named entity disambiguation (NED) in twitter messages. NED in tweets is challenging in two ways. First, the limited length of Tweet makes it hard to have enough context while many disambiguation techniques depend on it. The second is that many named entities in tweets do not exist in a knowledge base (KB). We share ideas from information retrieval (IR) and NED to propose solutions for both challenges. For the first problem we make use of the gregarious nature of tweets to get enough context needed for disambiguation. For the second problem we look for an alternative home page if there is no Wikipedia page represents the entity. Given a mention, we obtain a list of Wikipedia candidates from YAGO KB in addition to top ranked pages from Google search engine. We use Support Vector Machine (SVM) to rank the candidate pages to find the best representative entities. Experiments conducted on two data sets show better disambiguation results compared with the baselines and a competitor.
(Show Context)

Citation Context

...te entity page. We gave higher priority to Wikipedia pages. If Wikipedia has no page for the entity we link it to a home page or profile page. The first dataset (Brian Collection) is the one used in (=-=Locke and Martin, 2009-=-). The dataset is composed of four subsets of tweets; one public timeline subset and three subsets of targeted tweets revolving around economic recession, Australian Bushfires and and gas explosion in...

A hybrid framework for scalable Opinion Mining in Social Media: detecting polarities and attitude targets

by Carlos Rodríguez-penagos, Jens Grivolla, Joan Codina Fibá
"... joan.codina ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
joan.codina
(Show Context)

Citation Context

... improve coverage in their role-labeling. Recent approaches have included adaptation of NER techniques to noisy and irregular text, either by using learning algorithms or by doing text normalization (=-=Locke & Martin, 2009-=-; Ritter, Clark & Etzioni, 2011). 4 Exploring the semantic space of Telecom-related online postings We collected close to 200,000 postings from various SM sources in a 4 month timeframe, including fai...

Simple and Knowledge-intensive Generative Model for Named Entity Recognition

by Chun-Kai Wang , Bo-June , Paul Hsu , Ming-Wei Chang , Emre Kıcıman
"... ABSTRACT Almost all of the existing work on Named Entity Recognition (NER) consists of the following pipeline stages -part-of-speech tagging, segmentation, and named entity type classification. The requirement of hand-labeled training data on these stages makes it very expensive to extend to differ ..."
Abstract - Add to MetaCart
ABSTRACT Almost all of the existing work on Named Entity Recognition (NER) consists of the following pipeline stages -part-of-speech tagging, segmentation, and named entity type classification. The requirement of hand-labeled training data on these stages makes it very expensive to extend to different domains and entity classes. Even with a large amount of hand-labeled data, existing techniques for NER on informal text, such as social media, perform poorly due to a lack of reliable capitalization, irregular sentence structure and a wide range of vocabulary. In this paper, we address the lack of hand-labeled training data by taking advantage of weak super vision signals. We present our approach in two parts. First, we propose a novel generative model that combines the ideas from Hidden Markov Model (HMM) and n-gram language models into what we call an N-gram Language Markov Model (NLMM). Second, we utilize large-scale weak supervision signals from sources such as Wikipedia titles and the corresponding click counts to estimate parameters in NLMM. Our model is simple and can be implemented without the use of Expectation Maximization or other expensive iterative training techniques. Even with this simple model, our approach to NER on informal text outperforms existing systems trained on formal English and matches state-of-the-art NER systems trained on hand-labeled Twitter messages. Because our model does not require hand-labeled data, we can adapt our system to other domains and named entity classes very easily. We demonstrate the flexibility of our approach by successfully applying it to the different domain of extracting food dishes from restaurant reviews with very little extra work.
(Show Context)

Citation Context

...onary lookup, Lookup (Wiki + MI-100). We find that this gazetteer is actually very dirty with a precision as low as 0.10 further reinforcing that our NLMM is working as intended. In this section, we show that NLMM can easily adapt to learn a new entity class, FOOD, and adapt to another domain, restaurant reviews, with no hand-labeled training data. This level of adaptability in an NER system is a novel contribution that even state-of-the-art systems such as Stanford NER and Ritter et al.’s NER system cannot match. 7. FUTURE WORK Echoed by many others working on NER and even more so for tweets [37,38,47], due to the wide range of entity types, entity mentions, and lexical variations, having a high-quality large dataset is extremely important as observed by Ratinov and Roth [46] and others [14,30,48]. Aspects of Wikipedia we have not yet 15 Menu items crawled from the web with at least 10 occurrences taken full advantage of include disambiguation pages, link text as alternate surface forms [15], category hierarchies, lists and tables. In addition to Wikipedia, Freebase [47], Wordnet [39, 48], and other domain-specific data sources such as IMDB16 for movies and CrunchBase 17 for startups are al...

Exploiting Language Models to Classify Events from Twitter

by Duc-Thuan Vo , Vo Thuan Hai , Cheol-Young Ock
"... Classifying events is challenging in Twitter because tweets texts have a large amount of temporal data with a lot of noise and various kinds of topics. In this paper, we propose a method to classify events from Twitter. We firstly find the distinguishing terms between tweets in events and measure t ..."
Abstract - Add to MetaCart
Classifying events is challenging in Twitter because tweets texts have a large amount of temporal data with a lot of noise and various kinds of topics. In this paper, we propose a method to classify events from Twitter. We firstly find the distinguishing terms between tweets in events and measure their similarities with learning language models such as ConceptNet and a latent Dirichlet allocation method for selectional preferences (LDA-SP), which have been widely studied based on large text corpora within computational linguistic relations. The relationship of term words in tweets will be discovered by checking them under each model. We then proposed a method to compute the similarity between tweets based on tweets' features including common term words and relationships among their distinguishing term words. It will be explicit and convenient for applying to k-nearest neighbor techniques for classification. We carefully applied experiments on the Edinburgh Twitter Corpus to show that our method achieves competitive results for classifying events.
(Show Context)

Citation Context

...ntroduction Twitter (https://twitter.com/) is a social networking application that allows people to microblog about a broad range of topics. Users of Twitter post short text, called “tweets” (about 140 characters), on a variety of topics as news events and pop culture, to mundane daily events and spam. Recently, Twitter has grown over 200 million active users producing over 200 million tweets per day. Twitter is a popular microblogging and social networking service that presents many opportunities for researches in natural language processing (NLP) and machine learning [1–6]. Locke and Martin [5] and Liu et al. [4] train a classifier to recognized entities based on annotated Twitter data for Named Entity Recognition (NER). Some research has explored Part of Speech (PoS) tagging [3], geographical variation in language found on Twitter [2], modeling informal conversations [1], and also applying NLP techniques to help crisis workers with the flood of information following natural disasters [6]. Benson et al. [7] applied distant supervision to train a relation extractor to recognize artists and venues mentioned within tweets of users who list their location. Classifying events in Twitter ...

Universidade Federal do Amazonas

by Diego Marinho De Oliveira, Alberto H. F. Laender, Adriano Veloso, Altigran S. Da Silva
"... Microblog platforms such as Twitter are being increasingly adopted by Web users, yielding an important source of data forwebsearchandminingapplications. TaskssuchasNamed Entity Recognition are at the core of many of these applications, but the effectiveness of existing tools is seriously compromised ..."
Abstract - Add to MetaCart
Microblog platforms such as Twitter are being increasingly adopted by Web users, yielding an important source of data forwebsearchandminingapplications. TaskssuchasNamed Entity Recognition are at the core of many of these applications, but the effectiveness of existing tools is seriously compromised when applied to Twitter data, since messages are terse, poorlywordedandpostedinmanydifferentlanguages. Also, Twitter follows a streaming paradigm, imposing that entities must be recognized in real-time. In view of these challenges and the inappropriateness of existing tools, we propose a novel approach for Named Entity Recognition on Twitter data called FS-NER (Filter-Stream Named Entity Recognition). FS-NER is characterized by the use of filters that process unlabeled Twitter messages, being much more practical than existing supervised CRF-based approaches. Such filters can be combined either in sequence or in parallel in a flexible way. Moreover, because these filters are not language dependent, FS-NER can be applied to different languages without requiring a laborious adaptation. Through a systematic evaluation using three Twitter collections and considering seven types of entity, we show that FS-NERperforms 3 % better than a CRF-based baseline, besides being orders of magnitude faster and much more practical.
(Show Context)

Citation Context

...tain a considerable amount of labeled tweets, learning transfer is a relevant issue. However, using formal sources to train an entity recognizer and then applying it to Twitter data, Locke and Martin =-=[12]-=- have concluded that due to the Twitter nature it is difficult to transfer learning from one domain to another. In another study [5], Finin et al. describe how to efficiently use the Amazon Mechanical...

Linguistic Engineering Group Polish Academy of Sciences

by Jakub Piskorski, Maud Ehrmann
"... This paper reports on some experiments aiming at tuning a rule-based NER system designed for detecting names in Polish online news to the processing of targeted Twitter streams. In particular, one explores whether the performance of the baseline NER system can be improved through the incremental app ..."
Abstract - Add to MetaCart
This paper reports on some experiments aiming at tuning a rule-based NER system designed for detecting names in Polish online news to the processing of targeted Twitter streams. In particular, one explores whether the performance of the baseline NER system can be improved through the incremental application of knowledge-poor methods for name matching and guessing. We study various settings and combinations of the methods and present evaluation results on five corpora gathered from Twitter, centred around major events and known individuals. 1
(Show Context)

Citation Context

...hods for Polish are reported in (Waszczuk et al., 2010) and (Marcińczuk and Janicki, 2012). While NER from formal texts has been well studied, relatively little work on NER for Twitter was reported. (=-=Locke and Martin, 2009-=-) presented a SVM-based classifier for classifying persons, locations and organizations in Twitter. (Ritter et al., 2011) described an approach to segmentation and classification of a wider range of n...

Slovak Republic

by Marek Ciglan, Slovak Republic, Michal Laclavík, Slovak Republic
"... Abstract—In this paper we evaluate eight well-known Information Extraction (IE) tools on a task of Named Entity Recognition (NER) in microposts. We have chosen six NLP tools and two Wikipedia concept extractors for the evaluation. Our intent was to see how these tools would perform on relatively sho ..."
Abstract - Add to MetaCart
Abstract—In this paper we evaluate eight well-known Information Extraction (IE) tools on a task of Named Entity Recognition (NER) in microposts. We have chosen six NLP tools and two Wikipedia concept extractors for the evaluation. Our intent was to see how these tools would perform on relatively short texts of microposts. Evaluation dataset has been adopted from the MSM 2013 IE Challenge. This dataset contained manually annotated microposts with classification restricted to four entity types: PER, LOC, ORG and MISC. I.
(Show Context)

Citation Context

...een done in [1], where the authors evaluated news-trained Stanford NER on tweets. They demonstrated that news-trained NER classifiers rely heavily on capitalization, which is unreliable in tweets. In =-=[2]-=-, the authors compared the performance of their proprietary NER classifier on a CoNLL dataset and a handmade Twitter dataset (1684 postings). The authors observed that although the NER classifier perf...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University