DMCA
Rijke. Adding semantics to microblog posts (2012)
Venue: | In WSDM ’12. ACM |
Citations: | 64 - 14 self |
Citations
3471 |
The elements of statistical learning
- Hastie, Tibshirani, et al.
- 2001
(Show Context)
Citation Context ... ∑n i=1(T (qi) − yi)2. So, GBRT depends on three parameters: the learning rate α, the depth of the tree, d, and the number of iterations, k. We set α = 0.02 and k = 1000, and, following Hastie et al. =-=[13]-=-, we set d = 4. Finally, Mohan et al. [29] show that, since RF is resistant to overfitting and also often outperforms GBRT, the RF predictions can be used as starting point for GBRT. By doing so, GBRT... |
1857 | Introduction to Information Retrieval
- Manning, Raghavan, et al.
- 2008
(Show Context)
Citation Context ...the relative number of concepts in which q occurs, which is defined as IDF (q) = log (|C|/df (q)), where |C| indicates the total number of concepts and df (q) the number of concepts in which q occurs =-=[21]-=-. The subscript f denotes the field of the Wikipedia articles used, see above. WIG(q) indicates the weighted information gain [39], which can be considered a predictor of the retrieval performance of ... |
999 | Greedy function approximation: A gradient boosting machine
- Friedman
(Show Context)
Citation Context ...g this setting on the linking effectiveness. In recent years, gradient boosted regression trees (GBRTs) have been established as the de facto state-of-the-art learning paradigm for web search ranking =-=[7, 12, 29]-=-. It is a point-wise learning to rank algorithm that predicts the relevance score of a result to a query by minimizing a loss function (e.g., the squared loss) using stochastic gradient descent. It is... |
989 | What is twitter, a social network or a news media
- KWAK, LEE, et al.
- 2010
(Show Context)
Citation Context ...2012 ACM 978-1-4503-0747-5/12/02 ...$10.00. valuable sources for many kinds of analyses, including online reputation management, news and trend detection, and targeted marketing and customer services =-=[4, 18, 32, 35]-=-. Searching and mining microblog streams offers interesting technical challenges, because of the sheer volume of the data, its dynamic nature, the creative language usage, and the length of individual... |
437 | Combination of multiple searches”.
- Shaw, Fox
- 1995
(Show Context)
Citation Context ... (4) As to the concept fields we use either the full Wikipedia article, its title, or its incoming anchor texts. To combine the rankings produced by each constituent n-gram of a tweet, we use combMNZ =-=[33]-=-. CombMNZ is a result list merging method and a variant of CombSUM—which sums a document’s scores from all lists where it was retrieved. CombMNZ multiplies the CombSUM score by the number of lists tha... |
322 | Learning to link with wikipedia
- Milne, Witten
- 2008
(Show Context)
Citation Context ...icles. With over 3.5 million articles, Wikipedia has become a rich source of knowledge and a common target for linking; automatic linking approaches using Wikipedia have met with considerable success =-=[14, 25, 27, 28]-=-. Most, if not all, of the linking methods assume that the input text is relatively clean and grammatically correct and that it provides sufficient context for the purposes of identifying concepts. Mi... |
297 | From Tweets to polls: Linking text sentiment to public opinion time series.
- O’Connor, Balasubramanyan, et al.
- 2010
(Show Context)
Citation Context ...guage usage, and the length of individual posts [17, 22]. In many microblog search scenarios the goal is to find out what people are saying about concepts such as products, brands, persons, et cetera =-=[31]-=-. Here, it is important to be able to accurately retrieve tweets that are on topic, including all possible naming and other lexical variants. So, it is common to manually construct lengthy keyword que... |
265 | Wikify!: linking documents to encyclopedic knowledge.
- Mihalcea, Csomai
- 2007
(Show Context)
Citation Context ...icles. With over 3.5 million articles, Wikipedia has become a rich source of knowledge and a common target for linking; automatic linking approaches using Wikipedia have met with considerable success =-=[14, 25, 27, 28]-=-. Most, if not all, of the linking methods assume that the input text is relatively clean and grammatically correct and that it provides sufficient context for the purposes of identifying concepts. Mi... |
238 | Large-scale named entity disambiguation based on Wikipedia data.
- Cucerzan
- 2007
(Show Context)
Citation Context ...n seen as a way of providing semantics to digital items. The idea has been used for different media types (such as text [27, 28] and multimedia [34]) and for different text genres (such as news pages =-=[9]-=-, queries [23], archives [6], and radiology reports [14]). A simple and frequently taken approach for linking text to concepts is to perform lexical matching between (parts of the text) and the concep... |
235 | A survey of named entity recognition and classification.
- Nadeau, Sekine
- 2007
(Show Context)
Citation Context ...ing amount of attention in recent years. Starting from the domain of named entity recognition (NER), current approaches establish links not just to entity types, but to the actual entities themselves =-=[15, 20, 30]-=-. Instead of merely identifying types, we also aim to disambiguate the found concepts and link them to Wikipedia articles. With over 3.5 million articles, Wikipedia has become a rich source of knowled... |
172 | SemTag and Seeker: Bootstrapping the semantic web via automated semantic annotation.
- Dill, Eiron, et al.
- 2003
(Show Context)
Citation Context ... [23], archives [6], and radiology reports [14]). A simple and frequently taken approach for linking text to concepts is to perform lexical matching between (parts of the text) and the concept titles =-=[10, 26]-=-, an approach related to keyword-based interfaces to databases [38]. However, merely matching an input text with concept titles suffers from many drawbacks, including ambiguity (where different concep... |
87 |
Retweet: Conversational Aspects of Retweeting on Twitter.”
- “Tweet
- 2010
(Show Context)
Citation Context ...2012 ACM 978-1-4503-0747-5/12/02 ...$10.00. valuable sources for many kinds of analyses, including online reputation management, news and trend detection, and targeted marketing and customer services =-=[4, 18, 32, 35]-=-. Searching and mining microblog streams offers interesting technical challenges, because of the sheer volume of the data, its dynamic nature, the creative language usage, and the length of individual... |
82 | Tagme: On-the-fly annotation of short text fragments (by wikipedia entities),”
- Ferragina, Scaiella
- 2010
(Show Context)
Citation Context ...h as those from Wikipedia entries and news articles. For comparative purposes, we include their method as one of the baselines and show that it does not perform well on tweets. Ferragina and Scaiella =-=[11]-=- propose an approach similar to [28], but with an explicit focus on short texts. They incorporate a voting scheme as well as pruning of n-grams unrelated to the input text. Since this method is geared... |
77 | Adding semantics to detectors for video retrieval.
- Snoek, Huurnink, et al.
- 2007
(Show Context)
Citation Context ...2.1 Linking Text Links to a knowledge structure are often seen as a way of providing semantics to digital items. The idea has been used for different media types (such as text [27, 28] and multimedia =-=[34]-=-) and for different text genres (such as news pages [9], queries [23], archives [6], and radiology reports [14]). A simple and frequently taken approach for linking text to concepts is to perform lexi... |
73 | Recognizing named entities in tweets.
- Liu, Zhang, et al.
- 2011
(Show Context)
Citation Context ...ing amount of attention in recent years. Starting from the domain of named entity recognition (NER), current approaches establish links not just to entity types, but to the actual entities themselves =-=[15, 20, 30]-=-. Instead of merely identifying types, we also aim to disambiguate the found concepts and link them to Wikipedia articles. With over 3.5 million articles, Wikipedia has become a rich source of knowled... |
72 | Query performance prediction in web search environments,”
- Zhou, Croft
- 2007
(Show Context)
Citation Context ...umber of concepts and df (q) the number of concepts in which q occurs [21]. The subscript f denotes the field of the Wikipedia articles used, see above. WIG(q) indicates the weighted information gain =-=[39]-=-, which can be considered a predictor of the retrieval performance of a query. It uses the set of all candidate concepts retrieved for this n-gram, Cq , and determines the relative probability of q oc... |
67 | Inverse document frequency (IDF): a measure of deviations from Poisson, in:
- Church, Gale
- 1995
(Show Context)
Citation Context ... occurrence of q in separate concept fields, the position of the first occurrence of the n-gram, the distance between the first and last occurrence, and various IR-based measures [21]. Of these, RIDF =-=[8]-=- is the difference between expected and observed IDF for a concept, which is defined as RIDF (c, q) = log ( |C| df (q) ) + log ( 1− exp (−n(q, C) |C| )) . We also consider whether the title of the Wik... |
40 | Conversational tagging in twitter.
- Huang, Thornton, et al.
- 2010
(Show Context)
Citation Context ...ion we lift in our work, enabling us to add semantics to tweets without hashtags. Previous work has shown that hashtag usage is quite low and differs a lot per country and language [36]. Huang et al. =-=[16]-=- analyze the semantics of hashtags in more detail and reveal that hashtagging in Twitter is more commonly used to join public discussions than to organize content for future retrieval. In order to ver... |
38 | Incorporating query expansion and quality indicators in searching microblog posts. In:
- Massoudi, Tsagkias, et al.
- 2011
(Show Context)
Citation Context ...hing and mining microblog streams offers interesting technical challenges, because of the sheer volume of the data, its dynamic nature, the creative language usage, and the length of individual posts =-=[17, 22]-=-. In many microblog search scenarios the goal is to find out what people are saying about concepts such as products, brands, persons, et cetera [31]. Here, it is important to be able to accurately ret... |
33 | Event discovery in social media feeds.
- Benson, Haghighi, et al.
- 2011
(Show Context)
Citation Context ... meaning” to text in general [24] or to text contained in tweets. Liu et al. [20] focus on NER on tweets and use a semi-supervised learning framework to identify four types of entities. Benson et al. =-=[3]-=- try to match tweets to “records.” These records are, for example, artist-venue pairs and can be obtained from sources like music guides. They train a model that extracts artists and venues from tweet... |
30 | Web-Search Ranking with Initialized Gradient Boosted Regression Trees
- Mohan, Chen, et al.
- 2011
(Show Context)
Citation Context ...g this setting on the linking effectiveness. In recent years, gradient boosted regression trees (GBRTs) have been established as the de facto state-of-the-art learning paradigm for web search ranking =-=[7, 12, 29]-=-. It is a point-wise learning to rank algorithm that predicts the relevance score of a result to a query by minimizing a loss function (e.g., the squared loss) using stochastic gradient descent. It is... |
23 | Learning to rank using an ensemble of lambda-gradient models. In Yahoo! Learning to Rank Challenge,
- Burges, Svore, et al.
- 2011
(Show Context)
Citation Context ...g this setting on the linking effectiveness. In recent years, gradient boosted regression trees (GBRTs) have been established as the de facto state-of-the-art learning paradigm for web search ranking =-=[7, 12, 29]-=-. It is a point-wise learning to rank algorithm that predicts the relevance score of a result to a query by minimizing a loss function (e.g., the squared loss) using stochastic gradient descent. It is... |
22 |
Topical semantics of twitter links.
- Welch, Schonfeld, et al.
- 2011
(Show Context)
Citation Context ...xchanges. As Twitter has grown, novel language use and standards such as mentions (to reference another user), hashtags (to refer to a topic), and retweets (similar to an e-mail forward) have emerged =-=[37]-=-. Various authors have attempted to “give meaning” to text in general [24] or to text contained in tweets. Liu et al. [20] focus on NER on tweets and use a semi-supervised learning framework to identi... |
22 | Keyword search in relational databases: A survey.
- Yu, Qin, et al.
- 2010
(Show Context)
Citation Context ...ly taken approach for linking text to concepts is to perform lexical matching between (parts of the text) and the concept titles [10, 26], an approach related to keyword-based interfaces to databases =-=[38]-=-. However, merely matching an input text with concept titles suffers from many drawbacks, including ambiguity (where different concepts with the same label can be confused) and a possible lack of spec... |
20 |
de Rijke, Learning semantic query suggestions, in:
- Meij, Bron, et al.
- 2009
(Show Context)
Citation Context ...ay of providing semantics to digital items. The idea has been used for different media types (such as text [27, 28] and multimedia [34]) and for different text genres (such as news pages [9], queries =-=[23]-=-, archives [6], and radiology reports [14]). A simple and frequently taken approach for linking text to concepts is to perform lexical matching between (parts of the text) and the concept titles [10, ... |
19 | Mapping queries to the Linking Open Data cloud: A case study using DBpedia.
- Meij, Bron, et al.
- 2011
(Show Context)
Citation Context ...icles. With over 3.5 million articles, Wikipedia has become a rich source of knowledge and a common target for linking; automatic linking approaches using Wikipedia have met with considerable success =-=[14, 25, 27, 28]-=-. Most, if not all, of the linking methods assume that the input text is relatively clean and grammatically correct and that it provides sufficient context for the purposes of identifying concepts. Mi... |
17 | Making Sense of Twitter
- Laniado, Mika
- 2010
(Show Context)
Citation Context ...clude the former as a feature in our framework and evaluate the latter as a baseline. Kwak et al. [18] show that hashtags are good indicators to detect events and trending topics and Laniado and Mika =-=[19]-=- explore the use of hashtags in Twitter and the relation to (Freebase) concepts. Using manual annotations, they find that about half of the hashtags can be mapped to Freebase concepts, most of them be... |
17 |
Linked Open Social Signals.
- Mendes, Passant, et al.
- 2010
(Show Context)
Citation Context ... [23], archives [6], and radiology reports [14]). A simple and frequently taken approach for linking text to concepts is to perform lexical matching between (parts of the text) and the concept titles =-=[10, 26]-=-, an approach related to keyword-based interfaces to databases [38]. However, merely matching an input text with concept titles suffers from many drawbacks, including ambiguity (where different concep... |
17 | Linking Online News and Social Media
- Tsagkias, Rijke, et al.
- 2011
(Show Context)
Citation Context ...2012 ACM 978-1-4503-0747-5/12/02 ...$10.00. valuable sources for many kinds of analyses, including online reputation management, news and trend detection, and targeted marketing and customer services =-=[4, 18, 32, 35]-=-. Searching and mining microblog streams offers interesting technical challenges, because of the sheer volume of the data, its dynamic nature, the creative language usage, and the length of individual... |
16 | Generating links to background knowledge: a case study using narrative radiology reports
- He, Rijke, et al.
- 2011
(Show Context)
Citation Context ...icles. With over 3.5 million articles, Wikipedia has become a rich source of knowledge and a common target for linking; automatic linking approaches using Wikipedia have met with considerable success =-=[14, 25, 27, 28]-=-. Most, if not all, of the linking methods assume that the input text is relatively clean and grammatically correct and that it provides sufficient context for the purposes of identifying concepts. Mi... |
15 | WePS-3 evaluation campaign: Overview of the online reputation management task.
- Amigo, Artiles, et al.
- 2010
(Show Context)
Citation Context ...rieve tweets that are on topic, including all possible naming and other lexical variants. So, it is common to manually construct lengthy keyword queries that (hopefully) capture all possible variants =-=[2]-=-. We propose an alternative approach, namely to determine what a microblog post is about by automatically identifying concepts in them. We take a concept to be any item that has a unique and unambiguo... |
13 | Rijke. Linking archives using document enrichment and term selection
- Bron, Huurnink, et al.
- 2011
(Show Context)
Citation Context ... semantics to digital items. The idea has been used for different media types (such as text [27, 28] and multimedia [34]) and for different text genres (such as news pages [9], queries [23], archives =-=[6]-=-, and radiology reports [14]). A simple and frequently taken approach for linking text to concepts is to perform lexical matching between (parts of the text) and the concept titles [10, 26], an approa... |
9 | Conceptual language models for domain-specific retrieval.
- Meij, Trieschnigg, et al.
- 2010
(Show Context)
Citation Context ...entions (to reference another user), hashtags (to refer to a topic), and retweets (similar to an e-mail forward) have emerged [37]. Various authors have attempted to “give meaning” to text in general =-=[24]-=- or to text contained in tweets. Liu et al. [20] focus on NER on tweets and use a semi-supervised learning framework to identify four types of entities. Benson et al. [3] try to match tweets to “recor... |
6 |
Twitter study-
- Analytics
- 2009
(Show Context)
Citation Context ...2012 ACM 978-1-4503-0747-5/12/02 ...$10.00. valuable sources for many kinds of analyses, including online reputation management, news and trend detection, and targeted marketing and customer services =-=[4, 18, 32, 35]-=-. Searching and mining microblog streams offers interesting technical challenges, because of the sheer volume of the data, its dynamic nature, the creative language usage, and the length of individual... |
5 | Statistics of online user-generated short documents
- Inches, Carman, et al.
- 2010
(Show Context)
Citation Context ...hing and mining microblog streams offers interesting technical challenges, because of the sheer volume of the data, its dynamic nature, the creative language usage, and the length of individual posts =-=[17, 22]-=-. In many microblog search scenarios the goal is to find out what people are saying about concepts such as products, brands, persons, et cetera [31]. Here, it is important to be able to accurately ret... |
5 | How people use Twitter in different languages
- Weerkamp, Carter, et al.
- 2011
(Show Context)
Citation Context ... tweets, an assumption we lift in our work, enabling us to add semantics to tweets without hashtags. Previous work has shown that hashtag usage is quite low and differs a lot per country and language =-=[36]-=-. Huang et al. [16] analyze the semantics of hashtags in more detail and reveal that hashtagging in Twitter is more commonly used to join public discussions than to organize content for future retriev... |