Results 1 - 10
of
11
Credibility improves topical blog post retrieval
- IN HLT-NAACL
, 2008
"... Topical blog post retrieval is the task of ranking blog posts with respect to their relevance for a given topic. To improve topical blog post retrieval we incorporate textual credibility indicators in the retrieval process. We consider two groups of indicators: post level (determined using informati ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
Topical blog post retrieval is the task of ranking blog posts with respect to their relevance for a given topic. To improve topical blog post retrieval we incorporate textual credibility indicators in the retrieval process. We consider two groups of indicators: post level (determined using information about individual blog posts only) and blog level (determined using information from the underlying blogs). We describe how to estimate these indicators and how to integrate them into a retrieval approach based on language models. Experiments on the TREC Blog track test set show that both groups of credibility indicators significantly improve retrieval effectiveness; the best performance is achieved when combining them.
Extracting the Discussion Structure in Comments on News-Articles ABSTRACT
"... Several on-line daily newspapers offer readers the opportunity to directly comment on articles. In the Netherlands this feature is used quite often and the quality (grammatically and content-wise) is surprisingly high. We develop techniques to collect, store, enrich and analyze these comments. After ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Several on-line daily newspapers offer readers the opportunity to directly comment on articles. In the Netherlands this feature is used quite often and the quality (grammatically and content-wise) is surprisingly high. We develop techniques to collect, store, enrich and analyze these comments. After giving a high-level overview of the Dutch ‘commentosphere’ we zoom in on extracting the discussion structure found in flat comment threads; people not only comment on the news article, they also heavily comment on other comments, resembling discussion fora. We show how techniques from information retrieval, natural language processing and machine learning can be used to extract the ‘reacts-on ’ relation between comments with high precision and recall.
Exploiting Surface Features for the Prediction of Podcast Preference
"... Abstract. Podcasts display an unevenness characteristic of domains dominated by user generated content, resulting in potentially radical variation of the user preference they enjoy. We report on work that uses easily extractable surface features of podcasts in order to achieve solid performance on t ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Abstract. Podcasts display an unevenness characteristic of domains dominated by user generated content, resulting in potentially radical variation of the user preference they enjoy. We report on work that uses easily extractable surface features of podcasts in order to achieve solid performance on two podcast preference prediction tasks: classification of preferred vs. non-preferred podcasts and ranking podcasts by level of preference. We identify features with good discriminative potential by carrying out manual data analysis, resulting in a refinement of the indicators of an existent podcast preference framework. Our preference prediction is useful for topic-independent ranking of podcasts, and can be used to support download suggestion or collection browsing. 1
Web (2.0) Mining: Analyzing Social Media
"... Social media systems such as blogs, photo and link sharing sites, wikis and on-line forums are estimated to produce up to one third of new Web content. One thing that sets these ”Web 2.0 ” sites apart from traditional Web pages and resources is that they are intertwined with other forms of networked ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Social media systems such as blogs, photo and link sharing sites, wikis and on-line forums are estimated to produce up to one third of new Web content. One thing that sets these ”Web 2.0 ” sites apart from traditional Web pages and resources is that they are intertwined with other forms of networked data. Their standard hyperlinks are enriched by social networks, comments, trackbacks, advertisements, tags, RDF data and metadata. We describe recent work on building systems that analyse these emerging social media systems to recognize spam blogs, find opinions on topics, identify communities of interest, derive trust relationships, and detect influential bloggers. 1
Using Contextual Information to Improve Search in Email Archives
"... Abstract. In this paper we address the task of finding topically relevant email messages in public discussion lists. We make two important observations. First, email messages are not isolated, but are part of a larger online environment. This context, existing on different levels, can be incorporate ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Abstract. In this paper we address the task of finding topically relevant email messages in public discussion lists. We make two important observations. First, email messages are not isolated, but are part of a larger online environment. This context, existing on different levels, can be incorporated into the retrieval model. We explore the use of thread, mailing list, and community content levels, by expanding our original query with term from these sources. We find that query models based on contextual information improve retrieval effectiveness. Second, email is a relatively informal genre, and therefore offers scope for incorporating techniques previously shown useful in searching user-generated content. Indeed, our experiments show that using query-independent features (email length, thread size, and text quality), implemented as priors, results in further improvements. 1
Language Modeling Approaches to Blog Post and Feed Finding
"... Abstract: We describe our participation in the TREC 2007 Blog track. In the opinion task we looked at the differences in performance between Indri and our mixture model, the influence of external expansion and document priors to improve opinion finding; results show that an out-of-the-box Indri impl ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract: We describe our participation in the TREC 2007 Blog track. In the opinion task we looked at the differences in performance between Indri and our mixture model, the influence of external expansion and document priors to improve opinion finding; results show that an out-of-the-box Indri implementation outperforms our mixture model, and that external expansion on a news corpus is very benificial. Opinion finding can be improved using either lexicons or the number of comments as document priors. Our approach to the feed distillation task is based on aggregating post-level scores to obtain a feed-level ranking. We integrated time-based and persistence aspects into the retrieval model. After correcting bugs in our post-score aggregation module we found that time-based retrieval improves results only marginally, while persistence-based ranking results in substantial improvements under the right circumstances. 1
PodCred: A Framework for Analyzing Podcast Preference
"... The PodCred framework is a framework for assessing the credibility and quality of podcasts published on the internet. It consists of a series of indicators designed to support prediction of listener preference of one podcast over another, given that both carry comparable informational content. The i ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
The PodCred framework is a framework for assessing the credibility and quality of podcasts published on the internet. It consists of a series of indicators designed to support prediction of listener preference of one podcast over another, given that both carry comparable informational content. The indicators are grouped into four categories pertaining to the Podcast Content, the Podcaster, the Podcast Context or the Technical Execution of the podcast. We adopt the term “cred ” as a designation encompassing both credibility (comprising trustworthiness and expertise) and qualitative acceptability to listeners. Our podcast analysis framework is inspired by work on credibility in blogs, another medium dominated by user generated content. The PodCred framework is derived from a review of the literature on credibility for other media, a survey of prescriptive standards for podcasting, and a detailed data analysis of award winning podcasts. The paper concludes with a discussion of future work in which the framework will be applied.
Named Entity Normalization in User Generated Content
"... Named entity recognition is important for semantically oriented retrieval tasks, such as question answering, entity retrieval, biomedical retrieval, trend detection, and event and entity tracking. In many of these tasks it is important to be able to accurately normalize the recognized entities, i.e. ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Named entity recognition is important for semantically oriented retrieval tasks, such as question answering, entity retrieval, biomedical retrieval, trend detection, and event and entity tracking. In many of these tasks it is important to be able to accurately normalize the recognized entities, i.e., to map surface forms to unambiguous references to real world entities. Within the context of structured databases, this task (known as record linkage and data de-duplication) has been a topic of active research for more than five decades. For edited content, such as news articles, the named entity normalization (NEN) task is one that has recently attracted considerable attention. We consider the task in the challenging context of user generated content (UGC), where it forms a key ingredient of tracking and media-analysis systems. A baseline NEN system from the literature (that normalizes surface forms to Wikipedia pages) performs considerably worse on UGC than on edited news: accuracy drops from 80 % to 65 % for a Dutch language data set and from 94 % to 77 % for English. We identify several sources of errors: entity recognition errors, multiple ways of referring to the same entity and ambiguous references. To address these issues we propose five improvements to the baseline NEN algorithm, to arrive at a language independent NEN system that achieves overall accuracy scores of 90 % on the English data set and 89 % on the Dutch data set. We show that each of the improvements contributes to the overall score of our improved NEN algorithm, and conclude with an error analysis on both Dutch and English language UGC. The NEN system is computationally efficient and runs with very modest computational requirements.
Blogger, Stick to your Story Modeling Topical Noise in Blogs with Coherence Measures
"... Topical noise in blogs arises when bloggers digress from the central topical thrust of their blogs. We introduce a method to explicitly incorporate a model of topical noise into a language modeling approach to the task of blog distillation. Topical noise is integrated into the model using a coherenc ..."
Abstract
- Add to MetaCart
Topical noise in blogs arises when bloggers digress from the central topical thrust of their blogs. We introduce a method to explicitly incorporate a model of topical noise into a language modeling approach to the task of blog distillation. Topical noise is integrated into the model using a coherence score, which reflects the tightness of the topical structure of a blog. Tests performed on the TRECBlog06 corpus show that a naive integration of the coherence score as blog prior fails to achieve performance improvements. Instead, we develop a set of more sophisticated models in which the coherence score is weighted by a function of the blog retrieval score. The proposed models help improve effectiveness of our language modeling approach to the blog distillation task.

