Results 1 -
6 of
6
Sentiment analysis and subjectivity
- Handbook of Natural Language Processing, Second Edition. Taylor and Francis Group, Boca
, 2010
"... Textual information in the world can be broadly categorized into two main types: facts and opinions. Facts are objective expressions about entities, events and their properties. Opinions are usually subjective expressions that describe people’s sentiments, appraisals or feelings toward entities, eve ..."
Abstract
-
Cited by 17 (6 self)
- Add to MetaCart
Textual information in the world can be broadly categorized into two main types: facts and opinions. Facts are objective expressions about entities, events and their properties. Opinions are usually subjective expressions that describe people’s sentiments, appraisals or feelings toward entities, events and their properties. The concept of opinion is very broad. In this chapter, we only focus on opinion expressions that convey people’s positive or negative sentiments. Much of the existing research on textual information processing has been focused on mining and retrieval of factual information, e.g., information retrieval, Web search, text classification, text clustering and many other text mining and natural language processing tasks. Little work had been done on the processing of opinions until only recently. Yet, opinions are so important that whenever we need to make a decision we want to hear others ’ opinions. This is not only true for individuals but also true for organizations. One of the main reasons for the lack of study on opinions is the fact that there was little opinionated text available before the World Wide Web. Before the Web, when an individual needed to make a decision, he/she typically asked for opinions from friends and families. When an organization wanted to find the opinions or sentiments of the general public about its products and services, it conducted opinion polls, surveys, and focus groups. However, with the Web, especially with the explosive growth of the usergenerated
Topic-bridged PLSA for Cross-Domain Text Classification
"... In many Web applications, such as blog classification and newsgroup classification, labeled data are in short supply. It often happens that obtaining labeled data in a new domain is expensive and time consuming, while there may be plenty of labeled data in a related but different domain. Traditional ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
In many Web applications, such as blog classification and newsgroup classification, labeled data are in short supply. It often happens that obtaining labeled data in a new domain is expensive and time consuming, while there may be plenty of labeled data in a related but different domain. Traditional text classification approaches are not able to cope well with learning across different domains. In this paper, we propose a novel cross-domain text classification algorithm which extends the traditional probabilistic latent semantic analysis (PLSA) algorithm to integrate labeled and unlabeled data, which come from different but related domains, into a unified probabilistic model. We call this new model Topic-bridged PLSA, or TPLSA. By exploiting the common topics between two domains, we transfer knowledge across different domains through a topic-bridge to help the text classification in the target domain. A unique advantage of our method is its ability to maximally mine knowledge that can be transferred between domains, resulting in superior performance when compared to other state-of-the-art text classification approaches. Experimental evaluation on different kinds of datasets shows that our proposed algorithm can improve the performance of cross-domain text classification significantly.
Can chinese web pages be classified with english data source
- In Proceeding of the 17th international conference on World Wide Web
, 2008
"... As the World Wide Web in China grows rapidly, mining knowledge in Chinese Web pages becomes more and more important. Mining Web information usually relies on the machine learning techniques which require a large amount of labeled data to train credible models. Although the number of Chinese Web page ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
As the World Wide Web in China grows rapidly, mining knowledge in Chinese Web pages becomes more and more important. Mining Web information usually relies on the machine learning techniques which require a large amount of labeled data to train credible models. Although the number of Chinese Web pages increases quite fast, it still lacks Chinese labeled data. However, there are relatively sufficient English labeled Web pages. These labeled data, though in different linguistic representations, share a substantial amount of semantic information with Chinese ones, and can be utilized to help classify Chinese Web pages. In this paper, we propose an information bottleneck based approach to address this cross-language classification problem. Our algorithm first translates all the Chinese Web pages to English. Then, all the Web pages, including Chinese and English ones, are encoded through an information bottleneck which can allow only limited information to pass. Therefore, in order to retain as much useful information as possible, the common part between Chinese and English Web pages is inclined to be encoded to the same code (i.e. class label), which makes the cross-language classification accurate. We evaluated our approach using the Web pages collected from Open Directory Project (ODP). The experimental results show that our method significantly improves several existing supervised and semi-supervised classifiers.
For a few dollars less: Identifying review pages sans human labels
- In Proc. NAACL
, 2009
"... We address the problem of large-scale automatic detection of online reviews without using any human labels. We propose an efficient method that combines two basic ideas: Building a classifier from a large number of noisy examples and using the structure of the website to enhance the performance of t ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
We address the problem of large-scale automatic detection of online reviews without using any human labels. We propose an efficient method that combines two basic ideas: Building a classifier from a large number of noisy examples and using the structure of the website to enhance the performance of this classifier. Experiments suggest that our method is competitive against supervised learning methods that mandate expensive human effort. 1
Collective Indexing of Emotions in Images. A Study in Emotional Information Retrieval
"... Some documents provoke emotions in people viewing them. Will it be possible to describe emotions consistently and use this information in retrieval systems? We tested collective (statistically aggregated) emotion indexing using images as examples. Considering psychological results, basic emotions ar ..."
Abstract
- Add to MetaCart
Some documents provoke emotions in people viewing them. Will it be possible to describe emotions consistently and use this information in retrieval systems? We tested collective (statistically aggregated) emotion indexing using images as examples. Considering psychological results, basic emotions are anger, disgust, fear, happiness, and sadness. This study follows an approach developed by Lee and Neal (2007) for music emotion retrieval and applies scroll bars for tagging basic emotions and their intensities. A sample comprising 763 persons tagged emotions caused by images (retrieved from www.Flickr.com) applying scroll bars and (linguistic) tags. Using SPSS, we performed descriptive statistics and correlation analysis. For more than half of the images, the test persons have clear emotion favorites. There are prototypical images for given emotions. The document-specific consistency of tagging using a scroll bar is, for some images, very high. Most of the (most commonly used) linguistic tags are on the basic level (in the sense of Rosch’s basic level theory). The distributions of the linguistic tags in our examples follow an inverse power-law. Hence, it seems possible to apply collective image emotion tagging to image information systems and to present a new search option for basic emotions. This article is one of the first steps in the research area of emotional information retrieval (EmIR).
WWW 2009 MADRID! Track: Data Mining / Session: Learning A Class-Feature-Centroid Classifier for Text Categorization
"... Automated text categorization is an important technique for many web applications, such as document indexing, document filtering, and cataloging web resources. Many different approaches have been proposed for the automated text categorization problem. Among them, centroid-based approaches have the a ..."
Abstract
- Add to MetaCart
Automated text categorization is an important technique for many web applications, such as document indexing, document filtering, and cataloging web resources. Many different approaches have been proposed for the automated text categorization problem. Among them, centroid-based approaches have the advantages of short training time and testing time due to its computational efficiency. As a result, centroid-based classifiers have been widely used in many web applications. However, the accuracy of centroid-based classifiers is inferior to SVM, mainly because centroids found during construction are far from perfect locations. We design a fast Class-Feature-Centroid (CFC) classifier for multi-class, single-label text categorization. In CFC, a centroid is built from two important class distributions: inter-class term index and inner-class term index. CFC proposes a novel combination of these indices and employs a denormalized cosine measure to calculate the similarity score between a text vector and a centroid. Experiments on the Reuters-21578 corpus and 20-newsgroup email collection show that CFC consistently outperforms the state-of-the-art SVM classifiers on both micro-F1 and macro-F1 scores. Particularly, CFC is more effective and robust than SVM when data is sparse.

