Results 11 - 20
of
456
Improving Text Classification by Shrinkage in a Hierarchy of Classes
, 1998
"... When documents are organized in a large number of topic categories, the categories are often arranged in a hierarchy. The U.S. patent database and Yahoo are two examples. ..."
Abstract
-
Cited by 289 (6 self)
- Add to MetaCart
When documents are organized in a large number of topic categories, the categories are often arranged in a hierarchy. The U.S. patent database and Yahoo are two examples.
Trust management for the semantic web
- In ISWC
, 2003
"... Abstract. Though research on the Semantic Web has progressed at a steady pace, its promise has yet to be realized. One major difficulty is that, by its very nature, the Semantic Web is a large, uncensored system to which anyone may contribute. This raises the question of how much credence to give ea ..."
Abstract
-
Cited by 271 (3 self)
- Add to MetaCart
(Show Context)
Abstract. Though research on the Semantic Web has progressed at a steady pace, its promise has yet to be realized. One major difficulty is that, by its very nature, the Semantic Web is a large, uncensored system to which anyone may contribute. This raises the question of how much credence to give each source. We cannot expect each user to know the trustworthiness of each source, nor would we want to assign top-down or global credibility values due to the subjective nature of trust. We tackle this problem by employing a web of trust, in which each user provides personal trust values for a small number of other users. We compose these trusts to compute the trust a user should place in any other user in the network. A user is not assigned a single trust rank. Instead, different users may have different trust values for the same user. We define properties for combination functions which merge such trusts, and define a class of functions for which merging may be done locally while maintaining these properties. We give examples of specific functions and apply them to data from Epinions and our BibServ bibliography server. Experiments confirm that the methods are robust to noise, and do not put unreasonable expectations on users. We hope that these methods will help move the Semantic Web closer to fulfilling its promise. 1.
Analyzing the Effectiveness and Applicability of Co-training
, 2000
"... Recently there has been significant interest in supervised learning algorithms that combine labeled and unlabeled data for text learning tasks. The co-training setting [1] applies to datasets that have a natural separation of their features into two disjoint sets. We demonstrate that when learning f ..."
Abstract
-
Cited by 263 (7 self)
- Add to MetaCart
(Show Context)
Recently there has been significant interest in supervised learning algorithms that combine labeled and unlabeled data for text learning tasks. The co-training setting [1] applies to datasets that have a natural separation of their features into two disjoint sets. We demonstrate that when learning from labeled and unlabeled data, algorithms explicitly leveraging a natural independent split of the features outperform algorithms that do not. When a natural split does not exist, co-training algorithms that manufacture a feature split may out-perform algorithms not using a split. These results help explain why co-training algorithms are both discriminative in nature and robust to the assumptions of their embedded classifiers. Categories and Subject Descriptors I.2.6 [Artificial Intelligence]: Learning; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval--- Information Filtering Keywords co-training, expectation-maximization, learning with labeled and unlabeled...
Learning to Construct Knowledge Bases from the World Wide Web
, 2000
"... The World Wide Web is a vast source of information accessible to computers, but understandable only to humans. The goal of the research described here is to automatically create a computer understandable knowledge base whose content mirrors that of the World Wide Web. Such a knowledge base would ena ..."
Abstract
-
Cited by 242 (5 self)
- Add to MetaCart
(Show Context)
The World Wide Web is a vast source of information accessible to computers, but understandable only to humans. The goal of the research described here is to automatically create a computer understandable knowledge base whose content mirrors that of the World Wide Web. Such a knowledge base would enable much more effective retrieval of Web information, and promote new uses of the Web to support knowledge-based inference and problem solving. Our approach is to develop a trainable information extraction system that takes two inputs. The first is an ontology that defines the classes (e.g., company, person, employee, product) and relations (e.g., employed_by, produced_by) of interest when creating the knowledge base. The second is a set of training data consisting of labeled regions of hypertext that represent instances of these classes and relations. Given these inputs, the system learns to extract information from other pages and hyperlinks on the Web. This article describes our general a...
Learning to Classify Text from Labeled and Unlabeled Documents
, 1998
"... . This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is significant because in many important text classification problems obtaining classification labels is expensi ..."
Abstract
-
Cited by 188 (20 self)
- Add to MetaCart
(Show Context)
. This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is significant because in many important text classification problems obtaining classification labels is expensive, while large quantities of unlabeled documents are readily available. We present a theoretical argument showing that, under common assumptions, unlabeled data contain information about the target function. We then introduce an algorithm for learning from labeled and unlabeled text, based on the combination of Expectation-Maximization with a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates. Experimental results, obtained using text from three different real-world tasks, show that the use of unlabeled...
Understanding inverse document frequency: On theoretical arguments for IDF
- Journal of Documentation
, 2004
"... The term weighting function known as IDF was proposed in 1972, and has since been extremely widely used, usually as part of a TF*IDF function. It is often described as a heuristic, and many papers have been written (some based on Shannon’s Information Theory) seeking to establish some theoretical ba ..."
Abstract
-
Cited by 168 (2 self)
- Add to MetaCart
(Show Context)
The term weighting function known as IDF was proposed in 1972, and has since been extremely widely used, usually as part of a TF*IDF function. It is often described as a heuristic, and many papers have been written (some based on Shannon’s Information Theory) seeking to establish some theoretical basis for it. Some of these attempts are reviewed, and it is shown that the Information Theory approaches are problematic, but that there are good theoretical justifications of both IDF and TF*IDF in traditional probabilistic model of information retrieval.
A Simple Relational Classifier
- Proceedings of the Second Workshop on Multi-Relational Data Mining (MRDM-2003) at KDD-2003
, 2003
"... We analyze a Relational Neighbor (RN) classifier, a simple relational predictive model that predicts only based on class labels of related neighbors, using no learning and no inherent attributes. We show that it performs surprisingly well by comparing it to more complex models such as Probabilist ..."
Abstract
-
Cited by 111 (12 self)
- Add to MetaCart
(Show Context)
We analyze a Relational Neighbor (RN) classifier, a simple relational predictive model that predicts only based on class labels of related neighbors, using no learning and no inherent attributes. We show that it performs surprisingly well by comparing it to more complex models such as Probabilistic Relational Models and Relational Probability Trees on three data sets from published work.
Active + Semi-Supervised Learning = Robust Multi-View Learning
- Proceedings of ICML-02, 19th International Conference on Machine Learning
, 2002
"... In a multi-view problem, the features of the domain can be partitioned into disjoint subsets (views) that are sufficient to learn the target concept. ..."
Abstract
-
Cited by 110 (7 self)
- Add to MetaCart
In a multi-view problem, the features of the domain can be partitioned into disjoint subsets (views) that are sufficient to learn the target concept.
Determining the semantic orientation of terms through gloss classification
- In Proc. CIKM-05
, 2005
"... Sentiment classification is a recent subdiscipline of text classification which is concerned not with the topic a document is about, but with the opinion it expresses. It has a rich set of applications, ranging from tracking users ’ opinions about products or about political candidates as expressed ..."
Abstract
-
Cited by 104 (4 self)
- Add to MetaCart
(Show Context)
Sentiment classification is a recent subdiscipline of text classification which is concerned not with the topic a document is about, but with the opinion it expresses. It has a rich set of applications, ranging from tracking users ’ opinions about products or about political candidates as expressed in online forums, to customer relationship management. Functional to the extraction of opinions from text is the determination of the orientation of “subjective ” terms contained in text, i.e. the determination of whether a term that carries opinionated content has a positive or a negative connotation. In this paper we present a new method for determining the orientation of subjective terms. The method is based on the quantitative analysis of the glosses of such terms, i.e. the definitions that these terms are given in on-line dictionaries,