Results 1 - 10
of
12
A Study of Approaches to Hypertext Categorization
- Journal of Intelligent Information Systems
, 2002
"... . Hypertext poses new research challenges for text classification. Hyperlinks, HTML tags, category labels distributed over linked documents, and meta data extracted from related web sites all provide rich information for classifying hypertext documents. How to appropriately represent that informatio ..."
Abstract
-
Cited by 78 (3 self)
- Add to MetaCart
. Hypertext poses new research challenges for text classification. Hyperlinks, HTML tags, category labels distributed over linked documents, and meta data extracted from related web sites all provide rich information for classifying hypertext documents. How to appropriately represent that information and automatically learn statistical patterns for solving hypertext classification problems is an open question. This paper seeks a principled approach to providing the answers. Specifically, we define five hypertext regularities which may (or may not) hold in a particular application domain, and whose presence (or absence) may significantly influence the optimal design of a classifier. Using three hypertext datasets and three well-known learning algorithms (Naive Bayes, Nearest Neighbor, and First Order Inductive Learner), we examine these regularities in different domains, and compare alternative ways to exploit them. Our experimental results suggest that a naive use of linked pages, such as treating the words in the linked neighborhood of a page as local to that page, can be more harmful than helpful when the linked neighborhood is highly "noisy". This is especially true if the classifier is not sufficiently robust in discriminating informative words from noisy ones. It is also evident in our results that extracting meta data (when available) from related web sites can be extremely useful for improving classification accuracy. Finally, the relative performance of the classifiers being tested provides insights into their strengths and limitations for solving classification problems involving diverse and often noisy web pages. Keywords: hypertext classification, machine learning, web mining 1.
Feature generation for text categorization using world knowledge
- In IJCAI’05
, 2005
"... We enhance machine learning algorithms for text categorization with generated features based on domain-specific and common-sense knowledge. This knowledge is represented using publicly available ontologies that contain hundreds of thousands of concepts, such as the Open Directory; these ontologies a ..."
Abstract
-
Cited by 62 (13 self)
- Add to MetaCart
We enhance machine learning algorithms for text categorization with generated features based on domain-specific and common-sense knowledge. This knowledge is represented using publicly available ontologies that contain hundreds of thousands of concepts, such as the Open Directory; these ontologies are further enriched by several orders of magnitude through controlled Web crawling. Prior to text categorization, a feature generator analyzes the documents and maps them onto appropriate ontology concepts, which in turn induce a set of generated features that augment the standard bag of words. Feature generation is accomplished through contextual analysis of document text, implicitly performing word sense disambiguation. Coupled with the ability to generalize concepts using the ontology, this approach addresses the two main problems of natural language processing—synonymy and polysemy. Categorizing documents with the aid of knowledge-based features leverages information that cannot be deduced from the documents alone. Experimental results confirm improved performance, breaking through the plateau previously reached in the field. 1
Language-Independent Set Expansion of Named Entities using the Web
"... Set expansion refers to expanding a given partial set of objects into a more complete set. A well-known example system that does set expansion using the web is Google Sets. In this paper, we propose a novel method for expanding sets of named entities. The approach can be applied to semi-structured d ..."
Abstract
-
Cited by 18 (4 self)
- Add to MetaCart
Set expansion refers to expanding a given partial set of objects into a more complete set. A well-known example system that does set expansion using the web is Google Sets. In this paper, we propose a novel method for expanding sets of named entities. The approach can be applied to semi-structured documents written in any markup language and in any human language. We present experimental results on 36 benchmark sets in three languages, showing that our system is superior to Google Sets in terms of mean average precision. 1.
Wikipedia-based semantic interpretation for natural language processing
- J. Artif. Int. Res
"... Adequate representation of natural language semantics requires access to vast amounts of common sense and domain-specific world knowledge. Prior work in the field was based on purely statistical techniques that did not make use of background knowledge, on limited lexicographic knowledge bases such a ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
Adequate representation of natural language semantics requires access to vast amounts of common sense and domain-specific world knowledge. Prior work in the field was based on purely statistical techniques that did not make use of background knowledge, on limited lexicographic knowledge bases such as WordNet, or on huge manual efforts such as the CYC project. Here we propose a novel method, called Explicit Semantic Analysis (ESA), for fine-grained semantic interpretation of unrestricted natural language texts. Our method represents meaning in a high-dimensional space of concepts derived from Wikipedia, the largest encyclopedia in existence. We explicitly represent the meaning of any text in terms of Wikipedia-based concepts. We evaluate the effectiveness of our method on text categorization and on computing the degree of semantic relatedness between fragments of natural language text. Using ESA results in significant improvements over the previous state of the art in both tasks. Importantly, due to the use of natural concepts, the ESA model is easy to explain to human users. 1.
Improving A Page Classifier with Anchor Extraction And Link Analysis
- In Advances in Neural Information Processing Systems 15
, 2002
"... Most text categorization systems use simple models of documents and document collections. In this paper we describe a technique that improves a simple web page classifier's performance on pages from a new, unseen web site, by exploiting link structure within a site as well as page structure with ..."
Abstract
-
Cited by 12 (4 self)
- Add to MetaCart
Most text categorization systems use simple models of documents and document collections. In this paper we describe a technique that improves a simple web page classifier's performance on pages from a new, unseen web site, by exploiting link structure within a site as well as page structure within hub pages. On real-world test cases, this technique significantly and substantially improves the accuracy of a bag-of-words classifier, reducing error rate by about half, on average. The system uses a variant of co-training to exploit unlabeled data from a new site. Pages are labeled using the base classifier; the results are used by a restricted wrapper-learner to propose potential "main-category anchor wrappers"; and finally, these wrappers are used as features by a third learner to find a categorization of the site that implies a simple hub structure, but which also largely agrees with the original bag-of-words classifier.
Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization
"... Most existing methods for text categorization employ induction algorithms that use the words appearing in the training documents as features. While they perform well in many categorization tasks, these methods are inherently limited when faced with more complicated tasks where external knowledge is ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Most existing methods for text categorization employ induction algorithms that use the words appearing in the training documents as features. While they perform well in many categorization tasks, these methods are inherently limited when faced with more complicated tasks where external knowledge is essential. Recently, there have been efforts to augment these basic features with external knowledge, including semi-supervised learning and transfer learning. In this work, we present a new framework for automatic acquisition of world knowledge and methods for incorporating it into the text categorization process. Our approach enhances machine learning algorithms with features generated from domain-specific and common-sense knowledge. This knowledge is represented by ontologies that contain hundreds of thousands of concepts, further enriched through controlled Web crawling. Prior to text categorization, a feature generator analyzes the documents and maps them onto appropriate ontology concepts that augment the bag of words used in simple supervised learning. Feature generation is accomplished through contextual analysis of document text, thus implicitly performing word sense disambiguation. Coupled with the ability to generalize concepts using the ontology, this approach addresses two significant problems in natural language processing—synonymy and polysemy. Categorizing documents with the aid of knowledge-based features leverages information that cannot be deduced from the training documents alone. We applied our methodology using the Open Directory Project, the largest existing Web directory built by over 70,000 human editors. Experimental results over a range of datasets confirm improved performance compared to the bag of words document representation.
Classifying biological articles using web resources
- In Proc. of the 2004 ACM Symposium on Applied Computing
, 2004
"... Text classification systems on biomedical literature aim to select relevant articles to a specific issue from large corpora. Most systems with an acceptable accuracy are based on domain knowledge, which is very expensive and does not provide a general solution. This paper presents a novel approach f ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Text classification systems on biomedical literature aim to select relevant articles to a specific issue from large corpora. Most systems with an acceptable accuracy are based on domain knowledge, which is very expensive and does not provide a general solution. This paper presents a novel approach for text classification on biomedical literature, involving the use of information extracted from related web resources. We validated this approach by implementing the proposed method and testing it on the KDD2002 Cup challenge: bio-text task. Results show that our approach can effectively improve efficiency on text classification systems for biomedical literature.
An Immune-Based Approach to Document Classification
, 2002
"... The human immune system as a biological complex adaptive system has provided inspiration for a range of innovative problem solving techniques in areas such as computer security, knowledge management and information retrieval. In this paper the construction and performance of a novel immune-based lea ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
The human immune system as a biological complex adaptive system has provided inspiration for a range of innovative problem solving techniques in areas such as computer security, knowledge management and information retrieval. In this paper the construction and performance of a novel immune-based learning algorithm is explored whose distributed, dynamic and adaptive nature offers many potential advantages over more traditional models. Through a process of cooperative coevolution a classifier is generated which consists of a set of detectors whose local dynamics enable the system as a whole to group positive and negative examples of a concept. The immune-based learning algorithm is first validated on a standard dataset. Then, combined with an HTML feature extractor, it is tested on a web-based document classification task and found to outperform traditional classification paradigms. Further applications in document based searching, content filtering, recommendation systems and user profile generation are also directly relevant to the work presented.
Semi-supervised Text Classification Using Partitioned EM
- 11 th Int. Conference on Database Systems for Advanced Applications (DASFAA
, 2004
"... Abstract. Text classification using a small labeled set and a large unlabeled data is seen as a promising technique to reduce the labor-intensive and time consuming effort of labeling training data in order to build accurate classifiers since unlabeled data is easy to get from the Web. In [16] it ha ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Abstract. Text classification using a small labeled set and a large unlabeled data is seen as a promising technique to reduce the labor-intensive and time consuming effort of labeling training data in order to build accurate classifiers since unlabeled data is easy to get from the Web. In [16] it has been demonstrated that an unlabeled set improves classification accuracy significantly with only a small labeled training set. However, the Bayesian method used in [16] assumes that text documents are generated from a mixture model and there is a one-to-one correspondence between the mixture components and the classes. This may not be the case in many applications. In many real-life applications, a class may cover documents from many different topics, which violates the oneto-one correspondence assumption. In such cases, the resulting classifiers can be quite poor. In this paper, we propose a clustering based partitioning technique to solve the problem. This method first partitions the training documents in a hierarchical fashion using hard clustering. After running the expectation maximization (EM) algorithm in each partition, it prunes the tree using the labeled data. The remaining tree nodes or partitions are likely to satisfy the oneto-one correspondence condition. Extensive experiments demonstrate that this method is able to achieve a dramatic gain in classification performance.
Using Data and Text Mining Techniques for Yeast Gene Regulation Prediction: A Case Study
- SIGKDD Explorations
, 2002
"... We focus on the problem of predicting yeast gene regulation experiments. In order to construct a good solution, we study combinations of different methods that are not yet to be found in any single data mining application. ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
We focus on the problem of predicting yeast gene regulation experiments. In order to construct a good solution, we study combinations of different methods that are not yet to be found in any single data mining application.

