Results 1 - 10
of
71
A comparison of document clustering techniques
- In KDD Workshop on Text Mining
, 2000
"... This paper presents the results of an experimental study of some common document clustering techniques: agglomerative hierarchical clustering and K-means. (We used both a “standard” K-means algorithm and a “bisecting ” K-means algorithm.) Our results indicate that the bisecting K-means technique is ..."
Abstract
-
Cited by 306 (18 self)
- Add to MetaCart
This paper presents the results of an experimental study of some common document clustering techniques: agglomerative hierarchical clustering and K-means. (We used both a “standard” K-means algorithm and a “bisecting ” K-means algorithm.) Our results indicate that the bisecting K-means technique is better than the standard K-means approach and (somewhat surprisingly) as good or better than the hierarchical approaches that we tested.
Web Document Clustering: A Feasibility Demonstration
, 1998
"... Abstract Users of Web search engines are often forced to sift through the long ordered list of document “snippets” returned by the engines. The IR community has explored document clustering as an alternative method of organizing retrieval results, but clustering has yet to be deployed on the major s ..."
Abstract
-
Cited by 279 (3 self)
- Add to MetaCart
Abstract Users of Web search engines are often forced to sift through the long ordered list of document “snippets” returned by the engines. The IR community has explored document clustering as an alternative method of organizing retrieval results, but clustering has yet to be deployed on the major search engines. The paper articulates the unique requirements of Web document clustering and reports on the first evaluation of clustering methods in this domain. A key requirement is that the methods create their clusters based on the short snippets returned by Web search engines. Surprisingly, we find that clusters based on snippets are almost as good as clusters created using the full text of Web documents. To satisfy the stringent requirements of the Web domain, we introduce an incremental, linear time (in the document collection size) algorithm called Suffix Tree Clustering (STC). which creates clusters based on phrases shared between documents. We show that STC is faster than standard clustering methods in this domain, and argue that Web document clustering via STC is both feasible and potentially beneficial. 1
Analysis of navigation behaviour in web sites integrating multiple information systems
- The VLDB Journal
, 2000
"... Abstract. The analysis of web usage has mostly focused on sites composed of conventional static pages. However, huge amounts of information available in the web come from databases or other data collections and are presented to the users in the form of dynamically generated pages. The query interfac ..."
Abstract
-
Cited by 69 (22 self)
- Add to MetaCart
Abstract. The analysis of web usage has mostly focused on sites composed of conventional static pages. However, huge amounts of information available in the web come from databases or other data collections and are presented to the users in the form of dynamically generated pages. The query interfaces of such sites allow the specification of many search criteria. Their generated results support navigation to pages of results combining cross-linked data from many sources. For the analysis of visitor navigation behaviour in such web sites, we propose the web usage miner (WUM), which discovers navigation patterns subject to advanced statistical and structural constraints. Since our objective is the discovery of interesting navigation patterns, we do not focus on accesses to individual pages. Instead, we construct conceptual hierarchies that reflect the query capabilities used in the production of those pages. Our experiments with a real web site that integrates data from multiple databases, the German SchulWeb, demonstrate the appropriateness of WUM in discovering navigation patterns and show how those discoveries can help in assessing and improving the quality of the site. Key words: Web usage mining – Data mining – Web query interfaces – Web databases – Query capabilities – Conceptual hierarchies 1
Information retrieval on the Web
- ACM Computing Surveys
, 2000
"... In this paper we review studies of the growth of the Internet and technologies that are useful for information search and retrieval on the Web. We present data on the Internet from several different sources, e.g., current as well as projected number of users, hosts, and Web sites. Although numerical ..."
Abstract
-
Cited by 58 (0 self)
- Add to MetaCart
In this paper we review studies of the growth of the Internet and technologies that are useful for information search and retrieval on the Web. We present data on the Internet from several different sources, e.g., current as well as projected number of users, hosts, and Web sites. Although numerical figures vary, overall trends cited
Hierarchical Document Clustering Using Frequent Itemsets
- IN PROC. SIAM INTERNATIONAL CONFERENCE ON DATA MINING 2003 (SDM 2003
, 2003
"... A major challenge in document clustering is the extremely high dimensionality. For example, the vocabulary for a document set can easily be thousands of words. On the other hand, each document often contains a small fraction of words in the vocabulary. These features require special handlings. Anoth ..."
Abstract
-
Cited by 55 (1 self)
- Add to MetaCart
A major challenge in document clustering is the extremely high dimensionality. For example, the vocabulary for a document set can easily be thousands of words. On the other hand, each document often contains a small fraction of words in the vocabulary. These features require special handlings. Another requirement is hierarchical clustering where clustered documents can be browsed according to the increasing specificity of topics. In this paper, we propose to use the notion of frequent itemsets, which comes from association rule mining, for document clustering. The intuition of our clustering criterion is that each cluster is identified by some common words, called frequent itemsets, for the documents in the cluster. Frequent itemsets are also used to produce a hierarchical topic tree for clusters. By focusing on frequent items, the dimensionality of the document set is drastically reduced. We show that this method outperforms best existing methods in terms of both clustering accuracy and scalability.
Learning to Attach Semantic Metadata to Web Services
- In Proc. Int. Semantic Web Conf
, 2003
"... Emerging Web standards promise a network of heterogeneous yet interoperable Web Services. Web Services would greatly simplify the development of many kinds of data integration and knowledge management applications. Unfortunately, this vision requires that services describe themselves with large amou ..."
Abstract
-
Cited by 51 (10 self)
- Add to MetaCart
Emerging Web standards promise a network of heterogeneous yet interoperable Web Services. Web Services would greatly simplify the development of many kinds of data integration and knowledge management applications. Unfortunately, this vision requires that services describe themselves with large amounts of semantic metadata "glue". We explore a variety of machine learning techniques to semiautomatically create such metadata.
Data mining for measuring and improving the success of web sites
- Data Mining and Knowledge Discovery
, 2001
"... Abstract. For many companies, competitiveness in e-commerce requires a successful presence on the web. Web sites are used to establish the company’s image, to promote and sell goods and to provide customer support. The success of a web site affects and reflects directly the success of the company in ..."
Abstract
-
Cited by 39 (2 self)
- Add to MetaCart
Abstract. For many companies, competitiveness in e-commerce requires a successful presence on the web. Web sites are used to establish the company’s image, to promote and sell goods and to provide customer support. The success of a web site affects and reflects directly the success of the company in the electronic market. In this study, we propose a methodology to improve the “success ” of web sites, based on the exploitation of navigation pattern discovery. In particular, we present a theory, in which success is modelled on the basis of the navigation behaviour of the site’s users. We then exploit WUM, a navigation pattern discovery miner, to study how the success of a site is reflected in the users ’ behaviour. With WUM we measure the success of a site’s components and obtain concrete indications of how the site should be improved. We report on our first experiments with an online catalog, the success of which we have studied. Our mining analysis has shown very promising results, on the basis of which the site is currently undergoing concrete improvements.
Clustering of Web Users Based on Access Patterns
- In Proceedings of the 1999 KDD Workshop on Web Mining
, 1999
"... The clustering of the Web users based on their access patterns is studied. Access patterns of the Web users are extracted from Web servers' log files, and then organized into sessions which represent episodes of interaction between Web users and the Web server. Using attributedoriented induction, th ..."
Abstract
-
Cited by 37 (0 self)
- Add to MetaCart
The clustering of the Web users based on their access patterns is studied. Access patterns of the Web users are extracted from Web servers' log files, and then organized into sessions which represent episodes of interaction between Web users and the Web server. Using attributedoriented induction, the sessions are then generalized according to the page hierarchy which organizes pages according to their generalities. The generalized sessions are finally clustered using a hierarchical clustering method. Our experiments on a large real data set show that the method is efficient and practical for Web mining applications. 1 Introduction With the rapid development of the World Wide Web (WWW), or the Web, many organizations now put their information on the Web and provide Web-based services such as on-line shopping, user feedback, technical support, etc. Web mining, the knowledge discovery in the Web, has become an important research area [2]. Research in Web Mining can be broadly classified...
Efficient phrase-based document indexing for Web document clustering
- IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
, 2004
"... Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly important in such scenarios. Document clustering ..."
Abstract
-
Cited by 31 (1 self)
- Add to MetaCart
Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly important in such scenarios. Document clustering is particularly useful in many applications such as automatic categorization of documents, grouping search engine results, building a taxonomy of documents, and others. This paper presents two key parts of successful document clustering. The first part is a novel phrase-based document index model, the Document Index Graph, which allows for incremental construction of a phrase-based index of the document set with an emphasis on efficiency, rather than relying on single-term indexes only. It provides efficient phrase matching that is used to judge the similarity between documents. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. The second part is an incremental document clustering algorithm based on maximizing the tightness of clusters by carefully watching the pair-wise document similarity distribution inside clusters. The combination of these two components creates an underlying model for robust and accurate document similarity calculation that leads to much improved results in Web document clustering over traditional methods.
Polynomial Time Approximation Schemes for Geometric k-Clustering
- J. OF THE ACM
, 2001
"... The Johnson-Lindenstrauss lemma states that n points in a high dimensional Hilbert space can be embedded with small distortion of the distances into an O(log n) dimensional space by applying a random linear transformation. We show that similar (though weaker) properties hold for certain random linea ..."
Abstract
-
Cited by 28 (5 self)
- Add to MetaCart
The Johnson-Lindenstrauss lemma states that n points in a high dimensional Hilbert space can be embedded with small distortion of the distances into an O(log n) dimensional space by applying a random linear transformation. We show that similar (though weaker) properties hold for certain random linear transformations over the Hamming cube. We use these transformations to solve NP-hard clustering problems in the cube as well as in geometric settings. More specifically, we address the following clustering problem. Given n points in a larger set (for example, R^d) endowed with a distance function (for example, L² distance), we would like to partition the data set into k disjoint clusters, each with a "cluster center", so as to minimize the sum over all data points of the distance between the point and the center of the cluster containing the point. The problem is provably NP-hard in some high dimensional geometric settings, even for k = 2. We give polynomial time approximation schemes for this problem in several settings, including the binary cube {0, 1}^d with Hamming distance, and R^d either with L¹ distance, or with L² distance, or with the square of L² distance. In all these settings, the best previous results were constant factor approximation guarantees. We note that our problem is similar in flavor to the k-median problem (and the related facility location problem), which has been considered in graph-theoretic and fixed dimensional geometric settings, where it becomes hard when k is part of the input. In contrast, we study the problem when k is fixed, but the dimension is part of the input.

