Results 11 - 20
of
71
TopCat: Data Mining for Topic Identification in a Text Corpus
- In Proceedings of the 3rd European Conference of Principles and Practice of Knowledge Discovery in Databases
, 2002
"... TopCat (Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a dat ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
TopCat (Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. This paper presents a novel method for identifying related items based on "traditional" data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually-categorized "ground truth" news corpus showing this technique is effective in identifying "topics" in collections of news articles.
An efficient and scalable algorithm for clustering xml documents by structure
- IEEE Transactions on Knowledge and Data Engineering
, 2004
"... Abstract—With the standardization of XML as an information exchange language over the net, a huge amount of information is formatted in XML documents. In order to analyze this information efficiently, decomposing the XML documents and storing them in relational tables is a popular practice. However, ..."
Abstract
-
Cited by 27 (0 self)
- Add to MetaCart
Abstract—With the standardization of XML as an information exchange language over the net, a huge amount of information is formatted in XML documents. In order to analyze this information efficiently, decomposing the XML documents and storing them in relational tables is a popular practice. However, query processing becomes expensive since, in many cases, an excessive number of joins is required to recover information from the fragmented data. If a collection consists of documents with different structures (for example, they come from different DTDs), mining clusters in the documents could alleviate the fragmentation problem. We propose a hierarchical algorithm (S-GRACE) for clustering XML documents based on structural information in the data. The notion of structure graph (s-graph) is proposed, supporting a computationally efficient distance metric defined between documents and sets of documents. This simple metric yields our new clustering algorithm which is efficient and effective, compared to other approaches based on tree-edit distance. Experiments on real data show that our algorithm can discover clusters not easily identified by manual inspection. Index Terms—Data mining, clustering, XML, semistructured data, query processing. 1
Explaining Text Clustering Results Using Semantic Structures
- In Principles of Data Mining and Knowledge Discovery, 7th European Conference, PKDD 2003
, 2003
"... Common text clustering techniques offer rather poor capabilities for explaining to their users why a particular result has been achieved. They have the disadvantage that they do not relate semantically nearby terms and that they cannot explain how resulting clusters are related to each other. In ..."
Abstract
-
Cited by 26 (8 self)
- Add to MetaCart
Common text clustering techniques offer rather poor capabilities for explaining to their users why a particular result has been achieved. They have the disadvantage that they do not relate semantically nearby terms and that they cannot explain how resulting clusters are related to each other. In this paper, we discuss a way of integrating a large thesaurus and the computation of lattices of resulting clusters into common text clustering in order to overcome these two problems. As its major result, our approach achieves an explanation using an appropriate level of granularity at the concept level as well as an appropriate size and complexity of the explaining lattice of resulting clusters.
Organizing Structured Web Sources by Query Schemas: A Clustering Approach
- In CIKM Conference
, 2004
"... In the recent years, the Web has been rapidly "deepened" with the prevalence of databases online. On this deep Web, many sources are structured by providing structured query interfaces and results. Organizing such structured sources into a domain hierarchy is one of the critical steps toward the int ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
In the recent years, the Web has been rapidly "deepened" with the prevalence of databases online. On this deep Web, many sources are structured by providing structured query interfaces and results. Organizing such structured sources into a domain hierarchy is one of the critical steps toward the integration of heterogeneous Web sources. We observe that, for structured Web sources, query schemas (i.e., attributes in query interfaces) are discriminative representatives of the sources and thus can be exploited for source characterization. In particular, by viewing query schemas as a type of categorical data, we abstract the problem of source organization into the clustering of categorical data. Our approach hypothesizes that "homogeneous sources" are characterized by the same hidden generative models for their schemas. To find clusters governed by such statistical distributions, we propose a new objective function, model-differentiation, which employs principled hypothesis testing to maximize statistical heterogeneity among clusters. Our evaluation over hundreds of real sources indicates that (1) the schemabased clustering accurately organizes sources by object domains (e.g., Books, Movies), and (2) on clustering Web query schemas, the model-differentiation function outperforms existing ones, such as likelihood, entropy, and context linkages, with the hierarchical agglomerative clustering algorithm.
Detecting and Browsing Events in Unstructured Text
, 2002
"... Previews and overviews of large, heterogeneous information resources help users comprehend the scope of collections and focus on particular subsets of interest. For narrative documents, questions of "what happened? where? and when?" are natural points of entry. Building on our earlier work at the Pe ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
Previews and overviews of large, heterogeneous information resources help users comprehend the scope of collections and focus on particular subsets of interest. For narrative documents, questions of "what happened? where? and when?" are natural points of entry. Building on our earlier work at the Perseus Project with detecting terms, place names, and dates, we have exploited co-occurrences of dates and place names to detect and describe likely events in document collections. We compare statistical measures for determining the relative significance of various events. We have built interfaces that help users preview likely regions of interest for a given range of space and time by plotting the distribution and relevance of various collocations. Users can also control the amount of collocation information in each view. Once particular collocations are selected, the system can identify key phrases associated with each possible event to organize browsing of the documents themselves.
Detecting Events with Date and Place Information in Unstructured Text
, 2002
"... Digital libraries of historical documents provide a wealth of information about past events, often in unstructured form. Once dates and place names are identified and disambiguated, using methods that can di#er by genre, we examine collocations to detect events. Collocations can be ranked by severa ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
Digital libraries of historical documents provide a wealth of information about past events, often in unstructured form. Once dates and place names are identified and disambiguated, using methods that can di#er by genre, we examine collocations to detect events. Collocations can be ranked by several measures, which vary in e#ectiveness according to type of events, but the log-likelihood measure (-2 log #) o#ers a reasonable balance between frequently and infrequently mentioned events and between larger and smaller spatial and temporal ranges. Significant date-place collocations can be displayed on timelines and maps as an interface to digital libraries. More detailed displays can highlight key names and phrases associated with a given event.
Meta clustering
- In Proceedings IEEE International Conference on Data Mining
, 2006
"... Clustering is ill-defined. Unlike supervised learning where labels lead to crisp performance criteria such as accuracy and squared error, clustering quality depends on how the clusters will be used. Devising clustering criteria that capture what users need is difficult. Most clustering algorithms se ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
Clustering is ill-defined. Unlike supervised learning where labels lead to crisp performance criteria such as accuracy and squared error, clustering quality depends on how the clusters will be used. Devising clustering criteria that capture what users need is difficult. Most clustering algorithms search for optimal clusterings based on a pre-specified clustering criterion. Our approach differs. We search for many alternate clusterings of the data, and then allow users to select the clustering(s) that best fit their needs. Meta clustering first finds a variety of clusterings and then clusters this diverse set of clusterings so that users must only examine a small number of qualitatively different clusterings. We present methods for automatically generating a diverse set of alternate clusterings, as well as methods for grouping clusterings into meta clusters. We evaluate meta clustering on four test problems and two case studies. Surprisingly, clusterings that would be of most interest to users often are not very compact clusterings. 1.
Document Clustering for Electronic Meetings: An Experimental Comparison of Two Techniques
- of 16QAM Digital PLL Based Demodultors", Proc. Globecom-94
, 1994
"... In this article, we report our implementation and comparison of two text clustering techniques. One is based on Ward's clustering and the other on Kohonen's Self-organizing Maps. We have evaluated how closely clusters produced by a computer resemble those created by human experts. We have also meas ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
In this article, we report our implementation and comparison of two text clustering techniques. One is based on Ward's clustering and the other on Kohonen's Self-organizing Maps. We have evaluated how closely clusters produced by a computer resemble those created by human experts. We have also measured the time that it takes for an expert to "clean up" the automatically produced clusters. The technique based on Ward's clustering was found to be more precise. Both techniques have worked equally well in detecting associations between text documents. We used text messages obtained from group brainstorming meetings.
Integrating Information Seeking and Structuring: Exploring the Role of Spatial Hypertext in a Digital Library
- in Proceedings of HT ‘04
, 2004
"... This paper presents Garnet, a novel spatial hypertext interface to a digital library. Garnet supports both information structuring – via spatial hypertext – and traditional information seeking – via a digital library. A user study of Garnet is reported, together with an analysis of how the organizin ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
This paper presents Garnet, a novel spatial hypertext interface to a digital library. Garnet supports both information structuring – via spatial hypertext – and traditional information seeking – via a digital library. A user study of Garnet is reported, together with an analysis of how the organizing work done by users in a spatial hypertext workspace could support later information seeking. The use of Garnet during the study is related to both digital library and spatial hypertext research. Spatial hypertexts support the detection of implicit document groups in a user’s workspace. The study also investigates the degree of similarity found in the full text of documents within such document groups. Categories and Subject Descriptors H.5.4 [Hypertext/Hypermedia]: User issues

