Results 1 - 10
of
33
Efficient Mining of Emerging Patterns: Discovering Trends and Differences
, 1999
"... We introduce a new kind of patterns, called emerging patterns (EPs), for knowledge discovery from databases. EPs are defined as itemsets whose supports increase significantly from one dataset to another. EPs can capture emerging trends in timestamped databases, or useful contrasts between data clas ..."
Abstract
-
Cited by 192 (28 self)
- Add to MetaCart
We introduce a new kind of patterns, called emerging patterns (EPs), for knowledge discovery from databases. EPs are defined as itemsets whose supports increase significantly from one dataset to another. EPs can capture emerging trends in timestamped databases, or useful contrasts between data classes. EPs have been proven useful: we have used them to build very powerful classifiers, which are more accurate than C4.5 and CBA, for many datasets. We believe that EPs with low to medium support, such as 1%-- 20%, can give useful new insights and guidance to experts, in even "well understood" applications. The efficient mining of EPs is a challenging problem, since (i) the Apriori property no longer holds for EPs, and (ii) there are usually too many candidates for high dimensional databases or for small support thresholds such as 0.5%. Naive algorithms are too costly. To solve this problem, (a) we promote the description of large collections of itemsets using their concise borders (the pa...
Mining the Biomedical Literature in the Genomic Era: An Overview
- JOURNAL OF COMPUTATIONAL BIOLOGY
, 2003
"... The past decade has seen a tremendous growth in the amount of experimental and computational biomedical data, specifically in the areas of Genomics and Proteomics. This growth is accompanied by an accelerated increase in the number of biomedical publications discussing the findings. In the last f ..."
Abstract
-
Cited by 72 (2 self)
- Add to MetaCart
The past decade has seen a tremendous growth in the amount of experimental and computational biomedical data, specifically in the areas of Genomics and Proteomics. This growth is accompanied by an accelerated increase in the number of biomedical publications discussing the findings. In the last few years there is a lot of interest within the scientific community in literature-mining tools to help sort through this abundance of literature, and find the nuggets of information most relevant and useful for specific analysis tasks. This paper
Mining Surprising Patterns Using Temporal Description Length
, 1998
"... We propose a new notion of surprising temporal patterns in market basket data, and algorithms to find such patterns. This is distinct from finding frequent patterns as addressed in the common mining literature. We argue that once the analyst is already familiar with prevalent patterns in the data, t ..."
Abstract
-
Cited by 48 (0 self)
- Add to MetaCart
We propose a new notion of surprising temporal patterns in market basket data, and algorithms to find such patterns. This is distinct from finding frequent patterns as addressed in the common mining literature. We argue that once the analyst is already familiar with prevalent patterns in the data, the greatest incremental benefit is likely to be from changes in the relationship between item frequencies over time. A simple measure of surprise is the extent of departure from a model, estimated using standard multivariate time series analysis. Unfortunately, such estimation involves models, smoothing windows and parameters whose optimal choices can vary dramatically from one application to another. In contrast, we propose a precise characterization of surprise based on the number of bits in which a basket sequence can be encoded under a carefully chosen coding scheme. In this scheme it is inexpensive to encode sequences of itemsets that have steady, hence likely to be well-known, correla...
Text mining at the term level
- Proc. of the 2nd European Symposium on Principles of Data Mining and Knowledge Discovery (PKDD'98
, 1998
"... Abstract. Knowledge Discovery in Databases (KDD) focuses on the computerized exploration of large amounts of data and on the discovery of interesting patterns within them. While most work on KDD has been concerned with structured databases, there has been little work on handling the huge amount of i ..."
Abstract
-
Cited by 40 (0 self)
- Add to MetaCart
Abstract. Knowledge Discovery in Databases (KDD) focuses on the computerized exploration of large amounts of data and on the discovery of interesting patterns within them. While most work on KDD has been concerned with structured databases, there has been little work on handling the huge amount of information that is available only in unstructured textual form. Previous work in text mining focused at the word or the tag level. This paper presents an approach to performing text mining at the term level. The mining process starts by preprocessing the document collection and extracting terms from the documents. Each document is then represented by a set of terms and annotations characterizing the document. Terms and additional higher-level entities are then organized in a hierarchical taxonomy. In this paper we will describe the Term Extraction module of the Document Explorer system, and provide experimental evaluation performed on a set of 52,000 documents published by Reuters in the years 1995-1996. 1
Text Mining: Generating Hypotheses from MEDLINE
- Journal of the American Society for Information Science and Technology
"... Hypothesis generation, a crucial initial step for making scientific discoveries, relies on prior knowledge, experience and intuition. Chance connections made between seemingly distinct subareas sometimes turn out to be fruitful. The goal in text mining is to assist in this process by automaticall ..."
Abstract
-
Cited by 34 (2 self)
- Add to MetaCart
Hypothesis generation, a crucial initial step for making scientific discoveries, relies on prior knowledge, experience and intuition. Chance connections made between seemingly distinct subareas sometimes turn out to be fruitful. The goal in text mining is to assist in this process by automatically discovering a small set of interesting hypotheses from a suitable text collection.
TopCat: Data Mining for Topic Identification in a Text Corpus
- In Proceedings of the 3rd European Conference of Principles and Practice of Knowledge Discovery in Databases
, 2002
"... TopCat (Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a dat ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
TopCat (Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. This paper presents a novel method for identifying related items based on "traditional" data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually-categorized "ground truth" news corpus showing this technique is effective in identifying "topics" in collections of news articles.
Text Mining with Information Extraction
- AAAI 2002 Spring Symposium on Mining Answers from Texts and Knowledge Bases
, 2002
"... The popularity of the Web and the large number of documents available in electronic form has motivated the search for hidden knowledge in text collections. Consequently, there is growing research interest in the general topic of text mining. In this paper, we develop a text-mining system by integrat ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
The popularity of the Web and the large number of documents available in electronic form has motivated the search for hidden knowledge in text collections. Consequently, there is growing research interest in the general topic of text mining. In this paper, we develop a text-mining system by integrating methods from Information Extraction (IE) and Data Mining (Knowledge Discovery from Databases or KDD). By utilizing existing IE and KDD techniques, text-mining systems can be developed relatively rapidly and evaluated on existing text corpora for testing IE systems. We present a general text-mining framework called DiscoTEX which employs an IE module for transforming natural-language documents into structured data and a KDD module for discovering prediction rules from the extracted data. When discovering patterns in extracted text, strict matching of strings is inadequate because textual database entries generally exhibit variations due to typographical errors, misspellings, abbreviations, and other
Mining Concept Profiles with the Vector Model or Where on Earth are Diseases being Studied?
- In: Proceedings of Text Mining Workshop. Third SIAM International Conference on Data Mining
, 2003
"... In this research we study the value of concept exploration, a function o#ered in our text mining prototype. This function, implemented using the vector space model, allows one to build a profile for a given concept. This profile is derived from the text collection being mined. The function may be us ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
In this research we study the value of concept exploration, a function o#ered in our text mining prototype. This function, implemented using the vector space model, allows one to build a profile for a given concept. This profile is derived from the text collection being mined. The function may be used to build profiles for concepts that are as complex or as simple as the user desires. In this paper, we apply this function towards studying trends in disease research. Profiles are built with diseases as concepts and by mining the MEDLINE database. Disease research trends are compared with disease prevalence trends. The study indicates that text mining may o#er a useful option for current e#orts at estimating global epidemiological data. More generally, this research demonstrates the application of text mining and concept exploration.
Applying Data Mining Techniques for Descriptive Phrase Extraction in Digital Document Collections
- IN PROCEEDINGS OF THE IEEE FORUM ON RESEARCH AND TECHNOLOGY ADVANCES IN DIGITAL LIBRARIES
, 1998
"... Traditionally, texts have been analysed using various information retrieval related methods, such as full-text analysis, and natural language processing. However, only few examples of data mining in text, particularly in full text, are available. In this paper we show that general data mining metho ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Traditionally, texts have been analysed using various information retrieval related methods, such as full-text analysis, and natural language processing. However, only few examples of data mining in text, particularly in full text, are available. In this paper we show that general data mining methods are applicable to text analysis tasks such as descriptive phrase extraction. Moreover, we present a general framework for text mining. The framework follows the general knowledge discovery process, thus containing steps from preprocessing to the utilization of the results. The data mining method that we apply is based on generalized episodes and episode rules. We give concrete examples of how to preprocess texts based on the intended use of the discovered results and we introduce a weighting scheme that helps in pruning out redundant or non-descriptive phrases. We also present results from real-life data experiments.

