Results 1 - 10
of
27
Data Mining: An Overview from Database Perspective
- IEEE Transactions on Knowledge and Data Engineering
, 1996
"... Mining information and knowledge from large databases has been recognized by many researchers as a key research topic in database systems and machine learning, and by many industrial companies as an important area with an opportunity of major revenues. Researchers in many different fields have sh ..."
Abstract
-
Cited by 314 (23 self)
- Add to MetaCart
Mining information and knowledge from large databases has been recognized by many researchers as a key research topic in database systems and machine learning, and by many industrial companies as an important area with an opportunity of major revenues. Researchers in many different fields have shown great interest in data mining. Several emerging applications in information providing services, such as data warehousing and on-line services over the Internet, also call for various data mining techniques to better understand user behavior, to improve the service provided, and to increase the business opportunities. In response to such a demand, this article is to provide a survey, from a database researcher's point of view, on the data mining techniques developed recently. A classification of the available data mining techniques is provided and a comparative study of such techniques is presented.
Scaling Clustering Algorithms to Large Databases”, Microsoft Research Report
, 1998
"... Practical clustering algorithms require multiple data scans to achieve convergence. For large databases, these scans become prohibitively expensive. We present a scalable clustering framework applicable to a wide class of iterative clustering. We require at most one scan of the database. In this wor ..."
Abstract
-
Cited by 197 (5 self)
- Add to MetaCart
Practical clustering algorithms require multiple data scans to achieve convergence. For large databases, these scans become prohibitively expensive. We present a scalable clustering framework applicable to a wide class of iterative clustering. We require at most one scan of the database. In this work, the framework is instantiated and numerically justified with the popular K-Means clustering algorithm. The method is based on identifying regions of the data that are compressible, regions that must be maintained in memory, and regions that are discardable. The algorithm operates within the confines of a limited memory buffer. Empirical results demonstrate that the scalable scheme outperforms a sampling-based approach. In our scheme, data resolution is preserved to the extent possible based upon the size of the allocated memory buffer and the fit of current clustering model to the data. The framework is naturally extended to update multiple clustering models simultaneously. We empirically evaluate on synthetic and publicly available data sets.
Scaling EM (Expectation-Maximization) Clustering to Large Databases
, 1999
"... Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the ..."
Abstract
-
Cited by 35 (0 self)
- Add to MetaCart
Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the access to every record in the data table. For large databases, the scans become prohibitively expensive. We present a scalable implementation of the Expectation-Maximization (EM) algorithm. The database community has focused on distance-based clustering schemes and methods have been developed to cluster either numerical or categorical data. Unlike distancebased algorithms (such as K-Means), EM constructs proper statistical models of the underlying data source and naturally generalizes to cluster databases containing both discrete-valued and continuous-valued data. The scalable method is based on a decomposition of the basic statistics the algorithm needs: identifying regions of the data that...
GeoVISTA Studio: A Codeless Visual Programming Environment For Geoscientific Data Analysis and Visualization
- Computational Geoscience
, 2002
"... The fundamental goal of the GeoVISTA Studio project is to improve geoscientific analysis by providing an environment that operationally integrates a wide range of analysis activities, including those both computationally and visually based. We argue here that improving the infrastructure used in ana ..."
Abstract
-
Cited by 32 (3 self)
- Add to MetaCart
The fundamental goal of the GeoVISTA Studio project is to improve geoscientific analysis by providing an environment that operationally integrates a wide range of analysis activities, including those both computationally and visually based. We argue here that improving the infrastructure used in analysis has far-reaching potential to better integrate human-based and computationally-based expertise, and so ultimately improve scientific outcomes. But to address these challenges, some difficult system design and software engineering problems must be overcome. This paper illustrates the design of a component-oriented system, GeoVISTA Studio, as a means to overcome such difficulties by using state-of-the-art component-based software engineering techniques. Advantages described include: ease of program construction (visual programming), an open (non-proprietary) architecture, simple component-based integration and advanced deployment methods. This versatility has the potential to change the nature of systems development for the geosciences, providing better mechanisms to coordinate complex functionality, and as a consequence, to improve analysis by closer integration of software tools and better engagement of the human expert. Two example applications are presented to illustrate the potentia l of the Studio environment for exploring and better understanding large, complex geographical datasets and for supporting complex visual and computational analysis. Keywords: visual programming, exploratory data analysis (EDA), knowledge construction, Java, component-oriented programming (COP). 1 1
Data Mining: Machine Learning, Statistics, and Databases
- In Proceedings of the 8th International Conference on Scientific and Statistical Database Management
, 1996
"... Knowledge discovery in databases and data mining aim at semiautomatic tools for the analysis of large data sets. We give an overview of the area and present some of the research issues, especially from the database angle. 1 Introduction Knowledge discovery in databases (KDD), often called data mi ..."
Abstract
-
Cited by 29 (2 self)
- Add to MetaCart
Knowledge discovery in databases and data mining aim at semiautomatic tools for the analysis of large data sets. We give an overview of the area and present some of the research issues, especially from the database angle. 1 Introduction Knowledge discovery in databases (KDD), often called data mining, aims at the discovery of useful information from large collections of data. The discovered knowledge can be rules describing properties of the data, frequently occurring patterns, clusterings of the objects in the database, etc. Data mining has in the 1990's emerged as visible research and development area; both in industry and in science there seems to be a lack of methods for efficient analysis of large data sets. Current technology makes it fairly easy to collect data, but data analysis tends to be slow and expensive. There is a suspicion that there might be nuggets of useful information hiding in the masses of unanalyzed or underanalyzed data, and therefore semiautomatic methods fo...
Statistics and Data Mining: Intersecting Disciplines
- SIGKDD Explorations
, 1999
"... is generally meant by data mining nowadays. Statistics and data mining have much in common, but they also have differences. The nature of the two disciplines is examined, with emphasis on their similarities and differences. ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
is generally meant by data mining nowadays. Statistics and data mining have much in common, but they also have differences. The nature of the two disciplines is examined, with emphasis on their similarities and differences.
Data mining criteria for tree-based regression and classification
- In Proceedings KDD
, 2001
"... This paper is concerned with the construction of regression and classification trees that are more adapted to data mining applications than conventional trees. To this end, we propose new splitting criteria for growing trees. Conventional splitting criteria attempt to perform well on both sides of a ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
This paper is concerned with the construction of regression and classification trees that are more adapted to data mining applications than conventional trees. To this end, we propose new splitting criteria for growing trees. Conventional splitting criteria attempt to perform well on both sides of a split by attempting a compromise in the quality of fit between the left and the right side. By contrast, we adopt a data mining point of view by proposing criteria that search for interesting subsets of the data, as opposed to modeling all of the data equally well. The new criteria do not split based on a compromise between the left and the right bucket; they effectively pick the more interesting bucket and ignore the other. As expected, the result is often a simpler characterization of interesting subsets of the data. Less expected is that the new criteria often yield whole trees that provide more interpretable data descriptions. Surprisingly, it is a “flaw ” that works to their advantage: The new criteria have an increased tendency to accept splits near the boundaries of the predictor ranges. This so-called “end-cut problem ” leads to the repeated peeling of small layers of data and results in very unbalanced but highly expressive and interpretable trees. 1
Issues for On-Line Analytical Mining of Data Warehouses (Extended Abstract)
- SIGMOD'98 Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'98
, 1998
"... ) Jiawei Han, Sonny H.S. Chee and Jenny Y. Chiang Intelligent Database Systems Research Laboratory School of Computing Science, Simon Fraser University, British Columbia, Canada V5A 1S6 f han, schee, ychiang g@cs.sfu.ca URL: http://db.cs.sfu.ca/ Abstract Data warehouses and OLAP engines are exp ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
) Jiawei Han, Sonny H.S. Chee and Jenny Y. Chiang Intelligent Database Systems Research Laboratory School of Computing Science, Simon Fraser University, British Columbia, Canada V5A 1S6 f han, schee, ychiang g@cs.sfu.ca URL: http://db.cs.sfu.ca/ Abstract Data warehouses and OLAP engines are expected to be widely available in the near future. The data in data warehouses has been cleansed, integrated, and preprocessed, and infrastructures have been built surrounding data warehouses for efficient data analysis. Therefore, data warehouses or OLAP databases are expected to be a major platform for data mining in the future. We discuss the issues related to efficient and effective data mining in large data warehouses and/or data marts, including the desired architectures for an integrated on-line analytical processing (OLAP) and on-line analytical mining (OLAM) system, the expected features of OLAM, and how to implement such a system effectively. 1 Introduction With recent developments...
Data Mining At The Interface Of Computer Science And Statistics
, 2001
"... This chapter is written for computer scientists, engineers, mathematicians, and scientists who wish to gain a better understanding of the role of statistical thinking in modern data mining. Data mining has attracted considerable attention both in the research and commercial arenas in recent years, i ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
This chapter is written for computer scientists, engineers, mathematicians, and scientists who wish to gain a better understanding of the role of statistical thinking in modern data mining. Data mining has attracted considerable attention both in the research and commercial arenas in recent years, involving the application of a variety of techniques from both computer science and statistics. The chapter discusses how computer scientists and statisticians approach data from different but complementary viewpoints and highlights the fundamental differences between statistical and computational views of data mining. In doing so we review the historical importance of statistical contributions to machine learning and data mining, including neural networks, graphical models, and flexible predictive modeling. The primary conclusion is that closer integration of computational methods with statistical thinking is likely to become increasingly important in data mining applications. Keywords: Data mining, statistics, pattern recognition, transaction data, correlation. 1.
Statistical Approaches to Predictive Modeling in Large Databases
, 1998
"... Prediction, i.e., predicting the potential values or value distributions of certain attributes for objects in a database or data warehouse, is an attractive goal in data mining. To predict future events not shown in databases with high quality can help users to make smart business decisions. With th ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Prediction, i.e., predicting the potential values or value distributions of certain attributes for objects in a database or data warehouse, is an attractive goal in data mining. To predict future events not shown in databases with high quality can help users to make smart business decisions. With the concern of both scalability and high quality of prediction, we propose a predictive modeling algorithm for interactive prediction in large databases and data warehouses. The algorithm consists of three steps: (1) data generalization, which converts data in relational databases or data warehouses into a multi-dimensional databases to which efficient analysis techniques can be applied; (2) relevance analysis, which identifies the attributes that are highly relevant to the prediction, to reduce number of attributes in prediction with the benefits in improving both efficiency and reliability of prediction; and (3) a statistical regression model, called generalized linear model, is constructed ...

