Results 1 - 10
of
12
From data mining to knowledge discovery in databases
- AI Magazine
, 1996
"... ■ Data mining and knowledge discovery in databases have been attracting a significant amount of research, industry, and media attention of late. What is all the excitement about? This article provides an overview of this emerging field, clarifying how data mining and knowledge discovery in databases ..."
Abstract
-
Cited by 215 (0 self)
- Add to MetaCart
■ Data mining and knowledge discovery in databases have been attracting a significant amount of research, industry, and media attention of late. What is all the excitement about? This article provides an overview of this emerging field, clarifying how data mining and knowledge discovery in databases are related both to each other and to related fields, such as machine learning, statistics, and databases. The article mentions particular real-world applications, specific data-mining techniques, challenges involved in real-world applications of knowledge discovery, and current and future research directions in the field. Across a wide variety of fields, data are
Knowledge Discovery from Users Web-Page Navigation
- in Proceedings of workshop on research issues in Data engineering
, 1997
"... We propose to detect users navigationpaths to the advantage of web-site owners. First, we explain the design and implementationof a profiler which captures client’s selected links and pages order, accurate page viewing time and cache references, using a Java based remote agent. The information captu ..."
Abstract
-
Cited by 131 (10 self)
- Add to MetaCart
We propose to detect users navigationpaths to the advantage of web-site owners. First, we explain the design and implementationof a profiler which captures client’s selected links and pages order, accurate page viewing time and cache references, using a Java based remote agent. The information captured by the profiler is then utilized by a knowledge discovery technique to cluster users with similar interests. We introduce a novel path clustering method based on the similarity of the history of user navigation. This approach is capable of capturing the interests of the user which could persist through several subsequent hypertext link selections. Finally, we evaluate our path clustering technique via a simulation study on a sample WWW-site. We show that depending on the level of inserted noise, we can recover the correct clusters by %10-%27 of average error margin. 1.
Knowledge Discovery and Data Mining: Towards a Unifying Framework
, 1996
"... This paper presents a first step towards a unifying framework for Knowledge Discovery in Databases. We describe links between data mining, knowledge discovery, and other related fields. We then define the KDD process and basic data mining algorithms, discuss application issues and conclude with an a ..."
Abstract
-
Cited by 108 (0 self)
- Add to MetaCart
This paper presents a first step towards a unifying framework for Knowledge Discovery in Databases. We describe links between data mining, knowledge discovery, and other related fields. We then define the KDD process and basic data mining algorithms, discuss application issues and conclude with an analysis of challenges facing practitioners in the field. keywords: Knowledge Discovery in Databases (KDD), Data mining, overview article, large databases, automated analysis, issues and challenges in data mining. To appear: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, Oregon, August 2-4, 1996, AAAI Press. http://wwwaig. jpl.nasa.gov/kdd96 Knowledge Discovery and Data Mining: Towards a Unifying Framework Usama Fayyad Microsoft Research One Microsoft Way Redmond, WA 98052, USA fayyad@microsoft.com Gregory Piatetsky-Shapiro GTE Laboratories, MS 44 Waltham, MA 02154, USA gps@gte.com Padhraic Smyth Information and Computer S...
Applications of Data Mining to Electronic Commerce
- Data Mining and Knowledge Discovery
, 2001
"... a data mining project. Unfortunately, the other 80% contains several substantial hurdles that without heroic eort may block the successful completion of the project. The following are ve desiderata for success. Seldom are they they all present in one data mining application. 1. Data with rich desc ..."
Abstract
-
Cited by 17 (4 self)
- Add to MetaCart
a data mining project. Unfortunately, the other 80% contains several substantial hurdles that without heroic eort may block the successful completion of the project. The following are ve desiderata for success. Seldom are they they all present in one data mining application. 1. Data with rich descriptions. For example, wide customer records with many potentially useful elds allow data mining algorithms to search beyond obvious correlations. 2. A large volume of data. The large model spaces corresponding to rich data demand many training instances to build reliable models. 3. Controlled and reliable data collection. Manual data entry and integration from legacy systems both are notoriously problematic; fully automated collection is considerably better. 4. The ability to evaluate results. Substantial, demonstrable return on investment can be very convincing. 2 RON KOHAVI AND FOSTER PROVOST 5. Ease of integration with existing processes. Even if pilot studi
A Survey of Methods for Scaling Up Inductive Learning Algorithms
, 1997
"... : Each year, one of the explicit challenges for the KDD research community is to develop methods that facilitate the use of inductive learning algorithms for mining very large databases. By collecting, categorizing, and summarizing past work on scaling up inductive learning algorithms, this paper se ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
: Each year, one of the explicit challenges for the KDD research community is to develop methods that facilitate the use of inductive learning algorithms for mining very large databases. By collecting, categorizing, and summarizing past work on scaling up inductive learning algorithms, this paper serves to establish a common ground for researchers addressing the challenge. We begin with a discussion of important, but often tacit, issues related to scaling up learning algorithms. We highlight similarities among methods by categorizing them into three main approaches. For each approach, we then describe, compare, and contrast the different constituent methods, drawing on specific examples from the published literature. Finally, we use the preceding analysis to suggest how one should proceed when dealing with a large problem, and where future research efforts should be focused. Primary contact: Foster Provost NYNEX Science and Technology, 400 Westchester Avenue, White Plains, NY 10604 em...
Analysis and Design of Server Informative WWW-sites
, 1997
"... The access patterns of the users on a web-site are traditionally investigated in order to improve the user access to the site's information. In this study, however, a systematic approach is introduced in order to analyze the users' navigation path to the advantage of the web-site owner. As users nav ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
The access patterns of the users on a web-site are traditionally investigated in order to improve the user access to the site's information. In this study, however, a systematic approach is introduced in order to analyze the users' navigation path to the advantage of the web-site owner. As users navigate through a web-site, they are transparently filling a questionnaire generated by the website owner. We first cluster the users who navigate similar paths employing the Path Mining algorithm. Next, the correlation between a set of target questions and the structure of the WWWsite is quantified. This has been done by borrowing the concept of channel from information theory. A channel can be considered as an information bridge between the users' path classes and the answers to a questionnaire. By adopting many concepts from information theory, we introduce a natural measure to compute the effectiveness of a WWW-site structure in answering the target questionnaire. Using this measure, we pr...
Mathematical Programming Approaches To Machine Learning And Data Mining
, 1998
"... Machine learning problems of supervised classification, unsupervised clustering and parsimonious approximation are formulated as mathematical programs. The feature selection problem arising in the supervised classification task is effectively addressed by calculating a separating plane by minimizing ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Machine learning problems of supervised classification, unsupervised clustering and parsimonious approximation are formulated as mathematical programs. The feature selection problem arising in the supervised classification task is effectively addressed by calculating a separating plane by minimizing separation error and the number of problem features utilized. The support vector machine approach is formulated using various norms to measure the margin of separation. The clustering problem of assigning m points in n-dimensional real space to k clusters is formulated as minimizing a piecewise-linear concave function over a polyhedral set. This problem is also formulated in a novel fashion by minimizing the sum of squared distances of data points to nearest cluster planes characterizing the k clusters. The problem of obtaining a parsimonious solution to a linear system where the right hand side vector may be corrupted by noise is formulated as minimizing the system residual plus either the number of nonzero elements in the solution vector or the norm of the solution vector. The feature selection problem, the clustering problem and the parsimonious approximation problem can all be stated as the minimization of a concave function over a polyhedral region and are solved by a theoretically justifiable, fast and finite successive linearization algorithm. Numerical tests indicate the utility and efficiency of these formulations on real-world databases. In particular, the feature selection approach via concave minimization computes a separating-plane based classifier that improves upon the generalization ability of a separating plane computed without feature suppression. This approach produces ii classifiers utilizing fewer original problem features than the support vector machin...
Intelligent Data Analysis: Issues and Challenges
- The Knowledge Engineering Review
, 1996
"... this paper, and to Fazel Famili for providing much of the materials regarding the panel activities in Section 3. ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
this paper, and to Fazel Famili for providing much of the materials regarding the panel activities in Section 3.
Fast Algorithms for Discovering the Maximum Frequent Set
, 1998
"... This paper addresses two issues in previous algorithms. The first issue is that the algorithms discussed so far require reading the database many times (as many times as the length of the longest frequent itemset). The second issue is that most of the records in the database are not useful in the la ..."
Abstract
- Add to MetaCart
This paper addresses two issues in previous algorithms. The first issue is that the algorithms discussed so far require reading the database many times (as many times as the length of the longest frequent itemset). The second issue is that most of the records in the database are not useful in the later passes, since many of the records may not even contain the items in the candidates. In other words, a record that does not contain any item in any candidates can be removed without affecting the support counting process. The first issue is addressed by horizontally dividing the database into equal sized partitions, which can fit in main memory. Each partition is processed independently to produce a local frequent set for that partition in the first pass. The process in each partition is using a bottom-up approach similar to Apriori but with a different data structure (to be discussed later). After all local frequent sets are discovered, their union, called global candidate set, forms a superset of the actual frequent set. It relies on the fact that if an itemset is frequent then it must be frequent in at least one of the partition. Similarly, if an itemset is not frequent in any partition, then it must be infrequent. During the second pass, the database is read again to produce the actual support for the global candidate set. Therefore, the entire process takes only two passes. The computation in each partition is using a regular bottom-up approach. It will extend the length of the candidates by one in every loop until no more candidates can be generated. To prevent reading the database each time the length of the candidate is incremented, the database is transformed into a new data struture, called TID-list. Each candidate stores a list of the transaction IDs that supp...

