• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations

Tools

Sorted by:
Try your query at:
Semantic Scholar Scholar Academic
Google Bing DBLP
Results 1 - 10 of 4,362
Next 10 →

Information-theoretic tools for mining database structure from large data sets

by Periklis Andritsos, Renée J. Miller, Panayiotis Tsaparas - In SIGMOD , 2004
"... Data design has been characterized as a process of arriving at a design that maximizes the information content of each piece of data (or equivalently, one that minimizes redundancy). Information content (or redundancy) is measured with respect to a prescribed model for the data, a model that is ofte ..."
Abstract - Cited by 25 (3 self) - Add to MetaCart
, missing values, and duplicate records. We propose a set of information-theoretic tools for finding structural summaries that are useful in characterizing the information content of the data, and ultimately useful in data design. We provide algorithms for creating these summaries over large, categorical

Mining Association Rules between Sets of Items in Large Databases

by Rakesh Agrawal, Tomasz Imielinski, Arun Swami - IN: PROCEEDINGS OF THE 1993 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, WASHINGTON DC (USA , 1993
"... We are given a large database of customer transactions. Each transaction consists of items purchased by a customer in a visit. We present an efficient algorithm that generates all significant association rules between items in the database. The algorithm incorporates buffer management and novel esti ..."
Abstract - Cited by 3331 (16 self) - Add to MetaCart
estimation and pruning techniques. We also present results of applying this algorithm to sales data obtained from a large retailing company, which shows the effectiveness of the algorithm.

CURE: An Efficient Clustering Algorithm for Large Data sets

by Sudipto Guha, Rajeev Rastogi, Kyuseok Shim - Published in the Proceedings of the ACM SIGMOD Conference , 1998
"... Clustering, in data mining, is useful for discovering groups and identifying interesting distributions in the underlying data. Traditional clustering algorithms either favor clusters with spherical shapes and similar sizes, or are very fragile in the presence of outliers. We propose a new clustering ..."
Abstract - Cited by 722 (5 self) - Add to MetaCart
of random sampling and partitioning. A random sample drawn from the data set is first partitioned and each partition is partially clustered. The partial clusters are then clustered in a second pass to yield the desired clusters. Our experimental results confirm that the quality of clusters produced by CURE

Data Mining: Concepts and Techniques

by Jiawei Han, Micheline Kamber , 2000
"... Our capabilities of both generating and collecting data have been increasing rapidly in the last several decades. Contributing factors include the widespread use of bar codes for most commercial products, the computerization of many business, scientific and government transactions and managements, a ..."
Abstract - Cited by 3142 (23 self) - Add to MetaCart
warehouses, and other massive information repositories. Data mining is a multidisciplinary field, drawing work from areas including database technology, artificial intelligence, machine learning, neural networks, statistics, pattern recognition, knowledge based systems, knowledge acquisition, information

Mining the Network Value of Customers

by Pedro Domingos, Matt Richardson - In Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining , 2002
"... One of the major applications of data mining is in helping companies determine which potential customers to market to. If the expected pro t from a customer is greater than the cost of marketing to her, the marketing action for that customer is executed. So far, work in this area has considered only ..."
Abstract - Cited by 568 (11 self) - Add to MetaCart
as a set of independent entities, we view it as a social network and model it as a Markov random eld. We show the advantages of this approach using a social network mined from a collaborative ltering database. Marketing that exploits the network value of customers|also known as viral marketing

OPTICS: Ordering Points To Identify the Clustering Structure

by Mihael Ankerst, Markus M. Breunig, Hans-peter Kriegel, Jörg Sander , 1999
"... Cluster analysis is a primary method for database mining. It is either used as a stand-alone tool to get insight into the distribution of a data set, e.g. to focus further analysis and data processing, or as a preprocessing step for other algorithms operating on the detected clusters. Almost all of ..."
Abstract - Cited by 527 (51 self) - Add to MetaCart
Cluster analysis is a primary method for database mining. It is either used as a stand-alone tool to get insight into the distribution of a data set, e.g. to focus further analysis and data processing, or as a preprocessing step for other algorithms operating on the detected clusters. Almost all

Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach

by Jiawei Han, Jian Pei, Yiwen Yin, Runying Mao - DATA MINING AND KNOWLEDGE DISCOVERY , 2004
"... Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. However, candidate set generation is still co ..."
Abstract - Cited by 1752 (64 self) - Add to MetaCart
-tree- based mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth. Efficiency of mining is achieved with three techniques: (1) a large database is compressed into a condensed, smaller data structure, FP-tree which avoids costly, repeated database scans, (2) our

Building a Large Annotated Corpus of English: The Penn Treebank

by Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz - COMPUTATIONAL LINGUISTICS , 1993
"... There is a growing consensus that significant, rapid progress can be made in both text understanding and spoken language understanding by investigating those phenomena that occur most centrally in naturally occurring unconstrained materials and by attempting to automatically extract information abou ..."
Abstract - Cited by 2740 (10 self) - Add to MetaCart
about language from very large corpora. Such corpora are beginning to serve as important research tools for investigators in natural language processing, speech recognition, and integrated spoken language systems, as well as in theoretical linguistics. Annotated corpora promise to be valuable

Loopy belief propagation for approximate inference: An empirical study. In:

by Kevin P Murphy , Yair Weiss , Michael I Jordan - Proceedings of Uncertainty in AI, , 1999
"... Abstract Recently, researchers have demonstrated that "loopy belief propagation" -the use of Pearl's polytree algorithm in a Bayesian network with loops -can perform well in the context of error-correcting codes. The most dramatic instance of this is the near Shannon-limit performanc ..."
Abstract - Cited by 676 (15 self) - Add to MetaCart
. For each experimental run, we first gen erated random CPTs. We then sampled from the joint distribution defined by the network and clamped the observed nodes (all nodes in the bottom layer) to their sampled value. Given a structure and observations, we then ran three inference algorithms -junction tree

Efficient Algorithms for Mining Outliers from Large Data Sets

by Sridhar Ramaswamy, Rajeev Rastogi, Kyuseok Shim , 2000
"... In this paper, we propose a novel formulation for distance-based outliers that is based on the distance of a point from its k th nearest neighbor. We rank each point on the basis of its distance to its k th nearest neighbor and declare the top n points in this ranking to be outliers. In addition ..."
Abstract - Cited by 322 (0 self) - Add to MetaCart
, and then prunes entire partitions as soon as it is determined that they cannot contain outliers. This results in substantial savings in computation. We present the results of an extensive experimental study on real-life and synthetic data sets. The results from a real-life NBA database highlight and reveal
Next 10 →
Results 1 - 10 of 4,362
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University