Results 1 - 10
of
10
TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies
, 1999
"... this paper, we also consider the approximate dependency inference task: given a relation r and a threshold #, find all minimal non-trivial approximate dependencies ..."
Abstract
-
Cited by 50 (0 self)
- Add to MetaCart
this paper, we also consider the approximate dependency inference task: given a relation r and a threshold #, find all minimal non-trivial approximate dependencies
Discovering all Most Specific Sentences by Randomized Algorithms (Extended Abstract)
- In Intl. Conf. on Database Theory
, 1997
"... Dimitrios Gunopulos 1 and Heikki Mannila 2 and Sanjeev Saluja 3 1 Max-Planck-Insitut Informatik, Im Stadtwald, 66123 Saarbrucken, Germany. gunopulo@mpi-sb.mpg.de 2 University of Helsinki, Dept. of Computer Science, FIN-00014 Helsinki, Finland. Heikki.Mannila@cs.helsinki.fi. Work supported by ..."
Abstract
-
Cited by 47 (5 self)
- Add to MetaCart
Dimitrios Gunopulos 1 and Heikki Mannila 2 and Sanjeev Saluja 3 1 Max-Planck-Insitut Informatik, Im Stadtwald, 66123 Saarbrucken, Germany. gunopulo@mpi-sb.mpg.de 2 University of Helsinki, Dept. of Computer Science, FIN-00014 Helsinki, Finland. Heikki.Mannila@cs.helsinki.fi. Work supported by Alexander von Humbold-Stiftung and the Academy of Finland. 3 Max-Planck-Institut Informatik, Im Stadtwald, 66123 Saarbrucken, Germany. saluja@mpi-sb.mpg.de Abstract. Data mining can in many instances be viewed as the task of computing a representation of a theory of a model or a database. In this paper we present a randomized algorithm that can be used to compute the representation of a theory in terms of the most specific sentences of that theory. In addition to randomization, the algorithm uses a generalization of the concept of hypergraph transversal. We apply the general algorithm, for discovering maximal frequent sets in 0/1 data, and for computing minimal keys in relations. We prese...
Efficient Discovery of Functional and Approximate Dependencies Using Partitions (Extended version)
- In ICDE
, 1997
"... Discovery of functional dependencies from relations has been identified as an important database analysis technique. In this paper, we present a new approach for finding functional dependencies from large databases, based on partitioning the set of rows with respect to their attribute values. The us ..."
Abstract
-
Cited by 46 (1 self)
- Add to MetaCart
Discovery of functional dependencies from relations has been identified as an important database analysis technique. In this paper, we present a new approach for finding functional dependencies from large databases, based on partitioning the set of rows with respect to their attribute values. The use of partitions makes the discovery of approximate functional dependencies easy and efficient, and the erroneous or exceptional rows can be identified easily. Experiments show that the new algorithm is efficient in practice. For benchmark databases the running times are improved by several orders of magnitude over previously published results. The algorithm is also applicable to much larger datasets than the previous methods. Computing Reviews Categories and Subject Descriptors: H.3.1 Content Analysis and Indexing F.2.2 Nonnumerical Algorithms and Problems I.2.6 Learning General Terms: Algorithms, Experimentation Additional Key Words and Phrases: Knowledge Discovery, Data Mining, Func...
Discovering All Most Specific Sentences
- ACM Transactions on Database Systems
, 2003
"... this article, we show how the problems of finding frequent sets in relations and of finding minimal keys in databases can be reduced to this formulation. Using this theory extraction formulation [Mannila 1995, 1996; Mannila and Toivonen 1997], one can formulate general results about the complexity o ..."
Abstract
-
Cited by 39 (2 self)
- Add to MetaCart
this article, we show how the problems of finding frequent sets in relations and of finding minimal keys in databases can be reduced to this formulation. Using this theory extraction formulation [Mannila 1995, 1996; Mannila and Toivonen 1997], one can formulate general results about the complexity of algorithms for these data mining tasks
WebTables: Exploring the Power of Tables on the Web
, 2008
"... The World-Wide Web consists of a huge number of unstructured documents, but it also contains structured data in the form of HTML tables. We extracted 14.1 billion HTML tables from Google’s general-purpose web crawl, and used statistical classification techniques to find the estimated 154M that conta ..."
Abstract
-
Cited by 39 (4 self)
- Add to MetaCart
The World-Wide Web consists of a huge number of unstructured documents, but it also contains structured data in the form of HTML tables. We extracted 14.1 billion HTML tables from Google’s general-purpose web crawl, and used statistical classification techniques to find the estimated 154M that contain high-quality relational data. Because each relational table has its own “schema ” of labeled and typed columns, each such table can be considered a small structured database. The resulting corpus of databases is larger than any other corpus we are aware of, by at least five orders of magnitude. We describe the WebTables system to explore two fundamental questions about this collection of databases. First, what are effective techniques for searching for structured data at search-engine scales? Second, what additional power
Inferring Dependencies from Relations: A Conceptual Clustering Approach
- Computational Intelligence
, 1999
"... In this paper we consider two related types of data dependencies that can hold in a relation: conjunctive implication rules between attribute-value pairs, and functional dependencies. We present a conceptual clustering approach that can be used, with some small modifications, for inferring a cover f ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
In this paper we consider two related types of data dependencies that can hold in a relation: conjunctive implication rules between attribute-value pairs, and functional dependencies. We present a conceptual clustering approach that can be used, with some small modifications, for inferring a cover for both types of dependencies. The approach consists of two steps. First, a particular clustered representation of the relation, called concept (or Galois) lattice is built; then, a cover is extracted from the lattice built in the earlier step. The main emphasis of this paper is on the second step. We study the computational complexity of the proposed approach and present an experimental comparison with other methods that confirms its validity. The results of the experiments show that our algorithm for extracting implication rules from concept lattices clearly outperforms an earlier algorithm, and suggest that the overall lattice-based approach to inferring functional dependencies from relations can be seen as an alternative to traditional methods.
Efficiently detecting inclusion dependencies
- In Int. Conf. on Data Engineering (ICDE 07
, 2007
"... Data sources for data integration often come with spurious schema definitions such as undefined foreign key constraints. Such metadata are important for querying the database and for database integration. We present our algorithm SPIDER (Single Pass Inclusion DEpendency Recognition) for detecting in ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Data sources for data integration often come with spurious schema definitions such as undefined foreign key constraints. Such metadata are important for querying the database and for database integration. We present our algorithm SPIDER (Single Pass Inclusion DEpendency Recognition) for detecting inclusion dependencies, as these are the automatically testable part of a foreign key constraint. For IND detection all pairs of attributes must be tested. SPIDER solves this task very efficiently by testing all attribute pairs in parallel. It analyzes a 2 GB database in ∼ 20 min and a 21 GB database in ∼ 4 h. 1. Schema Discovery for Data Integration In large integration projects one is often confronted with undocumented data sources. One important schema information
Combining Inductive and Deductive Inference in Knowledge Management Tasks
- In 11th Intl. Workshop on Database and Expert Systems Applications. IEEE Computer Society
, 2000
"... This paper indicates how different logic programming technologies can underpin an architecture for distributed knowledge management in which higher throughput in information supply is achieved by a (semi-)automated solution to the more challenging problem of knowledge creation. The paper first propo ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
This paper indicates how different logic programming technologies can underpin an architecture for distributed knowledge management in which higher throughput in information supply is achieved by a (semi-)automated solution to the more challenging problem of knowledge creation. The paper first proposes working definitions of the notions of data, knowledge and information in purely logical terms, and then shows how existing technologies can be combined into an inference engine, referred to as a knowledge, information and data engine (KIDE), integrating inductive and deductive capabilities. The paper then briefly introduces the notion of virtual organizations and uses the set-up stage of virtual organizations to exemplify the value-adding potential of KIDEs in knowledge management contexts.
Combining Inductive and Deductive Engines for Knowledge Management
, 2000
"... This paper indicates how dierent logic programming technologies can underpin an architecture for distributed knowledge management in which higher throughput in information supply is achieved by a (semi-)automated solution to the more challenging problem of knowledge creation. The paper rst proposes ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
This paper indicates how dierent logic programming technologies can underpin an architecture for distributed knowledge management in which higher throughput in information supply is achieved by a (semi-)automated solution to the more challenging problem of knowledge creation. The paper rst proposes working denitions of the notions of data, knowledge and information in purely logical terms, and then shows how existing technologies can be combined into an inference engine, referred to as a knowledge, information and data engine (KIDE), integrating inductive and deductive capabilities. The paper then describes an architecture for KIDEs, briey introduces the notion of virtual organizations and uses the set up stage of virtual organizations to exemplify the valueadding potential of KIDEs in knowledge management contexts.

