Levelwise Search and Borders of Theories in Knowledge Discovery
, 1997
"... One of the basic problems in knowledge discovery in databases (KDD) is the following: given a data set r, a class L of sentences for defining subgroups of r, and a selection predicate, find all sentences of L deemed interesting by the selection predicate. We analyze the simple levelwise algorithm fo ..."
Cited by 211 (13 self)
 
One of the basic problems in knowledge discovery in databases (KDD) is the following: given a data set r, a class L of sentences for defining subgroups of r, and a selection predicate, find all sentences of L deemed interesting by the selection predicate. We analyze the simple levelwise algorithm for finding all such descriptions. We give bounds for the number of database accesses that the algorithm makes. For this, we introduce the concept of the border of a theory, a notion that turns out to be surprisingly powerful in analyzing the algorithm. We also consider the verification problem of a KDD process: given r and a set of sentences S ` L, determine whether S is exactly the set of interesting statements about r. We show strong connections between the verification problem and the hypergraph transversal problem. The verification problem arises in a natural way when using sampling to speed up the pattern discovery step in KDD.
An Incremental Algorithm for a Generalization of the ShortestPath Problem
, 1992
"... The grammar problem, a generalization of the singlesource shortestpath problem introduced by Knuth, is to compute the minimumcost derivation of a terminal string from each nonterminal of a given contextfree grammar, with the cost of a derivation being suitably defined. This problem also subsume ..."
Cited by 116 (1 self)
 
The grammar problem, a generalization of the singlesource shortestpath problem introduced by Knuth, is to compute the minimumcost derivation of a terminal string from each nonterminal of a given contextfree grammar, with the cost of a derivation being suitably defined. This problem also subsumes the problem of finding optimal hyperpaths in directed hypergraphs (under varying optimization criteria) that has received attention recently. In this paper we present an incremental algorithm for a version of the grammar problem. As a special case of this algorithm we obtain an efficient incremental algorithm for the singlesource shortestpath problem with positive edge lengths. The aspect of our work that distinguishes it from other work on the dynamic shortestpath problem is its ability to handle "multiple heterogeneous modifications": between updates, the input graph is allowed to be restructured by an arbitrary mixture of edge insertions, edge deletions, and edgelength changes.
Directed Hypergraphs And Applications
, 1992
"... We deal with directed hypergraphs as a tool to model and solve some classes of problems arising in Operations Research and in Computer Science. Concepts such as connectivity, paths and cuts are defined. An extension of the main duality results to a special class of hypergraphs is presented. Algorith ..."
Cited by 100 (5 self)
 
We deal with directed hypergraphs as a tool to model and solve some classes of problems arising in Operations Research and in Computer Science. Concepts such as connectivity, paths and cuts are defined. An extension of the main duality results to a special class of hypergraphs is presented. Algorithms to perform visits of hypergraphs and to find optimal paths are studied in detail. Some applications arising in propositional logic, AndOr graphs, relational data bases and transportation analysis are presented. January 1990 Revised, October 1992 ( * ) This research has been supported in part by the "Comitato Nazionale Scienza e Tecnologia dell'Informazione", National Research Council of Italy, under Grant n.89.00208.12, and in part by research grants from the National Research Council of Canada. 1 Dipartimento di Informatica, Università di Pisa, Italy 2 Département d'Informatique et de Recherche Opérationnelle, Université de Montréal, Canada 2 INTRODUCTION Hypergraphs, a generaliz...
Data mining, hypergraph transversals, and machine learning
, 1997
"... Several data mining problems can be formulated as problems of finding maximally specific sentences that are interesting in a database. We first show that this problem has a close relationship with the hypergraph transversal problem. We then analyze two algorithms that have been previously used in da ..."
Cited by 65 (5 self)
 
Several data mining problems can be formulated as problems of finding maximally specific sentences that are interesting in a database. We first show that this problem has a close relationship with the hypergraph transversal problem. We then analyze two algorithms that have been previously used in data mining, proving upper bounds on their complexity. The first algorithm is useful when the maximally specific interesting sentences are "small". We show that this algorithm can also be used to efficiently solve a special case of the hypergraph transversal problem, improving on previous results. The second algorithm utilizes a subroutine for hypergraph transversals, and is applicable in more general situations, with complexity close to a lower bound for the problem. We also relate these problems to the model of exact learning in computational learning theory, and use the correspondence to derive some corollaries. 1
Discovering All Most Specific Sentences
 ACM Transactions on Database Systems
, 2003
"... this article, we show how the problems of finding frequent sets in relations and of finding minimal keys in databases can be reduced to this formulation. Using this theory extraction formulation [Mannila 1995, 1996; Mannila and Toivonen 1997], one can formulate general results about the complexity o ..."
Cited by 55 (3 self)
 
this article, we show how the problems of finding frequent sets in relations and of finding minimal keys in databases can be reduced to this formulation. Using this theory extraction formulation [Mannila 1995, 1996; Mannila and Toivonen 1997], one can formulate general results about the complexity of algorithms for these data mining tasks
Discovering all Most Specific Sentences by Randomized Algorithms (Extended Abstract)
 In Intl. Conf. on Database Theory
, 1997
"... Dimitrios Gunopulos 1 and Heikki Mannila 2 and Sanjeev Saluja 3 1 MaxPlanckInsitut Informatik, Im Stadtwald, 66123 Saarbrucken, Germany. gunopulo@mpisb.mpg.de 2 University of Helsinki, Dept. of Computer Science, FIN00014 Helsinki, Finland. Heikki.Mannila@cs.helsinki.fi. Work supported by ..."
Cited by 54 (5 self)
 
Dimitrios Gunopulos 1 and Heikki Mannila 2 and Sanjeev Saluja 3 1 MaxPlanckInsitut Informatik, Im Stadtwald, 66123 Saarbrucken, Germany. gunopulo@mpisb.mpg.de 2 University of Helsinki, Dept. of Computer Science, FIN00014 Helsinki, Finland. Heikki.Mannila@cs.helsinki.fi. Work supported by Alexander von HumboldStiftung and the Academy of Finland. 3 MaxPlanckInstitut Informatik, Im Stadtwald, 66123 Saarbrucken, Germany. saluja@mpisb.mpg.de Abstract. Data mining can in many instances be viewed as the task of computing a representation of a theory of a model or a database. In this paper we present a randomized algorithm that can be used to compute the representation of a theory in terms of the most specific sentences of that theory. In addition to randomization, the algorithm uses a generalization of the concept of hypergraph transversal. We apply the general algorithm, for discovering maximal frequent sets in 0/1 data, and for computing minimal keys in relations. We prese...
NeighborhoodBased Models for Social Networks
 Sociological Methodology
, 2002
"... Harrison White and several anonymous reviewers for valuable comments on the work. We argue that social networks can be modeled as the outcome of processes that occur in overlapping local regions of the network, termed local social neighborhoods. Each neighborhood is conceived as a possible site of i ..."
Cited by 54 (4 self)
 
Harrison White and several anonymous reviewers for valuable comments on the work. We argue that social networks can be modeled as the outcome of processes that occur in overlapping local regions of the network, termed local social neighborhoods. Each neighborhood is conceived as a possible site of interaction and corresponds to a subset of possible network ties. In this paper, we discuss hypotheses about the form of these neighborhoods, and we present two new and theoretically plausible ways in which neighborhoodbased models for networks can be constructed. In the first, we introduce the notion of a setting structure, a directly hypothesized (or observed) set of exogenous constraints on possible neighborhood forms. In the second, we propose higherorder neighborhoods that are generated, in part, by the outcome of interactive network processes themselves. Applications of both approaches to model construction are presented, and the developments are considered within a general conceptual framework of locale for social networks. We show how assumptions about neighborhoods can be cast within a hierarchy of increasingly complex models; these models represent a progressively greater capacity for network processes to “reach ” across a network through long cycles or semipaths. We argue that this class of models holds new promise for the development of empirically plausible models for networks and networkbased processes. 2 1.
Parallel Algorithms for Discovery of Association Rules
 DATA MINING AND KNOWLEDGE DISCOVERY
, 1997
"... Discovery of association rules is an important data mining task. Several parallel and sequential algorithms have been proposed in the literature to solve this problem. Almost all of these algorithms make repeated passes over the database to determine the set of frequent itemsets (a subset of databas ..."
Cited by 53 (6 self)
 
Discovery of association rules is an important data mining task. Several parallel and sequential algorithms have been proposed in the literature to solve this problem. Almost all of these algorithms make repeated passes over the database to determine the set of frequent itemsets (a subset of database items), thus incurring high I/O overhead. In the parallel case, most algorithms perform a sumreduction at the end of each pass to construct the global counts, also incurring high synchronization cost. In this paper we describe new parallel association mining algorithms. The algorithms use novel itemset clustering techniques to approximate the set of potentially maximal frequent itemsets. Once this set has been identified, the algorithms make use of efficient traversal techniques to generate the frequent itemsets contained in each cluster. We propose two clustering schemes based on equivalence classes and maximal hypergraph cliques, and study two lattice traversal techniques based on bottomup and hybrid search. We use a vertical database layout to cluster related transactions together. The database is also selectively replicated so that the portion of the database needed for the computation of associations is local to each processor. After the initial setup phase, the algorithms do not need any further communication or synchronization. The algorithms minimize I/O overheads by scanning the local database portion only twice. Once in the setup phase, and once when processing the itemset clusters. Unlike previous parallel approaches, the algorithms use simple intersection operations to compute frequent itemsets and
On an algorithm for finding all interesting sentences (Extended Abstract)
 In Cybernetics and Systems, Volume II, The Thirteenth European Meeting on Cybernetics and Systems Research
, 1996
"... Knowledge discovery in databases (KDD), also called data mining, has recently received wide attention from practitioners and researchers. One of the basic problems in KDD is the following: given a data set r, a class L of sentences defining subgroups or properties of r, and an interestingness predic ..."
Cited by 33 (9 self)
 
Knowledge discovery in databases (KDD), also called data mining, has recently received wide attention from practitioners and researchers. One of the basic problems in KDD is the following: given a data set r, a class L of sentences defining subgroups or properties of r, and an interestingness predicate, find all sentences of L deemed interesting by the interestingness predicate. In this paper we analyze a simple and wellknown levelwise algorithm for finding all such descriptions. We give bounds for the number of database accesses that the algorithm makes. We also consider the verification problem of a KDD process: given r and a set of sentences T ` L, determine whether T is exactly the set of interesting statements about r. We show strong connections between the verification problem and the hypergraph transversal problem. The verification problem arises in a natural way when using sampling to speed up the pattern discovery step in KDD.
Pathbased depthfirst search for strong and biconnected components
 Information Processing Letters
, 2000
"... Key words: Graph, depthfirst search, strongly connected component, biconnected component, stack. ..."
Cited by 30 (0 self)
 
Key words: Graph, depthfirst search, strongly connected component, biconnected component, stack.