## Towards Automated Synthesis of Data Mining Programs (1999)

### Cached

### Download Links

- [ic.arc.nasa.gov]
- [www.ic.arc.nasa.gov]
- [ic.arc.nasa.gov]
- [ti.arc.nasa.gov]
- [ase.arc.nasa.gov]
- [www.ic.arc.nasa.gov]
- [ti.arc.nasa.gov]
- [ic.arc.nasa.gov]
- [www.riacs.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | Proc. 5th Intl. Conf. Knowledge Discovery and Data Mining |

Citations: | 5 - 4 self |

### BibTeX

@INPROCEEDINGS{Buntine99towardsautomated,

author = {Wray Buntine},

title = {Towards Automated Synthesis of Data Mining Programs},

booktitle = {Proc. 5th Intl. Conf. Knowledge Discovery and Data Mining},

year = {1999},

pages = {372--376},

publisher = {ACM Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

Code synthesis is routinely used in industry to generate GUIs, form lling applications, and database support code and is even used with COBOL. In this paper we consider the question of whether code synthesis could also be applied to the data mining phase of knowledge discovery. We view this as a rapid prototyping method. Rapid prototyping of statistical data analysis algorithms would allow experienced analysts to experiment with di erent statistical models before choosing one, but without requiring prohibitively expensive programming e orts. It would also smooth the steep learning curve often faced by novice users of data mining tools and libraries. Finally, it would accelerate dissemination of essential research results and the development of applications. In this paper, we present a framework and the basic software for the automated synthesis of data analysis programs. We use a speci cation language that generalizes Bayesian networks, a popular notation used in many communities. Using decomposition methods and algorithm templates, our system transforms the network through several levels of representation and then nally into pseudocode which can be translated into the implementation language of choice. Here, we explain the framework on a mixture of Gaussians model, a core data mining algorithm at the heart of many commercial clustering tools. We mention the e ectiveness of our framework by generating pseudocode for some more sophisticated algorithms from recent literature.

### Citations

799 | A view of the EM algorithm that justifies incremental, sparse, and other variants
- Neal, Hinton
- 1998
(Show Context)
Citation Context ...nd throw away the information regarding the original Gaussian source. This kind of problem is traditionally handled using an algorithm known as Expectation-Maximization (EM); our presentation follows =-=[12]-=-. In the mixture of Gaussian problem, one common interpretation of the learning task is to seek to maximize P N i=1 log P r(x i j ~ ae; ~; ~oe). The problem here is that the inner probabilities are th... |

562 | Distributional Clustering of English Words
- Pereira, Tishby, et al.
- 1993
(Show Context)
Citation Context ...urve clustering" model suggested by Smyth which attempts to t multiple curves at once. Our system yielded correct pseudo-code in all cases. We also modelled the distributional clustering framework of =-=[14]-=- but without introducing their \temperature" parameter. This method is the basis of techniques for featurizing documents by generating clusters of related words, and versions of it are used in text mi... |

461 |
Graphical Models in Applied Multivariate Statistics
- Whittaker
- 1990
(Show Context)
Citation Context ...typing. Second, Bayesian networks provide a ready, unifying speci cation language, as seen by their widespread use in communities such as applied Bayesian statistics and neural information processing =-=[17, 8]-=-; their role for the data mining community istoprovide a exible data modeling language [4]. Finally, program synthesis has been proven to be competitive in other domains. It o ers: Rapid turn-around: ... |

256 | Operations for learning with graphical models
- Buntine
- 1994
(Show Context)
Citation Context ...sis programs from Bayesian network speci cations using a library of e cient algorithm templates together with core special-purpose algorithms and general purpose solvers, related to a suggestion from =-=[3]-=-. We show that this approach can address nontrivial data analysis problems. Our approach is motivated by three observations. First, the success of BUGS [16] demonstrates the need for data analysis too... |

147 |
Independence properties of directed Markov fields
- Lauritzen, Dawid, et al.
- 1990
(Show Context)
Citation Context ...k to determine probabilities over indexed vectors. Tests for independence on these indexed Bayesian networks are easily developed in Lauritzen's framework which uses ancestral sets and set separation =-=[9]-=-. 2.3 Expressions for probabilities Given a Bayesian network, some probabilities can easily be extracted by enumerating the component probabilities at each node: Lemma 1 Let U; V be sets of variables ... |

70 |
Bugs: A program to perform Bayesian inference using Gibbs sampling
- Thomas, Spiegelhalter, et al.
- 1992
(Show Context)
Citation Context ...urpose solvers, related to a suggestion from [3]. We show that this approach can address nontrivial data analysis problems. Our approach is motivated by three observations. First, the success of BUGS =-=[16]-=- demonstrates the need for data analysis tools suitable for reliable rapid prototyping. Second, Bayesian networks provide a ready, unifying speci cation language, as seen by their widespread use in co... |

53 | Factorial learning and the EM algorithm
- Ghahramani
- 1995
(Show Context)
Citation Context ...ethod is the basis of techniques for featurizing documents by generating clusters of related words, and versions of it are used in text mining. We also encoded the factorial Gaussian mixture model of =-=[6]-=- which uses multiple hidden variables to capture di erent hidden causes. This is a more complex model that may beimportant for analysing image data. We modeled the E-step via a combinatorial exact com... |

53 | Generating Bayesian networks from probability logic knowledge bases
- Haddawy
- 1994
(Show Context)
Citation Context ...ric single components, xi, and particular single components, x5. Our system uses Prolog-terms; a theory of indexed Bayesian networks, where indices are represented as Prolog variables is developed in =-=[7]-=-. The conditions required on the depends-literals 2 are as follows: (1) each term in the second argument matches a term in the rst argument of some other depends-literals, (2) any variable in the seco... |

33 |
A View of the EM Algorithm that Justi es Incremental, Sparse, and Other Variants
- Neal, Hinton
- 1998
(Show Context)
Citation Context ...nd throw away the information regarding the original Gaussian source. This kind of problem is traditionally handled using an algorithm known as Expectation-Maximization (EM); our presentation follows =-=[12]-=-. In the mixture of Gaussian problem, one common interpretation of the learning task is to seek to maximize PN i=1 log Pr(xij ~; ~; ~). The problem here is that the inner probabilities are themselves ... |

29 | Graphical models for discovering knowledge
- Buntine
- 1995
(Show Context)
Citation Context ...their widespread use in communities such as applied Bayesian statistics and neural information processing [17, 8]; their role for the data mining community istoprovide a exible data modeling language =-=[4]-=-. Finally, program synthesis has been proven to be competitive in other domains. It o ers: Rapid turn-around: even for large tasks mature synthesis systems usually require less than a few minutes to p... |

27 |
Efficient inference in bayes nets as a combinatorial optimization problem. lnt’l, Journal of Approximate Reasoning
- Li, D’Ambrosio
- 1994
(Show Context)
Citation Context ...ng probability statement Pr(UjV)=Pr(Ujparents(U)) = Y u2U Pr(ujparents(u)) How can probabilities not satisfying these conditions be converted to symbolic expressions? Symbolic probabilistic inference =-=[10]-=-, for instance extracts an e cient expression for a particular marginal probability, p(U). We have developed another result that lets us extract probabilities on a large class of mixed discrete and re... |

25 | Amphion: Automatic programming for scientific subroutine libraries
- Lowry, Philpot, et al.
- 1994
(Show Context)
Citation Context ... synthesis has been proven to be competitive in other domains. It offers: ffl Rapid turn-around : even for large tasks mature synthesis systems usually require less than a few minutes to produce code =-=[11, 2]-=-. ffl Reliability : synthesized code is used in production systems to schedule military logistics [2] or to price stock options [15]. ffl Efficiency : synthesized code can be an order of magnitude fas... |

19 |
Independence properties of directed Markov elds
- Lauritzen, Dawid, et al.
- 1990
(Show Context)
Citation Context ...k to determine probabilities over indexed vectors. Tests for independence on these indexed Bayesian networks are easily developed in Lauritzen's framework which uses ancestral sets and set separation =-=[9]-=-. 2.3 Expressions for probabilities Given a Bayesian network, some probabilities can easily be extracted by enumerating the component probabilities at each node: Lemma 1 Let U; V be sets of variables ... |

17 | Planware - Domain-Specific Synthesis of HighPerformance Schedulers
- Blaine, Gilham, et al.
- 1998
(Show Context)
Citation Context ... synthesis has been proven to be competitive in other domains. It offers: ffl Rapid turn-around : even for large tasks mature synthesis systems usually require less than a few minutes to produce code =-=[11, 2]-=-. ffl Reliability : synthesized code is used in production systems to schedule military logistics [2] or to price stock options [15]. ffl Efficiency : synthesized code can be an order of magnitude fas... |

5 |
Planware: Domain-speci synthesis of high-performance schedulers
- Blaine, Gilham, et al.
- 1998
(Show Context)
Citation Context ...rogram synthesis has been proven to be competitive in other domains. It o ers: Rapid turn-around: even for large tasks mature synthesis systems usually require less than a few minutes to produce code =-=[11, 2]-=-. Reliability: synthesized code is used in production systems to schedule military logistics [2] or to price stock options [15]. E ciency: synthesized code can be an order of magnitude faster than han... |

4 | Automatic synthesis of financial modeling codes
- Randall, Kant, et al.
- 1996
(Show Context)
Citation Context ... systems usually require less than a few minutes to produce code [11, 2]. ffl Reliability : synthesized code is used in production systems to schedule military logistics [2] or to price stock options =-=[15]-=-. ffl Efficiency : synthesized code can be an order of magnitude faster than hand-crafted special-purpose code [2]. However, to the best of our knowledge, program synthesis has not previously been app... |

2 |
Amphion: Automatic programming for scienti c subroutine libraries
- Lowry, Philpot, et al.
- 1994
(Show Context)
Citation Context ...rogram synthesis has been proven to be competitive in other domains. It o ers: Rapid turn-around: even for large tasks mature synthesis systems usually require less than a few minutes to produce code =-=[11, 2]-=-. Reliability: synthesized code is used in production systems to schedule military logistics [2] or to price stock options [15]. E ciency: synthesized code can be an order of magnitude faster than han... |

1 |
Compiler optimizations for high-performance computing
- Bacon, Graham, et al.
- 1994
(Show Context)
Citation Context ... advantages for data analysis, other than rapid prototyping, are that generated code should be time and space e cient; to achieve this we would rely on high-performance optimizing compiler techniques =-=[1]-=- coupled to our pseudo-code, as discussed in Section 3.2. 2 Preliminaries 2.1 A simple problem As a simple running example to illustrate our concepts we will use mixture of Gaussians (cf. Fig. 1). It ... |

1 | Analysing rock samples for the mars lander - Oliver, Roush, et al. - 1998 |

1 |
Automatic synthesis of nancial modeling codes
- Randall, Kant, et al.
- 1996
(Show Context)
Citation Context ...hesis systems usually require less than a few minutes to produce code [11, 2]. Reliability: synthesized code is used in production systems to schedule military logistics [2] or to price stock options =-=[15]-=-. E ciency: synthesized code can be an order of magnitude faster than hand-crafted special-purposescode [2]. However, to the best of our knowledge, program synthesis has not previously been applied to... |