## Conceptual Clustering with Numeric-and-Nominal Mixed Data - A New Similarity Based System (1998)

Venue: | in IEEE Transcript on KCE |

Citations: | 5 - 1 self |

### BibTeX

@INPROCEEDINGS{Li98conceptualclustering,

author = {Cen Li and Gautam Biswas},

title = {Conceptual Clustering with Numeric-and-Nominal Mixed Data - A New Similarity Based System},

booktitle = {in IEEE Transcript on KCE},

year = {1998}

}

### OpenURL

### Abstract

This paper presents a new Similarity Based Agglomerative Clustering(SBAC) algorithm that works well for data with mixed numeric and nominal features. A similarity measure, proposed by Goodall for biological taxonomy[13], that gives greater weight to uncommon feature-value matches in similarity computations and makes no assumptions of the underlying distributions of the feature-values, is adopted to define the similarity measure between pairs of objects. An agglomerative algorithm is employed to construct a concept tree, and a simple distinctness heuristic is used to extract a partition of the data. The performance of SBAC has been studied on artificially generated data sets. Results demonstrate the effectiveness of this algorithm in unsupervised discovery tasks. Comparisons with other schemes illustrate the superior performance of the algorithm. 1 Introduction The widespread use of computers and information technology has made extensive data collection in businesses, manufacturing, an...

### Citations

8089 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...embership, direct optimization approaches for finding the best parameter values from mixture pdf is computationally intractable even for moderate size data sets. A variation of Dempster's EM algorithm=-=[4]-=- is used to approximate the solution where weighted instance assignments and weighted statistics, calculated from normalized class probability and data instances, are used to represent known assignmen... |

3921 |
Pattern Classification and Scene Analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...tructures Construction Universally Joint Tuples Preprossed Identification Figure 1: Discovery system architecture with unsupervised classification discovery engine Traditional clustering methodologies=-=[5, 15]-=- assume features are numeric-valued, but as applications extend from scientific and engineering domains to the medical, business, and social domains, one has to deal with features, such as sex, color,... |

2152 |
Algorithms for Clustering Data
- Jain, Dubes
- 1988
(Show Context)
Citation Context ...tructures Construction Universally Joint Tuples Preprossed Identification Figure 1: Discovery system architecture with unsupervised classification discovery engine Traditional clustering methodologies=-=[5, 15]-=- assume features are numeric-valued, but as applications extend from scientific and engineering domains to the medical, business, and social domains, one has to deal with features, such as sex, color,... |

643 | Knowledge acquisition via incremental conceptual clustering
- Fisher
- 1987
(Show Context)
Citation Context ...rity between objects[5, 15]. On the other hand, conceptual clustering systems use conditional probability estimates as a means for defining the relation between groups or clusters. Systems like COBWEB=-=[6]-=- and its derivatives use the Category Utility(CU) measure[12], which has its roots in information theory, to partition a data set in a manner that maximizes the probability of correctly predicting a f... |

549 |
Statistical Methods for Research Workers
- Fisher
(Show Context)
Citation Context ...ve properties of �� 2 distribution, assuming the individual results are expressed as the square of a standard normal deviate. Such combination can be simply achieved using Fisher's �� 2 transf=-=ormation[9]: ��-=-�� 2 = \Gamma2ln(P i ). This transformation works well with data from continuous populations and in discrete populations where a feature has a large number of distinct observations associated with i... |

359 | Knowledge discovery in databases: an overview
- Frawley, etal
- 1992
(Show Context)
Citation Context ...y from databases is the extraction of potentially useful information by careful processing and analysis of this data in a computationally efficient and sometimes interactive manner[17]. Frawley et al.=-=[10] define kn-=-owledge discovery to be "the non trivial extraction of implicit, previously unknown and potentially useful information in data." This suggests a generic architecture for a discovery system, ... |

270 |
Pattern Classi cation and Scene Analysis
- Duda, Hart, et al.
- 1973
(Show Context)
Citation Context ...e are relevant features, but the color of the car body is not. Selection of appropriate features is an important task in clustering and classi cation applications. Traditional clustering methodologies=-=[5, 15]-=- assume features are numeric-valued, but as applications extend from scienti c and engineering domains to the medical, business, and social domains, one has to deal with features, such as sex, color, ... |

196 |
Model of Incremental Concept Formation
- Gennari, Langley, et al.
- 1992
(Show Context)
Citation Context ... a partition structure made up of K classes, is defined as the average CU over the K classes: P K k=1 CUk K . COBWEB/3 combines the original COBWEB[6] algorithm with the methodology defined in CLASSIT=-=[11]-=- to handle numeric attributes in the CU measure. For numeric attributes, probabilities are expressed in terms of the probability density function(pdf) defined for the range of values that can be assoc... |

105 | Optimization and simplification of hierarchical clusterings
- Fisher
- 1995
(Show Context)
Citation Context ... nominalvalued. Though effective methods do not exist for clustering data sets with mixed numeric and nominal data, symbolic and numeric clustering methods have by themselves approached their maturity=-=[7, 15]-=-. In knowledge discovery tasks, it becomes crucial that clustering and discovery systems deal with combinations of numeric and nominal-valued data. Section 2 discusses criterion functions used for num... |

93 |
Information, uncertainty, and the utility of categories
- Gluck, Corter
- 1985
(Show Context)
Citation Context ...lustering systems use conditional probability estimates as a means for defining the relation between groups or clusters. Systems like COBWEB[6] and its derivatives use the Category Utility(CU) measure=-=[12]-=-, which has its roots in information theory, to partition a data set in a manner that maximizes the probability of correctly predicting a feature-value given group C k versus the same probability give... |

25 |
Bayesian classi cation (AutoClass): Theory and results
- Cheeseman, Stutz
- 1995
(Show Context)
Citation Context ... de ne groupings. These measures are tailored for nominal attributes, though variations, such as COBWEB/3[18] and ECOBWEB[20] use modi cations of the CU measure to handle numeric attributes. AUTOCLASS=-=[3]-=- uses a fundamental nite mixture model, and derives groupings of objects that locally maximize the posterior probability of individual clusters given the feature distribution assumptions. The COBWEB/3... |

24 |
Conceptual clustering, categorization, and polymorphy. Machine Learning 3(4):343{372
- Hanson, Bauer
- 1989
(Show Context)
Citation Context ...ata set in a manner that maximizes the probability of correctly predicting a feature-value given group C k versus the same probability given the distribution for the entire data set. Systems like WITT=-=[14]-=- use correlation measures to define groupings. These measures are tailored for nominal attributes, though variations, such as COBWEB/3[18] and ECOBWEB[20] use modifications of the CU measure to handle... |

23 |
A New Similarity Index Based on Probability
- Goodall
- 1966
(Show Context)
Citation Context ...ents a new Similarity Based Agglomerative Clustering(SBAC) algorithm that works well for data with mixed numeric and nominal features. A similarity measure, proposed by Goodall for biological taxonomy=-=[13]-=-, that gives greater weight to uncommon feature-value matches in similarity computations and makes no assumptions of the underlying distributions of the feature-values, is adopted to define the simila... |

18 | The formation and use of abstract concepts in design
- Reich, Fenves
- 1991
(Show Context)
Citation Context ...ion for the entire data set. Systems like WITT[14] use correlation measures to define groupings. These measures are tailored for nominal attributes, though variations, such as COBWEB/3[18] and ECOBWEB=-=[20]-=- use modifications of the CU measure to handle numeric attributes. AUTOCLASS[3] uses a fundamental finite mixture model, and derives groupings of objects that locally maximize the posterior probabilit... |

17 |
Cobweb/3: A portable implementation
- McKusick, Thompson
- 1990
(Show Context)
Citation Context ...en the distribution for the entire data set. Systems like WITT[14] use correlation measures to define groupings. These measures are tailored for nominal attributes, though variations, such as COBWEB/3=-=[18]-=- and ECOBWEB[20] use modifications of the CU measure to handle numeric attributes. AUTOCLASS[3] uses a fundamental finite mixture model, and derives groupings of objects that locally maximize the post... |

10 | ITERATE: A conceptual clustering method for knowledge discovery in databases
- Biswas, Jerry�, et al.
- 1994
(Show Context)
Citation Context ...t the core of the system is the discovery engine, which computes and evaluates groupings, patterns, and relationships using a relevant set of features selected in the context of a problem solving task=-=[1]-=-. Depending on the discovery engine employed in the system, the results can be further analyzed to derive models as rules, analytic equations, and concept definitions under the chosen context. Typical... |

10 |
Building and Improving Design Systems: A Machine Learning Approach
- Reich
- 1991
(Show Context)
Citation Context ...lue of A i expected number of distinct intervals of attribute A i Figure 3: Three methods implemented in ECOBWEB for defining the interval value 2.2 ECOBWEB ECOBWEB is part of a larger system, Bridger=-=[19, 20]-=-, that employs inductive learning techniques in designing cable-stayed bridges. At the core of the system is the clustering algorithm, ECOBWEB, an extension of the COBWEB system. ECOBWEB attempts to r... |

9 |
Iterative optimization and simpli cation of hierarchical clusterings
- Fisher
- 1996
(Show Context)
Citation Context ...e nominalvalued. Though e ective methods do not exist for clustering data sets with mixed numeric and nominal data, symbolic and numeric clustering methods have by themselves approached their maturity=-=[7,15]-=-. In knowledge discovery tasks, it becomes crucial that clustering and discovery systems deal with combinations of numeric and nominal-valued data. Section 2 discusses criterion functions used for num... |

9 |
the Utility of Categories
- Information
- 1985
(Show Context)
Citation Context ...clustering systems use conditional probability estimates as a means for de ning the relation between groups or clusters. Systems like COBWEB[6] and its derivatives use the Category Utility(CU) measure=-=[12]-=-, which has its roots in information theory, to partition a data set in a manner that maximizes the probability of correctly predicting a feature-value given group Ck versus the same probability given... |

8 |
Autoclass: A bayesian classification system
- Cheesman, Kelly, et al.
- 1988
(Show Context)
Citation Context ...ere T denotes the abstract mathematical form of the pdf . In fact, on top of this two-step process, another level of search for the approximate optimal number of classes for the data has to be imposed=-=[2]-=-. AUTOCLASS starts out with a number of classes J greater than the predicted true number of classes for the data. If the resulting classes all have significant probability, then the number of classes ... |

7 | The combination of probabilities arising from data in discrete distributions - Lancaster - 1949 |

6 |
Methods of Conceptual Clustering and their Relation to Numerical Taxonomy
- Fisher, Langley
- 1986
(Show Context)
Citation Context ..., shape, and type of diseases, that are nominal-valued. Schemes for processing nominal-valued data, called conceptual clustering, combine clustering with the concept formation and interpretation tasks=-=[8]-=-. 2sClustering methodologies incorporate three main steps[1]: preprocessing, clustering, and explanation. Data preprocessing involves the selection of relevant features for the analysis task and the c... |

3 |
AutoClass: A Bayesian Classi cation System
- Cheesman
- 1988
(Show Context)
Citation Context ...here T denotes the abstract mathematical form of the pdf. In fact, on top of this two-step process, another level of search for the approximate optimal number of classes for the data has to be imposed=-=[2]-=-. AUTOCLASS starts out with a number of classes J greater than the predicted true number of classes for the data. If the resulting classes all have signi cant probability, then the number of classes a... |

2 |
Bayesian Classification(AUTOCLASS): Theory and Results
- Cheesman, Stutz
- 1995
(Show Context)
Citation Context ...efine groupings. These measures are tailored for nominal attributes, though variations, such as COBWEB/3[18] and ECOBWEB[20] use modifications of the CU measure to handle numeric attributes. AUTOCLASS=-=[3]-=- uses a fundamental finite mixture model, and derives groupings of objects that locally maximize the posterior probability of individual clusters given the feature distribution assumptions. The COBWEB... |

1 |
and P.Langley. "Methods of Conceptual Clustering and Their Relation to Numeric Taxonomy
- Fisher
- 1986
(Show Context)
Citation Context ..., shape, and type of diseases, that are nominal-valued. Schemes for processing nominal-valued data, called conceptual clustering, combine clustering with the concept formation and interpretation tasks=-=[8]-=-. Clustering methodologies incorporate three main steps[1]: preprocessing, clustering, and explanation. Data preprocessing involves the selection of relevant features for the analysis task and the cha... |

1 |
Knowledge-based Scientific Discovery from Geological Databases
- Li, Biswas
- 1995
(Show Context)
Citation Context ...r knowledge discovery from databases is the extraction of potentially useful information by careful processing and analysis of this data in a computationally efficient and sometimes interactive manner=-=[17]. Frawley -=-et al.[10] define knowledge discovery to be "the non trivial extraction of implicit, previously unknown and potentially useful information in data." This suggests a generic architecture for ... |

1 |
Knowledge-based Scienti c Discovery from Geological Databases
- Li, Biswas
- 1995
(Show Context)
Citation Context ... or knowledge discovery from databases is the extraction of potentially useful information by careful processing and analysis of this data in a computationally e cient and sometimes interactive manner=-=[17]-=-. Frawley et al.[10] de ne knowledge discovery to be \the non trivial extraction of implicit, previously unknown and potentially useful information in data." This suggests a generic architecture for a... |