Results 1  10
of
25
CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling
, 1999
"... Clustering in data mining is a discovery process that groups a set of data such that the intracluster similarity is maximized and the intercluster similarity is minimized. Existing clustering algorithms, such as Kmeans, PAM, CLARANS, DBSCAN, CURE, and ROCK are designed to find clusters that fit s ..."
Abstract

Cited by 252 (21 self)
 Add to MetaCart
Clustering in data mining is a discovery process that groups a set of data such that the intracluster similarity is maximized and the intercluster similarity is minimized. Existing clustering algorithms, such as Kmeans, PAM, CLARANS, DBSCAN, CURE, and ROCK are designed to find clusters that fit some static models. These algorithms can breakdown if the choice of parameters in the static model is incorrect with respect to the data set being clustered, or if the model is not adequate to capture the characteristics of clusters. Furthermore, most of these algorithms breakdown when the data consists of clusters that are of diverse shapes, densities, and sizes. In this paper, we present a novel hierarchical clustering algorithm called CHAMELEON that measures the similarity of two clusters based on a dynamic model. In the clustering process, two clusters are merged only if the interconnectivity and closeness (proximity) between two clusters are high relative to the internal intercon...
Blocks+: A NonRedundant Database of Protein Alignment Blocks Derived from multiple compilations
, 1999
"... Motivation: As databanks grow, sequence classification and prediction of function by searching protein family databases becomes increasingly valuable. The original Blocks Database, which contains ungapped multiple alignments for families documented in PROSITE, can be searched to classify new sequenc ..."
Abstract

Cited by 114 (3 self)
 Add to MetaCart
Motivation: As databanks grow, sequence classification and prediction of function by searching protein family databases becomes increasingly valuable. The original Blocks Database, which contains ungapped multiple alignments for families documented in PROSITE, can be searched to classify new sequences. However, PROSITE is incomplete, and families from other databases are now available to expand coverage of the Blocks Database.
ProtoMap: Automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space
 PROTEINS
, 1999
"... We investigate the space of all protein sequences in search of clusters of related proteins. Our aim is to automatically detect these sets, and thus obtain a classification of all protein sequences. Our analysis, which uses standard measures of sequence similarity as applied to an allvs.all compar ..."
Abstract

Cited by 108 (15 self)
 Add to MetaCart
We investigate the space of all protein sequences in search of clusters of related proteins. Our aim is to automatically detect these sets, and thus obtain a classification of all protein sequences. Our analysis, which uses standard measures of sequence similarity as applied to an allvs.all comparison of SWISSPROT, gives a very conservative initial classification based on the highest scoring pairs. The many classes in this classification correspond to protein subfamilies. Subsequently we merge the subclasses using the weaker pairs in a twophase clustering algorithm. The algorithm makes use of transitivity to identify homologous proteins; however, transitivity is applied restrictively in an attempt to prevent unrelated proteins from clustering together. This process is repeated at varying levels of statistical significance. Consequently, a hierarchical organization of all proteins is obtained. The resulting classification splits the protein space into welldefined groups of proteins, which are closely correlated with natural biological families and superfamilies. Different indices of validity were applied to assess the quality of our classification and compare it with the protein families in the PROSITE and Pfam databases. Our classification agrees with these domainbased classifications for between 64.8 % and 88.5 % of the proteins. It also finds many new clusters of protein sequences which were not classified by these databases. The hierarchical organization suggested by our analysis reveals finer subfamilies in families of known proteins as well as many novel relations between protein families.
Hypergraph Based Clustering in HighDimensional Data Sets: A Summary of Results
 IEEE Bulletin of the Technical Committee on Data Engineering
, 1998
"... Clustering of data in a large dimension space is of a great interest in many data mining applications. In this paper, we propose a method for clustering of data in a high dimensional space based on a hypergraph model. In this method, the relationship present in the original data in high dimensional ..."
Abstract

Cited by 52 (17 self)
 Add to MetaCart
Clustering of data in a large dimension space is of a great interest in many data mining applications. In this paper, we propose a method for clustering of data in a high dimensional space based on a hypergraph model. In this method, the relationship present in the original data in high dimensional space are mapped into a hypergraph. A hyperedge represents a relationship (affinity) among subsets of data and the weight of the hyperedge reflects the strength of this affinity. A hypergraph partitioning algorithm is used to find a partitioning of the vertices such that the corresponding data items in each partition are highly related and the weight of the hyperedges cut by the partitioning is minimized. We present results of experiments on two different data sets: S&P500 stock data for the period of 19941996 and protein coding data. These experiments demonstrate that our approach is applicable and effective in high dimensional datasets. 1 Introduction Clustering in data mining is a disco...
Global Self Organization of All Known Protein Sequences Reveals Inherent Biological Signatures
, 1997
"... A global classification of all currently known protein sequences is performed. Every protein sequence is partitioned into segments of 50 amino acids and a dynamicprogramming distance is calculated between each pair of segments. This space of segments is first embedded into Euclidean space with small ..."
Abstract

Cited by 41 (3 self)
 Add to MetaCart
A global classification of all currently known protein sequences is performed. Every protein sequence is partitioned into segments of 50 amino acids and a dynamicprogramming distance is calculated between each pair of segments. This space of segments is first embedded into Euclidean space with small metric distortion. A novel selforganized crossvalidated clustering algorithm is then applied to the embedded space with Euclidean distances. The resulting hierarchical tree of clusters offers a new representation of protein sequences and families, which compares favorably with the most updated classifications based on functional and structural protein data. Motifs and domains such as the Zinc Finger, EF hand, Homeobox, EGFlike and others are automatically correctly identified. A novel representation of protein families is introduced, from which functional biological kinship of protein families can be deduced, as demonstrated for the transporters family. The self organization method prese...
SST: an algorithm for finding nearexact sequence matches in time proportional to the logarithm of the database size
, 2002
"... Motivation: Searches for near exact sequence matches are performed frequently in largescale sequencing projects and in comparative genomics. The time and cost of performing these largescale sequencesimilarity searches is prohibitive using even the fastest of the extant algorithms. Faster algorith ..."
Abstract

Cited by 26 (1 self)
 Add to MetaCart
Motivation: Searches for near exact sequence matches are performed frequently in largescale sequencing projects and in comparative genomics. The time and cost of performing these largescale sequencesimilarity searches is prohibitive using even the fastest of the extant algorithms. Faster algorithms are desired.
Clustering in a HighDimensional Space Using Hypergraph Models
"... Clustering of data in a large dimension space is of a great interest in many data mining applications. Most of the traditional algorithms such as Kmeans or AutoClass fail to produce meaningful clusters in such data sets even when they are used with well known dimensionality reduction techniques suc ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
Clustering of data in a large dimension space is of a great interest in many data mining applications. Most of the traditional algorithms such as Kmeans or AutoClass fail to produce meaningful clusters in such data sets even when they are used with well known dimensionality reduction techniques such as Principal Component Analysis and Latent Semantic Indexing. In this paper, we propose a method for clustering of data in a high dimensional space based on a hypergraph model. The hypergraph model maps the relationship present in the original data in high dimensional space into a hypergraph. A hyperedge represents a relationship (affinity) among subsets of data and the weight of the hyperedge reflects the strength of this affinity. A hypergraph partitioning algorithm is used to find a partitioning of the vertices such that the corresponding data items in each partition are highly related and the weight of the hyperedges cut by the partitioning is minimized. We present results of experiments on three different data sets: S&P500 stock data for the period of 19941996, protein coding data, and Web document data. Wherever applicable, we compared our results with those of AutoClass and Kmeans clustering algorithm on original data as well as on the reduced dimensionality data obtained via Principal Component Analysis or Latent Semantic Indexing scheme. These experiments demonstrate that our approach is applicable and effective in a wide range
Methods for Global Organization of the Protein Sequence Space
, 1999
"... r Bioccelerator and to their software. It was a pleasure to work with Alex Kremer, Avi Kavas, Yoav Etsion and Daniel Avrahami, the great team of students with whom I created the ProtoMap web site. I thank Michael Levitt, Steven Brenner, Patrice Koehl, Boaz Shaanan, Yoram Gdalyahu and Ran ElYaniv fo ..."
Abstract

Cited by 9 (4 self)
 Add to MetaCart
r Bioccelerator and to their software. It was a pleasure to work with Alex Kremer, Avi Kavas, Yoav Etsion and Daniel Avrahami, the great team of students with whom I created the ProtoMap web site. I thank Michael Levitt, Steven Brenner, Patrice Koehl, Boaz Shaanan, Yoram Gdalyahu and Ran ElYaniv for critically reading parts of this manuscript and for making many helpful comments, and Avner Magen for valuable suggestions. A special thanks to Nati Linial, my advisor, who with much patience read most of this thesis, commenting, correcting my English, and improving my writing style. To my family, especially my mother and father for their tremendous love and encouragement, and for continually reminding me that they will be happy with whatever I choose to do (always leaving me the option of becoming a carpenter). And last, to my best friends, Yoram Gdalyahu and Rami Doron, for their enormous help and for their invaluable friendship during these intensive years. Contents
Sequential Inductive Learning
 In Proceedings of the Thirteenth National Conference on Artificial Intelligence
, 1995
"... In this paper I advocate a new model for inductive learning. Called sequential induction, this model bridges classical fixedsample learning techniques (which are efficient but ad hoc), and worstcase approaches (which provide strong statistical guarantees but are too inefficient for practical use). ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
In this paper I advocate a new model for inductive learning. Called sequential induction, this model bridges classical fixedsample learning techniques (which are efficient but ad hoc), and worstcase approaches (which provide strong statistical guarantees but are too inefficient for practical use). According to the sequential inductive model, learning is a sequence of decisions which are informed by training data. By analyzing induction at the level of these decisions, and by utilizing the minimum data necessary to make each decision, sequential inductive techniques can provide the strong statistical guarantees of worstcase methods, but with substantially less data than those methods require. The sequential inductive model is also useful as a method for determining a sufficient sample size for inductive learning and as such, is relevant to megainduction,where the preponderance of data introduces problems of scale. The peepholing and decisiontheoretic subsampling approaches of Catlet...