Results 1  10
of
39
Vertical Partitioning Algorithms for Database Design
 ACM Transactions on Database Systems
, 1984
"... This paper addresses the vertical partitioning of a set of logical records or a relation into fragments. The rationale behind vertical partitioning is to produce fragments, groups of attribute columns, that “closely match ” the requirements of transactions. Vertical partitioning is applied in three ..."
Abstract

Cited by 84 (8 self)
 Add to MetaCart
This paper addresses the vertical partitioning of a set of logical records or a relation into fragments. The rationale behind vertical partitioning is to produce fragments, groups of attribute columns, that “closely match ” the requirements of transactions. Vertical partitioning is applied in three contexts: a database stored on devices of a single type, a database stored in different memory levels, and a distributed database. In a twolevel memory hierarchy, most transactions should be processed using the fragments in primary memory. In distributed databases, fragment allocation should maximize the amount of local transaction processing. Fragments may be nonoverlapping or overlapping. A twophase approach for the determination of fragments is proposed; in the first phase, the design is driven by empirical objective functions which do not require specific cost information. The second phase performs cost optimization by incorporating the knowledge of a specific application environment. The algorithms presented in this paper have been implemented, and examples of their actual use are shown. 1.
Modelbased overlapping clustering
 In KDD
, 2005
"... While the vast majority of clustering algorithms are partitional, many real world datasets have inherently overlapping clusters. Several approaches to finding overlapping clusters have come from work on analysis of biological datasets. In this paper, we interpret an overlapping clustering model prop ..."
Abstract

Cited by 29 (6 self)
 Add to MetaCart
While the vast majority of clustering algorithms are partitional, many real world datasets have inherently overlapping clusters. Several approaches to finding overlapping clusters have come from work on analysis of biological datasets. In this paper, we interpret an overlapping clustering model proposed by Segal et al. [23] as a generalization of Gaussian mixture models, and we extend it to an overlapping clustering model based on mixtures of any regular exponential family distribution and the corresponding Bregman divergence. We provide the necessary algorithm modifications for this extension, and present results on synthetic data as well as subsets of 20Newsgroups and EachMovie datasets.
An Objective Function for Vertically Partitioning Relations in Distributed Databases and its Analysis
, 1992
"... The design of distributed databases is an optimization problem requiring solutions to several interrelated problems including: data fragmentation, allocation, and local optimization. Each problem can be solved with several different approaches thereby making the distributed database design a very di ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
The design of distributed databases is an optimization problem requiring solutions to several interrelated problems including: data fragmentation, allocation, and local optimization. Each problem can be solved with several different approaches thereby making the distributed database design a very difficult task. Although there is a large body of work on the design of data fragmentation, most of them are either ad hoc solutions or formal solutions for special cases (e. g., binary vertical partitioning). In this paper, we address the general vertical partitioning problem formally. We first provide a comparison of work in the area of data clustering and distributed databases to highlight the thrust of this work. We derive an objective function that generalizes and subsumes earlier work on vertical partitioning in databases. The objective function developed in this paper provides a basis for developing heuristic algorithms for vertical partitioning. The objective function also facilitates ...
Distributed Object Based Design: Vertical Fragmentation of Classes
, 1998
"... Processing costs in distributed environments is most often dominated by the network communications required for interprocess communication. It is wellknown from distributed relational database design research that careful placement of data "near" the users or processors where it is used is manda ..."
Abstract

Cited by 16 (5 self)
 Add to MetaCart
Processing costs in distributed environments is most often dominated by the network communications required for interprocess communication. It is wellknown from distributed relational database design research that careful placement of data "near" the users or processors where it is used is mandatory or system performance will suffer greatly. Data placement in relational database systems is comparatively simple because the data is flat, structured, and passive. Objects are characterized by an inheritance hierarchy (other hierarchies could also be considered including, class composition and execution), unstructured (possibly dynamic data), and contain a behavioral component that defines how the "data" is accessed by encapsulating it within the object per se. Algorithms currently exist for fragmenting relations, but the fragmentation and allocation of objects is still a relatively untouched field of study. Similar to relations, objects can be fragmented both horizontally and ve...
The history of the cluster heat map
 The American Statistician
, 2009
"... The cluster heat map is an ingenious display that simultaneously reveals row and column hierarchical cluster structure in a data matrix. It consists of a rectangular tiling with each tile shaded on a color scale to represent the value of the corresponding element of the data matrix. The rows (column ..."
Abstract

Cited by 16 (0 self)
 Add to MetaCart
The cluster heat map is an ingenious display that simultaneously reveals row and column hierarchical cluster structure in a data matrix. It consists of a rectangular tiling with each tile shaded on a color scale to represent the value of the corresponding element of the data matrix. The rows (columns) of the tiling are ordered such that similar rows (columns) are near each other. On the vertical and horizontal margins of the tiling there are hierarchical cluster trees. This cluster heat map is a synthesis of several different graphic displays developed by statisticians over more than a century. We locate the earliest sources of this display in late 19th century publications. And we trace a diverse 20th century statistical literature that provided a foundation for this most widely used of all bioinformatics displays. 1
Manufacturing Cell Design: An Integer Programming Model Employing Genetic Algorithms
 IIE Transactions
, 1996
"... The design of a cellular manufacturing system requires that a part population, at least minimally described by its use of process technology (part/machine incidence matrix), be partitioned into part families and that the associated plant equipment be partitioned into machine cells. At the highest le ..."
Abstract

Cited by 15 (5 self)
 Add to MetaCart
The design of a cellular manufacturing system requires that a part population, at least minimally described by its use of process technology (part/machine incidence matrix), be partitioned into part families and that the associated plant equipment be partitioned into machine cells. At the highest level, the objective is to form a set of completely autonomous units such that intercell movement of parts is minimized. We present an integer program that is solved using a genetic algorithm (GA) to assist in the design of cellular manufacturing systems. The formulation uses a unique representation scheme for individuals (part/machine partitions) that reduces the size of the cell formation problem and increases the scale of problems that can be solved. This approach offers improved design flexibility by allowing a variety of evaluation functions to be employed and by incorporating design constraints during cell formation. The effectiveness of the GA approach is demonstrated on several problems from the literature.
A Formal Approach to the Vertical Partitioning Problem in Distributed Database Design
 In Technical Report. CIS Dept, Univ. of
, 1993
"... The design of distributed databases is an optimization problem requiring solutions to several interrelated problems: data fragmentation, allocation, and local optimization. Each problem can be solved with several different approaches thereby making the distributed database design a very difficult ta ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
The design of distributed databases is an optimization problem requiring solutions to several interrelated problems: data fragmentation, allocation, and local optimization. Each problem can be solved with several different approaches thereby making the distributed database design a very difficult task. Although there is a large body of work on the design of data fragmentation, most of them are either ad hoc solutions or formal solutions for special cases (e. g., binary vertical partitioning). In this paper, we address the problem of nary vertical partitioning problem and derive an objective function that generalizes and subsumes earlier work. The objective function derived in this paper is being used for developing heuristic algorithms that can be shown to satisfy the objective function. The objective function is also being used for comparing previously proposed algorithms for vertical partitioning. We first derive an objective function that is suited to distributed transaction proces...
Overcoming the Curse of Dimensionality in Clustering by means of the Wavelet Transform
 The Computer Journal
, 2000
"... We use a redundant wavelet transform analysis to detect clusters in highdimensional data spaces. We overcome Bellman's \curse of dimensionality" in such problems by (i) using some canonical ordering of observation and variable (document and term) dimensions in our data, (ii) applying a wavelet t ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
We use a redundant wavelet transform analysis to detect clusters in highdimensional data spaces. We overcome Bellman's \curse of dimensionality" in such problems by (i) using some canonical ordering of observation and variable (document and term) dimensions in our data, (ii) applying a wavelet transform to such canonically ordered data, (iii) modeling the noise in wavelet space, (iv) dening signicant component parts of the data as opposed to insignicant or noisy component parts, and (v) reading o the resultant clusters. The overall complexity of this innovative approach is linear in the data dimensionality. We describe a number of examples and test cases, including the clustering of highdimensional hypertext data. 1 Introduction Bellman's (1961) [1] \curse of dimensionality" refers to the exponential growth of hypervolume as a function of dimensionality. All problems become tougher as the dimensionality increases. Nowhere is this more evident than in problems related to ...
Rearrangement clustering: Pitfalls, remedies, and applications
 Journal of Machine Learning Research
, 2006
"... Given a matrix of values in which the rows correspond to objects and the columns correspond to features of the objects, rearrangement clustering is the problem of rearranging the rows of the matrix such that the sum of the similarities between adjacent rows is maximized. Referred to by various names ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
Given a matrix of values in which the rows correspond to objects and the columns correspond to features of the objects, rearrangement clustering is the problem of rearranging the rows of the matrix such that the sum of the similarities between adjacent rows is maximized. Referred to by various names and reinvented several times, this clustering technique has been extensively used in many fields over the last three decades. In this paper, we point out two critical pitfalls that have been previously overlooked. The first pitfall is deleterious when rearrangement clustering is applied to objects that form natural clusters. The second concerns a similarity metric that is commonly used. We present an algorithm that overcomes these pitfalls. This algorithm is based on a variation of the Traveling Salesman Problem. It offers an extra benefit as it automatically determines cluster boundaries. Using this algorithm, we optimally solve four benchmark problems and a 2,467gene expression data clustering problem. As expected, our new algorithm identifies better clusters than those found by previous approaches in all five cases. Overall, our results demonstrate the benefits of rectifying the pitfalls and exemplify the usefulness of this clustering technique. Our code is available at our websites.