Results 1 
5 of
5
Rearrangement clustering: Pitfalls, remedies, and applications
 Journal of Machine Learning Research
, 2006
"... Given a matrix of values in which the rows correspond to objects and the columns correspond to features of the objects, rearrangement clustering is the problem of rearranging the rows of the matrix such that the sum of the similarities between adjacent rows is maximized. Referred to by various names ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
Given a matrix of values in which the rows correspond to objects and the columns correspond to features of the objects, rearrangement clustering is the problem of rearranging the rows of the matrix such that the sum of the similarities between adjacent rows is maximized. Referred to by various names and reinvented several times, this clustering technique has been extensively used in many fields over the last three decades. In this paper, we point out two critical pitfalls that have been previously overlooked. The first pitfall is deleterious when rearrangement clustering is applied to objects that form natural clusters. The second concerns a similarity metric that is commonly used. We present an algorithm that overcomes these pitfalls. This algorithm is based on a variation of the Traveling Salesman Problem. It offers an extra benefit as it automatically determines cluster boundaries. Using this algorithm, we optimally solve four benchmark problems and a 2,467gene expression data clustering problem. As expected, our new algorithm identifies better clusters than those found by previous approaches in all five cases. Overall, our results demonstrate the benefits of rectifying the pitfalls and exemplify the usefulness of this clustering technique. Our code is available at our websites.
Columnar Storage in SQL Server 2012
"... SQL Server 2012 introduces a new index type called a column store index and new query operators that efficiently process batches of rows at a time. These two features together greatly improve the performance of typical data warehouse queries, in some cases by two orders of magnitude. This paper outl ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
SQL Server 2012 introduces a new index type called a column store index and new query operators that efficiently process batches of rows at a time. These two features together greatly improve the performance of typical data warehouse queries, in some cases by two orders of magnitude. This paper outlines the design of column store indexes and batchmode processing and summarizes the key benefits this technology provides to customers. It also highlights some early customer experiences and feedback and briefly discusses future enhancements for column store indexes. 1
4. TITLE (and Subtitle) 5. TYPE OF REPORT & PERIOD COVERED "Analysis Techniques for Use With The Extended SDM Model " 1
, 1979
"... (of the abstract entered in Block 20, if different from Report) 18. SUPPLEMENTARY NOTES 19. KEY WORDS (Continue on reverse side if necessary and identify by block number) Software architectural design; problem design structuring; mathematical graph modelling; graph decomposition. 20. ABSTRACT (Conti ..."
Abstract
 Add to MetaCart
(of the abstract entered in Block 20, if different from Report) 18. SUPPLEMENTARY NOTES 19. KEY WORDS (Continue on reverse side if necessary and identify by block number) Software architectural design; problem design structuring; mathematical graph modelling; graph decomposition. 20. ABSTRACT (Continue on reverse side if necessary and identify by block number) Complex design problems are characterized by a multitude of competing requirements. System designers frequently find the scope of the problem beyond their conceptual abilities, and attempt to cope with this difficulty by decomposing the original design problem into smaller, more manageable subproblems. Functional requirements form a key interface between the users of a system and its designers. In this research effort, a systematic approach has been proposed for the decomposition of the overall set of functional
Abstract Scalable Clustering of Categorical Data and Applications
, 2004
"... Clustering is widely used to explore and understand large collections of data. In this thesis, we introduce LIMBO, a scalable hierarchical categorical clustering algorithm based on the Information Bottleneck (IB) framework for quantifying the relevant information preserved when clustering. As a hier ..."
Abstract
 Add to MetaCart
Clustering is widely used to explore and understand large collections of data. In this thesis, we introduce LIMBO, a scalable hierarchical categorical clustering algorithm based on the Information Bottleneck (IB) framework for quantifying the relevant information preserved when clustering. As a hierarchical algorithm, LIMBO can produce clusterings of different sizes in a single execution. We also define a distance measure for categorical tuples and values of a specific attribute. Within this framework, we define a heuristic for discovering candidate values for the number of meaningful clusters. Next, we consider the problem of database design, which has been characterized as a process of arriving at a design that minimizes redundancy. Redundancy is measured with respect to a prescribed model for the data (a set of constraints). We consider the problem of doing database redesign when the prescribed model is unknown or incomplete. Specifically, we consider the problem of finding structural clues in a data instance, which may contain errors, missing values, and duplicate records. We propose a set of tools based on LIMBO for finding structural summaries that are useful in characterizing the information content of the data. We study the use of these summaries in ranking
File Allocation in Distributed Databases with Interaction between Files
"... In this paper, we reexamine the file allocation problem. Because of changing technology, the assumptions we use here are different from those of previous researchers. Specifically, the interaction of files during processing of queries is explicitly incorperated into our model and the cost of commun ..."
Abstract
 Add to MetaCart
In this paper, we reexamine the file allocation problem. Because of changing technology, the assumptions we use here are different from those of previous researchers. Specifically, the interaction of files during processing of queries is explicitly incorperated into our model and the cost of communication between two sites is dominated by the amount of data transfer and is independent of the receiving and the sending sites. We study the complexity of the file allocation problem using the new model. Unfortunateiy, the problem is NPhard. We Present an approach to three versions of the problem, thus demonstrating the flexibility of our approach. we further argue that our method provides a practical solution to the problem, because accurate solutions are obtained, the time complexity cf our algorithm is much smaller than existing algorithms, the algorithm is conceptually simple, easy to implement and is adaptive to users ’ changing access patterns. Section