Results 1 
9 of
9
Concept Decompositions for Large Sparse Text Data using Clustering
 Machine Learning
, 2000
"... . Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as highdimensional and sparse vectorsa few thousand dimensions and a sparsity of 95 to 99 ..."
Abstract

Cited by 303 (28 self)
 Add to MetaCart
. Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as highdimensional and sparse vectorsa few thousand dimensions and a sparsity of 95 to 99% is typical. In this paper, we study a certain spherical kmeans algorithm for clustering such document vectors. The algorithm outputs k disjoint clusters each with a concept vector that is the centroid of the cluster normalized to have unit Euclidean norm. As our first contribution, we empirically demonstrate that, owing to the highdimensionality and sparsity of the text data, the clusters produced by the algorithm have a certain "fractallike" and "selfsimilar" behavior. As our second contribution, we introduce concept decompositions to approximate the matrix of document vectors; these decompositions are obtained by taking the leastsquares approximation onto the linear subspace spanned...
Efficient Clustering Of Very Large Document Collections
, 2001
"... An invaluable portion of scientific data occurs naturally in text form. Given a large unlabeled document collection, it is often helpful to organize this collection into clusters of related documents. By using a vector space model, text data can be treated as highdimensional but sparse numerical da ..."
Abstract

Cited by 92 (11 self)
 Add to MetaCart
An invaluable portion of scientific data occurs naturally in text form. Given a large unlabeled document collection, it is often helpful to organize this collection into clusters of related documents. By using a vector space model, text data can be treated as highdimensional but sparse numerical data vectors. It is a contemporary challenge to efficiently preprocess and cluster very large document collections. In this paper we present a time and memory ecient technique for the entire clustering process, including the creation of the vector space model. This efficiency is obtained by (i) a memoryecient multithreaded preprocessing scheme, and (ii) a fast clustering algorithm that fully exploits the sparsity of the data set. We show that this entire process takes time that is linear in the size of the document collection. Detailed experimental results are presented  a highlight of our results is that we are able to effectively cluster a collection of 113,716 NSF award abstracts in 23 minutes (including disk I/O costs) on a single workstation with modest memory consumption.
ConstraintBased Clustering in Large Databases
, 2000
"... . Constrained clustering nding clusters that satisfy userspeci ed constraintsis highly desirable in many applications. In this paper, we introduce the constrained clustering problem and show that traditional clustering algorithms (e.g., kmeans) cannot handle it. A scalable constraintclustering ..."
Abstract

Cited by 46 (2 self)
 Add to MetaCart
. Constrained clustering nding clusters that satisfy userspeci ed constraintsis highly desirable in many applications. In this paper, we introduce the constrained clustering problem and show that traditional clustering algorithms (e.g., kmeans) cannot handle it. A scalable constraintclustering algorithm, Coca, is developed in this study which starts by nding an initial solution that satis es userspeci ed constraints and then re nes the solution by performing con ned object movements under constraints. Our algorithm consists of two phases: pivot movement and deadlock resolution. For both phases, we show that nding the optimal solution is NPhard. We then propose several heuristics and show how our algorithm can scale up for large data sets using the heuristic of microcluster sharing. By experiments, we show the eectiveness and eciency of the heuristics. 1
Local and Global Methods in Data Mining: Basic Techniques and Open Problems
 In ICALP 2002, 29th International Colloquium on Automata, Languages, and Programming, Malaga
, 2002
"... Data mining has in recent years emerged as an interesting area in the boundary between algorithms, probabilistic modeling, statistics, and databases. Data mining research can be divided into global approaches, which try to model the whole data, and local methods, which try to find useful patterns oc ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
Data mining has in recent years emerged as an interesting area in the boundary between algorithms, probabilistic modeling, statistics, and databases. Data mining research can be divided into global approaches, which try to model the whole data, and local methods, which try to find useful patterns occurring in the data. We discuss briefly some simple local and global techniques, review two attempts at combining the approaches, and list open problems with an algorithmic flavor.
A local facility location algorithm for sensor networks
 In DCOSS ’05
, 2005
"... Abstract. In this paper we address a wellknown facility location problem (FLP) in a sensor network environment. The problem deals with finding the optimal way to provide service to a (possibly) very large number of clients. We show that a variation of the problem can be solved using a local algorit ..."
Abstract

Cited by 15 (4 self)
 Add to MetaCart
Abstract. In this paper we address a wellknown facility location problem (FLP) in a sensor network environment. The problem deals with finding the optimal way to provide service to a (possibly) very large number of clients. We show that a variation of the problem can be solved using a local algorithm. Local algorithms are extremely useful in a sensor network scenario. This is because they allow the communication range of the sensor to be restricted to the minimum, they can operate in routerless networks, and they allow complex problems to be solved on the basis of very little information, gathered from nearby sensors. The local facility location algorithm we describe is entirely asynchronous, seamlessly supports failures and changes in the data during calculation, poses modest memory and computational requirements, and can provide an anytime solution which is guaranteed to converge to the exact same one that would be computed by a centralized algorithm given the entire data. 1
A Local Facility Location Algorithm for LargeScale Distributed Systems
 Journal of Grid Computing
, 2007
"... Abstract. In the facility location problem (FLP) we are given a set of facilities and a set of clients, each of which is to be served by one facility. The goal is to decide which subset of facilities to open, such that the clients will be served at a minimal cost. In this paper we investigate the FL ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
Abstract. In the facility location problem (FLP) we are given a set of facilities and a set of clients, each of which is to be served by one facility. The goal is to decide which subset of facilities to open, such that the clients will be served at a minimal cost. In this paper we investigate the FLP in a setting where the cost depends on data known only to peer nodes. This setting typifies modern distributed systems: peertopeer file sharing networks, grid systems, and wireless sensor networks. All of them need to perform network organization, data placement, collective power management, and other tasks of this kind. We propose a local and efficient algorithm that solves FLP in these settings. The algorithm presented here is extremely scalable, entirely decentralized, requires no routing capabilities, and is resilient to failures and changes in the data throughout its execution.
On Approximation Algorithms for Data Mining Applications
, 2002
"... We aim to present current trends in the theoretical computer science research on topics which have applications in data mining. We briefly describe data mining tasks in various application contexts. We give an overview of some of the questions and algorithmic issues that are of concern when mining h ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We aim to present current trends in the theoretical computer science research on topics which have applications in data mining. We briefly describe data mining tasks in various application contexts. We give an overview of some of the questions and algorithmic issues that are of concern when mining huge amounts of data that do not fit in main memory.
Utility Driven Clustering
"... Data mining has primarily focused on statistical properties of data alone and not necessarily on what could be done with the patterns. While there has been some work on measuring usefulness of patterns in decision making but not on using such measures for driving the mining process. We introduce a f ..."
Abstract
 Add to MetaCart
Data mining has primarily focused on statistical properties of data alone and not necessarily on what could be done with the patterns. While there has been some work on measuring usefulness of patterns in decision making but not on using such measures for driving the mining process. We introduce a framework to mine clusters that support decision making. We use an extrinsic measure that evaluates patterns based on their utility in decision making. We show empirical validation of our approach on several test domains.
Stefano Moretti, Fioravante PatroneSpecial Issue on
, 2013
"... ICTbased strategies for environmental conflicts ..."