Results

**11 - 18**of**18**### Exact and Approximation Algorithms for Clustering (Extended Abstract)

, 1998

"... In this paper we present an n O(k 1\Gamma1=d ) time algorithm for solving the k-center problem in R d , under L1 and L2 metrics. The algorithm extends to other metrics, and to the discrete k-center problem. We also describe a simple (1+ ffl)- approximation algorithm for the k-center problem, ..."

Abstract
- Add to MetaCart

In this paper we present an n O(k 1\Gamma1=d ) time algorithm for solving the k-center problem in R d , under L1 and L2 metrics. The algorithm extends to other metrics, and to the discrete k-center problem. We also describe a simple (1+ ffl)- approximation algorithm for the k-center problem, with running time O(n log k) + (k=ffl) O(k 1\Gamma1=d ) . Finally, we present a n O(k 1\Gamma1=d ) time algorithm for solving the L-capacitated k- center problem, provided that L = \Omega\Gamma n=k 1\Gamma1=d ) or L = O(1). We conclude with a simple approximation algorithm for the L-capacitated k-center problem.

### Clustering in Trees: Optimizing . . .

- JOURNAL OF GRAPH ALGORITHMS AND APPLICATIONS
, 2000

"... This paper considers partitioning the vertices of an n-vertex tree into p disjoint sets C1,C 2,...,C p , called clusters so that the number of vertices in a cluster and the number of subtrees in a cluster are minimized. For this NP-hard problem we present greedy heuristics which di#er in (i) how su ..."

Abstract
- Add to MetaCart

This paper considers partitioning the vertices of an n-vertex tree into p disjoint sets C1,C 2,...,C p , called clusters so that the number of vertices in a cluster and the number of subtrees in a cluster are minimized. For this NP-hard problem we present greedy heuristics which di#er in (i) how subtrees are identified (using either a best-fit, good-fit, or first-fit selection criteria), (ii) whether clusters are filled one at a time or simultaneously, and (iii) how much cluster sizes can di#er from the ideal size of c vertices per cluster, n = cp. The last criteria is controlled by a constant #,0# #<1, such that cluster C i satisfies (1 - # 2 )c #|C i |#c(1+#), 1 # i # p. For algorithms resulting from combinations of these criteria we develop worst-case bounds on the number of subtrees in a cluster in terms of c, #, and the maximum degree of a vertex. We present experimental results which give insight into how parameters c, #, and the maximum degree of a vertex impact the...

### STRING MATCHING AND INDEXING WITH SUFFIX DATA STRUCTURES

, 2007

"... I like to thank everyone who has been there for me in this quest for knowledge and a journey of self discovery. I am fortunately blessed with a caring family and am grateful to my parents and sisters for their support. I dedicate my thesis to the memory of my mother for her self-lessness and abundan ..."

Abstract
- Add to MetaCart

(Show Context)
I like to thank everyone who has been there for me in this quest for knowledge and a journey of self discovery. I am fortunately blessed with a caring family and am grateful to my parents and sisters for their support. I dedicate my thesis to the memory of my mother for her self-lessness and abundant love. To that special someone, my loving and supportive wife Lin Li, thank you for your kindness and believing in me. To my advisory committee members, Assoc Prof Tan Kian Lee and Assoc Prof Lee Mong Li, thank you for your patience and valuable advice. My sincere appreciation goes to my supervisors Assoc Prof Ken Sung Wing Kin and Prof Wong Lim Soon for their guidance and generosity in sharing their wisdom with me. Lastly, to all my friends and colleagues at the School of Computing, a big thanks to you. The past years with the school will be fondly remembered.

### SUPPORTING LINK ANALYSIS USING ADVANCED QUERYING METHODS ON SEMANTIC WEB DATABASES

"... There is an increasing demand for technologies that can help organizations unearth actionable knowledge from their data assets. This demand continues to drive the flurry of activities in data mining research where the emphasis is on technologies that can identify patterns in data. However, in additi ..."

Abstract
- Add to MetaCart

(Show Context)
There is an increasing demand for technologies that can help organizations unearth actionable knowledge from their data assets. This demand continues to drive the flurry of activities in data mining research where the emphasis is on technologies that can identify patterns in data. However, in addition to the “patterns ” view of data, other data and knowledge perspectives are required to support the broad range of complex analytical tasks found in contemporary applications. For example, in some applications in homeland security, bioinformatics, business and other investigative domains many tasks are focused on “connecting the dots”. For this genre of applications, support for identifying, revealing and analyzing links or relationships between groups of entities (link analysis) is crucial. Currently, mainstream database systems do not provide support for such analyses and current solutions rely on exporting their data from their databases into custom applications to be analyzed. This has the disadvantage of additional overhead and precludes the ability to exploit other mature technologies offered by today’s database systems. This thesis argues for database support for link analysis by providing an appropriate

### Search-Optimized Persistent Suffix Tree Storage for Biological Applications

"... The suffix tree is a well known and popular indexing structure for various sequence processing problems arising in biological data management. However, unlike traditional indexing structures, suffix trees are orders of magnitude larger than the underlying data. Moreover, their construction and searc ..."

Abstract
- Add to MetaCart

(Show Context)
The suffix tree is a well known and popular indexing structure for various sequence processing problems arising in biological data management. However, unlike traditional indexing structures, suffix trees are orders of magnitude larger than the underlying data. Moreover, their construction and search algorithms are extremely inefficient when implemented directly on disk. Recently, we have shown that it is possible to significantly speedup the on-disk construction of suffix trees through a careful choice of buffering policy and physical representation of suffix tree nodes. In this paper, we explore the problem of performing efficient searches on disk-resident suffix trees. Specifically, we investigate the gains that can be achieved through customized node-to-page layout strategies. Through detailed experimentation on real-life genomic sequences, we demonstrate that a new layout, called Stellar, provides significantly improved search performance. The key feature of Stellar is that in addition to standard root-to-leaf lookup queries, it also supports complex sequence search algorithms that exploit the suffix link feature of suffix trees. These results are encouraging with regard to the ultimate objective of seamlessly integrating sequence processing in database engines. 1

### Applications of Clustering Problems

, 1997

"... Introduction Clustering algorithms have been employed in many fields of human knowledge that require finding a "natural association" among some specific data. The way the term "natural association" is defined depends on the field and the particular application that uses it, and ..."

Abstract
- Add to MetaCart

(Show Context)
Introduction Clustering algorithms have been employed in many fields of human knowledge that require finding a "natural association" among some specific data. The way the term "natural association" is defined depends on the field and the particular application that uses it, and can vary quite a lot. For example, in the life sciences, the data to be clustered consists of life forms and the clusters themselves are species, or subspecies of a given species. The need for using clustering alghorithms is determined by the scientists' wish to study various characteristics of the life forms. [5] mentions a situation where mammals are clustered by similarities in the proportion of three chemicals in their milk. The "natural association" of the animals depends on the percentages of these chemicals in their milk. A number of other domains use clustering algorithms to extract information by grouping a large collection of objects, such as the medical sciences (clustering of symptoms yields