Results 1  10
of
24
On kerneltarget alignment
 Advances in Neural Information Processing Systems 14
, 2002
"... Editor: Kernel based methods are increasingly being used for data modeling because of their conceptual simplicity and outstanding performance on many tasks. However, the kernel function is often chosen using trialanderror heuristics. In this paper we address the problem of measuring the degree of ..."
Abstract

Cited by 243 (8 self)
 Add to MetaCart
Editor: Kernel based methods are increasingly being used for data modeling because of their conceptual simplicity and outstanding performance on many tasks. However, the kernel function is often chosen using trialanderror heuristics. In this paper we address the problem of measuring the degree of agreement between a kernel and a learning task. A quantitative measure of agreement is important from both a theoretical and practical point of view. We propose a quantity to capture this notion, which we call Alignment. We study its theoretical properties, and derive a series of simple algorithms for adapting a kernel to the labels and vice versa. This produces a series of novel methods for clustering and transduction, kernel combination and kernel selection. The algorithms are tested on two publicly available datasets and are shown to exhibit good performance.
Automated Text Summarization in SUMMARIST
, 1999
"... SUMMARIST is an attempt to create a robust automated text summarization system, based on the equation: summarization = topic identification interpretation generation. Each of these stages contains several independent modules, many of them trained on large corpora of text. We describe the systems ..."
Abstract

Cited by 138 (11 self)
 Add to MetaCart
SUMMARIST is an attempt to create a robust automated text summarization system, based on the equation: summarization = topic identification interpretation generation. Each of these stages contains several independent modules, many of them trained on large corpora of text. We describe the systems architecture and provide details of some of its modules.
Matrices, vector spaces, and information retrieval
 SIAM Review
, 1999
"... Abstract. The evolution of digital libraries and the Internet has dramatically transformed the processing, storage, and retrieval of information. Efforts to digitize text, images, video, and audio now consume a substantial portion of both academic and industrial activity. Even when there is no short ..."
Abstract

Cited by 112 (1 self)
 Add to MetaCart
Abstract. The evolution of digital libraries and the Internet has dramatically transformed the processing, storage, and retrieval of information. Efforts to digitize text, images, video, and audio now consume a substantial portion of both academic and industrial activity. Even when there is no shortage of textual materials on a particular topic, procedures for indexing or extracting the knowledge or conceptual information contained in them can be lacking. Recently developed information retrieval technologies are based on the concept of a vector space. Data are modeled as a matrix, and a user’s query of the database is represented as a vector. Relevant documents in the database are then identified via simple vector operations. Orthogonal factorizations of the matrix provide mechanisms for handling uncertainty in the database itself. The purpose of this paper is to show how such fundamental mathematical concepts from linear algebra can be used to manage and index large text collections. Key words. information retrieval, linear algebra, QR factorization, singular value decomposition, vector spaces
RelationshipBased Clustering and Visualization for HighDimensional Data Mining
 INFORMS Journal on Computing
, 2002
"... In several reallife datamining... This paper proposes a relationshipbased approach that alleviates both problems, sidestepping the "curseofdimensionality" issue by working in a suitable similarity space instead of the original highdimensional attribute space. This intermediary simil ..."
Abstract

Cited by 41 (10 self)
 Add to MetaCart
In several reallife datamining... This paper proposes a relationshipbased approach that alleviates both problems, sidestepping the "curseofdimensionality" issue by working in a suitable similarity space instead of the original highdimensional attribute space. This intermediary similarity space can be suitably tailored to satisfy business criteria such as requiring customer clusters to represent comparable amounts of revenue. We apply efficient and scalable graphpartitioningbased clustering techniques in this space. The output from the clustering algorithm is used to reorder the data points so that the resulting permuted similarity matrix can be readily visualized in two dimensions, with clusters showing up as bands. While twodimensional visualization of a similarity matrix is by itself not novel, its combination with the ordersensitive partitioning of a graph that captures the relevant similarity measure between objects provides three powerful properties: (i) the highdimensionality of the data does not affect further processing once the similarity space is formed; (ii) it leads to clusters of (approximately) equal importance, and (iii) related clusters show up adjacent to one another, further facilitating the visualization of results. The visualization is very helpful for assessing and improving clustering. For example, actionable recommendations for splitting or merging of clusters can be easily derived, and it also guides the user toward the right number of clusters
Web Document Clustering Using Hyperlink Structures
, 2001
"... With the exponential growth of information on the World Wide Web, there is great demand for developing efficient and effective methods for organizing and retrieving the information available. Document clustering plays an important role in information retrieval and taxonomy management for the World W ..."
Abstract

Cited by 40 (5 self)
 Add to MetaCart
With the exponential growth of information on the World Wide Web, there is great demand for developing efficient and effective methods for organizing and retrieving the information available. Document clustering plays an important role in information retrieval and taxonomy management for the World Wide Web and remains an interesting and challenging problem in the field of web computing. In this paper we consider document clustering methods exploring textual information, hyperlink structure and cocitation relations. In particular, we apply the normalizedcut clustering method developed in computer vision to the task of hyperdocument clustering. We also explore some theoretical connections of the normalizedcut method to Kmeans method. We then experiment with normalizedcut method in the context of clustering query result sets for web search engines.
TMG: A MATLAB Toolbox for Generating TermDocument Matrices from Text Collections
, 2005
"... A wide range of computational kernels in data mining and information retrieval from text collections involve techniques from linear algebra. These kernels typically operate on data that is presented in the form of large sparse termdocument matrices (tdm). We present TMG, a research and teaching too ..."
Abstract

Cited by 30 (2 self)
 Add to MetaCart
A wide range of computational kernels in data mining and information retrieval from text collections involve techniques from linear algebra. These kernels typically operate on data that is presented in the form of large sparse termdocument matrices (tdm). We present TMG, a research and teaching toolbox for the generation of sparse tdm’s from text collections and for the incremental modification of these tdm’s by means of additions or deletions. The toolbox is written entirely in MATLAB, a popular problem solving environment that is powerful in computational linear algebra, in order to streamline document preprocessing and prototyping of algorithms for information retrieval. Several design issues that concern the use of MATLAB sparse infrastructure and data structures are addressed. We illustrate the use of the tool in numerical explorations of the effect of stemming and different termweighting policies on the performance of querying and clustering tasks.
Partitioning Rectangular And Structurally Nonsymmetric Sparse Matrices For Parallel Processing
 SIAM J. Sci. Comput
, 1998
"... . A common operation in scientific computing is the multiplication of a sparse, rectangular or structurally nonsymmetric matrix and a vector. In many applications the matrixtransposevector product is also required. This paper addresses the efficient parallelization of these operations. We show that ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
. A common operation in scientific computing is the multiplication of a sparse, rectangular or structurally nonsymmetric matrix and a vector. In many applications the matrixtransposevector product is also required. This paper addresses the efficient parallelization of these operations. We show that the problem can be expressed in terms of partitioning bipartite graphs. We then introduce several algorithms for this partitioning problem and compare their performance on a set of test matrices. Key words. matrix partitioning, iterative method, parallel computing, rectangular matrix, structurally nonsymmetric matrix, bipartite graph AMS subject classifications. 05C50, 65F10, 65F50, 65Y05 1. Introduction. Matrixvector and matrixtransposevector products that repeatedly involve the same large, sparse, structurally nonsymmetric or rectangular matrix arise in many iterative algorithms. Examples include algorithms for solving linear systems, least squares problems, and linear programs. To e...
Partitioning rectangular and structurally unsymmetric sparse matrices for parallel processing
 SIAM J. Sci. Comput
"... Abstract. A common operation in scientific computingis the multiplication of a sparse, rectangular, or structurally unsymmetric matrix and a vector. In many applications the matrixtransposevector product is also required. This paper addresses the efficient parallelization of these operations. We sh ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
Abstract. A common operation in scientific computingis the multiplication of a sparse, rectangular, or structurally unsymmetric matrix and a vector. In many applications the matrixtransposevector product is also required. This paper addresses the efficient parallelization of these operations. We show that the problem can be expressed in terms of partitioningbipartite graphs. We then introduce several algorithms for this partitioning problem and compare their performance on a set of test matrices.
Clustering in Massive Data Sets
 Handbook of massive data sets
, 1999
"... We review the time and storage costs of search and clustering algorithms. We exemplify these, based on casestudies in astronomy, information retrieval, visual user interfaces, chemical databases, and other areas. Sections 2 to 6 relate to nearest neighbor searching, an elemental form of clustering, ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
We review the time and storage costs of search and clustering algorithms. We exemplify these, based on casestudies in astronomy, information retrieval, visual user interfaces, chemical databases, and other areas. Sections 2 to 6 relate to nearest neighbor searching, an elemental form of clustering, and a basis for clustering algorithms to follow. Sections 7 to 11 review a number of families of clustering algorithm. Sections 12 to 14 relate to visual or image representations of data sets, from which a number of interesting algorithmic developments arise.