Results 1  10
of
96
A comparison of document clustering techniques
 In KDD Workshop on Text Mining
, 2000
"... This paper presents the results of an experimental study of some common document clustering techniques: agglomerative hierarchical clustering and Kmeans. (We used both a “standard” Kmeans algorithm and a “bisecting ” Kmeans algorithm.) Our results indicate that the bisecting Kmeans technique is ..."
Abstract

Cited by 420 (28 self)
 Add to MetaCart
This paper presents the results of an experimental study of some common document clustering techniques: agglomerative hierarchical clustering and Kmeans. (We used both a “standard” Kmeans algorithm and a “bisecting ” Kmeans algorithm.) Our results indicate that the bisecting Kmeans technique is better than the standard Kmeans approach and (somewhat surprisingly) as good or better than the hierarchical approaches that we tested.
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval
, 1998
"... The naive Bayes classifier, currently experiencing a renaissance in machine learning, has long been a core technique in information retrieval. We review some of the variations of naive Bayes models used for text retrieval and classification, focusing on the distributional assump tions made abou ..."
Abstract

Cited by 346 (1 self)
 Add to MetaCart
The naive Bayes classifier, currently experiencing a renaissance in machine learning, has long been a core technique in information retrieval. We review some of the variations of naive Bayes models used for text retrieval and classification, focusing on the distributional assump tions made about word occurrences in documents.
Matrices, vector spaces, and information retrieval
 SIAM Review
, 1999
"... Abstract. The evolution of digital libraries and the Internet has dramatically transformed the processing, storage, and retrieval of information. Efforts to digitize text, images, video, and audio now consume a substantial portion of both academic and industrial activity. Even when there is no short ..."
Abstract

Cited by 108 (1 self)
 Add to MetaCart
Abstract. The evolution of digital libraries and the Internet has dramatically transformed the processing, storage, and retrieval of information. Efforts to digitize text, images, video, and audio now consume a substantial portion of both academic and industrial activity. Even when there is no shortage of textual materials on a particular topic, procedures for indexing or extracting the knowledge or conceptual information contained in them can be lacking. Recently developed information retrieval technologies are based on the concept of a vector space. Data are modeled as a matrix, and a user’s query of the database is represented as a vector. Relevant documents in the database are then identified via simple vector operations. Orthogonal factorizations of the matrix provide mechanisms for handling uncertainty in the database itself. The purpose of this paper is to show how such fundamental mathematical concepts from linear algebra can be used to manage and index large text collections. Key words. information retrieval, linear algebra, QR factorization, singular value decomposition, vector spaces
Generalizing discriminant analysis using the generalized singular value decomposition
 IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2004
"... Discriminant analysis has been used for decades to extract features that preserve class separability. It is commonly defined as an optimization problem involving covariance matrices that represent the scatter within and between clusters. The requirement that one of these matrices be nonsingular limi ..."
Abstract

Cited by 54 (14 self)
 Add to MetaCart
Discriminant analysis has been used for decades to extract features that preserve class separability. It is commonly defined as an optimization problem involving covariance matrices that represent the scatter within and between clusters. The requirement that one of these matrices be nonsingular limits its application to data sets with certain relative dimensions. We examine a number of optimization criteria, and extend their applicability by using the generalized singular value decomposition to circumvent the nonsingularity requirement. The result is a generalization of discriminant analysis that can be applied even when the sample size is smaller than the dimension of the sample data. We use classification results from the reduced representation to compare the effectiveness of this approach with some alternatives, and conclude with a discussion of their relative merits. 1
Local features for object class recognition
 In Proceedings of the 10th IEEE International Conference on Computer Vision
, 2005
"... In this paper we compare the performance of local detectors and descriptors in the context of object class recognition. Recently, many detectors / descriptors have been evaluated in the context of matching as well as invariance to viewpoint changes [20]. However, it is unclear if these results can b ..."
Abstract

Cited by 51 (6 self)
 Add to MetaCart
In this paper we compare the performance of local detectors and descriptors in the context of object class recognition. Recently, many detectors / descriptors have been evaluated in the context of matching as well as invariance to viewpoint changes [20]. However, it is unclear if these results can be generalized to categorization problems, which require different properties of features. We evaluate 5 stateoftheart scale invariant region detectors and 5 descriptors. Local features are computed for 20 object classes and clustered using hierarchical agglomerative clustering. We measure the quality of appearance clusters and location distributions using entropy as well as precision. We also measure how the clusters generalize from training set to novel test data. Our results indicate that extended SIFT descriptors [22] computed on HessianLaplace [20] regions perform best. Second score is obtained by Salient regions [11]. The results also show that these two detectors provide complementary features. The new detectors/descriptors significantly improve the performance of a stateofthe art recognition approach [16] in pedestrian detection task. 1.
Structure preserving dimension reduction for clustered text data based on the generalized singular value decomposition
 SIAM Journal on Matrix Analysis and Applications
, 2003
"... Abstract. In today’s vector space information retrieval systems, dimension reduction is imperative for efficiently manipulating the massive quantity of data. To be useful, this lowerdimensional representation must be a good approximation of the full document set. To that end, we adapt and extend th ..."
Abstract

Cited by 42 (19 self)
 Add to MetaCart
Abstract. In today’s vector space information retrieval systems, dimension reduction is imperative for efficiently manipulating the massive quantity of data. To be useful, this lowerdimensional representation must be a good approximation of the full document set. To that end, we adapt and extend the discriminant analysis projection used in pattern recognition. This projection preserves cluster structure by maximizing the scatter between clusters while minimizing the scatter within clusters. A common limitation of trace optimization in discriminant analysis is that one of the scatter matrices must be nonsingular, which restricts its application to document sets in which the number of terms does not exceed the number of documents. We show that by using the generalized singular value decomposition (GSVD), we can achieve the same goal regardless of the relative dimensions of the termdocument matrix. In addition, applying the GSVD allows us to avoid the explicit formation of the scatter matrices in favor of working directly with the data matrix, thus improving the numerical properties of the approach. Finally, we present experimental results that confirm the effectiveness of our approach.
Determining Text Databases to Search in the Internet
, 1998
"... Text data in the Internet can be partitioned into many databases naturally. Efficient retrieval of desired data can be achieved if we can accurately predict the usefulness of each database, because with such information, we only need to retrieve potentially useful documents from useful databases. In ..."
Abstract

Cited by 39 (5 self)
 Add to MetaCart
Text data in the Internet can be partitioned into many databases naturally. Efficient retrieval of desired data can be achieved if we can accurately predict the usefulness of each database, because with such information, we only need to retrieve potentially useful documents from useful databases. In this paper, we propose two new methods for estimating the usefulness of text databases. For a given query, the usefulness of a text database in this paper is defined to be the number of documents in the database that are sufficiently similar to the query. Such a usefulness measure enables naiveusers to make informed decision about which databases to search. We also consider the collection fusion problem. Because local databases may employ similarity functions that are different from that used by the global database, the threshold used by a local database to determine whether a document is potentially useful may be different from that used by the global database. We provide techniques that ...
Lower Dimensional Representation of Text Data based on
 Centroids and Least Squares, BIT
, 2003
"... improvingcomputationaleciencyinhandlingmassivedata. reductionformula.WeillustratehowthecommonlyusedLatentSemanticIndexingbasedon oftextdatainvectorspacebasedinformationretrievalusingminimizationandmatrixrank Inthispaper,weproposeamathematicalframeworkforlowerdimensionalrepresentation fromourmathemat ..."
Abstract

Cited by 39 (14 self)
 Add to MetaCart
improvingcomputationaleciencyinhandlingmassivedata. reductionformula.WeillustratehowthecommonlyusedLatentSemanticIndexingbasedon oftextdatainvectorspacebasedinformationretrievalusingminimizationandmatrixrank Inthispaper,weproposeamathematicalframeworkforlowerdimensionalrepresentation fromourmathematicalframework.Thenweproposeanewapproachwhichismoreecientand SingularValueDecomposition(LSI/SVD)canbederivedasamethodfordimensionreduction SeveraladvantagesofthenewmethodsarediscussedovertheLSI/SVDintermsofcomputational eectivethanLSI/SVDwhenwehaveaprioriinformationontheclusterstructureofthedata. eciencyanddatarepresentationinthereduceddimensionalspace.
Estimating the Usefulness of Search Engines
, 1999
"... In this paper, we present a statistical method to estimate the usefulness of a search engine for any given query. The estimates can be used by a metasearch engine to choose local search engines to invoke. For a given query, the usefulness of a search engine in this paper is defined to be a combinati ..."
Abstract

Cited by 32 (14 self)
 Add to MetaCart
In this paper, we present a statistical method to estimate the usefulness of a search engine for any given query. The estimates can be used by a metasearch engine to choose local search engines to invoke. For a given query, the usefulness of a search engine in this paper is defined to be a combination of the number of documents in the search engine that are sufficiently similar to the query and the average similarity of these documents. Experimental results indicate that the proposed estimation method is quite accurate. 1 Introduction Many search engines have been created on the Internet to help ordinary users find desired data. Each search engine has a corresponding database that defines the set of documents that can be searched by the search engine. Usually, an index for all documents in the database is created and stored in the search engine to speed up query processing. The amount of data in the Internet is huge (it is believed that by the end of 1997, there were more than 300 mil...
An optimization criterion for generalized discriminant analysis on undersampled problems
 IEEE Trans. Pattern Analysis and Machine Intelligence
, 2004
"... Abstract—An optimization criterion is presented for discriminant analysis. The criterion extends the optimization criteria of the classical Linear Discriminant Analysis (LDA) through the use of the pseudoinverse when the scatter matrices are singular. It is applicable regardless of the relative size ..."
Abstract

Cited by 28 (8 self)
 Add to MetaCart
Abstract—An optimization criterion is presented for discriminant analysis. The criterion extends the optimization criteria of the classical Linear Discriminant Analysis (LDA) through the use of the pseudoinverse when the scatter matrices are singular. It is applicable regardless of the relative sizes of the data dimension and sample size, overcoming a limitation of classical LDA. The optimization problem can be solved analytically by applying the Generalized Singular Value Decomposition (GSVD) technique. The pseudoinverse has been suggested and used for undersampled problems in the past, where the data dimension exceeds the number of data points. The criterion proposed in this paper provides a theoretical justification for this procedure. An approximation algorithm for the GSVDbased approach is also presented. It reduces the computational complexity by finding subclusters of each cluster and uses their centroids to capture the structure of each cluster. This reduced problem yields much smaller matrices to which the GSVD can be applied efficiently. Experiments on text data, with up to 7,000 dimensions, show that the approximation algorithm produces results that are close to those produced by the exact algorithm. Index Terms—Classification, clustering, dimension reduction, generalized singular value decomposition, linear discriminant analysis, text mining. 1