Results 1  10
of
28
Latent semantic indexing: A probabilistic analysis
, 1998
"... Latent semantic indexing (LSI) is an information retrieval technique based on the spectral analysis of the termdocument matrix, whose empirical success had heretofore been without rigorous prediction and explanation. We prove that, under certain conditions, LSI does succeed in capturing the underl ..."
Abstract

Cited by 269 (7 self)
 Add to MetaCart
Latent semantic indexing (LSI) is an information retrieval technique based on the spectral analysis of the termdocument matrix, whose empirical success had heretofore been without rigorous prediction and explanation. We prove that, under certain conditions, LSI does succeed in capturing the underlying semantics of the corpus and achieves improved retrieval performance. We also propose the technique of random projection as a way of speeding up LSI. We complement our theorems with encouraging experimental results. We also argue that our results may be viewed in a more general framework, as a theoretical basis for the use of spectral methods in a wider class of applications such as collaborative filtering.
Approximation Algorithms for Projective Clustering
 Proceedings of the ACM SIGMOD International Conference on Management of data, Philadelphia
, 2000
"... We consider the following two instances of the projective clustering problem: Given a set S of n points in R d and an integer k ? 0; cover S by k hyperstrips (resp. hypercylinders) so that the maximum width of a hyperstrip (resp., the maximum diameter of a hypercylinder) is minimized. Let w ..."
Abstract

Cited by 256 (21 self)
 Add to MetaCart
We consider the following two instances of the projective clustering problem: Given a set S of n points in R d and an integer k ? 0; cover S by k hyperstrips (resp. hypercylinders) so that the maximum width of a hyperstrip (resp., the maximum diameter of a hypercylinder) is minimized. Let w be the smallest value so that S can be covered by k hyperstrips (resp. hypercylinders), each of width (resp. diameter) at most w : In the plane, the two problems are equivalent. It is NPHard to compute k planar strips of width even at most Cw ; for any constant C ? 0 [50]. This paper contains four main results related to projective clustering: (i) For d = 2, we present a randomized algorithm that computes O(k log k) strips of width at most 6w that cover S. Its expected running time is O(nk 2 log 4 n) if k 2 log k n; it also works for larger values of k, but then the expected running time is O(n 2=3 k 8=3 log 4 n). We also propose another algorithm that computes a c...
PeertoPeer Information Retrieval Using SelfOrganizing Semantic Overlay Networks
, 2003
"... Contentbased fulltext search is a challenging problem in PeertoPeer (P2P) systems. Traditional approaches have either been centralized or use flooding to ensure accuracy of the results returned. In this paper, we present pSearch, a decentralized nonflooding P2P information retrieval system. pSea ..."
Abstract

Cited by 215 (7 self)
 Add to MetaCart
Contentbased fulltext search is a challenging problem in PeertoPeer (P2P) systems. Traditional approaches have either been centralized or use flooding to ensure accuracy of the results returned. In this paper, we present pSearch, a decentralized nonflooding P2P information retrieval system. pSearch distributes document indices through the P2P network based on document semantics generated by Latent Semantic Indexing (LSI). The search cost (in terms of different nodes searched and data transmitted) for a given query is thereby reduced, since the indices of semantically related documents are likely to be colocated in the network. We also describe techniques that help distribute the indices more evenly across the nodes, and further reduce the number of nodes accessed using appropriate index distribution as well as using index samples and recently processed queries to guide the search. Experiments show that pSearch can achieve performance comparable to centralized information retrieval systems by searching only a small number of nodes. For a system with 128,000 nodes and 528,543 documents (from news, magazines, etc.), pSearch searches only 19 nodes and transmits only 95.5KB data during the search, whereas the top 15 documents returned by pSearch and LSI have a 91.7% intersection.
Scalable Feature Selection, Classification and Signature Generation for Organizing Large Text Databases Into Hierarchical Topic Taxonomies
, 1998
"... We explore how to organize large text databases hierarchically by topic to aid better searching, browsing and filtering. Many corpora, such as internet directories, digital libraries, and patent databases are manually organized into topic hierarchies, also called taxonomies. Similar to indices for r ..."
Abstract

Cited by 115 (9 self)
 Add to MetaCart
We explore how to organize large text databases hierarchically by topic to aid better searching, browsing and filtering. Many corpora, such as internet directories, digital libraries, and patent databases are manually organized into topic hierarchies, also called taxonomies. Similar to indices for relational data, taxonomies make search and access more efficient. However, the exponential growth in the volume of online textual information makes it nearly impossible to maintain such taxonomic organization for large, fastchanging corpora by hand. We describe an automatic system that starts with a small sample of the corpus in which topics have been assigned by hand, and then updates the database with new documents as the corpus grows, assigning topics to these new documents with high speed and accuracy. To do this, we use techniques from statistical pattern recognition to efficiently separate the feature words, or...
Using taxonomy, discriminants, and signatures for navigating in text databases
 In Proceedings of the 23rd VLDB Conference
, 1997
"... We explore how to organize a text database hierarchically to aid better searching and browsing. We propose to exploit the natural hierarchy of topics, or taxonomy, that many corpora,suchas internet directories, digital libraries, and patent databases enjoy. In our system, the user navigates through ..."
Abstract

Cited by 79 (5 self)
 Add to MetaCart
We explore how to organize a text database hierarchically to aid better searching and browsing. We propose to exploit the natural hierarchy of topics, or taxonomy, that many corpora,suchas internet directories, digital libraries, and patent databases enjoy. In our system, the user navigates through the query response not as a at unstructured list, but embedded in the familiar taxonomy, and annotated with document signatures computed dynamically with respect to where the user is located at any time. Weshowhowto update such databases with new documents with high speed and accuracy. Weuse techniques from statistical pattern recognition to e ciently separate the feature words or discriminants from the noise words at each node of the taxonomy. Using these, we build a multilevel classi er. At each node, this classi er can ignore the large number of noise words in a document. Thus the classi er has a small model size and is very fast. However, owing to the use of contextsensitive features, the classi er is very accurate. We report on experiences with the Reuters newswire benchmark, the US Patent database, and web document samples from Yahoo!. 1
Spectral Analysis of Internet Topologies
, 2003
"... We perform spectral analysis of the Internet topology at the AS level, by adapting the standard spectral filtering method of examining the eigenvectors corresponding to the largest eigenvalues of matrices related to the adjacency matrix of the topology. We observe that the method suggests clusters o ..."
Abstract

Cited by 78 (6 self)
 Add to MetaCart
We perform spectral analysis of the Internet topology at the AS level, by adapting the standard spectral filtering method of examining the eigenvectors corresponding to the largest eigenvalues of matrices related to the adjacency matrix of the topology. We observe that the method suggests clusters of ASes with natural semantic proximity, such as geography or business interests. We examine how these clustering properties vary in the core and in the edge of the network, as well as across geographic areas, over time, and between real and synthetic data. We observe that these clustering properties may be suggestive of traffic patterns and thus have direct impact on the link stress of the network. Finally, we use the weights of the eigenvector corresponding to the first eigenvalue to obtain an alternative hierarchical ranking of the ASes.
Information retrieval on the Web
 ACM Computing Surveys
, 2000
"... In this paper we review studies of the growth of the Internet and technologies that are useful for information search and retrieval on the Web. We present data on the Internet from several different sources, e.g., current as well as projected number of users, hosts, and Web sites. Although numerical ..."
Abstract

Cited by 74 (0 self)
 Add to MetaCart
In this paper we review studies of the growth of the Internet and technologies that are useful for information search and retrieval on the Web. We present data on the Internet from several different sources, e.g., current as well as projected number of users, hosts, and Web sites. Although numerical figures vary, overall trends cited
Exact and Approximation Algorithms for Clustering
, 1997
"... In this paper we present a n O(k1�1=d) time algorithm for solving the kcenter problem in R d, under L1 and L2 metrics. The algorithm extends to other metrics, and can be used to solve the discrete kcenter problem, as well. We also describe a simple (1 +)approximation algorithm for the kcenter pr ..."
Abstract

Cited by 59 (5 self)
 Add to MetaCart
In this paper we present a n O(k1�1=d) time algorithm for solving the kcenter problem in R d, under L1 and L2 metrics. The algorithm extends to other metrics, and can be used to solve the discrete kcenter problem, as well. We also describe a simple (1 +)approximation algorithm for the kcenter problem, with running time O(n log k) + (k = ) O(k1�1=d). Finally, we present a n O(k1�1=d) time algorithm for solving the Lcapacitated kcenter problem, provided that L = (n=k 1�1=d) or L = O(1). We conclude with a simple approximation algorithm for the Lcapacitated kcenter problem.
Testing of Clustering
 In Proc. 41th Annu. IEEE Sympos. Found. Comput. Sci
, 2000
"... A set X of points in ! d is (k; b)clusterable if X can be partitioned into k subsets (clusters) so that the diameter (alternatively, the radius) of each cluster is at most b. We present algorithms that by sampling from a set X , distinguish between the case that X is (k; b)clusterable and the ca ..."
Abstract

Cited by 58 (14 self)
 Add to MetaCart
A set X of points in ! d is (k; b)clusterable if X can be partitioned into k subsets (clusters) so that the diameter (alternatively, the radius) of each cluster is at most b. We present algorithms that by sampling from a set X , distinguish between the case that X is (k; b)clusterable and the case that X is fflfar from being (k; b 0 )clusterable for any given 0 ! ffl 1 and for b 0 b. In fflfar from being (k; b 0 )clusterable we mean that more than ffl \Delta jX j points should be removed from X so that it becomes (k; b 0 )clusterable. We give algorithms for a variety of cost measures that use a sample of size independent of jX j, and polynomial in k and 1=ffl. Our algorithms can also be used to find approximately good clusterings. Namely, these are clusterings of all but an fflfraction of the points in X that have optimal (or close to optimal) cost. The benefit of our algorithms is that they construct an implicit representation of such clusterings in time independ...
pFilter: Global Information Filtering and Dissemination Using Structured Overlay Networks
 In FTDCS
, 2003
"... Due to the overwhelming amount of information on the Internet, it is becoming increasingly difficult for people to find relevant information in a timely fashion. Information filtering and dissemination systems allow user to register persistent queries called user profiles. They detect new contents, ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
Due to the overwhelming amount of information on the Internet, it is becoming increasingly difficult for people to find relevant information in a timely fashion. Information filtering and dissemination systems allow user to register persistent queries called user profiles. They detect new contents, match them against the profiles, and continuously notify users when relevant information becomes available. Existing systems, however, either are not scalable