Results 1 - 10
of
20
Latent semantic indexing: A probabilistic analysis
, 1998
"... Latent semantic indexing (LSI) is an information retrieval technique based on the spectral analysis of the term-document matrix, whose empirical success had heretofore been without rigorous prediction and explanation. We prove that, under certain conditions, LSI does succeed in capturing the underl ..."
Abstract
-
Cited by 210 (8 self)
- Add to MetaCart
Latent semantic indexing (LSI) is an information retrieval technique based on the spectral analysis of the term-document matrix, whose empirical success had heretofore been without rigorous prediction and explanation. We prove that, under certain conditions, LSI does succeed in capturing the underlying semantics of the corpus and achieves improved retrieval performance. We also propose the technique of random projection as a way of speeding up LSI. We complement our theorems with encouraging experimental results. We also argue that our results may be viewed in a more general framework, as a theoretical basis for the use of spectral methods in a wider class of applications such as collaborative filtering.
Approximation Algorithms for Projective Clustering
- Proceedings of the ACM SIGMOD International Conference on Management of data, Philadelphia
, 2000
"... We consider the following two instances of the projective clustering problem: Given a set S of n points in R d and an integer k ? 0; cover S by k hyper-strips (resp. hyper-cylinders) so that the maximum width of a hyper-strip (resp., the maximum diameter of a hyper-cylinder) is minimized. Let w ..."
Abstract
-
Cited by 196 (14 self)
- Add to MetaCart
We consider the following two instances of the projective clustering problem: Given a set S of n points in R d and an integer k ? 0; cover S by k hyper-strips (resp. hyper-cylinders) so that the maximum width of a hyper-strip (resp., the maximum diameter of a hyper-cylinder) is minimized. Let w be the smallest value so that S can be covered by k hyper-strips (resp. hyper-cylinders), each of width (resp. diameter) at most w : In the plane, the two problems are equivalent. It is NP-Hard to compute k planar strips of width even at most Cw ; for any constant C ? 0 [50]. This paper contains four main results related to projective clustering: (i) For d = 2, we present a randomized algorithm that computes O(k log k) strips of width at most 6w that cover S. Its expected running time is O(nk 2 log 4 n) if k 2 log k n; it also works for larger values of k, but then the expected running time is O(n 2=3 k 8=3 log 4 n). We also propose another algorithm that computes a c...
Peer-to-Peer Information Retrieval Using Self-Organizing Semantic Overlay Networks
, 2003
"... Content-based full-text search is a challenging problem in Peer-toPeer (P2P) systems. Traditional approaches have either been centralized or use flooding to ensure accuracy of the results returned. In this paper, we present pSearch, a decentralized non-flooding P2P information retrieval system. pSea ..."
Abstract
-
Cited by 184 (7 self)
- Add to MetaCart
Content-based full-text search is a challenging problem in Peer-toPeer (P2P) systems. Traditional approaches have either been centralized or use flooding to ensure accuracy of the results returned. In this paper, we present pSearch, a decentralized non-flooding P2P information retrieval system. pSearch distributes document indices through the P2P network based on document semantics generated by Latent Semantic Indexing (LSI). The search cost (in terms of different nodes searched and data transmitted) for a given query is thereby reduced, since the indices of semantically related documents are likely to be co-located in the network. We also describe techniques that help distribute the indices more evenly across the nodes, and further reduce the number of nodes accessed using appropriate index distribution as well as using index samples and recently processed queries to guide the search. Experiments show that pSearch can achieve performance comparable to centralized information retrieval systems by searching only a small number of nodes. For a system with 128,000 nodes and 528,543 documents (from news, magazines, etc.), pSearch searches only 19 nodes and transmits only 95.5KB data during the search, whereas the top 15 documents returned by pSearch and LSI have a 91.7% intersection.
Scalable Feature Selection, Classification and Signature Generation for Organizing Large Text Databases Into Hierarchical Topic Taxonomies
, 1998
"... We explore how to organize large text databases hierarchically by topic to aid better searching, browsing and filtering. Many corpora, such as internet directories, digital libraries, and patent databases are manually organized into topic hierarchies, also called taxonomies. Similar to indices for r ..."
Abstract
-
Cited by 87 (7 self)
- Add to MetaCart
We explore how to organize large text databases hierarchically by topic to aid better searching, browsing and filtering. Many corpora, such as internet directories, digital libraries, and patent databases are manually organized into topic hierarchies, also called taxonomies. Similar to indices for relational data, taxonomies make search and access more efficient. However, the exponential growth in the volume of on-line textual information makes it nearly impossible to maintain such taxonomic organization for large, fast-changing corpora by hand. We describe an automatic system that starts with a small sample of the corpus in which topics have been assigned by hand, and then updates the database with new documents as the corpus grows, assigning topics to these new documents with high speed and accuracy. To do this, we use techniques from statistical pattern recognition to efficiently separate the feature words, or...
Using taxonomy, discriminants, and signatures for navigating in text databases
- In Proceedings of the 23rd VLDB Conference
, 1997
"... We explore how to organize a text database hierarchically to aid better searching and browsing. We propose to exploit the natural hierarchy of topics, or taxonomy, that many corpora,suchas internet directories, digital libraries, and patent databases enjoy. In our system, the user navigates through ..."
Abstract
-
Cited by 67 (4 self)
- Add to MetaCart
We explore how to organize a text database hierarchically to aid better searching and browsing. We propose to exploit the natural hierarchy of topics, or taxonomy, that many corpora,suchas internet directories, digital libraries, and patent databases enjoy. In our system, the user navigates through the query response not as a at unstructured list, but embedded in the familiar taxonomy, and annotated with document signatures computed dynamically with respect to where the user is located at any time. Weshowhowto update such databases with new documents with high speed and accuracy. Weuse techniques from statistical pattern recognition to e ciently separate the feature words or discriminants from the noise words at each node of the taxonomy. Using these, we build a multi-level classi er. At each node, this classi er can ignore the large number of noise words in a document. Thus the classi er has a small model size and is very fast. However, owing to the use of context-sensitive features, the classi er is very accurate. We report on experiences with the Reuters newswire benchmark, the US Patent database, and web document samples from Yahoo!. 1
Spectral Analysis of Internet Topologies
, 2003
"... We perform spectral analysis of the Internet topology at the AS level, by adapting the standard spectral filtering method of examining the eigenvectors corresponding to the largest eigenvalues of matrices related to the adjacency matrix of the topology. We observe that the method suggests clusters o ..."
Abstract
-
Cited by 63 (7 self)
- Add to MetaCart
We perform spectral analysis of the Internet topology at the AS level, by adapting the standard spectral filtering method of examining the eigenvectors corresponding to the largest eigenvalues of matrices related to the adjacency matrix of the topology. We observe that the method suggests clusters of ASes with natural semantic proximity, such as geography or business interests. We examine how these clustering properties vary in the core and in the edge of the network, as well as across geographic areas, over time, and between real and synthetic data. We observe that these clustering properties may be suggestive of traffic patterns and thus have direct impact on the link stress of the network. Finally, we use the weights of the eigenvector corresponding to the first eigenvalue to obtain an alternative hierarchical ranking of the ASes.
Information retrieval on the Web
- ACM Computing Surveys
, 2000
"... In this paper we review studies of the growth of the Internet and technologies that are useful for information search and retrieval on the Web. We present data on the Internet from several different sources, e.g., current as well as projected number of users, hosts, and Web sites. Although numerical ..."
Abstract
-
Cited by 58 (0 self)
- Add to MetaCart
In this paper we review studies of the growth of the Internet and technologies that are useful for information search and retrieval on the Web. We present data on the Internet from several different sources, e.g., current as well as projected number of users, hosts, and Web sites. Although numerical figures vary, overall trends cited
Testing of Clustering
- In Proc. 41th Annu. IEEE Sympos. Found. Comput. Sci
, 2000
"... A set X of points in ! d is (k; b)-clusterable if X can be partitioned into k subsets (clusters) so that the diameter (alternatively, the radius) of each cluster is at most b. We present algorithms that by sampling from a set X , distinguish between the case that X is (k; b)-clusterable and the ca ..."
Abstract
-
Cited by 51 (11 self)
- Add to MetaCart
A set X of points in ! d is (k; b)-clusterable if X can be partitioned into k subsets (clusters) so that the diameter (alternatively, the radius) of each cluster is at most b. We present algorithms that by sampling from a set X , distinguish between the case that X is (k; b)-clusterable and the case that X is ffl-far from being (k; b 0 )-clusterable for any given 0 ! ffl 1 and for b 0 b. In ffl-far from being (k; b 0 )-clusterable we mean that more than ffl \Delta jX j points should be removed from X so that it becomes (k; b 0 )-clusterable. We give algorithms for a variety of cost measures that use a sample of size independent of jX j, and polynomial in k and 1=ffl. Our algorithms can also be used to find approximately good clusterings. Namely, these are clusterings of all but an ffl-fraction of the points in X that have optimal (or close to optimal) cost. The benefit of our algorithms is that they construct an implicit representation of such clusterings in time independ...
Exact and Approximation Algorithms for Clustering
, 1997
"... In this paper we present a n O(k 1\Gamma1=d ) time algorithm for solving the k-center problem in R d , under L1 and L 2 metrics. The algorithm extends to other metrics, and can be used to solve the discrete k-center problem, as well. We also describe a simple (1 + ffl)-approximation algorith ..."
Abstract
-
Cited by 48 (4 self)
- Add to MetaCart
In this paper we present a n O(k 1\Gamma1=d ) time algorithm for solving the k-center problem in R d , under L1 and L 2 metrics. The algorithm extends to other metrics, and can be used to solve the discrete k-center problem, as well. We also describe a simple (1 + ffl)-approximation algorithm for the k-center problem, with running time O(n log k) + (k=ffl) O(k 1\Gamma1=d ) . Finally, we present a n O(k 1\Gamma1=d ) time algorithm for solving the L-capacitated k-center problem, provided that L = \Omega\Gamma n=k 1\Gamma1=d ) or L = O(1). We conclude with a simple approximation algorithm for the L-capacitated k-center problem. The work on this paper was partially supported by a National Science Foundation Grant CCR-93--01259, by an Army Research Office MURI grant DAAH04-96-1-0013, by a Sloan fellowship, by an NYI award and matching funds from Xerox Corporation, and by a grant from the U.S.-Israeli Binational Science Foundation. y Department of Computer Science, Box ...
pFilter: Global Information Filtering and Dissemination Using Structured Overlay Networks
- In FTDCS
, 2003
"... Due to the overwhelming amount of information on the Internet, it is becoming increasingly difficult for people to find relevant information in a timely fashion. Information filtering and dissemination systems allow user to register persistent queries called user profiles. They detect new contents, ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
Due to the overwhelming amount of information on the Internet, it is becoming increasingly difficult for people to find relevant information in a timely fashion. Information filtering and dissemination systems allow user to register persistent queries called user profiles. They detect new contents, match them against the profiles, and continuously notify users when relevant information becomes available. Existing systems, however, either are not scalable

