Results 1 - 10
of
13
Spectral Analysis of Internet Topologies
, 2003
"... We perform spectral analysis of the Internet topology at the AS level, by adapting the standard spectral filtering method of examining the eigenvectors corresponding to the largest eigenvalues of matrices related to the adjacency matrix of the topology. We observe that the method suggests clusters o ..."
Abstract
-
Cited by 63 (7 self)
- Add to MetaCart
We perform spectral analysis of the Internet topology at the AS level, by adapting the standard spectral filtering method of examining the eigenvectors corresponding to the largest eigenvalues of matrices related to the adjacency matrix of the topology. We observe that the method suggests clusters of ASes with natural semantic proximity, such as geography or business interests. We examine how these clustering properties vary in the core and in the edge of the network, as well as across geographic areas, over time, and between real and synthetic data. We observe that these clustering properties may be suggestive of traffic patterns and thus have direct impact on the link stress of the network. Finally, we use the weights of the eigenvector corresponding to the first eigenvalue to obtain an alternative hierarchical ranking of the ASes.
CubeSVD: A Novel Approach to Personalized Web Search
- In Proc. of the 14 th International World Wide Web Conference (WWW
, 2005
"... As the competition of Web search market increases, there is a high demand for personalized Web search to conduct retrieval incorporating Web users' information needs. This paper focuses on utilizing clickthrough data to improve Web search. Since millions of searches are conducted everyday, a search ..."
Abstract
-
Cited by 47 (3 self)
- Add to MetaCart
As the competition of Web search market increases, there is a high demand for personalized Web search to conduct retrieval incorporating Web users' information needs. This paper focuses on utilizing clickthrough data to improve Web search. Since millions of searches are conducted everyday, a search engine accumulates a large volume of clickthrough data, which records who submits queries and which pages he/she clicks on. The clickthrough data is highly sparse and contains di#erent types of objects (user, query and Web page), and the relationships among these objects are also very complicated. By performing analysis on these data, we attempt to discover Web users' interests and the patterns that users locate information. In this paper, a novel approach CubeSVD is proposed to improve Web search. The clickthrough data is represented by a 3-order tensor, on which we perform 3-mode analysis using the higher-order singular value decomposition technique to automatically capture the latent factors that govern the relations among these multi-type objects: users, queries and Web pages. A tensor reconstructed based on the CubeSVD analysis reflects both the observed interactions among these objects and the implicit associations among them. Therefore, Web search activities can be carried out based on CubeSVD analysis. Experimental evaluations using a real-world data set collected from an MSN search engine show that CubeSVD achieves encouraging search results in comparison with some standard methods.
On the Eigenvalue Power Law
, 2002
"... We show that the largest eigenvalues of graphs whose highest degrees are Zipf-like distributed with slope are distributed according to a power law with slope =2. This follows as a direct and almost certain corollary of the degree power law. Our result has implications for the singular value deco ..."
Abstract
-
Cited by 43 (0 self)
- Add to MetaCart
We show that the largest eigenvalues of graphs whose highest degrees are Zipf-like distributed with slope are distributed according to a power law with slope =2. This follows as a direct and almost certain corollary of the degree power law. Our result has implications for the singular value decomposition method in information retrieval.
On Scaling Latent Semantic Indexing for Large Peer-To-Peer Systems
- Proc. 27th Annual International ACM SIGIR Conference
, 2004
"... The exponential growth of data demands scalable infrastructures capable of indexing and searching rich content such as text, music, and images. A promising direction is to combine information retrieval with peer-to-peer technology for scalability, fault-tolerance, and low administration cost. One pi ..."
Abstract
-
Cited by 26 (0 self)
- Add to MetaCart
The exponential growth of data demands scalable infrastructures capable of indexing and searching rich content such as text, music, and images. A promising direction is to combine information retrieval with peer-to-peer technology for scalability, fault-tolerance, and low administration cost. One pioneering work along this direction is pSearch [32, 33]. pSearch places documents onto a peerto -peer overlay network according to semantic vectors produced using Latent Semantic Indexing (LSI). The search cost for a query is reduced since documents related to the query are likely to be co-located on a small number of nodes. Unfortunately, because of its reliance on LSI, pSearch also inherits the limitations of LSI. (1) When the corpus is large and heterogeneous, LSI's retrieval quality is inferior to methods such as Okapi. (2) The Singular Value Decomposition (SVD) used in LSI is unscalable in terms of both memory consumption and computation time.
Clustered SVD strategies in latent semantic indexing
- Information Processing and Management
, 2005
"... The text retrieval method using Latent Semantic Indexing (LSI) technique with truncated Singular Value Decomposition (SVD) has been intensively studied in recent years. The SVD reduces the noise contained in the original representation of the term-document matrix and improves the information retriev ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
The text retrieval method using Latent Semantic Indexing (LSI) technique with truncated Singular Value Decomposition (SVD) has been intensively studied in recent years. The SVD reduces the noise contained in the original representation of the term-document matrix and improves the information retrieval accuracy. Recent studies indicate that SVD is mostly useful for small homogeneous data collections. For large inhomogeneous datasets, the performance of the SVD based text retrieval technique may deteriorate. We propose to partition a large inhomogeneous dataset into several smaller ones with clustered structure, on which we apply the truncated SVD. Our experimental results show the the clustered SVD strategies may enhance the retrieval accuracy and reduce the computing and storage costs. 1
Spectral analysis of random Graphs with skewed degree distributions
- Proceedings of the 42 nd IEEE Symposium on Foundations of Computer Science
, 2004
"... We extend spectral methods to random graphs with skewed degree distributions through a degree based normalization closely connected to the normalized Laplacian. The normalization is based on intuition drawn from perturbation theory of random matrices, and has the effect of boosting the expectation o ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
We extend spectral methods to random graphs with skewed degree distributions through a degree based normalization closely connected to the normalized Laplacian. The normalization is based on intuition drawn from perturbation theory of random matrices, and has the effect of boosting the expectation of the random adjacency matrix without increasing the variances of its entries, leading to better perturbation bounds. The primary implication of this result lies in the realm of spectral analysis of random graphs with skewed degree distributions, such as the ubiquitous “power law graphs”. Recently Mihail and Papadimitriou [22] argued that for randomly generated graphs satisfying a power law degree distribution, spectral analysis of the adjacency matrix will simply produce the neighborhoods of the high degree nodes as its eigenvectors, and thus miss any embedded structure. We present a generalization of their model, incorporating latent structure, and prove that after applying our transformation, spectral analysis succeeds in recovering the latent structure with high probability.
A Probabilistic Model for Dimensionality Reduction in Information Retrieval and Filtering
- In Proc. of 1st SIAM Computational Information Retrieval Workshop
, 2001
"... Dimension reduction methods, such as Latent Semantic Indexing (LSI), when applied to semantic spaces built upon text collections, improve information retrieval, information filtering and word sense disambiguation. A new dual probability model based on similarity concepts is introduced to explain the ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Dimension reduction methods, such as Latent Semantic Indexing (LSI), when applied to semantic spaces built upon text collections, improve information retrieval, information filtering and word sense disambiguation. A new dual probability model based on similarity concepts is introduced to explain the observed success. Semantic associations can be quantitatively characterized by their statistical significance, the likelihood. Semantic dimensions containing redundant and noisy information can be separated out and should be ignored because their contribution to the overall statistical significance is negative, giving rise to LSI: LSI is the optimal solution of the model. The peak in likelihood curve indicates the existence of an intrinsic semantic dimension. The importance of LSI dimensions follows the Zipf-distribution, indicating that LSI dimensions represent latent concepts. Document frequency of words follow the Zipf distribution, and the number of distinct words follows log-normal distribution. Experiments on four standard document collections both confirm and illustrate the results and concepts presented here.
Identification of Critical Values in Latent Semantic Indexing
- Foundations of Data Mining and Knowledge Discovery
, 2005
"... This paper reports the results of a study to determine the most critical elements of the T k and S k D k matrices, which are input to LSI. We are interested in the impact, both in terms of retrieval quality and query run time performance, of the removal (zeroing) of a large portion of the entries in ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
This paper reports the results of a study to determine the most critical elements of the T k and S k D k matrices, which are input to LSI. We are interested in the impact, both in terms of retrieval quality and query run time performance, of the removal (zeroing) of a large portion of the entries in these matrices
Document Representation and Dimension Reduction for Text Clustering
"... Increasingly large text datasets and the high dimensionality associated with natural language create a great challenge in text mining. In this research, a systematic study is conducted, in which three different document representation methods for text are used, together with three Dimension Reductio ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Increasingly large text datasets and the high dimensionality associated with natural language create a great challenge in text mining. In this research, a systematic study is conducted, in which three different document representation methods for text are used, together with three Dimension Reduction Techniques (DRT), in the context of the text clustering problem. Several standard benchmark datasets are used. The three Document representation methods considered are based on the vector space model, and they include word, multi-word term, and character N-gram representations. The dimension reduction methods are independent component analysis (ICA), latent semantic indexing (LSI), and a feature selection technique based on Document Frequency (DF). Results are compared in terms of clustering performance, using the k-means clustering algorithm. Experiments show that ICA and LSI are clearly better than DF on all datasets. For word and N-gram representation, ICA generally gives better results compared with LSI. Experiments also show that the word representation gives better clustering results compared to term and N-gram representation. Finally, for the N-gram representation, it is demonstrated that a profile length (before dimensionality reduction) of 2000 is sufficient to capture the information and, in most cases, a 4-gram representation gives better performance than 3-gram representation. 1
Embellishing Text Search Queries To Protect User Privacy
"... Users of text search engines are increasingly wary that their activities may disclose confidential information about their business or personal profiles. It would be desirable for a search engine to perform document retrieval for users while protecting their intent. In this paper, we identify the pr ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Users of text search engines are increasingly wary that their activities may disclose confidential information about their business or personal profiles. It would be desirable for a search engine to perform document retrieval for users while protecting their intent. In this paper, we identify the privacy risks arising from semantically related search terms within a query, and from recurring highspecificity query terms in a search session. To counter the risks, we propose a solution for a similarity text retrieval system to offer anonymity and plausible deniability for the query terms, and hence the user intent, without degrading the system’s precision-recall performance. The solution comprises a mechanism that embellishes each user query with decoy terms that exhibit similar specificity spread as the genuine terms, but point to plausible alternative topics. We also provide an accompanying retrieval scheme that enables the search engine to compute the encrypted document relevance scores from only the genuine search terms, yet remain oblivious to their distinction from the decoys. Empirical evaluation results are presented to substantiate the effectiveness of our solution. 1.

