Results 1 - 10
of
28
Query Suggestions Using Query-Flow Graphs
"... The query-flow graph [Boldi et al., CIKM 2008] is an aggregated representation of the latent querying behavior contained in a query log. Intuitively, in the query-flow graph a directed edge from query qi to query qj means that the two queries are likely to be part of the same search mission. Any pat ..."
Abstract
-
Cited by 18 (8 self)
- Add to MetaCart
The query-flow graph [Boldi et al., CIKM 2008] is an aggregated representation of the latent querying behavior contained in a query log. Intuitively, in the query-flow graph a directed edge from query qi to query qj means that the two queries are likely to be part of the same search mission. Any path over the query-flow graph may be seen as a possible search task, whose likelihood is given by the strength of the edges along the path. An edge (qi, qj) is also labelled with some information: e.g., the probability that user moves from qi to qj, or the type of the transition, for instance, the fact that qj is a specialization of qi. In this paper we propose, and experimentally study, query recommendations based on short random walks on the queryflow graph. Our experiments show that these methods can match in precision, and often improve, recommendations based on query-click graphs, without using users ’ clicks. Our experiments also show that it is important to consider transition-type labels on edges for having good quality recommendations. Finally, one feature that we had in mind while devising our methods was that of providing diverse sets of recommendations: the experimentation that we conducted provides encouraging results in this sense. 1.
Smoothing Clickthrough Data for Web Search Ranking
"... Incorporating features extracted from clickthrough data (called clickthrough features) has been demonstrated to significantly improve the performance of ranking models for Web search applications. Such benefits, however, are severely limited by the data sparseness problem, i.e., many queries and doc ..."
Abstract
-
Cited by 14 (6 self)
- Add to MetaCart
Incorporating features extracted from clickthrough data (called clickthrough features) has been demonstrated to significantly improve the performance of ranking models for Web search applications. Such benefits, however, are severely limited by the data sparseness problem, i.e., many queries and documents have no or very few clicks. The ranker thus cannot rely strongly on clickthrough features for document ranking. This paper presents two smoothing methods to expand clickthrough data: query clustering via Random Walk on click graphs and a discounting method inspired by the Good-Turing estimator. Both methods are evaluated on real-world data in three Web search domains. Experimental results show that the ranking models trained on smoothed clickthrough features consistently outperform those trained on unsmoothed features. This study demonstrates both the importance and the benefits of dealing with the sparseness problem in clickthrough data.
From “dango” to “japanese cakes”: Query reformulation models and patterns. Submitted for publication
, 2008
"... Abstract—Understanding query reformulation patterns is a key step towards next generation web search engines: it can help improving users ’ web-search experience by predicting their intent, and thus helping them to locate information more effectively. As a step in this direction, we build an accurat ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
Abstract—Understanding query reformulation patterns is a key step towards next generation web search engines: it can help improving users ’ web-search experience by predicting their intent, and thus helping them to locate information more effectively. As a step in this direction, we build an accurate model for classifying user query reformulations into broad classes (generalization, specialization, error correction or parallel move), achieving 92 % accuracy. We apply the model to automatically label two large query logs, creating annotated queryflow graphs. We study the resulting reformulation patterns, finding results consistent with previous studies done on smaller manually annotated datasets, and discovering new interesting patterns, including connections between reformulation types and topical categories. Finally, applying our findings to a third query log that is publicly available for research purposes, we demonstrate that our reformulation classifier leads to improved recommendations in a query recommendation system. I.
Clustering query refinements by user intent
- In 19th International World Wide Web Conference, WWW
, 2010
"... We address the problem of clustering the refinements of a user search query. The clusters computed by our proposed algorithm can be used to improve the selection and placement of the query suggestions proposed by a search engine, and can also serve to summarize the different aspects of information r ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
We address the problem of clustering the refinements of a user search query. The clusters computed by our proposed algorithm can be used to improve the selection and placement of the query suggestions proposed by a search engine, and can also serve to summarize the different aspects of information relevant to the original user query. Our algorithm clusters refinements based on their likely underlying user intents by combining document click and session cooccurrence information. At its core, our algorithm operates by performing multiple random walks on a Markov graph that approximates user search behavior. A user study performed on top search engine queries shows that our clusters are rated better than corresponding clusters computed using approaches that use only document click or only sessions co-occurrence information. 1.
Learning Markov Logic Networks Using Structural Motifs
"... Markov logic networks (MLNs) use firstorder formulas to define features of Markov networks. Current MLN structure learners can only learn short clauses (4-5 literals) due to extreme computational costs, and thus are unable to represent complex regularities in data. To address this problem, we presen ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Markov logic networks (MLNs) use firstorder formulas to define features of Markov networks. Current MLN structure learners can only learn short clauses (4-5 literals) due to extreme computational costs, and thus are unable to represent complex regularities in data. To address this problem, we present LSM, the first MLN structure learner capable of efficiently and accurately learning long clauses. LSM is based on the observation that relational data typically contains patterns that are variations of the same structural motifs. By constraining the search for clauses to occur within motifs, LSM can greatly speed up the search and thereby reduce the cost of finding long clauses. LSM uses random walks to identify densely connected objects in data, and groups them and their associated relations into a motif. Our experiments on three real-world datasets show that our approach is 2-5 orders of magnitude faster than the state-of-the-art ones, while achieving the same or better predictive performance. 1.
Entropy-biased Models for Query Representation On The Click Graph
, 2009
"... Query log analysis has received substantial attention in recent years, in which the click graph is an important technique for describing the relationship between queries and URLs. State-of-the-art approaches based on the raw click frequencies for modeling the click graph, however, are not noise-elim ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Query log analysis has received substantial attention in recent years, in which the click graph is an important technique for describing the relationship between queries and URLs. State-of-the-art approaches based on the raw click frequencies for modeling the click graph, however, are not noise-eliminated. Nor do they handle heterogeneous query-URL pairs well. In this paper, we investigate and develop a novel entropy-biased framework for modeling click graphs. The intuition behind this model is that various query-URL pairs should be treated differently, i.e., common clicks on less frequent but more specific URLs are of greater value than common clicks on frequent and general URLs. Based on this intuition, we utilize the entropy information of the URLs and introduce a new concept, namely the inverse query frequency (IQF), to weigh the importance (discriminative ability) of a click on a certain URL. The IQF weighting scheme is never explicitly explored or statistically examined for any bipartite graphs in the information retrieval literature. We not only formally define and quantify this scheme, but also incorporate it with the click frequency and user frequency information on the click graph for an effective query representation. To illustrate our methodology, we conduct experiments with the AOL query log data for query similarity analysis and query suggestion tasks. Experimental results demonstrate that considerable improvements in performance are obtained with our entropy-biased models.
Optimal Rare Query Suggestion With Implicit User Feedback
"... Query suggestion has been an effective approach to help users narrow down to the information they need. However, most of existing studies focused on only popular/head queries. Since rare queries possess much less information (e.g., clicks) than popular queries in the query logs, it is much more diff ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Query suggestion has been an effective approach to help users narrow down to the information they need. However, most of existing studies focused on only popular/head queries. Since rare queries possess much less information (e.g., clicks) than popular queries in the query logs, it is much more difficult to efficiently suggest relevant queries to a rare query. In this paper, we propose an optimal rare query suggestion framework by leveraging implicit feedbacks from users in the query logs. Our model resembles the principle of pseudo-relevance feedback which assumes that top-returned results by search engines are relevant. However, we argue that the clicked URLs and skipped URLs contain different levels of information and thus should be treated differently. Hence, our framework optimally combines both the click and skip information from users and uses a random walk model to optimize the query correlation. Our model specifically optimizes two parameters: (1) the restarting (jumping) rate of random walk, and (2) the combination ratio of click and skip information. Unlike the Rocchio algorithm, our learning process does not involve the content of the URLs but simply leverages the click and skip counts in the query-URL bipartite graphs. Consequently, our model is capable of scaling up to the need of commercial search engines. Experimental results on one-month query logs from a large commercial search engine with over 40 million rare queries demonstrate the superiority of our framework, with statistical significance, over the traditional random walk models and pseudo-relevance feedback models.
Improving Recommendation for Long-tail Queries via Templates
"... The ability to aggregate huge volumes of queries over a large population of users allows search engines to build precise models for a variety of query-assistance features such as query recommendation, correction, etc. Yet, no matter how much data is aggregated, the long-tail distribution implies tha ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
The ability to aggregate huge volumes of queries over a large population of users allows search engines to build precise models for a variety of query-assistance features such as query recommendation, correction, etc. Yet, no matter how much data is aggregated, the long-tail distribution implies that a large fraction of queries are rare. As a result, most query assistance services perform poorly or are not even triggered on long-tail queries. We propose a method to extend the reach of query assistance techniques (and in particular query recommendation) to long-tail queries by reasoning about rules between query templates rather than individual query transitions, as currently done in query-flow graph models. As a simple example, if we recognize that ‘Montezuma’ is a city in the rare query “Montezuma surf”andif therule‘<city> surf → <city> beach ’ has been observed, we are able to offer “Montezuma beach ” as a recommendation, even if the two queries were never observed in a same session. We conducted experiments to validate our hypothesis, first via traditional small-scale editorial assessments but more interestingly via a novel automated large scale evaluation methodology. Our experiments show that general coverage can be relatively increased by 24 % using templates without penalizing quality. Furthermore, for 36 % of the 95M queries in our query flow graph, which have no out edges and thus could not be served recommendations, we can now offer at least one recommendation in 98 % of the cases.
Jigs and Lures: Associating Web Queries with Structured Entities
"... We propose methods for estimating the probability that an entity from an entity database is associated with a web search query. Association is modeled using a query entity click graph, blending general query click logs with vertical query click logs. Smoothing techniques are proposed to address the ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
We propose methods for estimating the probability that an entity from an entity database is associated with a web search query. Association is modeled using a query entity click graph, blending general query click logs with vertical query click logs. Smoothing techniques are proposed to address the inherent data sparsity in such graphs, including interpolation using a query synonymy model. A large-scale empirical analysis of the smoothing techniques, over a 2-year click graph collected from a commercial search engine, shows significant reductions in modeling error. The association models are then applied to the task of recommending products to web queries, by annotating queries with products from a large catalog and then mining queryproduct associations through web search session analysis. Experimental analysis shows that our smoothing techniques improve coverage while keeping precision stable, and overall, that our top-performing model affects 9% of general web queries with 94 % precision. 1
CMU-ML-10-100 Fast Nearest-neighbor Search in Disk-resident Graphs
, 2010
"... Link prediction, personalized graph search, fraud detection, and many such graph mining problems revolve around the computation of the most “similar ” k nodes to a given query node. One widely used class of similarity measures is based on random walks on graphs, e.g., personalized pagerank, hitting ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Link prediction, personalized graph search, fraud detection, and many such graph mining problems revolve around the computation of the most “similar ” k nodes to a given query node. One widely used class of similarity measures is based on random walks on graphs, e.g., personalized pagerank, hitting and commute times, and simrank. There are two fundamental problems associated with these measures. First, existing online algorithms typically examine the local neighborhood of the query node which can become significantly slower whenever high-degree nodes are encountered (a common phenomenon in real-world graphs). We prove that turning high degree nodes into sinks results in only a small approximation error, while greatly improving running times. The second problem is that of computing similarities at query time when the graph is too large to be memoryresident. The obvious solution is to split the graph into clusters of nodes and store each cluster on a disk page; ideally random walks will rarely cross cluster boundaries and cause page-faults. Our contributions here are twofold: (a) we present an efficient deterministic algorithm to find the k closest neighbors (in terms of personalized pagerank) of any query node in such a clustered graph, and (b) we develop a clustering algorithm (RWDISK) that uses only sequential sweeps over data files. Empirical results on several large publicly available graphs like DBLP, Citeseer and Live-Journal ( ∼ 90 M edges) demonstrate that turning high degree nodes into sinks not only improves running time of RWDISK by a factor of 3 but also boosts link prediction accuracy by a factor of 4 on average. We also show that RWDISK returns more desirable (high conductance and small size) clusters than the popular clustering algorithm METIS, while requiring much less memory. Finally our deterministic algorithm for computing nearest neighbors incurs far fewer page-faults (factor of 5) than actually simulating random walks

