## Microsoft

### BibTeX

@MISC{König_microsoft,

author = {Arnd Christian König and Kenneth Church and Martin Markov},

title = {Microsoft},

year = {}

}

### OpenURL

### Abstract

Abstract — Inverted files have been very successful for document retrieval, but sponsored search is different. Inverted files are designed to find documents that match the query (all the terms in the query need to be in the document, but not vice versa). For sponsored search, ads are associated with bids. When a user issues a search query, bids are typically matched to the query using broad-match semantics: all the terms in the bid need to be in the query (but not vice versa). This means that the roles of the query and the bid/document are reversed in sponsored search, in turn making standard retrieval techniques based on inverted indexes ill-suited for sponsored search. This paper proposes novel index structures and query processing algorithms for sponsored search. We evaluate these structures using a real corpus of 180 million advertisements. I.

### Citations

798 | T.C.: Managing gigabytes: compressing and indexing documents and images
- Witten, Moffat, et al.
- 1999
(Show Context)
Citation Context ...y keywords that do not occur in the query; any advertisement that occurs less often cannot be a match. Note the we cannot use the well-known skipping optimization when processing the inverted indexes =-=[25]-=-, since we are not merely computing intersections: e.g., if a advertisement phrase contains fewer keywords than a query, it does not have be present in all inverted indexes traversed to be a match. Th... |

630 | A threshold of ln n for approximating set cover
- Feige
- 1998
(Show Context)
Citation Context ...ine this mapping. Solving the general set cover problem is known to be NP-hard, and no polynomial-time algorithm can achieve an approximation factor of better than O(ln |Sbase|) for general set-cover =-=[18]-=-. However, it is possible to leverage an observation on the subsets in SBase that may become part of the final mapping to come up with a fast approximation algorithm with tighter bounds. The key insig... |

548 |
A space–economical suffix tree construction algorithm
- McCreight
- 1976
(Show Context)
Citation Context ...e cannot leverage a non-redundant indexing scheme; because of this, the indexing of such rules is often based on index structures used for longer bodies of text, such as suffix-trees or suffix arrays =-=[16]-=-; both of these structures are less effective in the context of broad-match queries, as a suffix-tree would increase the size of the index structure very significantly over the current solution, where... |

375 |
A greedy heuristic for the set-covering problem
- Chvatal
(Show Context)
Citation Context ...imum number of advertisements that we can group in a single data node without violating the above constraint. Now, in cases where the set-size for a setcover problem is limited by k, it is well known =-=[19]-=- that a simple greedy algorithm is a Hk-approximation algorithm for the weighted set cover, where Hk = ∑k i=1 1 i is the kth harmonic number, in turn giving us a simple and fast algorithm with a much ... |

235 | Filtering algorithms and implementation for very fast publish/subscribe
- Fabret
- 2001
(Show Context)
Citation Context ...ompromised of the keywords it contains, where each advertisement is a subscription that is triggered by the set of keywords present in its bid phrase. The most closely related paper in this domain is =-=[12]-=-. This work is similar to our approach in that it models the underlying problem as a task of laying out the optimal in-memory indexing/processing structure, formulating this task as computing a struct... |

172 | Compressed full-text indexes
- NAVARRO, MÄKINEN
- 2007
(Show Context)
Citation Context ...rays for lookup. Compression of the hash-lookup table: For this purpose, it is possible to leverage compressed binary sequences, which have been studied in the context of compressed full-text indexes =-=[21]-=-; these encode compressed binary strings B while allowing a number of operations in asymptotically optimal time – the operations we require are: • B[i] ⇔ Returns the value of the i-th bit of B. • rank... |

91 |
Efficient exact set-similarity joins
- Arasu, Ganti, et al.
- 2006
(Show Context)
Citation Context ...bset-matching is typically not very meaningful for text documents. Similar comments hold for the database literature, as well, where there is a considerable body of work on set-similarity (e.g., [6], =-=[7]-=-) and set-containment joins (e.g., [8], [9]), that join independent relations on the basis of the overlap in their set-valued attributes. However, these techniques are generally targeted at much large... |

91 | Efficient query evaluation using a two-level retrieval process
- Broder, Carmel, et al.
- 2005
(Show Context)
Citation Context ...advertisements in described in [3]; there the retrieval of advertisements is divided into both a semantic as well as a syntactic matching component. This solution uses a variant of the WAND algorithm =-=[10]-=-, which is a document-at-atime algorithm [11] that uses a branch-and-bound approach to derive efficient moves for cursors associated with posting lists. However, this approach relies on the final scor... |

74 |
Vector-space ranking with effective early termination
- Anh, Kretser, et al.
- 2001
(Show Context)
Citation Context ...ueries. Therefore, pushing any of the secondary criteria used to rank/exclude advertisements into the index (similar how partial scores/impacts are used for early termination in traditional IR (e.g., =-=[1]-=-, [2])) is less likely to result in noticeable performance improvement for ad retrieval. Moreover, the eventual ranking of advertisements may take into account a number of different factors (e.g., the... |

66 | Combining Fuzzy Information: An Overview
- Fagin
(Show Context)
Citation Context ...common in information retrieval that rely on the total “score” (and in turn the rank) of an document/advertisement to be a monotonic function the scores of the matching keywords (e.g., [1], [2], [3], =-=[4]-=-) cannot be applied. Finally, the fact that the indexed ad phrases are very short itself implies that even very large corpora of advertisements can be indexed in main memory. Consequently, we will des... |

36 |
Optimal lower bounds for rank and select indexes. Theoretical Computer Science 387
- Golynski
- 2007
(Show Context)
Citation Context ...-array B of length n containing k 1-bits requires space of nH0(B) + o(k) + O(log log n) bits [21] (with H0(B) denoting the zero-order empirical entropy of B), which is close to the optimal bound (see =-=[22]-=-). The structures can be used to encode a compressed representation of H as follows. First, we use a compressed bit-array B sig of length 2 s to describe all wordhash() signatures for which we are sto... |

19 |
Adaptive algorithms for set containment joins
- Melnik, Garcia-Molina
(Show Context)
Citation Context ...aningful for text documents. Similar comments hold for the database literature, as well, where there is a considerable body of work on set-similarity (e.g., [6], [7]) and set-containment joins (e.g., =-=[8]-=-, [9]), that join independent relations on the basis of the overlap in their set-valued attributes. However, these techniques are generally targeted at much larger sets of words/values and – in case o... |

17 | Broadword implementation of rank/select queries
- Vigna
(Show Context)
Citation Context ...B off ) ) is about 9:1. While the data structures required to achieve these space bounds above can be very complex in practice, much simpler structures can yield significant compression as well (e.g. =-=[23]-=-). Maintaining the data structure under insertions/deletions: Keeping the data structure updated to ensure the correct broadmatch processing in the presence of inserts/deletions is straight forward, a... |

13 |
Modern information retrieval
- Baeze-Yates, Ribeiro-Neto
- 1999
(Show Context)
Citation Context ... retrieval of advertisements is divided into both a semantic as well as a syntactic matching component. This solution uses a variant of the WAND algorithm [10], which is a document-at-atime algorithm =-=[11]-=- that uses a branch-and-bound approach to derive efficient moves for cursors associated with posting lists. However, this approach relies on the final score of a match being a function that is monoton... |

13 | An efficient phrase-to-phrase alignment model for arbitrarily long phrase and large corpora
- Zhang, Vogel
(Show Context)
Citation Context ... sentence is translated by looking up each sub-phrase of an input sentence in the index, scoring the resulting rules and using a combination of the top-scoring results as the translation (e.g., [13], =-=[14]-=-). However, translation is different from sponsored search. Bids tend to be shorter than translation rules (using phrases of length up to 7 or more is common in settings where translation quality is e... |

12 | A betterthan-greedy approximation algorithm for the minimum set cover problem
- Hassin, Levin
- 2005
(Show Context)
Citation Context ...′, where k′ is the maximum number of distinct words(Ai)combinations in a data node. Finally, it can be shown that through the use of “withdrawal steps” this approximationfactor can be reduced further =-=[20]-=-. VI. EXTENSIONS Compression: The proposed structure is very amenable to compression; we differentiate between the compression of the data nodes and the compression of the hash-lookup table. Data-Node... |

11 |
Index compression is good, especially for random access
- Büttcher, Clarke
(Show Context)
Citation Context ...ial access of m bytes, once the random access to the start of the sequence has been performed. The precise nature of the cost function depends on the processor architecture and memory chips used (see =-=[17]-=- for examples) and – given that e.g., the latency induced by TLB misses can vary significantly – will only be an approximation of the real behavior. Still, as we will demonstrate in Section VII, using... |

2 |
Efficient Document Retrieval
- Strohman, Croft
(Show Context)
Citation Context ...s. Therefore, pushing any of the secondary criteria used to rank/exclude advertisements into the index (similar how partial scores/impacts are used for early termination in traditional IR (e.g., [1], =-=[2]-=-)) is less likely to result in noticeable performance improvement for ad retrieval. Moreover, the eventual ranking of advertisements may take into account a number of different factors (e.g., the obse... |

1 |
Scaling Phrase-based Statistical Machine-Translation to larger Corpora and longer
- Callison-Burch, Bannard, et al.
(Show Context)
Citation Context ...ges. A sentence is translated by looking up each sub-phrase of an input sentence in the index, scoring the resulting rules and using a combination of the top-scoring results as the translation (e.g., =-=[13]-=-, [14]). However, translation is different from sponsored search. Bids tend to be shorter than translation rules (using phrases of length up to 7 or more is common in settings where translation qualit... |

1 | Greedy Algorithms for On-line Set-Covering and related
- Ausiello, Giannakos, et al.
(Show Context)
Citation Context ... challenging to keep the mapping itself (close to) optimal in the presence of insertions/deletions, as online versions of the set-cover problem have much weaker guarantees on the approximation bounds =-=[24]-=-. Therefore, instead of re-computing the optimization for each insertion/deletion, this re-computation is only performed periodically (potentially on a separate machine), while – at the time of an ins... |