Results 1  10
of
96
Random sampling from a search engine’s index
 In Proceedings of the 15th International World Wide Web Conference (WWW
, 2006
"... We revisit a problem introduced by Bharat and Broder almost a decade ago: how to sample random pages from the corpus of documents indexed by a search engine, using only the search engine’s public interface? Such a primitive is particularly useful in creating objective benchmarks for search engines. ..."
Abstract

Cited by 63 (6 self)
 Add to MetaCart
We revisit a problem introduced by Bharat and Broder almost a decade ago: how to sample random pages from the corpus of documents indexed by a search engine, using only the search engine’s public interface? Such a primitive is particularly useful in creating objective benchmarks for search engines. The technique of Bharat and Broder suffers from a wellrecorded bias: it favors long documents. In this paper we introduce two novel sampling algorithms: a lexiconbased algorithm and a random walk algorithm. Our algorithms produce biased samples, but each sample is accompanied by a weight, which represents its bias. The samples, in conjunction with the weights, are then used to simulate nearuniform samples. To this end, we resort to four wellknown Monte Carlo simulation methods: rejection sampling, importance sampling, the MetropolisHastings algorithm, and the Maximum Degree method. The limited access to search engines force our algorithms to use bias weights that are only “approximate”. We characterize analytically the effect of approximate bias weights on Monte Carlo methods and conclude that our algorithms are guaranteed to produce nearuniform samples from the search engine’s corpus. Our study of approximate Monte Carlo methods could be of independent interest. Experiments on a corpus of 2.4 million documents substantiate our analytical findings and show that our algorithms do not have significant bias towards long documents. We use our algorithms to collect comparative statistics about the corpora of the Google, MSN Search, and Yahoo! search engines.
Sampling Algorithms: Lower Bounds and Applications (Extended Abstract)
, 2001
"... ] Ziv BarYossef y Computer Science Division U. C. Berkeley Berkeley, CA 94720 zivi@cs.berkeley.edu Ravi Kumar IBM Almaden 650 Harry Road San Jose, CA 95120 ravi@almaden.ibm.com D. Sivakumar IBM Almaden 650 Harry Road San Jose, CA 95120 siva@almaden.ibm.com ABSTRACT We develop a fr ..."
Abstract

Cited by 52 (2 self)
 Add to MetaCart
] Ziv BarYossef y Computer Science Division U. C. Berkeley Berkeley, CA 94720 zivi@cs.berkeley.edu Ravi Kumar IBM Almaden 650 Harry Road San Jose, CA 95120 ravi@almaden.ibm.com D. Sivakumar IBM Almaden 650 Harry Road San Jose, CA 95120 siva@almaden.ibm.com ABSTRACT We develop a framework to study probabilistic sampling algorithms that approximate general functions of the form f : A n ! B, where A and B are arbitrary sets. Our goal is to obtain lower bounds on the query complexity of functions, namely the number of input variables x i that any sampling algorithm needs to query to approximate f(x1 ; : : : ; xn ). We define two quantitative properties of functions  the block sensitivity and the minimum Hellinger distance  that give us techniques to prove lower bounds on the query complexity. These techniques are quite general, easy to use, yet powerful enough to yield tight results. Our applications include the mean and higher statistical moments, the median and other selection functions, and the frequency moments, where we obtain lower bounds that are close to the corresponding upper bounds. We also point out some connections between sampling and streaming algorithms and lossy compression schemes. 1.
Multihypothesis sequential probability ratio tests  Part II: Accurate asymptotic . . .
 IEEE TRANS. INFORM. THEORY
, 2000
"... In a companion paper [13], we proved that two specific constructions of multihypothesis sequential tests, which we refer to as Multihypothesis Sequential Probability Ratio Tests (MSPRT’s), are asymptotically optimal as the decision risks (or error probabilities) go to zero. The MSPRT’s asymptotical ..."
Abstract

Cited by 42 (14 self)
 Add to MetaCart
In a companion paper [13], we proved that two specific constructions of multihypothesis sequential tests, which we refer to as Multihypothesis Sequential Probability Ratio Tests (MSPRT’s), are asymptotically optimal as the decision risks (or error probabilities) go to zero. The MSPRT’s asymptotically minimize not only the expected sample size but also any positive moment of the stopping time distribution, under very general statistical models for the observations. In this paper, based on nonlinear renewal theory we find accurate asymptotic approximations (up to a vanishing term) for the expected sample size that take into account the “overshoot ” over the boundaries of decision statistics. The approximations are derived for the scenario where the hypotheses are simple, the observations are independent and identically distributed (i.i.d.) according to one of the underlying distributions, and the decision risks go to zero. Simulation results for practical examples show that these approximations are fairly accurate not only for large but also for moderate sample sizes. The asymptotic results given here complete the analysis initiated in [4], where firstorder asymptotics were obtained for the expected sample size under a specific restriction on the Kullback–Leibler distances between the hypotheses.
A novel approach to detection of “denial–of–service” attacks via adaptive sequential and batch–sequential change–point detection methods
, 2001
"... ..."
Connecting Discrete and Continuous PathDependent Options
, 1999
"... . This paper develops methods for relating the prices of discrete and continuoustime versions of pathdependent options sensitive to extremal values of the underlying asset, including lookback, barrier, and hindsight options. The relationships take the form of correction terms that can be interpre ..."
Abstract

Cited by 27 (3 self)
 Add to MetaCart
. This paper develops methods for relating the prices of discrete and continuoustime versions of pathdependent options sensitive to extremal values of the underlying asset, including lookback, barrier, and hindsight options. The relationships take the form of correction terms that can be interpreted as shifting a barrier, a strike, or an extremal price. These correction terms enable us to use closedform solutions for continuous option prices to approximate their discrete counterparts. We also develop discretetime discretestate lattice methods for determining accurate prices of discrete and continuous pathdependent options. In several cases, the lattice methods use correction terms based on the connection between discrete and continuoustime prices which dramatically improve convergence to the accurate price. Key words: Barrier options, lookback options, continuity corrections, trinomial trees JEL classification: G13, C63, G12 Mathematics Subject Classification (1991): 90A09, 60J15, 65N06 1
RaWMS  Random Walk based Lightweight Membership Service for Wireless Ad Hoc Networks
, 2008
"... This paper presents RaWMS, a novel lightweight random membership service for ad hoc networks. The service provides each node with a partial uniformly chosen view of network nodes. Such a membership service is useful, e.g., in data dissemination algorithms, lookup and discovery services, peer samplin ..."
Abstract

Cited by 27 (7 self)
 Add to MetaCart
This paper presents RaWMS, a novel lightweight random membership service for ad hoc networks. The service provides each node with a partial uniformly chosen view of network nodes. Such a membership service is useful, e.g., in data dissemination algorithms, lookup and discovery services, peer sampling services, and complete membership construction. The design of RaWMS is based on a novel reverse random walk (RW) sampling technique. The paper includes a formal analysis of both the reverse RW sampling technique and RaWMS and verifies it through a detailed simulation study. In addition, RaWMS is compared both analytically and by simulations with a number of other known methods such as flooding and gossipbased techniques.
Conservative Statistical PostElection Audits
 THE ANNALS OF APPLIED STATISTICS
, 2008
"... There are many sources of error in counting votes on election day: the apparent winner might not be the rightful winner. Hand tallies of the votes in a random sample of precincts can be used to test the hypothesis that a full manual recount would find a different outcome. This paper develops a conse ..."
Abstract

Cited by 24 (13 self)
 Add to MetaCart
There are many sources of error in counting votes on election day: the apparent winner might not be the rightful winner. Hand tallies of the votes in a random sample of precincts can be used to test the hypothesis that a full manual recount would find a different outcome. This paper develops a conservative sequential test based on the votecounting errors found in a hand tally of a simple or stratified random sample of precincts. The procedure includes a natural escalation: If the hypothesis that the apparent outcome is incorrect is not rejected at stage s, more precincts are audited. Eventually, either the hypothesis is rejected—and the apparent outcome is confirmed—or all precincts have been audited and the true outcome is known. The test uses a priori bounds on the overstatement of the margin that could result from error in each precinct. Such bounds can be derived from the reported counts in each precinct and upper bounds on the number of votes cast in each precinct. The test allows errors in different precincts to be treated differently to reflect voting technology or precinct sizes. It is not optimal, but it is conservative: the chance of erroneously confirming the outcome of a contest if a full manual recount would show a different outcome is no larger than the nominal significance level. The approach also gives a conservative Pvalue for the hypothesis that a full manual recount would find a different outcome, given the errors found in a fixed size sample. This is illustrated with two contests from November, 2006: the U.S. Senate race in Minnesota and a school board race for the Sausalito Marin City School District in California, a small contest in which voters could vote for up to three candidates.
Heavy Traffic Limits for Queues with Many Deterministic Servers
"... Consider a sequence of stationary GI/D/N queues indexed by N with servers' utilization 1 #/ # N , # > 0. For such queues we show that the scaled waiting times NWN converge to the (finite) supremum of a Gaussian random walk with drift #. ..."
Abstract

Cited by 24 (3 self)
 Add to MetaCart
Consider a sequence of stationary GI/D/N queues indexed by N with servers' utilization 1 #/ # N , # > 0. For such queues we show that the scaled waiting times NWN converge to the (finite) supremum of a Gaussian random walk with drift #.
Application of change detection to dynamic contact sensing
 The International Journal of Robotics Research
, 1994
"... The forces of contact during manipulation convey substantial information about the state of the manipulation. ..."
Abstract

Cited by 21 (1 self)
 Add to MetaCart
The forces of contact during manipulation convey substantial information about the state of the manipulation.
Asymptotic performance of a multichart CUSUM test under false alarm probability constraint
 Proc. 44th IEEE Conf. on Decision and Control and the European Control Conf. (CDCECC’05
"... Abstract — Traditionally the false alarm rate in change point detection problems is measured by the mean time to false detection (or between false alarms). The large values of the mean time to false alarm, however, do not generally guarantee small values of the false alarm probability in a fixed tim ..."
Abstract

Cited by 21 (8 self)
 Add to MetaCart
Abstract — Traditionally the false alarm rate in change point detection problems is measured by the mean time to false detection (or between false alarms). The large values of the mean time to false alarm, however, do not generally guarantee small values of the false alarm probability in a fixed time interval for any possible location of this interval. In this paper we consider a multichannel (multipopulation) change point detection problem under a nontraditional false alarm probability constraint, which is desirable for a variety of applications. It is shown that in the multichart CUSUM test this constraint is easy to control. Furthermore, the proposed multichart CUSUM test is shown to be uniformly asymptotically optimal when the false alarm probability is small: it minimizes an average detection delay, or more generally, any positive moment of the stopping time distribution for any point of change. Index Terms — Changepoint detection, sequential detection, multichart CUSUM test, asymptotic optimality, renewal theory, false alarm probability. I.