Results 1 - 10
of
33
Holding Intruders Accountable on the Internet
, 1994
"... This paper addresses the problem of tracing intruders who obscure their identity by logging through a chain of multiple machines. After discussing previous approaches to this problem, we introduce thumbprints which are short summaries of the content of a connection. These can be compared to determin ..."
Abstract
-
Cited by 95 (1 self)
- Add to MetaCart
This paper addresses the problem of tracing intruders who obscure their identity by logging through a chain of multiple machines. After discussing previous approaches to this problem, we introduce thumbprints which are short summaries of the content of a connection. These can be compared to determine whether two connections contain the same text and are therefore likely to be part of the same connection chain. We enumerate the properties a thumbprint needs to have to work in practice, and then define a class of local thumbprints which have the desired properties. A methodology from multivariate statistics called principal component analysis is used to infer the best choice of thumbprinting parameters from data. Currently our thumbprints require 24 bytes per minute per connection. We develop an algorithm to compare these thumbprints which allows for the possibility that data may leak from one time-interval to the next. We present experimental data showing that our scheme works on a loc...
A survey of information retrieval and filtering methods
, 1995
"... We survey the major techniques for information retrieval. In the rst part, weprovide an overview of the traditional ones (full text scanning, inversion, signature les and clustering). In the second part we discuss attempts to include semantic information (natural language processing, latent semantic ..."
Abstract
-
Cited by 82 (0 self)
- Add to MetaCart
We survey the major techniques for information retrieval. In the rst part, weprovide an overview of the traditional ones (full text scanning, inversion, signature les and clustering). In the second part we discuss attempts to include semantic information (natural language processing, latent semantic indexing and neural networks).
Declustering Using Fractals
- In Proceedings of the 2nd International Conference on Parallel and Distributed Information Systems
, 1993
"... We propose a method to achieve declustering for cartesian product files on M units. The focus is on range queries, as opposed to partial match queries that older declustering methods have examined. Our method uses a distance-preserving mapping, namely, the Hilbert curve, to impose a linear ordering ..."
Abstract
-
Cited by 80 (2 self)
- Add to MetaCart
We propose a method to achieve declustering for cartesian product files on M units. The focus is on range queries, as opposed to partial match queries that older declustering methods have examined. Our method uses a distance-preserving mapping, namely, the Hilbert curve, to impose a linear ordering on the multidimensional points (buckets); then, it traverses the buckets according to this ordering, assigning buckets to disks in a round-robin fashion. Thanks to the good distance-preserving properties of the Hilbert curve, the end result is that each disk contains buckets that are far away in the linear ordering, and, most probably, far away in the k-d address space. This is exactly the goal of declustering. Experiments show that these intuitive arguments lead indeed to good performance: the proposed method performs at least as well or better than older declustering schemes. Categories and Subject Descriptors: E.1 [Data Structures]; E.5 [Files]; H.2.2 [Data Base Management]: Physical Des...
Performance of inverted indices in shared-nothing distributed text document information retrieval systems
- In Proceedings of the Second International Conference on Parallel and Distributed Information Systems
, 1993
"... The performance of distributed text document retrieval systems is strongly in uenced bytheorganization of the inverted index. This paper compares the performance impact on query processing of various physical organizations for inverted lists. We present a new probabilistic model of the database and ..."
Abstract
-
Cited by 54 (6 self)
- Add to MetaCart
The performance of distributed text document retrieval systems is strongly in uenced bytheorganization of the inverted index. This paper compares the performance impact on query processing of various physical organizations for inverted lists. We present a new probabilistic model of the database and queries. Simulation experiments determine which variables most strongly inuence response time and throughput. This leadstoa set of design trade-o s over a range of hardware con gurations and new parallel query processing strategies. 1
Enhancing Performance in Latent Semantic Indexing (LSI) Retrieval
, 1992
"... We have previously described an extension of the vector retrieval method called "Latent Semantic Indexing" (LSI) (Deerwester, et al., 1990; Dumais, et al., 1988; Furnas, et al., 1988). The LSI approach partially overcomes the problem of variability in human word choice by automatically organizing ob ..."
Abstract
-
Cited by 37 (0 self)
- Add to MetaCart
We have previously described an extension of the vector retrieval method called "Latent Semantic Indexing" (LSI) (Deerwester, et al., 1990; Dumais, et al., 1988; Furnas, et al., 1988). The LSI approach partially overcomes the problem of variability in human word choice by automatically organizing objects into a "semantic" structure more appropriate for information retrieval. This is done by modeling the implicit higher-order structure in the association of terms with objects. Initial tests find this completely automatic method to be a promising way to improve users' access to many kinds of textual materials or to objects for which textual descriptions are available. This paper describes some enhancements to the basic LSI method, including differential term weighting and relevance feedback. Appropriate term weighting improves performance by an average of 40%, and feedback based on 3 relevant documents improves performance by an average of 67%. September 1, 1992 D R A F T Dumais - 2 1....
Evaluating the Performance of Distributed Architectures for Information Retrieval using a Variety of Workloads
- ACM Transactions on Information Systems
, 1997
"... Information explosion across the Internet and elsewhere offers access to an increasing number of document collections. In order for users to effectively access these collections, information retrieval (IR) systems must provide coordinated, concurrent, and distributed access. In this paper, we desc ..."
Abstract
-
Cited by 33 (7 self)
- Add to MetaCart
Information explosion across the Internet and elsewhere offers access to an increasing number of document collections. In order for users to effectively access these collections, information retrieval (IR) systems must provide coordinated, concurrent, and distributed access. In this paper, we describe a fully functional distributed IR system based on the Inquery unified IR system. To refine this prototype, we implement a flexible simulation model which we use to present a series of experiments using a variety of workloads that measure system performance. We vary numerous system parameters such as the number of users, document collections, terms per query, query term frequency, think time, answers returned, and workload. Based on our initial results, we recommend simple changes to the prototype and evaluate the changes using the simulator. Because of the significant resource demands of information retrieval, it is not difficult to generate workloads that overwhelm system resources regardless of the architecture. However under some realistic workloads, we demonstrate system organizations for which response time gracefully degrades as the workload increases and performance scales with the number of processors. This scalable architecture includes a surprisingly small number of brokers through which a large number of clients and servers communicate. Categories and Subject Descriptors: C.2.4 [Computer-Communication Networks]: Distributed Systems-- distributed applications; C.4 [Performance of Systems]: Performance Attributes; H.3.4 [Information Storage and Retrieval]: Systems and Software; General Terms: Experimentation, Performance Additional Key Words and Phrases: Distributed information retrieval architectures This material is based on work supported by ...
Performance of inverted indices in distributed text document retrieval systems
- IN PROC. OF THE 2ND INT. CONF. ON PARALLEL AND DISTRIBUTED INFORMATION SYSTEMS (PDIS
, 1993
"... The performance of distributed text document retrieval systems is strongly in uenced by the organization of the inverted index. This paper compares the performance impact on query processing of various physical organizations for inverted lists. We present a new probabilistic model of the database an ..."
Abstract
-
Cited by 25 (4 self)
- Add to MetaCart
The performance of distributed text document retrieval systems is strongly in uenced by the organization of the inverted index. This paper compares the performance impact on query processing of various physical organizations for inverted lists. We present a new probabilistic model of the database and queries. Simulation experiments determine which variables most strongly influence response time and throughput. This leads to a set of design trade-offs over a wide range of hardware configurations and new parallel query processing strategies.
Implementations of Partial Document Ranking Using Inverted Files
- Information Processing and Management
, 1993
"... Most commercial text retrieval systems employ inverted files to improve retrieval speed. This paper concerns with the implementations of document ranking based on inverted files. Three heuristic methods for implementing the tf \Thetaidf weighting strategy, where tf stands for term frequency and idf ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
Most commercial text retrieval systems employ inverted files to improve retrieval speed. This paper concerns with the implementations of document ranking based on inverted files. Three heuristic methods for implementing the tf \Thetaidf weighting strategy, where tf stands for term frequency and idf stands for inverse document frequency, are studied. The basic idea of the heuristic methods is to process the query terms in an order so that as many top documents as possible can be identified without processing all of the query terms. The first heuristic was proposed by Smeaton and van Rijsbergen (Smeaton & Rijsbergen, 1981), and it serves as the basis for comparison with the other two heuristic methods proposed in this paper. These three heuristics are evaluated and compared by experimental runs based on the number of disk accesses required for partial document ranking, in which the returned documents contain some, but not necessarily all, of the requested number of top documents. The re...
On the allocation of documents in multiprocessor information retrieval systems
- In Proceedings of the Fourteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
, 1991
"... Abstract. Information retrieval is the selection of documents that are potentially relevant to a user’s information need. Given the vast volume of data stored in modern information retrieval systems, searching the document database requires vast computational resources. To meet these computational d ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
Abstract. Information retrieval is the selection of documents that are potentially relevant to a user’s information need. Given the vast volume of data stored in modern information retrieval systems, searching the document database requires vast computational resources. To meet these computational demands, various researchers have developed parallel information retrieval systems. As efficient exploitation of parallelism demands fast access to the documents, data organization and placement significantly affect the total processing time. We describe and evaluate a data placement strategy for distributed memory, distributed 1/0 multicomputers. Initially, a formal description of the Multiprocessor Document Allocation Problem (MDAP) and a proof that MDAP is NP Complete are presented. A document allocation
Integrating Feature Extraction and Memory Search
- Machine Learning
, 1993
"... Reasoning from prior cases or abstractions requires that a system identify relevant similarities between the current situation and objects represented in memory. Often, relevance depends upon abstract, thematic, costly-to-infer properties of the situation. Because of the cost of inference, a case re ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
Reasoning from prior cases or abstractions requires that a system identify relevant similarities between the current situation and objects represented in memory. Often, relevance depends upon abstract, thematic, costly-to-infer properties of the situation. Because of the cost of inference, a case retrieval system needs to learn which descriptions are worth inferring, and how costly that inference will be. This paper outlines the properties that make an abstract thematic feature valuable to a case-based reasoner, and recasts the problem of case retrieval into a framework under which a system can explicitly and dynamically reason about the cost of acquiring features relative to their information value. 1 Retrieval, description, and learning For a case-based reasoner to make effective use of recalled prior experiences, it must be able to judge which of its cases are applicable to the current situation. This problem is not new nor is it unique to case-based reasoning: any system that re...

