Results 1 - 10
of
24
What's New on the Web? The Evolution of the Web from a Search Engine Perspective
, 2004
"... We seek to gain improved insight into how Web search engines should cope with the evolving Web, in an attempt to provide users with the most up-to-date results possible. For this purpose we collected weekly snapshots of some 150 Web sites over the course of one year, and measured the evolution of co ..."
Abstract
-
Cited by 220 (15 self)
- Add to MetaCart
(Show Context)
We seek to gain improved insight into how Web search engines should cope with the evolving Web, in an attempt to provide users with the most up-to-date results possible. For this purpose we collected weekly snapshots of some 150 Web sites over the course of one year, and measured the evolution of content and link structure. Our measurements focus on aspects of potential interest to search engine designers: the evolution of link structure over time, the rate of creation of new pages and new distinct content on the Web, and the rate of change of the content of existing pages under search-centric measures of degree of change.
Improving Web Search Efficiency via a Locality Based Static Pruning Method
- In Proceedings of the 14th International Conference on World Wide Web
, 2005
"... The unarguably fast, and continuous, growth of the volume of indexed (and indexable) documents on the Web poses a great challenge for search engines. This is true regarding not only search e#ectiveness but also time and space e#- ciency. In this paper we present an index pruning technique targeted f ..."
Abstract
-
Cited by 26 (3 self)
- Add to MetaCart
The unarguably fast, and continuous, growth of the volume of indexed (and indexable) documents on the Web poses a great challenge for search engines. This is true regarding not only search e#ectiveness but also time and space e#- ciency. In this paper we present an index pruning technique targeted for search engines that addresses the latter issue without disconsidering the former. To this e#ect, we adopt a new pruning strategy capable of greatly reducing the size of search engine indices. Experiments using a real search engine show that our technique can reduce the indices' storage costs by up to 60% over traditional lossless compression methods, while keeping the loss in retrieval precision to a minimum. When compared to the indices size with no compression at all, the compression rate is higher than 88%, i.e., less than one eighth of the original size. More importantly, our results indicate that, due to the reduction in storage overhead, query processing time can be reduced to nearly 65% of the original time, with no loss in average precision. The new method yields significative improvements when compared against the best known static pruning method for search engine indices. In addition, since our technique is orthogonal to the underlying search algorithms, it can be adopted by virtually any search engine.
Efficient information extraction over evolving text data
- in ICDE
, 2008
"... Most current information extraction (IE) approaches have considered only static text corpora, over which we typically have to apply IE only once. Many real-world text corpora however are dynamic. They evolve over time, and to keep extracted information up to date, we often must apply IE repeatedly, ..."
Abstract
-
Cited by 24 (6 self)
- Add to MetaCart
(Show Context)
Most current information extraction (IE) approaches have considered only static text corpora, over which we typically have to apply IE only once. Many real-world text corpora however are dynamic. They evolve over time, and to keep extracted information up to date, we often must apply IE repeatedly, to consecutive corpus snapshots. We describe Cyclex, an approach that efficiently executes such repeated IE, by recycling previous IE efforts. Specifically, given a current corpus snapshot U, Cyclex identifies text portions of U that also appears in the previous corpus snapshot V. Since Cyclex has already executed IE over V, it can now recycle the IE results of these parts, by combining these results with the results of executing IE over the remaining parts of U, to produce the complete IE results for U. Realizing Cyclex raises many challenges, including modeling information extractors, exploring the trade-off between runtime and completeness in identifying overlapping text, and making informed, cost-based decisions between redoing IE from scratch and recycling previous IE results. We describe initial solutions to these challenges, and experiments over two real-world data sets that demonstrate the utility of our approach. 1
Efficient Search in Large Textual Collections with Redundancy
, 2007
"... Current web search engines focus on searching only the most recent snapshot of the web. In some cases, however, it would be desirable to search over collections that include many different crawls and versions of each page. One important example of such a collection is the Internet Archive, though th ..."
Abstract
-
Cited by 16 (5 self)
- Add to MetaCart
Current web search engines focus on searching only the most recent snapshot of the web. In some cases, however, it would be desirable to search over collections that include many different crawls and versions of each page. One important example of such a collection is the Internet Archive, though there are many others. Since the data size of such an archive is multiple times that of a single snapshot, this presents us with significant performance challenges. Current engines use various techniques for index compression and optimized query execution, but these techniques do not exploit the significant similarities between different versions of a page, or between different pages. In this paper, we propose a general framework for indexing and query processing of archival collections and, more generally, any collections with a sufficient amount of redundancy. Our approach results in significant reductions in index size and query processing costs on such collections, and it is orthogonal to and can be combined with the existing techniques. It also supports highly efficient updates, both locally and over a network. Within this framework, we describe and evaluate different implementations that trade off index size versus CPU cost and other factors, and discuss applications ranging from archival web search to local search of web sites, email archives, or file systems. We present experimental results based on search engine query log and a large collection consisting of multiple crawls.
A Node Indexing Scheme for Web Entity Retrieval
"... Abstract. Now motivated also by the partial support of major search engines, hundreds of millions of documents are being published on the web embedding semi-structured data in RDF, RDFa and Microformats. This scenario calls for novel information search systems which provide effective means of retrie ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
(Show Context)
Abstract. Now motivated also by the partial support of major search engines, hundreds of millions of documents are being published on the web embedding semi-structured data in RDF, RDFa and Microformats. This scenario calls for novel information search systems which provide effective means of retrieving relevant semi-structured information. In this paper, we present an “entity retrieval system ” designed to provide entity search capabilities over datasets as large as the entire Web of Data. Our system supports full-text search, semi-structural queries and top-k query results while exhibiting a concise index and efficient incremental updates. We advocate the use of a node indexing scheme and show that it offers a good compromise between query expressiveness, query processing time and update complexity in comparison to three other indexing techniques. We then demonstrate how such system can effectively answer queries over 10 billion triples on a single commodity machine. 1
Efficient inverted lists and query algorithms for Structured Value Ranking in update-intensive relational databases
- In ICDE ’05: Proceedings of the 21st International Conference on Data Engineering (ICDE’05
, 2005
"... guolin,jai¡ We propose a new ranking paradigm for relational databases called Structured Value Ranking (SVR). SVR uses structured data values to score (rank) the results of keyword search queries over text columns. Our main contribution is a new family of inverted list indices and associated query a ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
(Show Context)
guolin,jai¡ We propose a new ranking paradigm for relational databases called Structured Value Ranking (SVR). SVR uses structured data values to score (rank) the results of keyword search queries over text columns. Our main contribution is a new family of inverted list indices and associated query algorithms that can support SVR efficiently in update–intensive databases, where the structured data values (and hence the scores of documents) change frequently. Our experimental results on real and synthetic data sets using BerkeleyDB show that we can support SVR efficiently in relational databases. 1
Multi-User File System Search
, 2007
"... Information retrieval research usually deals with globally visible, static document collections. Practical applications, in contrast, like file system search and enterprise search, have to cope with highly dynamic text collections and have to take into account user-specific access permissions when ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
(Show Context)
Information retrieval research usually deals with globally visible, static document collections. Practical applications, in contrast, like file system search and enterprise search, have to cope with highly dynamic text collections and have to take into account user-specific access permissions when generating the results to a search query. The goal of this thesis is to close the gap between information retrieval research and the requirements exacted by these real-life applications. The algorithms and data structures presented in this thesis can be used to implement a file system search engine that is able to react to changes in the file system by updating its index data in real time. File changes (in-sertions, deletions, or modifications) are reflected by the search results within a few seconds,
Modelling and Operation Issues for Pattern Base Management Systems
- PhD thesis, Knowledge and Database Systems Laboratory, School of Electrical and Computer Engineering, NTUA
, 2007
"... ..."
Optimizing Positional Index Structures for Versioned Document Collections
, 2012
"... Versioned document collections are collections that contain multiple versions of each document. Important examples are Web archives, Wikipedia and other wikis, or source code and documents maintained in revision control systems. Versioned document collections can become very large, due to the need t ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Versioned document collections are collections that contain multiple versions of each document. Important examples are Web archives, Wikipedia and other wikis, or source code and documents maintained in revision control systems. Versioned document collections can become very large, due to the need to retain past versions, but there is also a lot of redundancy between versions that can be exploited. Thus, versioned document collections are usually stored using special differential (delta) compression techniques, and a number of researchers have recently studied how to exploit this redundancy to obtain more succinct full-text index structures. In this paper, we study index organization and compression techniques for such versioned full-text index structures. In particular, we focus on the case of positional index structures, while most previous work has focused on the non-positional case. Building on earlier work in [32], we propose a framework for indexing and querying in versioned document collections that integrates non-positional and positional indexes to enable fast top-k query processing. Within this framework, we define and study the problem of minimizing positional index size through optimal substring partitioning. Experiments on Wikipedia and web archive data show that our techniques achieve significant reductions in index size over previous work while supporting very fast query processing.