Results 1 -
5 of
5
Scalable Techniques for Document Identifier Assignment in Inverted Indexes
- WWW2010
, 2010
"... Web search engines are based on a full-text data structure called an inverted index. The size of the inverted index structures is a major performance bottleneck during query processing, and a large amount of research has focused on fast and effective techniques for compressing this structure. Severa ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Web search engines are based on a full-text data structure called an inverted index. The size of the inverted index structures is a major performance bottleneck during query processing, and a large amount of research has focused on fast and effective techniques for compressing this structure. Several authors have recently proposed techniques for improving index compression by optimizing the assignment of document identifiers to the documents in the collection, leading to significant improvements in overall index size. In this paper, we propose improved techniques for document identifier assignment. Previous work includes simple and fast heuristics such as sorting by URL, as well as more involved approaches based on Travelling Salesman or graph partitioning problems that achieve good compression but do not scale to larger document collections. We propose a new framework based on performing a Travelling Salesman computation on a reduced sparse graph obtained using Locally Sensitive Hashing, which achieves improved compression while scaling to tens of millions of documents. Based on this framework, we describe a number of new algorithms, and perform a detailed evaluation on three large data sets showing improvements in index size.
Design and implementation of contextual information portals
- Proceedings of the 20th International Conference companion on World Wide Web
, 2011
"... This paper presents a system for enabling offline web use to satisfy the information needs of disconnected communities. We describe the design, implementation, evaluation, and pilot deployment of an automated mechanism to construct ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
This paper presents a system for enabling offline web use to satisfy the information needs of disconnected communities. We describe the design, implementation, evaluation, and pilot deployment of an automated mechanism to construct
Fast and Scalable Pattern Mining for Media-Type Focused Crawling ∗ [experience paper]
"... Search engines targeting content other than hypertext documents require a crawler that discovers resources identifying files of certain media types. Naïve crawling approaches do not guarantee a sufficient supply of new URIs (Uniform Resource Identifiers) to visit; effective and scalable mechanisms f ..."
Abstract
- Add to MetaCart
Search engines targeting content other than hypertext documents require a crawler that discovers resources identifying files of certain media types. Naïve crawling approaches do not guarantee a sufficient supply of new URIs (Uniform Resource Identifiers) to visit; effective and scalable mechanisms for discovering and crawling targeted resources are needed. One promising approach is to use data mining techniques to identify the media type of a resource without the need for downloading the content of the resource. The idea is to use a learning approach on features derived from patterns occuring in the resource identifier. We present a focused crawler as a use case for fast and scalable data mining and discuss classification and pattern mining techniques suited for selecting resources satisfying specified media types. We show that we can process an average of 17,000 URIs/second and still detect the media type of resources with a precision of more than 80 % and a recall of over 65 % for all media types. 1
Kairos: Proactive Harvesting of Research Paper Metadata from Scientific Conference Web Sites
"... Abstract. We investigate the automatic harvesting of research paper metadata from recent scholarly events. Our system, Kairos, combines a focused crawler and an information extraction engine, to convert a list of conference websites into a index filled with fields of metadata that correspond to indi ..."
Abstract
- Add to MetaCart
Abstract. We investigate the automatic harvesting of research paper metadata from recent scholarly events. Our system, Kairos, combines a focused crawler and an information extraction engine, to convert a list of conference websites into a index filled with fields of metadata that correspond to individual papers. Using event date metadata extracted from the conference website, Kairos proactively harvests metadata about the individual papers soon after they are made public. We use a Maximum Entropy classifier to classify uniform resource locators (URLs) as scientific conference websites and use Conditional Random Fields (CRF) to extract individual paper metadata from such websites. Experiments show an acceptable measure of classification accuracy of over 95 % for each of the two components. 1
Re-architecting Web and Mobile Information Access for Emerging Regions
, 2011
"... I would like to start by expressing my deepest gratitude to my advisor, Lakshminarayanan Subramanian (or just “Lakshmi”). It was Lakshmi who set me on the path toward my eventual area of research. Lakshmi has always been generous with his time, and never short on ideas or enthusiasm. Without Lakshmi ..."
Abstract
- Add to MetaCart
I would like to start by expressing my deepest gratitude to my advisor, Lakshminarayanan Subramanian (or just “Lakshmi”). It was Lakshmi who set me on the path toward my eventual area of research. Lakshmi has always been generous with his time, and never short on ideas or enthusiasm. Without Lakshmi’s courage to pursue the research that inspires him, I would not have found my own passion: to build systems that benefit people- as many people as much as possible by inventing ways to bring technology to people living outside of the privileged regions of the world. Contributors to this dissertation- This thesis is based on research that I performed over the past five years with many colleagues contributing directly to the work in this dissertation. Many people helped me along the way whose help I could not have done without. The RuralCafe user study would not have been possible without the help of Saleema Amershi and Aditya Dhananjay (Chapter 6.6). Our low bandwidth transport modeling and analysis (Chapter 3.1) was an effort largely attributable to Janardhan Iyengar and long discussions with Bryan Ford. Russell Power implemented the feature reduction algorithm for CIPs (Chapter 7.2.2) in his “spare time”. Our ELF deployments (Chapters 2.2 and 5.3) were only possible with help from David Hutchful.

