Results 1 - 10
of
16
Focused crawling: a new approach to topic-specific Web resource discovery
, 1999
"... The rapid growth of the World-Wide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In this paper we describe a new hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevan ..."
Abstract
-
Cited by 411 (8 self)
- Add to MetaCart
The rapid growth of the World-Wide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In this paper we describe a new hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics. The topics are specified not using keywords, but using exemplary documents. Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the Web. This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date. To achieve such goal-directed crawling, we designed two hypertext mining programs that guide our crawler: a classifier that evaluates the relevance of a hypertext document with respect to the focus topics, ...
The AT&T Internet Difference Engine: Tracking and Viewing Changes on the Web
, 1997
"... The AT&T Internet Difference Engine (aide) is a system that finds and displays changes to pages on the World Wide Web. The system consists of several components, including a webcrawler that detects changes, an archive of past versions of pages, a tool called HtmlDiff to highlight changes between ver ..."
Abstract
-
Cited by 45 (3 self)
- Add to MetaCart
The AT&T Internet Difference Engine (aide) is a system that finds and displays changes to pages on the World Wide Web. The system consists of several components, including a webcrawler that detects changes, an archive of past versions of pages, a tool called HtmlDiff to highlight changes between versions of a page, and a graphical interface to view the relationship between pages over time. This paper describes aide, with an emphasis on the evolution of the system and experiences with it. It also raises some sociological and legal issues.
Methods for sampling pages uniformly from the world wide web
- In AAAI Fall Symposium on Using Uncertainty Within Computation
, 2001
"... We present two new algorithms for generating uniformly random samples of pages from the World Wide Web, building upon recent work by Henzinger et al. (Henzinger et al. 2000) and Bar-Yossef et al. (Bar-Yossef et al. 2000). Both algorithms are based on a weighted random-walk methodology. The first alg ..."
Abstract
-
Cited by 32 (2 self)
- Add to MetaCart
We present two new algorithms for generating uniformly random samples of pages from the World Wide Web, building upon recent work by Henzinger et al. (Henzinger et al. 2000) and Bar-Yossef et al. (Bar-Yossef et al. 2000). Both algorithms are based on a weighted random-walk methodology. The first algorithm (DIRECTED-SAMPLE) operates on arbitrary directed graphs, and so is naturally applicable to the web. We show that, in the limit, this algorithm generates samples that are uniformly random. The second algorithm (UNDIRECTED-SAMPLE) operates on undirected graphs, thus requiring a mechanism for obtaining inbound links to web pages (e.g., access to a search engine). With this additional knowledge of inbound links, the algorithm can arrive at a uniform distribution faster than DIRECTED-SAMPLE, and we derive explicit bounds on the time to convergence. In addition, we evaluate the two algorithms on simulated web data, showing that both yield reliably uniform samples of pages. We also compare our results with those of previous algorithms, and discuss the theoretical relationships among the various proposed methods.
Realistic Books: A bizarre homage to an obsolete medium?
"... For many readers, handling a physical book is an enjoyably exquisite part of the information seeking process. Many physical characteristics of a book---its size, heft, the patina of use on its pages and so on---communicate ambient qualities of the document it represents. In contrast, the experience ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
For many readers, handling a physical book is an enjoyably exquisite part of the information seeking process. Many physical characteristics of a book---its size, heft, the patina of use on its pages and so on---communicate ambient qualities of the document it represents. In contrast, the experience of accessing and exploring digital library documents is dull. The emphasis is utilitarian; technophile rather than bibliophile. We have extended the page-turning algorithm we reported at last year's JCDL into a scaleable, systematic approach that allows users to view and interact with realistic visualizations of any textual-based document in a Greenstone collection. Here, we further motivate the approach, illustrate the system in use, discuss the system architecture and present a user evaluation. Our work leads us to believe that far from being a whimsical gimmick, physical book models can usefully complement conventional document viewers and increase the perceived value of a digital library system.
Using Mobile Crawlers to Search the Web Efficiently
- International Journal of Computer and Information Science
, 2000
"... Due to the enormous rowth of the World Wide Web, search engines have become indispensable tools for Web navigation. In order to provide powerful search facilities, search engines maintain comprehensive indices for documents and their contents on the Web by continuously downloading Web pages for ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Due to the enormous rowth of the World Wide Web, search engines have become indispensable tools for Web navigation. In order to provide powerful search facilities, search engines maintain comprehensive indices for documents and their contents on the Web by continuously downloading Web pages for processing. In this paper, we demonstrate an alternative, more efficient approach to the "download-first process-later" strategy of existing search engines by using mobile crawlers. The major advantage of the mobile approach is that the analysis portion of the crawling process is done locally where the data resides rather than remotely inside the Web search engine. This can significantly reduce network load which, in turn, can improve the performance of the crawling process.
Combining Text-, Link-, and Classification-based Retrieval Methods to Enhance Information Discovery on the Web
, 2002
"... ..."
cultural asset management and ethnohistory. Preserving the process and understanding the past
- Archivi & Computer
, 2001
"... Studying the Life on the Internet..........................................................................5 ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Studying the Life on the Internet..........................................................................5
DIVISOR: DIstributed VIdeo Server fOr stReaming
- In Proceedings of the 5th IEEE/WSES International Conference on Circuits, Systems, Communications and Computers (CSCC
, 2001
"... This paper presents the design and implementation of a networking system architecture targeted to support high-speed video transmission to multiple clients. We have designed, implemented, and evaluated a high-speed, distributed Video Server, which is divided in two different components, the video en ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
This paper presents the design and implementation of a networking system architecture targeted to support high-speed video transmission to multiple clients. We have designed, implemented, and evaluated a high-speed, distributed Video Server, which is divided in two different components, the video encoding unit and the network protocol processing unit. The video encoding unit performs the video data encoding, while the network protocol processing unit deals with the network protocol processing. In order to provide a low-cost, scalable system, we have used commercial, off-the-shelf components. We have implemented our system using a small cluster of personal computers, connected via an optical fiber (raw ATM communication). Our initial experimental evaluation suggests that our Distributed Video Server for Streaming (DIVISOR) can efficiently provide predictable response to a large number of clients, guaranteeing Quality of Service and real-time delivery.
The WEB archives: A time-machine in your pocket!
, 1999
"... Taking an interdisciplinary approach 1 , the authors discuss both technical issues of creating archives of the World Wide Web (as suggested at www.archive.org), and the possible socio-political relevance of such archives in the future. As the Internet becomes the Ever- and Everywherenet, the Web a ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Taking an interdisciplinary approach 1 , the authors discuss both technical issues of creating archives of the World Wide Web (as suggested at www.archive.org), and the possible socio-political relevance of such archives in the future. As the Internet becomes the Ever- and Everywherenet, the Web archives may become a memory of mankind, a sort of time-machine to go back into the past. The authors present the hardware and software concepts, and an initial analysis, of a highly scalable and extendable approach to archive a fully queryable copy of the ever-changing Web. The purpose is not to compete with the efforts at www.archives.org, but to present research results that may be useful in any future archiving project of the Web. The authors' software approach is unique in that the search strategy of the Web crawler is based on capture-recapture techniques (from statistics), rather than the common brute-force method of scanning as many Web pages as possible. This includes estimates on th...
FILTERING THE FUTURE?: SOFTWARE FILTERS, PORN, PICS, AND THE INTERNET CONTENT CONUNDRUM BY
, 1999
"... This thesis could not have been completed without the help of a great many people. First and foremost I would like to thank my advisor, Dr. W. Russell Neuman, for guiding me into the area of public policy and political economy. More than a professor, Dr. Neuman is friend and mentor who has taught me ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This thesis could not have been completed without the help of a great many people. First and foremost I would like to thank my advisor, Dr. W. Russell Neuman, for guiding me into the area of public policy and political economy. More than a professor, Dr. Neuman is friend and mentor who has taught me that diplomacy matters as much passion. Dr. Joseph Turow graciously accepted to be my second reader, and provided me with wonderful access to his research on families and the Internet. Dr. Hugh Donahue provided great moral support and help rating web sites. I would also like to thank Dr. Carolyn Marvin for helping to shape my thinking about the history of pornography and free speech. Several Annenberg students were also very helpful, particularly in testing web sites and helping with data analysis. Jennifer Stromer-Galley and John Bracken were real troopers in agreeing to content analyze 20 or so web pages, many with rather controversial content. Kirkland Ahern's statistical prowess was absolutely essential in analyzing my content analysis data. Few people are so willing to give such time for free and with such a bright spirit. Finally, I would like to dedicate this thesis to my loving mother and father. Throughout the arduous process of researching and writing, I drew inspiration from my father and his hard working, never

