Results 1 - 10
of
23
Efficient Crawling Through URL Ordering
- COMPUTER NETWORKS AND ISDN SYSTEMS
, 1998
"... In this paper we study in what order a crawler should visit the URLs it has seen, in order to obtain more “important” pages first. Obtaining important pages rapidly can be very useful when a crawler cannot visit the entire Web in a reasonable amount of time. We define several importance metrics, ord ..."
Abstract
-
Cited by 253 (8 self)
- Add to MetaCart
In this paper we study in what order a crawler should visit the URLs it has seen, in order to obtain more “important” pages first. Obtaining important pages rapidly can be very useful when a crawler cannot visit the entire Web in a reasonable amount of time. We define several importance metrics, ordering schemes, and performance evaluation measures for this problem. We also experimentally evaluate the ordering schemes on the Stanford University Web. Our results show that a crawler with a good ordering scheme can obtain important pages significantly faster than one without.
Multi-Service Search and Comparison Using the MetaCrawler
- In Proceedings of the 4th International World Wide Web Conference
, 1995
"... Standard Web search services, though useful, are far from ideal. There are over a dozen different search services currently in existence, each with a unique interface and a database covering a different portion of the Web. As a result, users are forced to repeatedly try and retry their queries acros ..."
Abstract
-
Cited by 172 (8 self)
- Add to MetaCart
Standard Web search services, though useful, are far from ideal. There are over a dozen different search services currently in existence, each with a unique interface and a database covering a different portion of the Web. As a result, users are forced to repeatedly try and retry their queries across different services. Furthermore, the services return many responses that are irrelevant, outdated, or unavailable, forcing the user to manually sift through the responses searching for useful information. This paper presents the MetaCrawler, a fielded Web service that represents the next level up in the information "food chain." The MetaCrawler provides a single, central interface for Web document searching. Upon receiving a query, the MetaCrawler posts the query to multiple search services in parallel, collates the returned references, and loads those references to verify their existence and to ensure that they contain relevant information. The MetaCrawler is sufficiently lightweight to r...
Searching the Web
- ACM TRANSACTIONS ON INTERNET TECHNOLOGY
, 2001
"... We offer an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance. The most common design and im ..."
Abstract
-
Cited by 108 (1 self)
- Add to MetaCart
We offer an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance. The most common design and implementation techniques for each of these components are presented. For this presentation we draw from the literature and from our own experimental search engine testbed. Emphasis is on introducing the fundamental concepts and the results of several performance analyses we conducted to compare different designs.
Parallel crawlers
- In Proceedings of the 11th international conference on World Wide Web
, 2002
"... In this paper we study how we can design an effective parallel crawler. As the size of the Web grows, it becomes imperative to parallelize a crawling process, in order to finish downloading pages in a reasonable amount of time. We first propose multiple architectures for a parallel crawler and ident ..."
Abstract
-
Cited by 71 (3 self)
- Add to MetaCart
In this paper we study how we can design an effective parallel crawler. As the size of the Web grows, it becomes imperative to parallelize a crawling process, in order to finish downloading pages in a reasonable amount of time. We first propose multiple architectures for a parallel crawler and identify fundamental issues related to parallel crawling. Based on this understanding, we then propose metrics to evaluate a parallel crawler, and compare the proposed architectures using 40 million pages collected from the Web. Our results clarify the relative merits of each architecture and provide a good guideline on when to adopt which architecture. 1
An Image and Video Search Engine for the World-Wide Web
- In Proc. SPIE Storage and Retrieval for Image and Video Databases
, 1997
"... We describe a visual information system prototype for searching for images and videos on the WorldWide Web. New visual information in the form of images, graphics, animations and videos is being published on the Web at an incredible rate. However, cataloging this visual data is beyond the capabiliti ..."
Abstract
-
Cited by 50 (7 self)
- Add to MetaCart
We describe a visual information system prototype for searching for images and videos on the WorldWide Web. New visual information in the form of images, graphics, animations and videos is being published on the Web at an incredible rate. However, cataloging this visual data is beyond the capabilities of current text-based Web search engines. In this paper, we describe a complete system by which visual information on the Web is (1) collected by automated agents, (2) processed in both text and visual feature domains, (3) catalogued and (4) indexed for fast search and retrieval. We introduce an image and video search engine which utilizes both text-based navigation and content-based technology for searching visually through the catalogued images and videos. Finally, we provide an initial evaluation based upon the cataloging of over one half million images and videos collected from the Web. Keywords -- content-based visual query, image and video storage and retrieval, World-Wide Web. 1 I...
Searching for Images and Videos on the World-Wide Web
, 1996
"... We describe a prototype visual information system for searching for images and videos on the World-Wide Web. New visual information in the form of images, graphics, animations and videos is being published on the Web at an incredible rate. However, cataloging this visual data is beyond the capabilit ..."
Abstract
-
Cited by 39 (1 self)
- Add to MetaCart
We describe a prototype visual information system for searching for images and videos on the World-Wide Web. New visual information in the form of images, graphics, animations and videos is being published on the Web at an incredible rate. However, cataloging this visual data is beyond the capabilities of current text-based Web search engines. The key to cataloging it is the marriage of text-based processing and content-based visual analysis of the images and videos. In this paper, we describe a complete system by which visual information on the Web is (1) collected by automated agents, (2) processed in both text and visual feature domains, (3) catalogued and (4) indexed for fast search and retrieval. We introduce an image and video search engine which utilizes both text-based navigation and content-based technology for searching visually through the catalogued images and videos. Finally, we provide an initial evaluation based upon the cataloging of over one half million images and videos collected from the Web. Keywords { content-based visual query, image and video storage and retrieval, World-Wide Web. John R. Smith and Shih-Fu Chang 1 1
Discovery of Web Robot Sessions based on their Navigational Patterns
, 2002
"... Web robots are software programs that automatically traverse the hyperlink structure of the World Wide Web in order to locate and retrieve information. There are many reasons why it is important to identify visits by the Web robots and distinguish them from other users. First of all, e-commerce reta ..."
Abstract
-
Cited by 37 (0 self)
- Add to MetaCart
Web robots are software programs that automatically traverse the hyperlink structure of the World Wide Web in order to locate and retrieve information. There are many reasons why it is important to identify visits by the Web robots and distinguish them from other users. First of all, e-commerce retailers are particularly concerned about the unauthorized deployment of robots for gathering business intelligence at their Web sites. In addition, Web robots tend to consume considerable network bandwidth at the expense of other users. Sessions due to Web robots also make it more difficult to perform clickstream analysis effectively on the Web data. Conventional techniques for detecting Web robots are often based on identifying the IP address and user agent of the Web clients. While these techniques are applicable to many well-known robots, they may not be sufficient to detect camouaging and previously unknown robots. In this paper, we propose an alternative approach that uses the navigational patterns in the click-stream data to determine if it is due to a robot. Experimental results on our Computer Science department Web server logs show that highly accurate classification models can be built using this approach. We also show that these models are able to discover many camouflaging and previously unidentified robots.
Balancing Volume, Quality and Freshness in Web Crawling
- In Soft Computing Systems - Design, Management and Applications
, 2002
"... We describe a crawling software designed for high-performance, large-scale information discovery and gathering on the Web. This crawler allows the administrator to seek for a balance between the volume of a Web collection and its freshness; and also provides flexibility for defining a quality metric ..."
Abstract
-
Cited by 18 (11 self)
- Add to MetaCart
We describe a crawling software designed for high-performance, large-scale information discovery and gathering on the Web. This crawler allows the administrator to seek for a balance between the volume of a Web collection and its freshness; and also provides flexibility for defining a quality metric to priorize certain pages.
Modeling of Web Robot Navigational Patterns
, 2000
"... In recent years, it is becoming increasingly difficult to ignore the impact of Web robots on both commercial and institutional Web sites. Not only do Web robots consume valuable bandwidth and Web server resources, they are also making it more difficult to apply Web Mining techniques eectively on the ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
In recent years, it is becoming increasingly difficult to ignore the impact of Web robots on both commercial and institutional Web sites. Not only do Web robots consume valuable bandwidth and Web server resources, they are also making it more difficult to apply Web Mining techniques eectively on the Web logs. E-commerce Web sites are also concern about unauthorized deployment of shopbots for the purpose of gathering business intelligence at their Web sites. Ethical robots can be easily detected because they tend to follow most of the guidelines proposed for robot designers. On the other hand, unethical robots are more difficult to identify since they tend to camouage their entries in the Web logs. In this paper, we examine the problem of identifying navigational patterns of Web robots using conventional machine learning techniques. Our goal is to construct a predictive model that will distinguish between the browsing behavior of legitimate Web users from access patterns due to Web robots. Our results show that highly accurate models can be obtained using a small set of access features deduced from the Web logs.
Challenges on distributed web retrieval
- In IEEE 23rd International Conference on Data Engineering
, 2007
"... In the ocean of Web data, Web search engines are the primary way to access content. As the data is on the order of petabytes, current search engines are very large centralized systems based on replicated clusters. Web data, however, is always evolving. The number of Web sites continues to grow rapid ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
In the ocean of Web data, Web search engines are the primary way to access content. As the data is on the order of petabytes, current search engines are very large centralized systems based on replicated clusters. Web data, however, is always evolving. The number of Web sites continues to grow rapidly and there are currently more than 20 billion indexed pages. In the near future, centralized systems are likely to become ineffective against such a load, thus suggesting the need of fully distributed search engines. Such engines need to achieve the following goals: high quality answers, fast response time, high query throughput, and scalability. In this paper we survey and organize recent research results, outlining the main challenges of designing a distributed Web retrieval system.

