Results 1 - 10
of
10
OverCite: A Cooperative Digital Research Library
, 2005
"... CiteSeer is a well-known online resource for the computer science research community, allowing users to search and browse a large archive of research papers. Unfortunately, its current centralized incarnation is costly to run. Although members of the community would presumably be willing to donate h ..."
Abstract
-
Cited by 24 (9 self)
- Add to MetaCart
CiteSeer is a well-known online resource for the computer science research community, allowing users to search and browse a large archive of research papers. Unfortunately, its current centralized incarnation is costly to run. Although members of the community would presumably be willing to donate hardware and bandwidth at their own sites to assist CiteSeer, the current architecture does not facilitate such distribution of resources. OverCite is a design for a new architecture for a distributed and cooperative research library based on a distributed hash table (DHT). The new architecture harnesses donated resources at many sites to provide document search and retrieval service to researchers worldwide. A preliminary evaluation of an initial OverCite prototype shows that it can service more queries per second than a centralized system, and that it increases total storage capacity by a factor of n/4 in a system of n nodes. OverCite can exploit these additional resources by supporting new features such as document alerts, and by scaling to larger data sets.
IRLbot: Scaling to 6 Billion Pages and Beyond
"... Abstract—This paper shares our experience in designing a web crawler that can download billions of pages using a singleserver implementation and models its performance. We show that with the quadratically increasing complexity of verifying URL uniqueness, BFS crawl order, and fixed per-host ratelimi ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
Abstract—This paper shares our experience in designing a web crawler that can download billions of pages using a singleserver implementation and models its performance. We show that with the quadratically increasing complexity of verifying URL uniqueness, BFS crawl order, and fixed per-host ratelimiting, current crawling algorithms cannot effectively cope with the sheer volume of URLs generated in large crawls, highly-branching spam, legitimate multi-million-page blog sites, and infinite loops created by server-side scripts. We offer a set of techniques for dealing with these issues and test their performance in an implementation we call IRLbot. In our recent experiment that lasted 41 days, IRLbot running on a single server successfully crawled 6.3 billion valid HTML pages (7.6 billion connection requests) and sustained an average download rate of 319 mb/s (1, 789 pages/s). Unlike our prior experiments with algorithms proposed in related work, this version of IRLbot did not experience any bottlenecks and successfully handled content from over 117 million hosts, parsed out 394 billion links, and discovered a subset of the web graph with 41 billion unique nodes. I.
Agyaat: Providing Mutually Anonymous Services over Structured P2P Networks
, 2004
"... In the modern era of ubiquitous computing, privacy is one of the most critical user concerns. To prevent their privacy, users typically, try to remain anonymous to the service provider. This is especially true for decentralized Peer-to-Peer (P2P) systems, where common users act both as clients and a ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
In the modern era of ubiquitous computing, privacy is one of the most critical user concerns. To prevent their privacy, users typically, try to remain anonymous to the service provider. This is especially true for decentralized Peer-to-Peer (P2P) systems, where common users act both as clients and as service providers. Preserving privacy in such cases requires mutual anonymity, which shields the users at both ends. Most unstructured P2P systems like Gnutella [15], Kazaa [16] provide a certain level of anonymity through the use of a random overlay topology and a flooding based routing protocol, but suffer from the lack of guaranteed lookup of data. In contrast, most structured P2P systems like Chord [7], are Distributed Hash Table (DHT) based systems and provide guarantees that any stored data item can be found within a bounded number of hops. However, none of the existing DHT systems provide any mutual anonymity. In this paper, we present...
EverLast: a distributed architecture for preserving the web
- In Proc. of ACM/IEEE JCDL Conf., 2009
"... The World Wide Web has become a key source of knowledge pertaining to almost every walk of life. Unfortunately, much of data on the Web is highly ephemeral in nature, with more than 50-80 % of content estimated to be changing within a short time. Continuing the pioneering efforts of many national (d ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The World Wide Web has become a key source of knowledge pertaining to almost every walk of life. Unfortunately, much of data on the Web is highly ephemeral in nature, with more than 50-80 % of content estimated to be changing within a short time. Continuing the pioneering efforts of many national (digital) libraries, organizations such as the
A Hybrid Topology Architecture for P2P Systems
"... A core area of P2P systems research is the topology of the overlay network. It has ranged from random unstructured networks like Gnutella [8] to Super-Peer [9] architectures to the recent trend of structured overlays based on Distributed Hash Tables (DHTs) [4, 12, 11]. While the unstructured network ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
A core area of P2P systems research is the topology of the overlay network. It has ranged from random unstructured networks like Gnutella [8] to Super-Peer [9] architectures to the recent trend of structured overlays based on Distributed Hash Tables (DHTs) [4, 12, 11]. While the unstructured networks have excessive lookup costs and un-guaranteed lookups, the structured systems offer no anonymity and delegate control over data items to unrelated peers. In this paper, we present an in-themiddle hybrid architecture which uses a mix of both topologies to create a decentralized P2P infrastructure. The system provides scalable and guaranteed lookups in addition to mutual anonymity and also allows hosting content with the content-owner. We validate our architecture through a thorough analytical and empirical performance analysis of the system.
User Oriented Information Retrieval in a Collaborative and Context Aware Search Engine
"... Abstract: The web is the largest knowledge system who people can access. Unfortunately, its size makes very complex to find the right information. Search engines help people but they face user with a large amount of results, mostly useless, and he have to select one by one the right satisfying needs ..."
Abstract
- Add to MetaCart
Abstract: The web is the largest knowledge system who people can access. Unfortunately, its size makes very complex to find the right information. Search engines help people but they face user with a large amount of results, mostly useless, and he have to select one by one the right satisfying needs. To focus the attention on user requirements and involving the end-user in their systems, new search engines should evolve from keyword-based indexing and classification to more sophisticated techniques considering the meaning, the context and the usage of information. The key aspects are three: semantics, geo-referencing, collaboration. Semantic analysis lets to increase the results relevance. The geo-referencing of catalogued resources allows contextualisation based on user position. Collaboration distributes storage, processing and trust on a world-wide network of nodes running on users ’ computers, getting rid of bottlenecks and central points of failures. In this paper, we describe the studies, the concepts and the solutions developed in the DART project to introduce these three key features in a novel search engine architecture.
Towards an Effective Personalized Information Filter for P2P Based Focused Web Crawling
"... Abstract: Information access is one of the hottest topics of information society, which has become even more important since the advent of the Web, but nowadays the general Web search engines still have no ability to find correct and timely information for individuals. In this paper, we propose a Pe ..."
Abstract
- Add to MetaCart
Abstract: Information access is one of the hottest topics of information society, which has become even more important since the advent of the Web, but nowadays the general Web search engines still have no ability to find correct and timely information for individuals. In this paper, we propose a Peerto-Peer (P2P) based decentralized focused Web crawling system called PeerBridge to provide usercentered, content-sensitive and personalized information search service from Web. The PeerBridge is built on the foundation of our previous work about WebBridge, which is a focused crawling system to crawl Web according several specified topic. The most important function of PeerBridge is to identify interesting information. So we furthermore present an efficient personalized information filter in detail, which combines several component neural networks to accomplish the filtering task. Performance evaluation in the experiments showed that PeerBridge is effective to crawl relevant information for specific topics and the information filter is efficient, which precision is better than that of support vector machine, naïve bayesian and individual neural network.
1 IRLbot: Scaling to 6 Billion Pages and Beyond
"... Abstract—This paper shares our experience in designing a web crawler that can download billions of pages using a singleserver implementation and models its performance. We show that with the quadratically increasing complexity of verifying URL uniqueness, BFS crawl order, and fixed per-host ratelimi ..."
Abstract
- Add to MetaCart
Abstract—This paper shares our experience in designing a web crawler that can download billions of pages using a singleserver implementation and models its performance. We show that with the quadratically increasing complexity of verifying URL uniqueness, BFS crawl order, and fixed per-host ratelimiting, current crawling algorithms cannot effectively cope with the sheer volume of URLs generated in large crawls, highly-branching spam, legitimate multi-million-page blog sites, and infinite loops created by server-side scripts. We offer a set of techniques for dealing with these issues and test their performance in an implementation we call IRLbot. In our recent experiment that lasted 41 days, IRLbot running on a single server successfully crawled 6.3 billion valid HTML pages (7.6 billion connection requests) and sustained an average download rate of 319 mb/s (1, 789 pages/s). Unlike our prior experiments with algorithms proposed in related work, this version of IRLbot did not experience any bottlenecks and successfully handled content from over 117 million hosts, parsed out 394 billion links, and discovered a subset of the web graph with 41 billion unique nodes. I.
Don’t Thrash: How
"... Many large storage systems use approximatemembership-query (AMQ) data structures to deal with the massive amounts of data that they process. An AMQ data structure is a dictionary that trades off space for a false positive rate on membership queries. It is designed to fit into small, fast storage, an ..."
Abstract
- Add to MetaCart
Many large storage systems use approximatemembership-query (AMQ) data structures to deal with the massive amounts of data that they process. An AMQ data structure is a dictionary that trades off space for a false positive rate on membership queries. It is designed to fit into small, fast storage, and it is used to avoid I/Os on slow storage. The Bloom filter is a well-known example of an AMQ data structure. Bloom filters, however, do not scale outside of main memory. This paper describes the Cascade Filter TM, an AMQ data structure that scales beyond main memory, supporting over half a million insertions/deletions per second and over 500 lookups per second on a commodity flashbased SSD. 1

