Results 1 - 10
of
19
Interpreting the Data: Parallel Analysis with Sawzall
- Scientific Programming Journal, Special Issue on Grids and Worldwide Computing Programming Models and Infrastructure
"... Very large data sets often have a flat but regular structure and span multiple disks and machines. Examples include telephone call records, network logs, and web document repositories. These large data sets are not amenable to study using traditional database techniques, if only because they can be ..."
Abstract
-
Cited by 128 (0 self)
- Add to MetaCart
Very large data sets often have a flat but regular structure and span multiple disks and machines. Examples include telephone call records, network logs, and web document repositories. These large data sets are not amenable to study using traditional database techniques, if only because they can be too large to fit in a single relational database. On the other hand, many of the analyses done on them can be expressed using simple, easily distributed computations: filtering, aggregation, extraction of statistics, and so on. We present a system for automating such analyses. A filtering phase, in which a query is expressed using a new procedural programming language, emits data to an aggregation phase. Both phases are distributed over hundreds or even thousands of computers. The results are then collated and saved to a file. The design—including the separation into two phases, the form of the programming language, and the properties of the aggregators—exploits the parallelism inherent in having data and computation distributed across many machines. 1
Balancing Volume, Quality and Freshness in Web Crawling
- In Soft Computing Systems - Design, Management and Applications
, 2002
"... We describe a crawling software designed for high-performance, large-scale information discovery and gathering on the Web. This crawler allows the administrator to seek for a balance between the volume of a Web collection and its freshness; and also provides flexibility for defining a quality metric ..."
Abstract
-
Cited by 18 (11 self)
- Add to MetaCart
We describe a crawling software designed for high-performance, large-scale information discovery and gathering on the Web. This crawler allows the administrator to seek for a balance between the volume of a Web collection and its freshness; and also provides flexibility for defining a quality metric to priorize certain pages.
Web Structure, Dynamics and Page Quality
- In Proc. String Processing and Information Retrieval
, 2002
"... This paper is aimed at the study of quantitative measures of the relation between Web structure, page recency, and quality of Web pages. Quality is studied using different link-based metrics considering their relationship with the structure of the Web and the last modification time of a page. We sho ..."
Abstract
-
Cited by 16 (5 self)
- Add to MetaCart
This paper is aimed at the study of quantitative measures of the relation between Web structure, page recency, and quality of Web pages. Quality is studied using different link-based metrics considering their relationship with the structure of the Web and the last modification time of a page. We show that, as expected, Pagerank is biased against new pages. As a subproduct we propose a Pagerank variant that includes page recency into account and we obtain information on how recency is related with Web structure.
Crawling the infinite Web: five levels are enough
- In Proceedings of the third Workshop on Web Graphs (WAW
, 2004
"... Abstract. A large amount of publicly available Web pages are generated dynamically upon request, and contain links to other dynamically generated pages. This usually produces Web sites which can create arbitrarily many pages. In this article, several probabilistic models for browsing “infinite ” Web ..."
Abstract
-
Cited by 16 (4 self)
- Add to MetaCart
Abstract. A large amount of publicly available Web pages are generated dynamically upon request, and contain links to other dynamically generated pages. This usually produces Web sites which can create arbitrarily many pages. In this article, several probabilistic models for browsing “infinite ” Web sites are proposed and studied. We use these models to estimate how deep a crawler must go to download a significant portion of the Web site content that is actually visited. The proposed models are validated against real data on page views in several Web sites, showing that, in both theory and practice, a crawler needs to download just a few levels, no more than 3 to 5 “clicks ” away from the start page, to reach 90 % of the pages that users actually visit. 1
Web Dynamics, Structure, and Page Quality
- In Web Dynamics
, 2004
"... Introduction The purpose of a Web search engine is to provide an infrastructure that supports relationships between publishers of content and readers. In this context, as the numbers involved are very big (550 million users [2] and more than 3 billion pages (a lower bound that comes from the covera ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Introduction The purpose of a Web search engine is to provide an infrastructure that supports relationships between publishers of content and readers. In this context, as the numbers involved are very big (550 million users [2] and more than 3 billion pages (a lower bound that comes from the coverage of popular search engines) in 35 million sites [4] on January 2003) it is critical to provide good measures of quality that allow the user to choose "good" pages. We think that this is the main element that explain Google's [3] success. However, the notion of what is a "good page" and how this it is related to different Web characteristics is not well understood. Therefore, in this chapter we address the study of the relationships between the age of a page or a site, the quality of a page, and the structure of the Web. Age is defined as the time since the page was last updated (recency). For Web servers, we use the oldest page in the site as a lower bound on the age of the site. The spe
The Semantic Web as the apotheosis of annotation, but what are its semantics?
"... The paper discusses what kind of entity the proposed Semantic Web (SW) is, and does so principally by reference to the relationship of natural language structure to knowledge representation (KR). It argues that there are three distinct views on the issue: first, that the SW is basically a renaming ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
The paper discusses what kind of entity the proposed Semantic Web (SW) is, and does so principally by reference to the relationship of natural language structure to knowledge representation (KR). It argues that there are three distinct views on the issue: first, that the SW is basically a renaming of the traditional AI knowledge representation task, with all its problems and challenges. Secondly, there is a view that the SW will be, at a minimum, the WorldWideWeb (WWW) with its constituent documents annotated so as to yield their content, or meaning structure, more directly. This view of the SW makes natural language processing central as the procedural bridge from texts to KR, usually via some form of automated Information Extraction. This view is discussed in some detail and it is argued that this can also be seen as a way of justifying the structures used as KR for the SW. There is a third view, possibly Berners-Lee's own, that the SW is about trusted databases as the foundation of a system of web processes and services, but it is argued that this ignores the whole history of the web as a textual system, and gives no better guarantee of agreed meanings for terms than the other two approaches. There is also a fourth view, much harder to define and discuss, which is that if the SW just keeps moving as an engineering development and is lucky (as the successful scale-up of the WWW seems to have been luckier, or better designed, than many cynics expected) then real problems will not arise
The demographics of web search
- In Proceedings of the 33th annual international ACM SIGIR conference on Research and development in information retrieval
, 2010
"... – users – queries – search engines Web – today, there are more than 130 million Web servers – Web is the largest data repository (estimated as 100 billion pages) – well-connected graph with out- and in-link power law distributions Users – culturally and educationally diverse – little patience (few q ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
– users – queries – search engines Web – today, there are more than 130 million Web servers – Web is the largest data repository (estimated as 100 billion pages) – well-connected graph with out- and in-link power law distributions Users – culturally and educationally diverse – little patience (few queries posed & few answers seen)
Web Structure, Age and Page Quality
, 2002
"... This paper is aimed at the study of quantitative measures of the relation between Web structure, age, and quality of Web pages. Quality is studied from different link-based metrics and their relationship with the structure of the Web and the last modification time of a page. ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
This paper is aimed at the study of quantitative measures of the relation between Web structure, age, and quality of Web pages. Quality is studied from different link-based metrics and their relationship with the structure of the Web and the last modification time of a page.
Assessing and Ranking Structural Correlations in Graphs
"... Real-life graphs not only have nodes and edges, but also have events taking place, e.g., product sales in social networks and virus infection in communication networks. Among different events, some exhibit strong correlation with the network structure, while others do not. Such structural correlatio ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Real-life graphs not only have nodes and edges, but also have events taking place, e.g., product sales in social networks and virus infection in communication networks. Among different events, some exhibit strong correlation with the network structure, while others do not. Such structural correlation will shed light on viral influence existing in the corresponding network. Unfortunately, the traditional association mining concept is not applicable in graphs since it only works on homogeneous datasets like transactions and baskets. We propose a novel measure for assessing such structural correlations in heterogeneous graph datasets with events. The measure applies hitting time to aggregate the proximity among nodes that have the same event. In order to calculate the correlation scores for many events in a large network, we develop a scalable framework, called gScore, using sampling and approximation. By comparing to the situation where events are randomly distributed in the same network, our method is able to discover events that are highly correlated with the graph structure. gScore is scalable and was successfully applied to the co-author DBLP network and social networks extracted from TaoBao.com, the largest online shopping network in China, with many interesting discoveries.
PtoP A Peer-to-Peer Search Engine
, 2001
"... Traditional search engines are very useful tools for searching specific information in World Wide Web (WWW). But they lack the ability to index and hence search the dynamic content of the web, which is growing at a much faster rate than the static content. The information stored in the searchable da ..."
Abstract
- Add to MetaCart
Traditional search engines are very useful tools for searching specific information in World Wide Web (WWW). But they lack the ability to index and hence search the dynamic content of the web, which is growing at a much faster rate than the static content. The information stored in the searchable databases of deep websites, which is around hundreds of times more than the static content quantitatively and 3-4 times better qualitatively, can only be searched by direct query to the database. But the process of "one at a time" direct query to different deep websites is a time consuming and laborious process. We have developed a peer-to-peer search engine that automates the process of sending queries to these deep websites using peer-to-peer technology and presents the search result from all the sites to the user.

