Results 1 - 10
of
10
Spotsigs: robust and efficient near duplicate detection in large web collections
- In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
, 2008
"... Motivated by our work with political scientists who need to manually analyze large Web archives of news sites, we present SpotSigs, a new algorithm for extracting and matching signatures for near duplicate detection in large Web crawls. Our spot signatures are designed to favor naturallanguage porti ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Motivated by our work with political scientists who need to manually analyze large Web archives of news sites, we present SpotSigs, a new algorithm for extracting and matching signatures for near duplicate detection in large Web crawls. Our spot signatures are designed to favor naturallanguage portions of Web pages over advertisements and navigational bars. The contributions of SpotSigs are twofold: 1) by combining stopword antecedents with short chains of adjacent content terms, we create robust document signatures with a natural ability to filter out noisy components of Web pages that would otherwise distract pure n-gram-based approaches such as Shingling; 2) we provide an exact and efficient, selftuning matching algorithm that exploits a novel combination of collection partitioning and inverted index pruning for high-dimensional similarity search. Experiments confirm an increase in combined precision and recall of more than 24 percent over state-of-the-art approaches such as Shingling or I-Match and up to a factor of 3 faster execution times than Locality Sensitive Hashing (LSH), over a demonstrative “Gold Set ” of manually assessed near-duplicate news articles as well as the TREC WT10g Web collection.
Towards Breaking the Quality Curse. A Web-Querying Approach to Web People Search. ∗
"... Searching for people on the Web is one of the most common query types to the web search engines today. However, when a person name is queried, the returned webpages often contain documents related to several distinct namesakes who have the queried name. The task of disambiguating and finding the web ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
Searching for people on the Web is one of the most common query types to the web search engines today. However, when a person name is queried, the returned webpages often contain documents related to several distinct namesakes who have the queried name. The task of disambiguating and finding the webpages related to the specific person of interest is left to the user. Many Web People Search (WePS) approaches have been developed recently that attempt to automate this disambiguation process. Nevertheless, the disambiguation quality of these techniques leaves a major room for improvement. This paper presents a new serverside WePS approach. It is based on collecting co-occurrence information from the Web and thus it uses the Web as an external data source. A skyline-based classification technique
Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks
- IN SPAA
, 2009
"... This paper introduces a storage format for sparse matrices, called compressed sparse blocks (CSB), which allows both Ax and A T x to be computed efficiently in parallel, where A is an n × n sparse matrix with nnz ≥ n nonzeros and x is a dense n-vector. Our algorithms use Θ(nnz) work (serial running ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
This paper introduces a storage format for sparse matrices, called compressed sparse blocks (CSB), which allows both Ax and A T x to be computed efficiently in parallel, where A is an n × n sparse matrix with nnz ≥ n nonzeros and x is a dense n-vector. Our algorithms use Θ(nnz) work (serial running time) and Θ ( √ nlgn) span (critical-path length), yielding a parallelism of Θ(nnz / √ nlgn), which is amply high for virtually any large matrix. The storage requirement for CSB is esssentially the same as that for the morestandard compressed-sparse-rows (CSR) format, for which computing Ax in parallel is easy but A T x is difficult. Benchmark results indicate that on one processor, the CSB algorithms for Ax and A T x run just as fast as the CSR algorithm for Ax, but the CSB algorithms also scale up linearly with processors until limited by offchip memory bandwidth.
The Mobile Web Is Structurally Different
"... Abstract—One of the premier applications on the global Internet is browsing the World Wide Web. The advent of advanced browser-enabled cell phones, high-speed wireless networks, and “unlimited-data ” pricing plans is fueling the demand for Web access on mobile devices. Further, there is an increasin ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract—One of the premier applications on the global Internet is browsing the World Wide Web. The advent of advanced browser-enabled cell phones, high-speed wireless networks, and “unlimited-data ” pricing plans is fueling the demand for Web access on mobile devices. Further, there is an increasing amount of content in the mobile Web, the set of web pages written in markup languages (CHTML, XHTML, and WML) designed specifically for consumption on mobile wireless devices. Understanding the structural properties of the WWW can be very helpful in a variety of applications, such as crawling the web more efficiently, or performing better search results ranking. So far, however, this line of investigation has been limited to the web consisting of HTML pages. In this study we examine the structural properties of the mobile web graph inferred from a crawl of mobile markup pages. We find that the mobile web graph differs in general from the fixed web in several important ways. Its connectivity is sparser than the fixed web and its node degree distributions fall off much more rapidly. We further analyze the web graph in terms of its bow-tie structure, which has been studied previously for the fixed web. The properties of the bow-tie structure for mobile web are quite different from those of the fixed web, such as having a smaller central core strongly connected component (SCC) and more disconnectedness. We also find the CHTML and XHTML/WML subgraphs of the mobile web subgraph differ significantly, indicating the influence of different usage and maturity of the mobile web in Japan compared to other countries. We also consider the domain-level graphs, where all nodes of a domain are collapsed into a single node and all interdomain edges are hidden, and find notable differences between the fixed and mobile graphs. To our knowledge this is the first study of the structural properties of the web graph. We briefly comment on the potential implications of the findings, focusing on crawl as an example application.
Measuring Similarity to Detect Qualified Links ∗
, 2006
"... The success of link-based ranking algorithms is achieved based on the assumption that links imply merit of the target pages. However, on the real web, there exist links for purposes other than to confer authority. Such links bring noise into link analysis and harm the quality of retrieval. In order ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
The success of link-based ranking algorithms is achieved based on the assumption that links imply merit of the target pages. However, on the real web, there exist links for purposes other than to confer authority. Such links bring noise into link analysis and harm the quality of retrieval. In order to provide high quality search results, it is important to detect them and reduce their influence. In this paper, a method is proposed to detect such links by considering multiple similarity measures over the source pages and target pages. With the help of a classifier, these noisy links are detected and dropped. After that, link analysis algorithms are performed on the reduced link graph. The usefulness of a number of features are also tested. Experiments across 53 query-specific datasets show that the result of our approach is able to boost Bharat and Henzinger’s imp algorithm by around 9 % in terms of precision. It also outperforms a previous approach focusing on link spam detection. 1
Incorporating Trust into Web Search
, 2007
"... The Web today includes many pages intended to deceive search engines, in which content or links are created to attain an unwarranted result ranking. Since the links among web pages are used to calculate authority, ranking systems should take into consideration which pages contain content to be trust ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The Web today includes many pages intended to deceive search engines, in which content or links are created to attain an unwarranted result ranking. Since the links among web pages are used to calculate authority, ranking systems should take into consideration which pages contain content to be trusted and which do not. In this paper, we assume the existence of a mechanism, such as, but not limited to Gyöngyi et al.’s TrustRank, to estimate the trustworthiness of a given page. However, unlike existing work that uses trust to identify or demote spam pages, we propose how to incorporate a given trust estimate into the process of calculating authority for a cautious surfer. We apply a total of forty-five queries over two large, real-world datasets to demonstrate that incorporating trust into an authority calculation using our cautious surfer can improve PageRank’s precision at 10 by 11-26 % and average top-10 result quality by 53-81%. 1
A Model for Web Mining Applications -- Conceptual Model, Architecture, Implementation and Use Cases
, 2008
"... Web mining is a computation intensive task even after the mining tool itself has been developed. However, most mining software is developed ad-hoc and usually is not scalable nor reused for other mining tasks. This paper presents a Web mining model and implementation, referred to as WIM – Web Inform ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Web mining is a computation intensive task even after the mining tool itself has been developed. However, most mining software is developed ad-hoc and usually is not scalable nor reused for other mining tasks. This paper presents a Web mining model and implementation, referred to as WIM – Web Information Mining –, where rapid prototyping is possible. The underlying conceptual model of WIM provides its users with a level of abstraction appropriate for prototyping and experimentation throughout the Web data mining task. Abstracting from the idiosyncrasies of raw Web data representations facilities the inherently iterative mining process. This paper details this conceptual model, together with its associated algebra, the architecture of the WIM tool, and its implementation. It also demonstrates how the model has been applied in several real Web data mining tasks. Resulting from this experimentation, WIM has proved to significantly facilitate Web mining prototyping.
Separate and Inequal: Preserving Heterogeneity in Topical Authority Flows
"... Web pages, like people, are often known by others in a variety of contexts. When those contexts are sufficiently distinct, a page’s importance may be better represented by multiple domains of authority, rather than by one that indiscriminately mixes reputations. In this work we determine domains of ..."
Abstract
- Add to MetaCart
Web pages, like people, are often known by others in a variety of contexts. When those contexts are sufficiently distinct, a page’s importance may be better represented by multiple domains of authority, rather than by one that indiscriminately mixes reputations. In this work we determine domains of authority by examining the contexts in which a page is cited. However, we find that it is not enough to determine separate domains of authority; our model additionally determines the local flow of authority based upon the relative similarity of the source and target authority domains. In this way, we differentiate both incoming and outgoing hyperlinks by topicality and importance rather than treating them indiscriminately. We find that this approach compares favorably to other topical ranking methods on two real-world datasets and produces an approximately 10 % improvement in precision and quality of the top ten results over PageRank.
General Terms Algorithms, Performance
"... Accurate web page classification often depends crucially on information gained from neighboring pages in the local web graph. Prior work has exploited the class labels of nearby pages to improve performance. In contrast, in this work we utilize a weighted combination of the contents of neighbors to ..."
Abstract
- Add to MetaCart
Accurate web page classification often depends crucially on information gained from neighboring pages in the local web graph. Prior work has exploited the class labels of nearby pages to improve performance. In contrast, in this work we utilize a weighted combination of the contents of neighbors to generate a better virtual document for classification. In addition, we break pages into fields, finding that a weighted combination of text from the target and fields of neighboring pages is able to reduce classification error by more than a third. We demonstrate performance on a large dataset of pages from the Open Directory Project and validate the approach using pages from a crawl from the Stanford WebBase. Interestingly, we find no value in anchor text and unexpected value in page titles (and especially titles of parent pages) in the virtual document.
Mining Anchor Text Trends for Retrieval
"... Abstract. Anchor text has been considered as a useful resource to complement the representation of target pages and is broadly used in web search. However, previous research only uses anchor text of a single snapshot to improve web search. Historical trends of anchor text importance have not been we ..."
Abstract
- Add to MetaCart
Abstract. Anchor text has been considered as a useful resource to complement the representation of target pages and is broadly used in web search. However, previous research only uses anchor text of a single snapshot to improve web search. Historical trends of anchor text importance have not been well modeled in anchor text weighting strategies. In this paper, we propose a novel temporal anchor text weighting method to incorporate the trends of anchor text creation over time, which combines historical weights of anchor text by propagating the anchor text weights among snapshots over the time axis. We evaluate our method on a real-world web crawl from the Stanford WebBase. Our results demonstrate that the proposed method can produce a significant improvement in ranking quality. 1

