Results 1 -
3 of
3
On the Representation and Multiplication of Hypersparse Matrices
, 2008
"... Multicore processors are marking the beginning of a new era of computing where massive parallelism is available and necessary. Slightly slower but easy to parallelize kernels are becoming more valuable than sequentially faster kernels that are unscalable when parallelized. In this paper, we focus on ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
Multicore processors are marking the beginning of a new era of computing where massive parallelism is available and necessary. Slightly slower but easy to parallelize kernels are becoming more valuable than sequentially faster kernels that are unscalable when parallelized. In this paper, we focus on the multiplication of sparse matrices (SpGEMM). We first present the issues with existing sparse matrix representations and multiplication algorithms that make them unscalable to thousands of processors. Then, we develop and analyze two new algorithms that overcome these limitations. We consider our algorithms first as the sequential kernel of a scalable parallel sparse matrix multiplication algorithm and second as part of a polyalgorithm for SpGEMM that would execute different kernels depending on the sparsity of the input matrices. Such a sequential kernel requires a new data structure that exploits the hypersparsity of the individual submatrices owned by a single processor after the 2D partitioning. We experimentally evaluate the performance and characteristics of our algorithms and show that they scale significantly better than existing kernels.
A Model for Web Mining Applications -- Conceptual Model, Architecture, Implementation and Use Cases
, 2008
"... Web mining is a computation intensive task even after the mining tool itself has been developed. However, most mining software is developed ad-hoc and usually is not scalable nor reused for other mining tasks. This paper presents a Web mining model and implementation, referred to as WIM – Web Inform ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Web mining is a computation intensive task even after the mining tool itself has been developed. However, most mining software is developed ad-hoc and usually is not scalable nor reused for other mining tasks. This paper presents a Web mining model and implementation, referred to as WIM – Web Information Mining –, where rapid prototyping is possible. The underlying conceptual model of WIM provides its users with a level of abstraction appropriate for prototyping and experimentation throughout the Web data mining task. Abstracting from the idiosyncrasies of raw Web data representations facilities the inherently iterative mining process. This paper details this conceptual model, together with its associated algebra, the architecture of the WIM tool, and its implementation. It also demonstrates how the model has been applied in several real Web data mining tasks. Resulting from this experimentation, WIM has proved to significantly facilitate Web mining prototyping.
November 2009Efficient Social Website Crawling Using Cluster Graph ABSTRACT
"... Online social communities have gained significant popularity in recent years and have become an area of active research. Compared with general websites or well-structured Web forums, user-centered social websites pose several unique challenges for crawling, a fundamental task for data collection and ..."
Abstract
- Add to MetaCart
Online social communities have gained significant popularity in recent years and have become an area of active research. Compared with general websites or well-structured Web forums, user-centered social websites pose several unique challenges for crawling, a fundamental task for data collection and data mining of large-scale online social communities: (1) Social websites have more complex link structures and much higher indegree and outdegree, resulting in a large number of duplicate links; (2) Social websites contain large amounts of duplicate content usually listed under different URLs; (3) Social websites are interactive in nature, containing a large number of action or uninformative webpages such as login, tell-a-friend, or commenting; and (4) Social webpages differ dramatically in URL format, link structure, and page layout, due to their diverse semantics, functionalities, and user customization. Previous crawler designs targeting the general Web or well-structured Web forums are inadequate for social websites, wasting network bandwidth, storage space, and causing extra overload in social network analysis and data mining tasks. This work tackles the problem of efficient social website crawling by proposing two key techniques: (1) URL-based webpage clustering that identifies frequent itemsets in URLs and groups webpages into semantic clusters; and (2) cluster graph pruning that removes edges and nodes representing duplicate links, duplicate or uninformative content. The offline trained webpage cluster graph is then used at runtime to direct the crawling process. By using only URLs and page link structures, our cluster-graph-based approach can successfully address the challenges in crawling social websites. Extensive evaluations on three different social websites demonstrate that our approach can effectively and efficiently crawl large amounts of informative social content while dramatically reducing the number of duplicate links as well as the amount of duplicate or uninformative content. 1.

