Results 1 - 10
of
15
Of Ivory and Smurfs: Loxodontan MapReduce Experiments for Web Search
"... This paper describes Ivory, an attempt to build a distributed retrieval system around the open-source Hadoop implementation of MapReduce. We focus on three noteworthy aspects of our work: a retrieval architecture built directly on the Hadoop Distributed File System (HDFS), a scalable Map-Reduce algo ..."
Abstract
-
Cited by 12 (8 self)
- Add to MetaCart
This paper describes Ivory, an attempt to build a distributed retrieval system around the open-source Hadoop implementation of MapReduce. We focus on three noteworthy aspects of our work: a retrieval architecture built directly on the Hadoop Distributed File System (HDFS), a scalable Map-Reduce algorithm for inverted indexing, and webpage classification to enhance retrieval effectiveness. 1.
Exploring Social Tagging Graph for Web Object Classification
- In KDD’09
"... This paper studies web object classification problem with the novel exploration of social tags. Automatically classifying web objects into manageable semantic categories has long been a fundamental preprocess for indexing, browsing, searching, and mining these objects. The explosive growth of hetero ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
This paper studies web object classification problem with the novel exploration of social tags. Automatically classifying web objects into manageable semantic categories has long been a fundamental preprocess for indexing, browsing, searching, and mining these objects. The explosive growth of heterogeneous web objects, especially non-textual objects such as products, pictures, and videos, has made the problem of web classification increasingly challenging. Such objects often suffer from a lack of easy-extractable features with semantic information, interconnections between each
Design and implementation of contextual information portals
- Proceedings of the 20th International Conference companion on World Wide Web
, 2011
"... This paper presents a system for enabling offline web use to satisfy the information needs of disconnected communities. We describe the design, implementation, evaluation, and pilot deployment of an automated mechanism to construct ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
This paper presents a system for enabling offline web use to satisfy the information needs of disconnected communities. We describe the design, implementation, evaluation, and pilot deployment of an automated mechanism to construct
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
"... Support Vector Machines present an interesting and effective approach to solve automated classification tasks. Although it only handles binary and supervised problems by nature, it has been transformed into multiclass and semi-supervised approaches in several works. A previous study on supervised an ..."
Abstract
- Add to MetaCart
Support Vector Machines present an interesting and effective approach to solve automated classification tasks. Although it only handles binary and supervised problems by nature, it has been transformed into multiclass and semi-supervised approaches in several works. A previous study on supervised and semi-supervised SVM classification over binary taxonomies showed how the latter clearly outperforms the former, proving the suitability of unlabeled data for the learning phase in this kind of tasks. However, the suitability of unlabeled data for multiclass tasks using SVM has never been tested before. In this work, we present a study on whether unlabeled data could improve results for multiclass web page classification tasks using Support Vector Machines. As a conclusion, we encourage to rely only on labeled data, both for improving (or at least equaling) performance and for reducing the computational cost. 1
Ecole Polytechnique Fédérale de Lausanne &
"... Given only the URL of a web page, can we identify its topic? This is the question that we examine in this paper. Usually, web pages are classified using their content [7], but a URL-only classifier is preferable, (i) when speed is crucial, (ii) to enable content filtering before an (objectionable) w ..."
Abstract
- Add to MetaCart
Given only the URL of a web page, can we identify its topic? This is the question that we examine in this paper. Usually, web pages are classified using their content [7], but a URL-only classifier is preferable, (i) when speed is crucial, (ii) to enable content filtering before an (objectionable) web page is downloaded, (iii) when a page’s content is hidden in images, (iv) to annotate hyperlinks in a personalized web browser, without fetching the target page, and (v) when a focused crawler wants to infer the topic of a target page before devoting bandwidth to download it. We apply a machine learning approach to the topic identification task and evaluate its performance in extensive experiments on categorized web pages from the Open Directory Project (ODP). When training separate binary classifiers for each topic, we achieve typical F-measure values between 80 and 85, and a typical precision of around 85. We also ran experiments on a small data set of university web pages. For the task of classifying these pages into faculty, student, course and project pages, our methods improve over previous approaches by 13.8 points of F-measure.
HK Polytechnic Univ.
"... This paper proposes an algorithm called Imprecise Spectrum Analysis (ISA) to carry out fast dimension reduction for document classification. ISA is designed based on the one-sided Jacobi method for Singular Value Decomposition (SVD). To speedup dimension reduction, it simplifies the orthogonalizatio ..."
Abstract
- Add to MetaCart
This paper proposes an algorithm called Imprecise Spectrum Analysis (ISA) to carry out fast dimension reduction for document classification. ISA is designed based on the one-sided Jacobi method for Singular Value Decomposition (SVD). To speedup dimension reduction, it simplifies the orthogonalization process of Jacobi computation and introduces a new mapping formula for transforming original documentterm vectors. To improve classification accuracy using ISA, a feature selection method is further developed to make inter-class feature vectors more orthogonal in building the initial weighted term-document matrix. Our experimental results show that ISA is extremely fast in handling large term-document matrices and delivers better or competitive classification accuracy compared to SVD-based LSI.
Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence A New Search Engine Integrating Hierarchical Browsing and Keyword Search
"... The original Yahoo! search engine consists of manually organized topic hierarchy of webpages for easy browsing. Modern search engines (such as Google and Bing), on the other hand, return a flat ..."
Abstract
- Add to MetaCart
The original Yahoo! search engine consists of manually organized topic hierarchy of webpages for easy browsing. Modern search engines (such as Google and Bing), on the other hand, return a flat
Learning Search Tasks in Queries and Web Pages via
"... As the Internet grows explosively, search engines play a more and more important role for users in effectively accessing online information. Recently, it has been recognized that a query is often triggered by a search task that the user wants to accomplish. Similarly, many web pages are specifically ..."
Abstract
- Add to MetaCart
As the Internet grows explosively, search engines play a more and more important role for users in effectively accessing online information. Recently, it has been recognized that a query is often triggered by a search task that the user wants to accomplish. Similarly, many web pages are specifically designed to help accomplish a certain task. Therefore, learning hidden tasks behind queries and web pages can help search engines return the most useful web pages to users by task matching. For instance, the search task that triggers query “thinkpad T410 broken ” is to maintain a computer, and it is desirable for a search engine to return the Lenovo troubleshooting page on the top of the list. However, existing search engine technologies mainly focus on topic detection or relevance ranking, which are not able to predict the task that triggers a query and the task a web page can accomplish.

