Results 1 -
7 of
7
Thwarting the nigritude ultramarine: learning to identify link spam
- In Proceedings of the 16th European Conference on Machine Learning (ECML
, 2005
"... Abstract. The page rank of a commercial web site has an enormous economic impact because it directly influences the number of potential customers that find the site as a highly ranked search engine result. Link spamming – inflating the page rank of a target page by artificially creating many referri ..."
Abstract
-
Cited by 30 (0 self)
- Add to MetaCart
Abstract. The page rank of a commercial web site has an enormous economic impact because it directly influences the number of potential customers that find the site as a highly ranked search engine result. Link spamming – inflating the page rank of a target page by artificially creating many referring pages – has therefore become a common practice. In order to maintain the quality of their search results, search engine providers try to oppose efforts that decorrelate page rank and relevance and maintain blacklists of spamming pages while spammers, at the same time, try to camouflage their spam pages. We formulate the problem of identifying link spam and discuss a methodology for generating training data. Experiments reveal the effectiveness of classes of intrinsic and relational attributes and shed light on the robustness of classifiers against obfuscation of attributes by an adversarial spammer. We identify open research problems related to web spam. 1
Libra: A library operating system for a jvm in a virtualized execution environment
- In VEE (Virtual Execution Environments
, 2007
"... If the operating system could be specialized for every application, many applications would run faster. For example, Java virtual machines (JVMs) provide their own threading model and memory protection, so general-purpose operating system implementations of these abstractions are redundant. However, ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
If the operating system could be specialized for every application, many applications would run faster. For example, Java virtual machines (JVMs) provide their own threading model and memory protection, so general-purpose operating system implementations of these abstractions are redundant. However, traditional means of transforming existing systems into specialized systems are difficult to adopt because they require replacing the entire operating system. This paper describes Libra, an execution environment specialized for IBM’s J9 JVM. Libra does not replace the entire operating system. Instead, Libra and J9 form a single statically-linked image that runs in a hypervisor partition. Libra provides the services necessary to achieve good performance for the Java workloads of interest but relies on an instance of Linux in another hypervisor partition to provide a networking stack, a filesystem, and other services. The expense of remote calls is offset by the fact that Libra’s services can be customized for a particular workload; for example, on the Nutch search engine, we show that two simple customizations improve application throughput by a factor of 2.7.
Specialized Execution Environments
"... Virtualization has become popular (again) as a means of consolidating multiple operating systems (OSes) onto a smaller set of hardware resources. The roles of OSes in such environments have changed. Whereas normally an OS provides balance between the demands of application and hardware support, in t ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Virtualization has become popular (again) as a means of consolidating multiple operating systems (OSes) onto a smaller set of hardware resources. The roles of OSes in such environments have changed. Whereas normally an OS provides balance between the demands of application and hardware support, in the world of virtualization it can be beneficial to split these roles. One OS may support a particular application set and use other OSes to interact with physical hardware. The hypervisor, or virtualization layer, provides communication facilities for the inter-OS communication needed to support such a deployment model. OSes can now be (1) dedicated to service specific applications, (2) detached from the underlying hardware, and (3) releaved from the need to provide the entire legacy support normally required of a generic OS. A benefit is that the
Comparing Distributed Indexing: To MapReduce or Not?
"... Information Retrieval (IR) systems require input corpora to be indexed. The advent of terabyte-scale Web corpora has reinvigorated the need for efficient indexing. In this work, we investigate distributed indexing paradigms, in particular within the auspices of the MapReduce programming framework. I ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Information Retrieval (IR) systems require input corpora to be indexed. The advent of terabyte-scale Web corpora has reinvigorated the need for efficient indexing. In this work, we investigate distributed indexing paradigms, in particular within the auspices of the MapReduce programming framework. In particular, we describe two indexing approaches based on the original MapReduce paper, and compare these with a standard distributed IR system, the MapReduce indexing strategy used by the Nutch IR platform, and a more advanced MapReduce indexing implementation that we propose. Experiments using the Hadoop MapReduce implementation and a large standard TREC corpus show our proposed MapReduce indexing implementation to be more efficient than those proposed in the original paper. 1.
SHOW AND TELL: A Seamlessly Integrated Tool For Searching with Image Content And Text ABSTRACT
"... image content feature querying and search is presented. The developed search tool tries to bridge the gap between commercial search engines, which are based on keyword search, and CBIR (Content Based Image Retrieval) systems developed mostly in the academic field, designed to search based on image c ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
image content feature querying and search is presented. The developed search tool tries to bridge the gap between commercial search engines, which are based on keyword search, and CBIR (Content Based Image Retrieval) systems developed mostly in the academic field, designed to search based on image content. The tool is implemented by building on and extending the open source text-based search engine Nutch and its powerful Lucene based crawling and indexing capabilities. Several user friendly search options are provided to allow users to query the index using not only words, but also by showing an image example, as well as image feature descriptions. Even though we evaluate the developed tool by running a set of controlled experiments on the COREL ’ 5000 image database, the developed search tool is able to crawl images from the World Wide Web at a larger scale. 1.
IIIT Hyderabad at Million Query Track TREC 2009
"... This was our maiden attempt at Million Query track, TREC 2009. We submitted three runs for ad-hoc retrieval task in Million Query track. We explored ad-hoc retrieval of web pages using Hadoop—a distributed infrastructure. To enhance recall, we expanded the queries using WordNet and also by combining ..."
Abstract
- Add to MetaCart
This was our maiden attempt at Million Query track, TREC 2009. We submitted three runs for ad-hoc retrieval task in Million Query track. We explored ad-hoc retrieval of web pages using Hadoop—a distributed infrastructure. To enhance recall, we expanded the queries using WordNet and also by combining the query with all possible subsets of tokens present in the query. To prevent query drift we experimented on giving selective boosts to different steps of expansion including giving higher boosts to sub-queries containing named entities as opposed to those that did not. In fact, this run achieved highest precision among our other runs. Using simple statistics we identified authoritative domains such as wikipedia.org, answers.com, etc and attempted to boost hits from them, while preventing them from overly biasing the results. An attempt to query classification was also made. 1
CONTINUOUS-TIME INFINITE DYNAMIC TOPIC MODELS
"... Topic models are probabilistic models for discovering topical themes in collections of documents. In real world applications, these models provide us with the means of organizing what would otherwise be unstructured collections. They can help us cluster a huge collection into different topics or fin ..."
Abstract
- Add to MetaCart
Topic models are probabilistic models for discovering topical themes in collections of documents. In real world applications, these models provide us with the means of organizing what would otherwise be unstructured collections. They can help us cluster a huge collection into different topics or find a subset of the collection that resembles the topical theme found in an article at hand. The first wave of topic models developed were able to discover the prevailing topics in a big collection of documents spanning a period of time. It was later realized that these timeinvariant models were not capable of modeling 1) the time varying number of topics they discover and 2) the time changing structure of these topics. Few models were developed to address this two deficiencies. The online-hierarchical Dirichlet process models the documents with a time varying number of topics. It varies the structure of the topics over time as well. However, it relies on document order, not timestamps to evolve the model over time. The continuous-time dynamic topic model evolves topic structure in continuous-time. However, it uses a fixed number of topics over time.

