Results 1 - 10
of
15
Detecting phrase-level duplication on the world wide web
- In Proceedings of the 28th Annual International ACM SIGIR Conference on Research & Development in Information Retrieval
, 2005
"... Two years ago, we conducted a study on the evolution of web pages over time. In the course of that study, we discovered a large number of machine-generated “spam ” web pages emanating from a handful of web servers in Germany. These spam web pages were dynamically assembled by stitching together gram ..."
Abstract
-
Cited by 32 (1 self)
- Add to MetaCart
Two years ago, we conducted a study on the evolution of web pages over time. In the course of that study, we discovered a large number of machine-generated “spam ” web pages emanating from a handful of web servers in Germany. These spam web pages were dynamically assembled by stitching together grammatically wellformed German sentences drawn from a large collection of sentences. This discovery motivated us to develop techniques for finding other instances of such “slice and dice ” generation of web pages, where pages are automatically generated by stitching together phrases drawn from a limited corpus. We applied these techniques to two data sets, a set of 151 million web pages collected in December 2002 and a set of 96 million web pages collected in June 2004. We found a number of other instances of large-scale phrase-level replication within the two data sets. This paper describes the algorithms we used to discover this type of replication, and highlights the results of our data mining.
Detecting Spam blogs: A machine learning approach
- 2006. Proceedings of the 21st National Conference on Artificial Intelligence (AAAI
, 2006
"... Weblogs or blogs are an important new way to publish information, engage in discussions, and form communities on the Internet. The Blogosphere has unfortunately been infected by several varieties of spam-like content. Blog search engines, for example, are inundated by posts from splogs – false blogs ..."
Abstract
-
Cited by 28 (7 self)
- Add to MetaCart
Weblogs or blogs are an important new way to publish information, engage in discussions, and form communities on the Internet. The Blogosphere has unfortunately been infected by several varieties of spam-like content. Blog search engines, for example, are inundated by posts from splogs – false blogs with machine generated or hijacked content whose sole purpose is to host ads or raise the PageRank of target sites. We discuss how SVM models based on local and link-based features can be used to detect splogs. We present an evaluation of learned models and their utility to blog search engines; systems that employ techniques differing from those of conventional web search engines.
Spam double-funnel: connecting web spammers with advertisers
- In WWW
, 2007
"... Spammers use questionable search engine optimization (SEO) techniques to promote their spam links into top search results. In this paper, we focus on one prevalent type of spam – redirection spam – where one can identify spam pages by the third-party domains that these pages redirect traffic to. We ..."
Abstract
-
Cited by 28 (0 self)
- Add to MetaCart
Spammers use questionable search engine optimization (SEO) techniques to promote their spam links into top search results. In this paper, we focus on one prevalent type of spam – redirection spam – where one can identify spam pages by the third-party domains that these pages redirect traffic to. We propose a fivelayer, double-funnel model for describing end-to-end redirection spam, present a methodology for analyzing the layers, and identify prominent domains on each layer using two sets of commercial keywords – one targeting spammers and the other targeting advertisers. The methodology and findings are useful for search engines to strengthen their ranking algorithms against spam, for legitimate website owners to locate and remove spam doorway pages, and for legitimate advertisers to identify unscrupulous syndicators who serve ads on spam pages.
Characterization of national Web domains
- ACM Transactions on Internet Technology
, 2005
"... During the last few years, several studies on the characterization of the public Web space of various national domains have been published. The pages of a country are an interesting set for studying the characteristics of the Web, because at the same time these are diverse (as they are written by se ..."
Abstract
-
Cited by 22 (8 self)
- Add to MetaCart
During the last few years, several studies on the characterization of the public Web space of various national domains have been published. The pages of a country are an interesting set for studying the characteristics of the Web, because at the same time these are diverse (as they are written by several authors) and yet rather similar (as they share a common geographical, historical and cultural context). This paper discusses the methodologies used for presenting the results of Web characterization studies, including the granularity at which different aspects are presented, and a separation of concerns between contents, links, and technologies. Based on this, we present a side-by-side comparison of the results of 12 Web characterization studies comprising over 120 million pages from 24 countries. The comparison unveils similarities and differences between the collections, and sheds light on how certain results of a single Web characterization study on a sample may be valid in the context of the full Web.
On the utility of incremental feature selection for the classification of textual data streams
- In 10th Panhellenic Conference on Informatics (PCI 2005
, 2005
"... Abstract. In this paper we argue that incrementally updating the features that a text classification algorithm considers is very important for real-world textual data streams, because in most applications the distribution of data and the description of the classification concept changes over time. W ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
Abstract. In this paper we argue that incrementally updating the features that a text classification algorithm considers is very important for real-world textual data streams, because in most applications the distribution of data and the description of the classification concept changes over time. We propose the coupling of an incremental feature ranking method and an incremental learning algorithm that can consider different subsets of the feature vector during prediction (what we call a feature based classifier), in order to deal with the above problem. Experimental results with a longitudinal database of real spam and legitimate emails shows that our approach can adapt to the changing nature of streaming data and works much better than classical incremental learning algorithms. 1
Local computation of pagerank contributions
- In WAW
, 2007
"... Abstract. Motivated by the problem of detecting link-spam, we consider the following graph-theoretic primitive: Given a webgraph G, a vertex v in G, and a parameter δ ∈ (0, 1), compute the set of all vertices that contribute to v at least a δ fraction of v’s PageRank. We call this set the δ-contribu ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Abstract. Motivated by the problem of detecting link-spam, we consider the following graph-theoretic primitive: Given a webgraph G, a vertex v in G, and a parameter δ ∈ (0, 1), compute the set of all vertices that contribute to v at least a δ fraction of v’s PageRank. We call this set the δ-contributing set of v. To this end, we define the contribution vector of v to be the vector whose entries measure the contributions of every vertex to the PageRank of v. A local algorithm is one that produces a solution by adaptively examining only a small portion of the input graph near a specified vertex. We give an efficient local algorithm that computes an ɛ-approximation of the contribution vector for a given vertex by adaptively examining O(1/ɛ) vertices. Using this algorithm, we give a local approximation algorithm for the primitive defined above. Specifically, we give an algorithm that returns a set containing the δcontributing set of v and at most O(1/δ) vertices from the δ/2-contributing set of v, and which does so by examining at most O(1/δ) vertices. We also give a local algorithm for solving the following problem: If there exist k vertices that contribute a ρ-fraction to the PageRank of v, find a set of k vertices that contribute at least a (ρ − ɛ)-fraction to the PageRank of v. In this case, we prove that our algorithm examines at most O(k/ɛ) vertices. 1
Detecting Splogs via Temporal Dynamics Using Self-Similarity Analysis
"... This article addresses the problem of spam blog (splog) detection using temporal and structural regularity of content, post time and links. Splogs are undesirable blogs meant to attract search engine traffic, used solely for promoting affiliate sites. Blogs represent popular online media, and splogs ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
This article addresses the problem of spam blog (splog) detection using temporal and structural regularity of content, post time and links. Splogs are undesirable blogs meant to attract search engine traffic, used solely for promoting affiliate sites. Blogs represent popular online media, and splogs not only degrade the quality of search engine results, but also waste network resources. The splog detection problem is made difficult due to the lack of stable content descriptors. We have developed a new technique for detecting splogs, based on the observation that a blog is a dynamic, growing sequence of entries (or posts) rather than a collection of individual pages. In our approach, splogs are recognized by their temporal characteristics and content. There are three key ideas in our splog detection framework. (a) We represent the blog temporal dynamics using selfsimilarity matrices defined on the histogram intersection similarity measure of the time, content, and link attributes of posts, to investigate the temporal changes of the post sequence. (b) We study the blog temporal characteristics using a visual representation derived from the self-similarity measures. The visual signature reveals correlation between attributes and posts, depending on the type of blogs (normal blogs and splogs). (c) We propose two types of novel temporal features to capture the splog temporal characteristics. In our splog detector, these novel features are combined
Efficient Algorithms for Large-Scale Local Triangle Counting
"... In this article, we study the problem of approximate local triangle counting in large graphs. Namely, given a large graph G = (V, E) we want to estimate as accurately as possible the number of triangles incident to every node v ∈ V in the graph. We consider the question both for undirected and direc ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
In this article, we study the problem of approximate local triangle counting in large graphs. Namely, given a large graph G = (V, E) we want to estimate as accurately as possible the number of triangles incident to every node v ∈ V in the graph. We consider the question both for undirected and directed graphs. The problem of computing the global number of triangles in a graph has been considered before, but to our knowledge this is the first contribution that addresses the problem of approximate local triangle counting with a focus on the efficiency issues arising in massive graphs and that also considers the directed case. The distribution of the local number of triangles and the related local clustering coefficient can be used in many interesting applications. For example, we show that the measures we compute can help detect the presence of spamming activity in largescale Web graphs, as well as to provide useful features for content quality assessment in social networks. For computing the local number of triangles (undirected and directed), we propose two approximation algorithms, which are based on the idea of min-wise independent permutations [Broder et al. 1998]. Our algorithms operate in a semi-streaming fashion, using O(|V |) space in main memory and performing O(log |V |) sequential scans over the edges of the graph. The first algorithm
Seven months with the devils: a long-term study of content polluters on Twitter
- In AAAI Int’l Conference on Weblogs and Social Media (ICWSM
, 2011
"... The rise in popularity of social networking sites such as Twitter and Facebook has been paralleled by the rise of unwanted, disruptive entities on these networks—including spammers, malware disseminators, and other content polluters. Inspired by sociologists working to ensure the success of commons ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The rise in popularity of social networking sites such as Twitter and Facebook has been paralleled by the rise of unwanted, disruptive entities on these networks—including spammers, malware disseminators, and other content polluters. Inspired by sociologists working to ensure the success of commons and criminologists focused on deterring vandalism and preventing crime, we present the first long-term study of social honeypots for tempting, profiling, and filtering content polluters in social media. Concretely, we report on our experiences via a seven-month deployment of 60 honeypots on Twitter that resulted in the harvesting of 36,000 candidate content polluters. As part of our study, we (i) examine the harvested Twitter users, including an analysis of link payloads, user behavior over time, and followers/following network dynamics and (ii) evaluate a wide range of features to investigate the effectiveness of automatic content polluter identification.
Detecting Fake Content with Relative Entropy Scoring 1
"... Abstract. How to distinguish natural texts from artificially generated ones? Fake content is commonly encountered on the Internet, ranging from web scraping to random word salads. Most of this fake content is generated for spam purpose. In this paper, we present two methods to deal with this problem ..."
Abstract
- Add to MetaCart
Abstract. How to distinguish natural texts from artificially generated ones? Fake content is commonly encountered on the Internet, ranging from web scraping to random word salads. Most of this fake content is generated for spam purpose. In this paper, we present two methods to deal with this problem. The first one uses classical language models, while the second one is a novel approach using short range information between words. 1

