Results 1 - 10
of
408
K-means++: The advantages of careful seeding.
- In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’07,
, 2007
"... Abstract The k-means method is a widely used clustering technique that seeks to minimize the average squared distance between points in the same cluster. Although it offers no accuracy guarantees, its simplicity and speed are very appealing in practice. By augmenting k-means with a very simple, ran ..."
Abstract
-
Cited by 478 (8 self)
- Add to MetaCart
(Show Context)
Abstract The k-means method is a widely used clustering technique that seeks to minimize the average squared distance between points in the same cluster. Although it offers no accuracy guarantees, its simplicity and speed are very appealing in practice. By augmenting k-means with a very simple, randomized seeding technique, we obtain an algorithm that is Θ(log k)-competitive with the optimal clustering. Preliminary experiments show that our augmentation improves both the speed and the accuracy of k-means, often quite dramatically.
Subspace clustering for high dimensional data: a review
- ACM SIGKDD Explorations Newsletter
, 2004
"... Subspace clustering for high dimensional data: ..."
(Show Context)
Collaborative Filtering: A Machine Learning Perspective
, 2004
"... Collaborative filtering was initially proposed as a framework for filtering information based on the preferences of users, and has since been refined in many different ways. This thesis is a comprehensive study of rating-based, pure, non-sequential collaborative filtering. We analyze existing method ..."
Abstract
-
Cited by 76 (3 self)
- Add to MetaCart
(Show Context)
Collaborative filtering was initially proposed as a framework for filtering information based on the preferences of users, and has since been refined in many different ways. This thesis is a comprehensive study of rating-based, pure, non-sequential collaborative filtering. We analyze existing methods for the task of rating prediction from a machine learning perspective. We show that many existing methods proposed for this task are simple applications or modi cations of one or more standard machine learning methods for classifi cation, regression, clustering, dimensionality reduction, and density estimation. We introduce new prediction methods in all of these classes. We introduce a new experimental procedure for testing stronger forms of generalization than has been used previously. We implement a total of nine prediction methods, and conduct large scale prediction accuracy experiments. We show interesting new results on the relative performance of these methods.
Learning Similarity Metrics for Event Identification in Social Media
"... Social media sites (e.g., Flickr, YouTube, and Facebook) are a popular distribution outlet for users looking to share their experiences and interests on the Web. These sites host substantial amounts of user-contributed materials (e.g., photographs, videos, and textual content) for a wide variety of ..."
Abstract
-
Cited by 74 (9 self)
- Add to MetaCart
(Show Context)
Social media sites (e.g., Flickr, YouTube, and Facebook) are a popular distribution outlet for users looking to share their experiences and interests on the Web. These sites host substantial amounts of user-contributed materials (e.g., photographs, videos, and textual content) for a wide variety of real-world events of different type and scale. By automatically identifying these events and their associated user-contributed social media documents, which is the focus of this paper, we can enable event browsing and search in state-of-the-art search engines. To address this problem, we exploit the rich “context ” associated with social media content, including user-provided annotations (e.g., title, tags) and automatically generated information (e.g., content creation time). Using this rich context, which includes both textual and non-textual features, we can define appropriate document similarity metrics to enable online clustering of media to events. As a key contribution of this paper, we explore a variety of techniques for learning multi-feature similarity metrics for social media documents in a principled manner. We evaluate our techniques on large-scale, realworld datasets of event images from Flickr. Our evaluation results suggest that our approach identifies events, and their associated social media documents, more effectively than the state-of-the-art strategies on which we build.
EXPOSURE: Finding Malicious Domains Using Passive DNS Analysis
"... The domain name service (DNS) plays an important role in the operation of the Internet, providing a two-way mapping between domain names and their numerical identifiers. Given its fundamental role, it is not surprising that a wide variety of malicious activities involve the domain name service in on ..."
Abstract
-
Cited by 54 (3 self)
- Add to MetaCart
(Show Context)
The domain name service (DNS) plays an important role in the operation of the Internet, providing a two-way mapping between domain names and their numerical identifiers. Given its fundamental role, it is not surprising that a wide variety of malicious activities involve the domain name service in one way or another. For example, bots resolve DNS names to locate their command and control servers, and spam mails contain URLs that link to domains that resolve to scam servers. Thus, it seems beneficial to monitor the use of the DNS system for signs that indicate that a certain name is used as part of a malicious operation. In this paper, we introduce EXPOSURE, a system that employs large-scale, passive DNS analysis techniques to detect domains that are involved in malicious activity. We use 15 features that we extract from the DNS traffic that allow us to characterize different properties of DNS names and the ways that they are queried. Our experiments with a large, real-world data set consisting of 100 billion DNS requests, and a real-life deployment for two weeks in an ISP show that our approach is scalable and that we are able to automatically identify unknown malicious domains that are misused in a variety of malicious activity (such as for botnet command and control, spamming, and phishing). 1
Roleminer: mining roles using subset enumeration
- In CCS ’06: Proceedings of the 13th ACM conference on Computer and communications security
, 2006
"... Role engineering, the task of defining roles and associating permissions to them, is essential to realize the full benefits of the role-based access control paradigm. Essentially, there are two basic approaches to accomplish this: the topdown and the bottom-up. The top-down approach relies on a care ..."
Abstract
-
Cited by 52 (6 self)
- Add to MetaCart
(Show Context)
Role engineering, the task of defining roles and associating permissions to them, is essential to realize the full benefits of the role-based access control paradigm. Essentially, there are two basic approaches to accomplish this: the topdown and the bottom-up. The top-down approach relies on a careful analysis of the business processes to define job functions and then specify appropriate roles from them. While this approach can aid in defining roles more accurately, it is tedious and time consuming since it requires that the semantics of the business processes be well understood. Moreover, it ignores existing permissions within an organization and does not utilize them. On the other hand, the bottomup approach starts with existing permissions and attempts to derive roles from them, thus helping to automate role definition. In this paper, we present an unsupervised approach called RoleMiner that mines roles from existing userpermission assignments. Since a role is nothing but a set of permissions, when no semantics are available, the task of role mining is essentially that of clustering users that have same (or similar) permissions. However, unlike the traditional applications of data mining that ideally require identification of non-overlapping clusters, roles will have overlapping permission needs and thus permission sets that define roles should be allowed to overlap. It is this distinction from traditional clustering that makes the problem of role mining non-trivial. Our experiments with real and simulated data sets indicate that our role mining process is quite accurate and efficient.
Document Clustering using Particle Swarm Optimization
- IEEE Swarm Intelligence Symposium, The Westin
, 2005
"... Fast and high-quality document clustering algorithms play an important role in effectively navigating, summarizing, and organizing information. Recent studies have shown that partitional clustering algorithms are more suitable for clustering large datasets. However, the K-means algorithm, the most c ..."
Abstract
-
Cited by 44 (6 self)
- Add to MetaCart
(Show Context)
Fast and high-quality document clustering algorithms play an important role in effectively navigating, summarizing, and organizing information. Recent studies have shown that partitional clustering algorithms are more suitable for clustering large datasets. However, the K-means algorithm, the most commonly used partitional clustering algorithm, can only generate a local optimal solution. In this paper, we present a Particle Swarm Optimization (PSO) document clustering algorithm. Contrary to the localized searching of the K-means algorithm, the PSO clustering algorithm performs a globalized search in the entire solution space. In the experiments we conducted, we applied the PSO, K-means and hybrid PSO clustering algorithm on four different text document datasets. The number of documents in the datasets ranges from 204 to over 800, and the number of terms ranges from over 5000 to over 7000. The results illustrate that the hybrid PSO algorithm can generate more compact clustering results than the Kmeans algorithm. 1.
A General Model for Multiple View Unsupervised Learning
, 2008
"... Multiple view data, which have multiple representations from different feature spaces or graph spaces, arise in various data mining applications such as information retrieval, bioinformatics and social network analysis. Since different representations could have very different statistical properties ..."
Abstract
-
Cited by 41 (2 self)
- Add to MetaCart
Multiple view data, which have multiple representations from different feature spaces or graph spaces, arise in various data mining applications such as information retrieval, bioinformatics and social network analysis. Since different representations could have very different statistical properties, how to learn a consensus pattern from multiple representations is a challenging problem. In this paper, we propose a general model for multiple view unsupervised learning. The proposed model introduces the concept of mapping function to make the different patterns from different pattern spaces comparable and hence an optimal pattern can be learned from the multiple patterns of multiple representations. Under this model, we formulate two specific models for
Access pattern disclosure on searchable encryption: Ramification, attack and mitigation
- in Network and Distributed System Security Symposium (NDSS
, 2012
"... The advent of cloud computing has ushered in an era of mass data storage in remote servers. Remote data storage offers reduced data management overhead for data owners in a cost effective manner. Sensitive documents, however, need to be stored in encrypted format due to security con-cerns. But, encr ..."
Abstract
-
Cited by 40 (0 self)
- Add to MetaCart
(Show Context)
The advent of cloud computing has ushered in an era of mass data storage in remote servers. Remote data storage offers reduced data management overhead for data owners in a cost effective manner. Sensitive documents, however, need to be stored in encrypted format due to security con-cerns. But, encrypted storage makes it difficult to search on the stored documents. Therefore, this poses a major barrier towards selective retrieval of encrypted documents from the remote servers. Various protocols have been proposed for keyword search over encrypted data to address this issue. Most of the available protocols leak data access patterns due to efficiency reasons. Although, oblivious RAM based protocols can be used to hide data access patterns, such protocols are computationally intensive and do not scale well for real world datasets. In this paper, we introduce a novel attack that exploits data access pattern leakage to disclose significant amount of sensitive information using a modicum of prior knowledge. Our empirical analysis with a real world dataset shows that the proposed attack is able to disclose sensitive information with a very high accuracy. Additionally, we propose a simple technique to mitigate the risk against the proposed attack at the expense of a slight increment in computational resources and communication cost. Furthermore, our proposed mitigation technique is generic enough to be used in conjunction with any search-able encryption scheme that reveals data access pattern. 1.
ReCoM: reinforcement clustering of multi-type interrelated data objects
- In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval
, 2003
"... Most existing clustering algorithms cluster highly related data objects such as Web pages and Web users separately. The interrelation among different types of data objects is either not considered, or represented by a static feature space and treated in the same ways as other attributes of the objec ..."
Abstract
-
Cited by 39 (11 self)
- Add to MetaCart
(Show Context)
Most existing clustering algorithms cluster highly related data objects such as Web pages and Web users separately. The interrelation among different types of data objects is either not considered, or represented by a static feature space and treated in the same ways as other attributes of the objects. In this paper, we propose a novel clustering approach for clustering multi-type interrelated data objects, ReCoM (Reinforcement Clustering of Multi-type Interrelated data objects). Under this approach, relationships among data objects are used to improve the cluster quality of interrelated data objects through an iterative reinforcement clustering process. At the same time, the link structure derived from relationships of the interrelated data objects is used to differentiate the importance of objects and the learned importance is also used in the clustering process to further improve the clustering results. Experimental results show that the proposed approach not only effectively overcomes the problem of data sparseness caused by the high dimensional relationship space but also significantly improves the clustering accuracy. 1.