Results 1  10
of
24
The Role Mining Problem: Finding a Minimal Descriptive Set of Roles
 In Symposium on Access Control Models and Technologies (SACMAT
, 2007
"... Devising a complete and correct set of roles has been recognized as one of the most important and challenging tasks in implementing role based access control. A key problem related to this is the notion of goodness/interestingness – when is a role good/interesting? In this paper, we define the role ..."
Abstract

Cited by 30 (2 self)
 Add to MetaCart
Devising a complete and correct set of roles has been recognized as one of the most important and challenging tasks in implementing role based access control. A key problem related to this is the notion of goodness/interestingness – when is a role good/interesting? In this paper, we define the role mining problem (RMP) as the problem of discovering an optimal set of roles from existing user permissions. The main contribution of this paper is to formally define RMP, and analyze its theoretical bounds. In addition to the above basic RMP, we introduce two different variations of the RMP, called the δapprox RMP and the Minimal Noise RMP that have pragmatic implications. We reduce the known “set basis problem ” to RMP to show that RMP is an NPcomplete problem. An important contribution of this paper is also to show the relation of the role mining problem to several problems already identified in the data mining and data analysis literature. By showing that the RMP is in essence reducible to these known problems, we can directly borrow the existing implementation solutions and guide further research in this direction.
MultiAssignment Clustering for Boolean Data
"... equally contributing corresponding author Conventional clustering methods typically assume that each data item belongs to a single cluster. This assumption does not hold in general. In order to overcome this limitation, we propose a generative method for clustering vectorial data, where each object ..."
Abstract

Cited by 14 (4 self)
 Add to MetaCart
equally contributing corresponding author Conventional clustering methods typically assume that each data item belongs to a single cluster. This assumption does not hold in general. In order to overcome this limitation, we propose a generative method for clustering vectorial data, where each object can be assigned to multiple clusters. Using a deterministic annealing scheme, our method decomposes the observed data into the contributions of individual clusters and infers their parameters. Experiments on synthetic Boolean data show that our method achieves higher accuracy in the source parameter estimation and superior cluster stability compared to stateoftheart approaches. We also apply our method to an important problem in computer security known as role mining. Experiments on realworld access control data show performance gains in generalization to new employees against other multiassignment methods. In challenging situations with high noise levels, our approach maintains its good performance, while alternative stateoftheart techniques lack robustness. 1.
Model order selection for Boolean matrix factorization
 In KDD
, 2011
"... Matrix factorizations—where a given data matrix is approximated by a product of two or more factor matrices—are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, however, requires solving the ‘model order selection prob ..."
Abstract

Cited by 6 (4 self)
 Add to MetaCart
Matrix factorizations—where a given data matrix is approximated by a product of two or more factor matrices—are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, however, requires solving the ‘model order selection problem ’ of determining where finegrained structure stops, and noise starts, i.e., what is the proper size of the factor matrices. Boolean matrix factorization (BMF)—where data, factors, and matrix product are Boolean—has received increased attention from the data mining community in recent years. The technique has desirable properties, such as high interpretability and natural sparsity. But so far no method for selecting the correct model order for BMF has been available. In this paper we propose to use the Minimum Description Length (MDL) principle for this task. Besides solving the problem, this wellfounded approach has numerous benefits, e.g., it is automatic, does not require a likelihood function, is fast, and, as experiments show, is highly accurate. We formulate the description length function for BMF in general— making it applicable for any BMF algorithm. We extend an existing algorithm for BMF to use MDL to identify the best Boolean matrix factorization, analyze the complexity of the problem, and perform an extensive experimental evaluation to study its behavior.
Binary Matrix Factorization with Applications
"... An interesting problem in Nonnegative Matrix Factorization (NMF) is to factorize the matrix X which is of some specific class, for example, binary matrix. In this paper, we extend the standard NMF to Binary Matrix Factorization (BMF for short): given a binary matrix X, we want to factorize X into tw ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
An interesting problem in Nonnegative Matrix Factorization (NMF) is to factorize the matrix X which is of some specific class, for example, binary matrix. In this paper, we extend the standard NMF to Binary Matrix Factorization (BMF for short): given a binary matrix X, we want to factorize X into two binary matrices W,H (thus conserving the most important integer property of the objective matrix X) satisfying X ≈ WH. Two algorithms are studied and compared. These methods rely on a fundamental boundedness property of NMF which we propose and prove. This new property also provides a natural normalization scheme that eliminates the bias of factor matrices. Experiments on both synthetic and real world datasets are conducted to show the competency and effectiveness of BMF. 1.
Complexity of Inference in Latent Dirichlet Allocation
"... We consider the computational complexity of probabilistic inference in Latent Dirichlet Allocation (LDA). First, we study the problem of finding the maximum a posteriori (MAP) assignment of topics to words, where the document’s topic distribution is integrated out. We show that, when the effective n ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
We consider the computational complexity of probabilistic inference in Latent Dirichlet Allocation (LDA). First, we study the problem of finding the maximum a posteriori (MAP) assignment of topics to words, where the document’s topic distribution is integrated out. We show that, when the effective number of topics per document is small, exact inference takes polynomial time. In contrast, we show that, when a document has a large number of topics, finding the MAP assignment of topics to words in LDA is NPhard. Next, we consider the problem of finding the MAP topic distribution for a document, where the topicword assignments are integrated out. We show that this problem is also NPhard. Finally, we briefly discuss the problem of sampling from the posterior, showing that this is NPhard in one restricted setting, but leaving open the general question. 1
What is the dimension of your binary data?
"... Many 0/1 datasets have a very large number of variables; however, they are sparse and the dependency structure of the variables is simpler than the number of variables would suggest. Defining the effective dimensionality of such a dataset is a nontrivial problem. We consider the problem of defining ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Many 0/1 datasets have a very large number of variables; however, they are sparse and the dependency structure of the variables is simpler than the number of variables would suggest. Defining the effective dimensionality of such a dataset is a nontrivial problem. We consider the problem of defining a robust measure of dimension for 0/1 datasets, and show that the basic idea of fractal dimension can be adapted for binary data. However, as such the fractal dimension is difficult to interpret. Hence we introduce the concept of normalized fractal dimension. For a dataset D, its normalized fractal dimension counts the number of independent columns needed to achieve the unnormalized fractal dimension of D. The normalized fractal dimension measures the degree of dependency structure of the data. We study the properties of the normalized fractal dimension and discuss its computation. We give empirical results on the normalized fractal dimension, comparing it against PCA. 1.
A class of probabilistic models for role engineering
 In CCS ’08. ACM
, 2008
"... Role Engineering is a securitycritical task for systems using rolebased access control (RBAC). Different rolemining approaches have been proposed that attempt to automatically infer appropriate roles from existing userpermission assignments. However, these approaches are mainly combinatorial and ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
Role Engineering is a securitycritical task for systems using rolebased access control (RBAC). Different rolemining approaches have been proposed that attempt to automatically infer appropriate roles from existing userpermission assignments. However, these approaches are mainly combinatorial and lack an underlying probabilistic model of the domain. We present the first probabilistic model for RBAC. Our model defines a general framework for expressing user permission assignments and can be specialized to different domains by limiting its degrees of freedom with appropriate constraints. For one practically important instance of this framework, we show how roles can be inferred from data using a stateoftheart machinelearning algorithm. Experiments on both randomly generated and realworld data provide evidence that our approach not only creates meaningful roles but also identifies erroneous userpermission assignments in given data.
Mining TopK Patterns from Binary Datasets in presence of Noise
"... The discovery of patterns in binary dataset has many applications, e.g. in electronic commerce, TCP/IP networking, Web usage logging, etc. Still, this is a very challenging task in many respects: overlapping vs. non overlapping patterns, presence of noise, extraction of the most important patterns o ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
The discovery of patterns in binary dataset has many applications, e.g. in electronic commerce, TCP/IP networking, Web usage logging, etc. Still, this is a very challenging task in many respects: overlapping vs. non overlapping patterns, presence of noise, extraction of the most important patterns only. In this paper we formalize the problem of discovering the TopK patterns from binary datasets in presence of noise, as the minimization of a novel cost function. According to the Minimum Description Length principle, the proposed cost function favors succinct pattern sets that may approximately describe the input data. We propose a greedy algorithm for the discovery of Patterns in Noisy Datasets, named PaNDa, and show that it outperforms related techniques on both synthetic and realworld data. 1
A hierarchical model for ordinal matrix factorization
"... Preprint. To appear in Statistics and Computing. The final publication is available at www.springerlink.com. This paper proposes a hierarchical probabilistic model for ordinal matrix factorization. Unlike previous approaches, we model the ordinal nature of the data and take a principled approach to ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Preprint. To appear in Statistics and Computing. The final publication is available at www.springerlink.com. This paper proposes a hierarchical probabilistic model for ordinal matrix factorization. Unlike previous approaches, we model the ordinal nature of the data and take a principled approach to incorporating priors for the hidden variables. Two algorithms are presented for inference, one based on Gibbs sampling and one based on variational Bayes. Importantly, these algorithms may be implemented in the factorization of very large matrices with missing entries. The model is evaluated on a collaborative filtering task, where users have rated a collection of movies and the system is asked to predict their ratings for other movies. The Netflix data set is used for evaluation, which consists of around 100 million ratings. Using root meansquared error (RMSE) as an evaluation metric, results show that the suggested model outperforms alternative factorization techniques. Results also show how Gibbs sampling outperforms variational Bayes on this task, despite the large number of ratings and model parameters. Matlab implementations of the proposed algorithms are available from cogsys.imm.dtu.dk/ordinalmatrixfactorization.
Boolean Tensor Factorizations
"... Abstract—Tensors are multiway generalizations of matrices, and similarly to matrices, they can also be factorized, that is, represented (approximately) as a product of factors. These factors are typicaly either all matrices or a mixture of matrices and tensors. With the widespread adoption of matri ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Abstract—Tensors are multiway generalizations of matrices, and similarly to matrices, they can also be factorized, that is, represented (approximately) as a product of factors. These factors are typicaly either all matrices or a mixture of matrices and tensors. With the widespread adoption of matrix factorization techniques in data mining, also tensor factroziations have started to gain attention. In this paper we study the Boolean tensor factorizations. We assume that the data is binary multiway data, and we want to factorize it to binary factors using Boolean arithmetic (i.e. defining that 1+1 = 1). Boolean tensor factorizations are, therefore, natural generalization of the Boolean matrix factorizations. We will study the theory of Boolean tensor factorizations and show that at least some of the benefits Boolean matrix factorizations have over normal matrix factorizations carry over to the tensor data. We will also present algorithms for Boolean variations of CP and Tucker decompositions, the two mostcommon types of tensor factorizations. With experimentation done with synthetic and realworld data, we show that Boolean tensor factorizations are a viable alternative when the data is naturally binary. KeywordsTensor factorization; CP factorization; Tucker factorization; Boolean tensor factorization; Boolean matrix factorization I.