Results 1 - 10
of
65
Detecting Near-Duplicates for Web Crawling
- WWW 2007
, 2007
"... Near-duplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrelevant for web search. So the quality of a web crawler increases if it can assess whether a newly crawled web page is a nea ..."
Abstract
-
Cited by 92 (0 self)
- Add to MetaCart
Near-duplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrelevant for web search. So the quality of a web crawler increases if it can assess whether a newly crawled web page is a near-duplicate of a previously crawled web page or not. In the course of developing a near-duplicate detection system for a multi-billion page repository, we make two research contributions. First, we demonstrate that Charikar’s fingerprinting technique is appropriate for this goal. Second, we present an algorithmic technique for identifying existing fbit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k. Our technique is useful for both online queries (single fingerprints) and batch queries (multiple fingerprints). Experimental evaluation over real data confirms the practicality of our design.
CorePhrase: Keyphrase Extraction for Document Clustering
- In IAPR 4th International Conference on Machine Learning and Data Mining (MLDM’2005
, 2005
"... Abstract. The ability to discover the topic of a large set of text documents using relevant keyphrases is usually regarded as a very tedious task if done by hand. Automatic keyphrase extraction from multi-document data sets or text clusters provides a very compact summary of the contents of the clus ..."
Abstract
-
Cited by 28 (1 self)
- Add to MetaCart
(Show Context)
Abstract. The ability to discover the topic of a large set of text documents using relevant keyphrases is usually regarded as a very tedious task if done by hand. Automatic keyphrase extraction from multi-document data sets or text clusters provides a very compact summary of the contents of the clusters, which often helps in locating information easily. We introduce an algorithm for topic discovery using keyphrase extraction from multi-document sets and clusters based on frequent and significant shared phrases between documents. The keyphrases extracted by the algorithm are highly accurate and fit the cluster topic. The algorithm is independent of the domain of the documents. Subjective as well as quantitative evaluation show that the algorithm outperforms keyword-based cluster-labeling algorithms, and is capable of accurately discovering the topic, and often ranking it in the top one or two extracted keyphrases. 1
Structure from statistics - unsupervised activity analysis using suffix trees
- In IEEE ICCV
, 2007
"... Models of activity structure for unconstrained environments are generally not available a priori. Recent representational approaches to this end are limited by their computational complexity, and ability to capture activity structure only up to some fixed temporal scale. In this work, we propose Suf ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
Models of activity structure for unconstrained environments are generally not available a priori. Recent representational approaches to this end are limited by their computational complexity, and ability to capture activity structure only up to some fixed temporal scale. In this work, we propose Suffix Trees as an activity representation to efficiently extract structure of activities by analyzing their constituent event-subsequences over multiple temporal scales. We empirically compare Suffix Trees with some of the previous approaches in terms of feature cardinality, discriminative prowess, noise sensitivity and activity-class discovery. Finally, exploiting properties of Suffix Trees, we present a novel perspective on anomalous subsequences of activities, and propose an algorithm to detect them in linear-time. We present comparative results over experimental data, collected from a kitchen environment to demonstrate the competence of our proposed framework. 1. Introduction & Previous
A Novel Sequence Representation for Unsupervised Analysis of Human Activities
"... ..."
(Show Context)
Process Mining, Discovery, and Integration using Distance Measures
- In ICWS ’06: Proceedings of the IEEE International Conference on Web Services (ICWS’06
, 2006
"... ABSTRACT Business processes continue to play an important role in today’s service-oriented enterprise computing systems. Mining, discovering, and integrating process-oriented services has attracted growing attention in the recent year. In this paper we present a quantitative approach to modeling and ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
(Show Context)
ABSTRACT Business processes continue to play an important role in today’s service-oriented enterprise computing systems. Mining, discovering, and integrating process-oriented services has attracted growing attention in the recent year. In this paper we present a quantitative approach to modeling and capturing the similarity and dissimilarity between different workflow designs. Concretely, we introduce a graph-based distance measure and a framework for utilizing this distance measure to mine the process repository and discover workflow designs that are similar to a given design pattern or to produce one integrated workflow design by merging two or more business workflows of similar designs. We derive the similarity measures by analyzing the workflow dependency graphs of the participating workflow processes. Such an analysis is conducted in two phases. We first convert each workflow dependency graph into a normalized process network matrix. Then we calculate the metric space distance between the normalized matrices. This distance measure can be used as a quantitative and qualitative tool in process mining, process merging, and process clustering, and ultimately it can reduce or minimize the costs involved in design, analysis, and evolution of workflow systems. 1.
Development of Distance Measures for Process Mining, Discovery, and Integration
"... Business processes continue to play an important role in today’s service-oriented enterprise computing systems. Mining, discovering, and integrating process-oriented services has attracted growing attention in the recent years. In this paper we present a quantitative approach to modeling and capturi ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Business processes continue to play an important role in today’s service-oriented enterprise computing systems. Mining, discovering, and integrating process-oriented services has attracted growing attention in the recent years. In this paper we present a quantitative approach to modeling and capturing the similarity and dissimilarity between different process designs. We derive the similarity measures by analyzing the process dependency graphs of the participating workflow processes. We first convert each process dependency graph into a normalized process matrix. Then we calculate the metric space distance between the normalized matrices. This distance measure can be used as a quantitative and qualitative tool in process mining, process merging, and process clustering, and ultimately it can reduce or minimize the costs involved in design, analysis, and evolution of workflow systems. KEY WORDS: Business Process, Similarity, Process Mining
Collaborative document clustering
- In SDM
, 2006
"... Document clustering has been traditionally studied as a centralized process. There are scenarios when centralized clustering does not serve the required purpose; e.g. documents spanning multiple digital libraries need not be clustered in one location, but rather clustered at each location, then enri ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
(Show Context)
Document clustering has been traditionally studied as a centralized process. There are scenarios when centralized clustering does not serve the required purpose; e.g. documents spanning multiple digital libraries need not be clustered in one location, but rather clustered at each location, then enriched by receiving more information from other locations. A distributed collaborative approach for document clustering is proposed in this paper. The main objective here is to allow peers in a network to form independent opinions of local document grouping, followed by exchange of cluster summaries in the form of keyphrase vectors. The nodes then expand and enrich their local solution by receiving recommended documents from their peers based on the peer judgement of the similarity of local documents to the exchanged cluster summaries. Results show improvement in final clustering after merging peer recommendations. The approach allows independent nodes to achieve better local clustering by having access to distributed data without the cost of centralized clustering, while maintaining the initial local clustering structure and coherency. 1
Process Mining by Measuring Process Block Similarity
"... Abstract. Mining, discovering, and integrating process-oriented services has attracted growing attention in the recent year. Workflow precedence graph and workflow block structures are two important factors for comparing and mining processes based on distance similarity measure. Some existing work h ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
(Show Context)
Abstract. Mining, discovering, and integrating process-oriented services has attracted growing attention in the recent year. Workflow precedence graph and workflow block structures are two important factors for comparing and mining processes based on distance similarity measure. Some existing work has done on comparing workflow designs based on their precedence graphs. However, there lacks of standard distance metrics for comparing workflows that contain complex block structures such as parallel OR, parallel AND. In this paper we present a quantitative approach to modeling and capturing the similarity and dissimilarity between different workflow designs, focusing on similarity and dissimilarity between the block structures of different workflow designs. We derive the distance-based similarity measures by analyzing the workflow block structure of the participating workflow processes in four consecutive phases. We first convert each workflow dependency graph into a block tree by using our block detection algorithm. Second, we transform the block tree into a binary tree to provide a normalized reference structure for distance based similarity analysis. Third, we construct a binary branch vector by encoding the binary tree. Finally, we calculate the distance metric between two binary branch vectors. Our initial experience shows that this distance measure can be used as a quantitative and qualitative tool for understanding and detecting block structure similarity and dissimilarity between two workflow designs. It can be effectively combined with a workflow precedence based similarity analysis tool in process mining, process merging, and process clustering, and ultimately it can reduce or minimize the costs involved in design, analysis, and evolution of workflow systems.
A new content-based model for social network analysis
- in IEEE International Conference on Semantic Computing
, 2008
"... ..."
(Show Context)
The impact of phrases in document clustering for Swedish
- In Proc. 15th Nordic Conf. on Comp. Ling. – NODALIDA ’05. URL http://www.nada.kth.se/~rosell/publications/papers/ rosellvelupillai05.pdf
, 2005
"... We have investigated the impact of using phrases in the vector space model for clustering documents in Swedish in different ways. The investigation is carried out on two text sets from different domains: one set of newspaper articles and one set of medical papers. The use of phrases do not improve r ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
(Show Context)
We have investigated the impact of using phrases in the vector space model for clustering documents in Swedish in different ways. The investigation is carried out on two text sets from different domains: one set of newspaper articles and one set of medical papers. The use of phrases do not improve results relative the ordinary use of words. The results differ significantly between the text types. This indicates that one could benefit from different text representations for different domains although a fundamentally different approach probably would be needed. 1