#### DMCA

## A Comparative Analysis of Process Instance Cluster Techniques

### BibTeX

@MISC{Thaler_acomparative,

author = {Tom Thaler and Simon Ternis and Peter Fettke and Peter Loos},

title = {A Comparative Analysis of Process Instance Cluster Techniques},

year = {}

}

### OpenURL

### Abstract

Abstract. The application of process mining and analysis techniques to the process logs of information systems often leads to highly complex results, e.g. in terms of a high number of elements in the mined model. Thus, clustering corresponding log files is mandatory in the context of an expedient analysis. Against that background, many cluster techniques have been developed during the last years but, at the same time, it is unclear how powerful they operate in particular application scenarios. Therefore, the paper at hand aims at analyzing and comparing the capabilities of existing cluster techniques with regard to different objectives. As a result, it is shown that some techniques are more suitable for the handling of particular scenarios than others and there are also general challenges in their application, which should be addressed in future work. Keywords: Process Clustering, Process Mining, Business Process Management Introduction The execution of business processes often causes unexpected dynamics, e.g. in terms of their behavior depending on a variety of parameters. At the same time, upcoming legal regulations as well as industry or company standards make it necessary to consequently check the process behavior against different demands. Moreover, the need for an elicitation and analysis of not yet covered business processes as a model, are of major importance for today's companies. Process-supporting business software, like ERP and workflow systems, generally produces process execution logs, which serve as a basis for such inquiries. However, analyzing these logs with e.g. process mining techniques is challenging as they may contain very heterogeneous execution information on many different processes and process variants. Thus, deriving process models based on the raw logs often leads to models of high complexity in terms of the number of contained elements Hence, the paper at hand aims at filling that research gap in terms of analyzing the capabilities of existing techniques with regard to different objectives, whereby both theoretical analytical and practical empirical aspects are considered. Thus, the goal is not the evaluation of particular techniques in the context of sample experiments (which is partially carried out in the papers describing the algorithms and techniques) but their fundamental characterization, which is rooted in the established literature, as well as a comparative analysis of their capabilities in realistic contexts. Within the theoretical analytical investigation, existing process instance cluster techniques are characterized by process mining specific and cluster theoretical aspects. In the practical empirical analysis, two areas of major interest are identified in the literature and serve as a basis for the design of two real life application scenarios. The first scenario analyzes the capabilities of existing techniques to separate a process log with regard to different processes, while the second scenario aims at reducing the complexity of the mined models and, thus, at improving its understandability. In order to get a framework for the theoretical analytical investigation, a morphological box will be developed in section 2. Afterwards, in section 3, the relevant literature providing corresponding techniques is identified and characterized using that morphological box. In section 4, a selection of the cluster techniques is then analyzed in the practical empirical part of the work. The limitations of the analysis are discussed in Section 5, while Section 6 provides some concluding recommendations on the usage of particular techniques and their further development. Morphological Box Describing Process Instance Cluster Techniques Preliminary Note Developing a morphological box describing process instance cluster techniques requires the identification of relevant aspects within the cluster theory in general and in the field of process mining in particular. The development of a cluster technique is generally motivated by a concrete objective. Since the objective essentially affects the design of a particular technique, it is of high interest in both the field of process mining and the comparative analysis at hand. In addition to that, the representation of traces is also important in that context as it affects the choice of a particular cluster method. The distance measurement and the cluster approach with its specific characteristics are two important cluster-theoretical aspects, between which the basic literature (e.g. [14]) generally distinguishes. With regard to the practical empirical analysis, the availability of an implementation of a technique is obligatory, too. Thus, both aspects are considered in the following as well. 424 Description of Characteristics Objective. As mentioned in Section 1, one objective is the differentiation of business processes Another objective is to improve the understandability of the mined model(s). Hereby, it is distinguished between the reduction of complexity in terms of the number of elements in general and a decomposition of the resulting models in particular Representation of Traces. Generally, a trace can be represented in an abstract and in a concrete manner. The abstract representation provides a mathematical abstract view on a trace by using the vector space model. Different properties or characteristics describing a trace from a particular point of view are transformed to numerical values and serve as the elements of a vector. The authors of [21] suggest some corresponding vector profiles including the control flow (activity profile and transition profile), organizational aspects (originator profile), specific case data (case attributes profile and event attributes profile) and performance aspects such as the size of a trace and the execution durations. The most common features are presented in In contrast to that, the concrete representation provides a linguistically exact view on the traces based on node labels without any transformations. This view is often used for the description of examples in the traditional process mining literature (e.g. traces ABC, ACD, ACE). Corresponding distance measures solely work on the recorded activity sequences. Distance Measure. A distance measure is a numerical value, calculating the distance between two objects. The value ranges from 0 (the objects are equal) and has no general upper bound in most cases (completely different). However, there are also normalized distance measures, for which an upper bound exists. Generally, the distance between the objects i and j is equal to j and i. The used distance measure depends on 425 the representation of a trace. In case of an abstract trace representation, the most common distance measures are the Euclidean, the Hamming and the Jaccard distance but the correlation between vectors and the cosine distance are relevant in some approaches as well. In case of a concrete trace representation, different kinds of edit distances are used to measure the distance between two strings. Thus, the number of edit operations (insert, delete, move) needed for the transformation of one string into another are calculated, whereby it is possible to use fixed or dynamic costs Cluster Approach. In order to divide a multiset of process instances into different groups, the basic idea is to determine the distance between all elements of the multiset and to put elements with low distances between each other in one group. The resulting groups should have a high inner density (low distance between all elements), while the distance between the produced groups should be high. Thereby, a variety of corresponding cluster approaches exists, which are generally divided into three categories -hierarchical, partitioning and density-based approaches An important property of a cluster algorithm is the handling of the amount of clusters which should be produced. Three different characteristics can be distinguished, namely (1) the number of resulting clusters must be provided, (2) the algorithm automatically determines the number of clusters or (3) the maximal number of clusters must be provided as an upper bound. Another important property is the type of the cluster membership. Hard cluster algorithms allocate each input object to exactly one cluster, while fuzzy algorithms allow an allocation to multiple clusters at the same time. With regard to particular cluster objectives, it might be meaningful to include an external validation directly to the cluster approach. E.g. in the context of process mining, the cluster approach of In addition to the three mentioned cluster categories, other approaches, especially neuronal networks, are used in the context of clustering process instance data as well. Implementation. Generally, the implementation of a cluster approach is of high interest as it is the only procedure that allows the application of particular techniques in the context of an evaluation or a real world scenario. Basically, one may distinguish between whether an approach is implemented or not. Depending on the evaluation objectives and on the general parameters, it may also be important to know the attributes of an implementation, e.g. being publically available, open or closed source or distributed in a free or a commercial manner. Based on the description of the characteristics above, the morphological box presented in Selection of Process Instance Cluster Techniques In order to identify the relevant literature, the databases Springer, ACM, IEEE Xplore, Ebsco, ISI Web of Science, ScienceDirect, Scopus and Google Scholar were searched for the terms: "trace clustering" AND "process", "sequence clustering" AND "process", "clustering" AND "process instances", "clustering" AND "process mining", "clustering" AND "BPM", "clustering" AND "log data", "clustering" and "log files". It was desist from further restrictions like a time limit. Moreover, a backwards search was conducted on known journal articles and conference proceeding. The identified 427 articles were selected concerning whether or not they consider the clustering of business process instance data. In case of more than one article developing a clustering approach (e.g. because of improvements or further developments), generally the newest article was taken into account. Overall, 20 approaches were identified, which are now characterized using the developed morphologic box. 70% of the approaches name improving the understandability of the resulting models in general as the prime objective and also 55% focus the reduction of the complexity of the resulting model(s) in particular. The identification of different processes and process variants is the prime objective of 45% of the articles. Furthermore, 75% of all approaches use an abstract trace representation, whereby the most often used distance measure is the Euclidean distance. Considering the cluster category, 60% of the approaches use a partitioning algorithm, even so a hierarchical clustering is applied in 40% (multiple assignments are possible, as some approaches provide the possibility to switch between different cluster algorithms). Furthermore, 45% of the approaches do not require an initial setting of the number of resulting clusters, however 40% do. Thus, only 5% allow the setting of a maximal number of resulting clusters. Moreover, 2 of the 20 approaches work with fuzzy clusters. The characterization of the particular process instance cluster techniques is presented in Within the analyzed papers, 10 different implementations where explicitly named: ProM 5 -DWS Mining & Analysis