## Active Learning via Sequential Design with Applications to Detection of Money Laundering

### BibTeX

@MISC{Deng_activelearning,

author = {Xinwei Deng and V. Roshan Joseph and Agus Sudjianto and C. F. Jeff Wu},

title = {Active Learning via Sequential Design with Applications to Detection of Money Laundering},

year = {}

}

### OpenURL

### Abstract

Money laundering is a process to conceal the true origin of funds that were originally derived from illegal activities. Because it often involves criminal activities, financial institutions have the responsibility to detect and inform about them to the appropriate government agencies in a timely manner. However, detecting money laundering is not an easy job because of the huge number of transactions that take place each day. The usual approach adopted by financial institutions is to extract some summary statistics from the transaction history and do a thorough and time-consuming investigation on those suspicious accounts. In this article, we propose an active learning via sequential design method for prioritization to improve the process of money laundering detection. The method uses a combination of stochastic approximation and D-optimal designs to judiciously select the accounts for investigation. The sequential nature of the method helps to decide the optimal prioritization criterion with minimal time and effort. A case study with real banking data is used to demonstrate the performance of the proposed method. A simulation study shows the efficiency and accuracy of the proposed method, as well as its robustness to model assumptions.

### Citations

8972 | The Nature of Statistical Learning Theory - Vapnik - 1995 |

2028 | Online learning with kernels
- Kivinen, Smola, et al.
- 2004
(Show Context)
Citation Context ...near model as in Section 5, i.e., by taking z = ∑p αi i=1 wixi , where αi ≥ 0 for all i = 1, . . . , p. Another strategy of incorporating the nonlinearity is to consider the so-called “kernel trick” (=-=Schölkopf and Smola, 2002-=-) on the synthetic variable z for the logit model in (4.1), i.e. z can be expressed as an inner product in the reproduced kernel Hilbert space (Wahba, 1990). Generalizing the active learning criterion... |

1272 |
Spline Models for Observational Data
- Wahba
- 1990
(Show Context)
Citation Context ... the so-called “kernel trick” (Schölkopf and Smola, 2002) on the synthetic variable z for the logit model in (4.1), i.e. z can be expressed as an inner product in the reproduced kernel Hilbert space (=-=Wahba, 1990-=-). Generalizing the active learning criterion (4.5) for the nonlinear threshold surface is an interesting topic for future research. The proposed active learning via sequential design is flexible in e... |

528 | Active learning with statistical models
- Cohn, Ghahramani, et al.
- 1996
(Show Context)
Citation Context ...ld hyperplane. Thus, the goal is to determine an optimal threshold hyperplane for prioritization with minimum number of investigated accounts. This calls for the use of active learning (Mackay, 1992; =-=Cohn et al., 1996-=-; Fukumizu, 2000) technique in machine learning. Here, the learner actively selects data points to be added into the training set. To minimize the number of investigated accounts and use them to const... |

507 | Support vector machine active learning with applications to text classification
- Tong, Koller
- 2002
(Show Context)
Citation Context ... accounts for investigation. This call for the use of active learning in machine learning. Recently, active learning methods using support vector machines (SVM) were developed by several researchers (=-=Tong and Koller, 2001-=-; Schohn and Cohn, 2000; Campbell et al., 2000), which can be applied to the present problem. For binary response, active learning with SVM is mainly for two-class classification. The decision boundar... |

473 | A sequential algorithm for training text classifiers
- Lewis, Gale
- 1994
(Show Context)
Citation Context ...e approach to account for the multivariate nature of the data. The methodology is explained in the next section. 4 Methodology 4.1 Active Learning via Sequential Design In pool-based active learning (=-=Lewis and Gale, 1994-=-), there is a pool of unlabelled data. The learner has access to the pool and can request the true label for a certain number of data in the pool. The main issue is in finding a way to choose the next... |

354 |
Theory of optimal experiments
- Fedorov
- 1972
(Show Context)
Citation Context ...mal design approach to sequential designs, first a parametric model for the unknown function is postulated. Then, the x points are chosen sequentially based on some optimality criteria (Kiefer, 1959; =-=Fedorov, 1972-=-; Pukelsheim, 1993). For example, Neyer (1994) proposed a sequential D-optimality-based design. Here xn+1 is chosen so that the determinant of the estimated Fisher information is maximized. It is well... |

324 | Information-based objective functions for active data selection - MacKay - 1992 |

204 | Less is more: Active learning with support vector machines
- Schohn, Cohn
- 2000
(Show Context)
Citation Context ...lane, we need to judiciously select the accounts for investigation. Recently, active learning methods using support vector machines (SVM) were developed by several researchers (Tong and Koller, 2001; =-=Schohn and Cohn, 2000-=-; Campbell et al., 2000), which can be applied to the present problem. For binary response, active learning with SVM is mainly for two-class classification. The decision boundary in SVM implements the... |

166 |
Optimal Design of Experiments
- Pukelsheim
- 1993
(Show Context)
Citation Context ...oach to sequential designs, first a parametric model for the unknown function is postulated. Then, the x points are chosen sequentially based on some optimality criteria (Kiefer, 1959; Fedorov, 1972; =-=Pukelsheim, 1993-=-). For example, Neyer (1994) proposed a sequential D-optimality-based design. Here xn+1 is chosen so that the determinant of the estimated Fisher information is maximized. It is well known that a Dopt... |

123 | Query learning with large margin classifiers
- Campbell, Cristianini, et al.
- 2000
(Show Context)
Citation Context ...ously select the accounts for investigation. Recently, active learning methods using support vector machines (SVM) were developed by several researchers (Tong and Koller, 2001; Schohn and Cohn, 2000; =-=Campbell et al., 2000-=-), which can be applied to the present problem. For binary response, active learning with SVM is mainly for two-class classification. The decision boundary in SVM implements the Bayes rule P (Y |x) = ... |

84 |
On the Existence of Maximum Likelihood Estimates in Logistic Regression Models
- Albert, JA
- 1984
(Show Context)
Citation Context ... θ. Suppose the labelled data are (x1, Y1), (x2, Y2), . . ., (xn, Yn). It is known that the existence and uniqueness of MLE can be achieved only when successes and failures overlap (Silvapulle, 1981; =-=Albert and Anderson, 1984-=-; Santner and Duffy, 1986). However, even when we are able to compute the MLE, they may suffer from low accuracy due to the small sample size, especially for nonlinear models. Use of a Bayesian approa... |

69 | Minimax and maximin distance designs - Johnson, Moore, et al. - 1990 |

64 |
Optimal Design
- Silvey
- 1980
(Show Context)
Citation Context ...xn+1 is chosen so that the determinant of the estimated Fisher information is maximized. It is well known that a Doptimal criterion minimizes the volume of the confidence ellipsoid of the parameters (=-=Silvey, 1980-=-). The root is solved from the final estimate of the function F (x). The performance of the optimal design approach is model dependent. It performs best when the assumed model is the true model, but t... |

51 |
Optimum experimental designs
- Kiefer
- 1959
(Show Context)
Citation Context ...e. In the optimal design approach to sequential designs, first a parametric model for the unknown function is postulated. Then, the x points are chosen sequentially based on some optimality criteria (=-=Kiefer, 1959-=-; Fedorov, 1972; Pukelsheim, 1993). For example, Neyer (1994) proposed a sequential D-optimality-based design. Here xn+1 is chosen so that the determinant of the estimated Fisher information is maximi... |

47 | Active learning in multilayer perceptrons
- Fukumizu
- 1996
(Show Context)
Citation Context ..., the goal is to determine an optimal threshold hyperplane for prioritization with minimum number of investigated accounts. This calls for the use of active learning (Mackay, 1992; Cohn et al., 1996; =-=Fukumizu, 2000-=-) technique in machine learning. Here, the learner actively selects data points to be added into the training set. To minimize the number of investigated accounts and use them to construct the thresho... |

15 |
Efficient Sequential Designs with Binary Data
- WU
- 1985
(Show Context)
Citation Context ...s Bayesian approach allows us to implement a fully sequential procedure, i.e., the proposed active learning method can begin from n = 1. This would not have been possible with a frequentist approach (=-=Wu, 1985-=-), for which some initial sample is necessary before the active learning method can be called. 5 Case Study We applied the proposed method to some real transaction data from a financial institution. T... |

13 |
On the existence of maximum likelihood estimates for the binomial response models
- Silvapulle
- 1981
(Show Context)
Citation Context ... for the parameter θ. Suppose the labelled data are (x1, Y1), (x2, Y2), . . ., (xn, Yn). It is known that the existence and uniqueness of MLE can be achieved only when successes and failures overlap (=-=Silvapulle, 1981-=-; Albert and Anderson, 1984; Santner and Duffy, 1986). However, even when we are able to compute the MLE, they may suffer from low accuracy due to the small sample size, especially for nonlinear model... |

12 | Kinematics Theory - Wu - 1995 |

6 | Support Vector Machines for Classification - Lin, Lee, et al. - 2002 |

5 | Support vector machine active learning with applications to text classification - S, Koller - 2000 |

4 | Efficient Robbins-Monro Procedure for Binary Data
- Joseph
- 2004
(Show Context)
Citation Context ...see that the performance of the methods is better when α = .5. This is a well known fact in the literature that the estimation of extreme quantiles is much more difficult than with α = .5 (see, e.g., =-=Joseph 2004-=-). It is also clear from the figures that the proposed method is quite robust to model assumptions. In the proposed active learning approach in (4.5), one selects k0 candidate points which are closest... |

4 | A sequential algorithm for training text classifiers - D, Gale - 1994 |

3 | Adaptive Designs for Stochastic RootFinding - Joseph, Tian, et al. - 2007 |

3 |
D-Optimality-Based Sensitivity
- Neyer
- 1994
(Show Context)
Citation Context ...hoose two x ′ s in each stratum according to the corresponding z value. The choice of the constant ±1.6 is based on the asymptotic optimality of the estimators under logistic distribution (see, e.g., =-=Neyer 1994-=-). The hyper-parameters were chosen as before. 100 simulations were generated for comparison. Logit Uniform Dist_PM 1 2 3 4 5 Dist_PM 1 2 3 4 5 10 15 20 25 30 35 n 10 15 20 25 30 35 n Normal Cauchy Di... |