## J3.9 MULTIPLE IMPUTATION THROUGH MACHINE LEARNING ALGORITHMS

Citations: | 1 - 0 self |

### BibTeX

@MISC{Richman_j3.9multiple,

author = {Michael B. Richman and Theodore B. Trafalis and Indra Adrianto},

title = {J3.9 MULTIPLE IMPUTATION THROUGH MACHINE LEARNING ALGORITHMS},

year = {}

}

### OpenURL

### Abstract

A problem common to meteorological and climatological datasets is how to address missing data. The majority of multivariate analysis techniques require that all variables be represented for each observation; hence, some action is required in the presence of missing data. In cases where the individual observations are thought not important, deletion of every observation missing one or more pieces of data (complete case deletion) is common. As the amount of missing data increases, tacit deletion can lead to bias in the first two statistical moments of the remaining data as population estimators and inaccuracies in subsequent analyses. What is desired is a principled method that uses information available in the remaining data to predict the missing values. Such techniques include substituting nearby data, interpolation techniques and linear regression using nearby sites as predictors. One class of technique that uses the information available in an iterative manner is known as multiple imputation. In this work, different types of machine learning techniques, such as support vector machines (SVMs) and artificial neural networks (ANNs) are tested against standard imputation methods (e.g., multiple regression), simple regression, mean substitution, and casewise deletion. All methods are used to predict the known values of climatological data which have been altered to produce missing data. These data sets are on the order of 400 variables (data station sites) and a large number of observations. Both precipitation and air temperature data are used to provide a range of inherent spatial coherence seen by analysts. The MSE of the prediction and the MAE of the variance are presented to assess the efficacy of each technique. Results indicate that the non-iterative methods, such as casewise deletion and mean substitution, lead to the largest errors and iterative imputation has considerably lower errors. Within the iterative techniques, SVMs are most promising in reducing error. 1.

### Citations

8980 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...rmulation (Fig. 1). The linear ε-insensitive loss function is defined by: ⎪⎧ 0 if y − f ( x) ≤ ε L ε ( x, y, f ) = ⎨ (1) ⎪⎩ y − f ( x) − ε otherwise The SVR formulation can be represented as follows (=-=Vapnik, 1998-=-): 1 l 2 min φ( w , ξ, ξ′ ) = w + C∑ ( ξi + ξi′ ) 2 i= 1 subject to ( w ⋅ x + b) − y ≤ ε + ξ , yi − ξ i ( w ⋅ x + b) i i , ξi′ ≥ 0, i = 1,..., l i i ≤ ε + ξi′ , where w is the weight vector, b is a bi... |

3629 |
Neural Networks: A Comprehensive Foundation (2 nd ed
- Haykin
- 1999
(Show Context)
Citation Context ...rward ANNs (Fig. 3). The network consists of a set of informationprocessing units called neurons that constitute an input layer, one or more hidden layers, and an output layer of computational nodes (=-=Haykin, 1999-=-). The formulation of feedforward ANNs is well explained by Haykin (1999). Input layer ξ Hidden layer Output layer Figure 3. A feedforward neural network with one hidden layer and one output layer. ξ ... |

3436 | LIBSVM: A Library for Support Vector Machines, 2001. Software available at www.csie.ntu.edu.tw/˜cjlin/libsvm - Chang, Lin |

1291 | A training algorithm for optimal margin classifiers
- Boser, Guyon, et al.
(Show Context)
Citation Context ...tion, simple linear regression, and stepwise multiple regression, are employed for comparison. The SVM algorithm was initially developed by Vapnik and has become a favored method in machine learning (=-=Boser et al., 1992-=-). The version of SVMs for regression called support vector regression (SVR) is used in this study. Trafalis et al. (2003) applied SVR for prediction of rainfall from WSR-88D Radar and showed that SVR... |

54 | Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values - Schneider |

31 | 2005: Testing the fidelity of methods used in proxy-based reconstructions of past climate - Mann, Rutherford, et al. |

17 |
Discussion on multiple imputation
- Rubin
- 2003
(Show Context)
Citation Context ...rence seen by analysts as the former is known to have a small correlation scale whereas the latter has a much larger spatial scale). Many research studies have investigated multiple imputation (e.g., =-=Rubin, 1988-=-; Wayman, 2003). Rubin (1988) showed the remarkable improvements when using multiple imputation rather than single imputation. Wayman (2003) discussed some missing data issues and explained the basic ... |

5 |
Climatic pattern analysis of three- and seven-day summer rainfall in the central United States: Some methodological considerations and a regionalization
- Richman, Lamb
- 1985
(Show Context)
Citation Context ...n Sections 3 and 4. The results are summarized in Section 5 and conclusions presented in Section 6. 2. DATA SETS There are two data sets used in this study based on the Lamb/Richman climate datasets (=-=Richman and Lamb, 1985-=-). The first data set is the monthly precipitation data set, where the values are reported in units of inches (to the hundredth of an inch). This data set consists of 528 monthly observations (1949 – ... |

2 |
Effects of missing data on estimates of monthly mean general circulation statistics
- Kidson, Trenberth
- 1988
(Show Context)
Citation Context .... INTRODUCTION How to address missing data in meteorological and climatological datasets is an issue most researchers face. The decisions made can have a profound impact on subsequent analyses (e.g., =-=Kidson and Trenberth, 1988-=- and Duffy et al. 2001 summarize the importance of *Corresponding author address: Michael B. Richman, University of Oklahoma, School of Meteorology, 120 David L. Boren Blvd, Norman, OK 73072; Email: m... |

2 | 2003: Prediction of Rainfall from WSR-88D Radar Using Kernel-based Methods - Trafalis, Santosa, et al. |

1 | 2001: Effect of missing data estimates of nearsurface temperature change since 1900 - Duffy, Doutriaux, et al. |

1 | The impact of four missing data techniques on validity estimates in human resource management - Roth, Campion, et al. - 1996 |

1 | Survey Research Methods Section of the American Statistical Association, 79Rutherford - Mann, Osborn, et al. |