Skip to content

Latest commit

 

History

History
27 lines (21 loc) · 1.8 KB

File metadata and controls

27 lines (21 loc) · 1.8 KB

Missing-Data-Imputation-Methods-Performance-Comparison

The data imputation methods MissForest, GAIN, MICE, MICE-NN and MIWAE are tested on two UCI datasets (Dataset for Sensorless Drive Diagnosis Data Set, Page Blocks Classification Dataset). MICE-NN is a modified version of MICE, where instead of linear regresssion fully connected neural networks are used. The tests are done by taking the complete dataset (without missing values) introducing either MAR or MCAR missingness with the desired missing rate and then using the imputation methods to impute the missing values. Since the correct values are known, the real MSE can be computed. To test other datasets, save the dataset as a 2-dim numpy array in the folder data. Now set dataset = "name" when calling the imputation method, where your dataset in the folder data is named "name_y" and name_x.

MCAR missing values are introduced by dropping each value in the data independently with probability "p_miss". MAR missing values are introduced by summing over one third of each observation and dropping each value in the rest of the observation independently with a probability proportional to the computed sum. For this the variable "para" is used (for details see load_data in utils.py).

Requirements

The code requires Python 3.6 or later. Required packages are:

  • fanyimpute >= 0.5.3
  • mathplotlib >= 2.2.2
  • missingpy >= 0.2.0
  • numpy >= 1.16.2
  • pathlib >= 2.3.3
  • pickle
  • Pillow >= 5.4.1
  • pylab
  • scipy >= 1.2.1
  • sklearn
  • tensorflow >= 1.14
  • tensorflow_probability >=0.7.0
  • torch >= 1.0.1
  • torchvision >= 0.2.2
  • tqdm >= 4.31.1