MICE: Multiple Imputation by Chained Equation

1. Basic Overview

  1. Decide on the number of iterations (k) and create as many copies of the raw dataset.
  2. In each column, replace the missing values with an approximate value like the ‘mean’, based on the non-missing values in that column. This is a temporary replacement. At the end of this step, there should be no missing values.
  3. For the specific column you want to impute, eg: columm A alone, change the imputed value back to missing.
  4. Now, build a regression model to predict A using (B and C) as predictors. For this model, only the non-missing rows of A are included. So, A is the response, while, B and C are predictors. Use this model to predict the missing values in A.
  5. Repeat steps 2-4 for columns B and C as well.

Check out this video that visualizes the process of MICE

2. Disclaimers:

  1. In the real execution, we do not entirely ditch the imputation data in the previous iteration, but do some analysis and pooling. However, for now, it is enough to know that this is a chained process

  2. The most traditional approach is to use linear regression in for predictions. However, the methodology doesn’t matter as long as we can do prediction properly. There are some more advanced libraries like MiceForest, which uses LightGBM for sequential prediction.

    MICE.jpeg

  3. According to this paper, MICE showed the best performance in general, even exceeding the ones of deep learning imputation models.