Based on this paper: Deep learning versus conventional methods for missing data imputation: A review and comparative study

1. Three types of missing values:

  1. MCAR (Missing Completely at Random)

    Mathematical Definition: Data is MCAR if the probability of a value being missing is the same for all observations, and it's independent of both observed and unobserved data.

    $$ P(data_{missing}) = P(data_{missing} |data_{observed},data_{unobserved}) $$

    Real-life Example: Imagine a survey where the respondent's age is missing due to a glitch in the system where certain responses weren't saved properly. If this glitch happened entirely at random and isn't related to the respondent's age or any other variable, then the missing age data is MCAR.

  2. MAR (Missing at Random)

    Mathematical Definition: Data is MAR if the probability of a value being missing can be fully accounted for by the observed data, but not by the unobserved data.

    $$ P(data_{missing}|data_{observed}) = P(data_{missing} |data_{observed},data_{unobserved}) $$

    Real-life Example: Let's say you're conducting a survey, and younger participants are less likely to answer a question about yearly income. If the propensity for the income data to be missing is related to age (which we have observed) but not related to the actual yearly income value (which we haven't observed when it's missing), then the missing income data is MAR.

  3. MNAR (Missing not at Random)

    Mathematical Definition: Data is MNAR if the missingness is related to the value of the unobserved data, even after accounting for the observed data

    Mathematical Definition: Data is MNAR if the missingness is related to the value of the unobserved data, even after accounting for the observed data

    $$ P(data_{missing}|data_{observed}) \neq P(data_{missing} |data_{observed},data_{unobserved}) $$

    Real-life Example: In the same survey, if people with higher incomes are less likely to disclose their yearly income, the missingness in income is related to the actual value of the income that's missing. In this case, the missing income data is MNAR.

To summarize:

I would say the case of my project is the closest to MAR, as the tendency of missing values are often due to the lack of popularity of, smaller market cap size, or because was delisted etc. The actual missing values themselves don’t seem to affect the propensity of missing.

2. Data imputation methodologies

2.1. Mean/median/mode imputation

Usually mode for binary variable and mean/median for continuous variable.

Pros:

Cons: