1. Introduction & Intuition

There are numerous methods that allows us to measure the feature importances. This includes:

However, each of them show different results. Then, which result should we trust? Is there a way for us to check the most accurate method?

One approach we could do is to conduct sanity check with the synthetic data created by ourselves. Unlike the real data we collected from real world, we can tune the exact relationship between independent and dependent variable.

2. Detailed process:

The jupyter notebook is shown here

2.1. Define the function to create synthetic data

We first define the function that creates synthetic data. In this example, we assume linear relationship between independent variable X_i and y.

The function to create the synthetic data is shown below:

def generate_dataframe(X_num, row_num, error_var, threshold):
    
    # Create a dataframe with specified rows and X_num columns
    df = pd.DataFrame(np.random.randn(row_num, X_num), columns=[f'X{i+1}' for i in range(X_num)])
    
    # Multiply each X by its order and sum them up for each row
    sum_x = df.mul([i+1 for i in range(X_num)], axis=1).sum(axis=1)

    # Determine the threshold value based on the percentile of the sum
    threshold_value = np.percentile(sum_x, threshold)

    # Assign 1 to rows where the sum is above the threshold, 0 otherwise
    df['y'] = (sum_x > threshold_value).astype(int)

    # Add the random error term to each X
    for i in range(X_num):
        df.iloc[:, i] += np.random.normal(0, np.sqrt(error_var), row_num)

    # Multiply each column by its order
    for i in range(X_num):
        df.iloc[:, i] *= (i + 1)

    return df

First, the inputs of the function are:

  1. X_num: Number of independent variables.
  2. row_num: Number of rows.
  3. error_var: The variance of the error term with the normal distribution having the mean value of 0.