There are numerous methods that allows us to measure the feature importances. This includes:
However, each of them show different results. Then, which result should we trust? Is there a way for us to check the most accurate method?
One approach we could do is to conduct sanity check with the synthetic data created by ourselves. Unlike the real data we collected from real world, we can tune the exact relationship between independent and dependent variable.
The jupyter notebook is shown here
We first define the function that creates synthetic data. In this example, we assume linear relationship between independent variable X_i and y.
The function to create the synthetic data is shown below:
def generate_dataframe(X_num, row_num, error_var, threshold):
# Create a dataframe with specified rows and X_num columns
df = pd.DataFrame(np.random.randn(row_num, X_num), columns=[f'X{i+1}' for i in range(X_num)])
# Multiply each X by its order and sum them up for each row
sum_x = df.mul([i+1 for i in range(X_num)], axis=1).sum(axis=1)
# Determine the threshold value based on the percentile of the sum
threshold_value = np.percentile(sum_x, threshold)
# Assign 1 to rows where the sum is above the threshold, 0 otherwise
df['y'] = (sum_x > threshold_value).astype(int)
# Add the random error term to each X
for i in range(X_num):
df.iloc[:, i] += np.random.normal(0, np.sqrt(error_var), row_num)
# Multiply each column by its order
for i in range(X_num):
df.iloc[:, i] *= (i + 1)
return df
First, the inputs of the function are: