Statistics
Calculating a Feature’s Importance with Gini Importance
Using Random Forest regression to identify important features
Many a times, in the course of analysis, we find ourselves asking questions like:
“What boosts our sneaker revenue more? Youtube Ads Facebook Ads or Google Ads?”
with a small complication — We didn’t measure where the revenue came from, and we didn’t run any experiments to see what our incremental revenue is for each.
In this case, understanding the direct causality is hard, or impossible. However, we still need ways of inferring what is more important and we’d like to back that up with data.
Although this isn’t a new technique, I’d like to review how feature importances can be used as a proxy for causality.
The problem
I have order book data from a single day of trading the S&P E-Mini. The system captures order book data as it’s generated in real time as new limit orders come into the market, and stores this with every new tick.
I don’t necessarily know what effect a trader making 100 limit buys at the current price + $1.00 is, or if it has a any effect on the current price at all. There’s no way for me to isolate the effect or run any experiment, so I’m left trying to infer causality from observation.
There’s a couple ways to go about this
Option A: I could run a correlation on the first order differences of each level of the order book and the price
or
Option B: I could create a regression, then calculate the feature importances which would give me what predicts the changes in price better. From there, I can use the direction of change in the order book level to infer what influences changes in price.
Neither of these is perfect. Spurious correlations can occur, and the regression is not likely to be significant. However, these are our best options and can help guide us to the next likely step.
The data
The data are tick data, from the trading session on 10/26/2020. Each of these ticks represents a price change, either in the close, bid or ask prices of the security. The order book data is snapshotted and returned with each tick. The order book may fluctuate “off-tick”, but are only recorded when a tick is generated, allowing simpler time-based analysis.
Here is an example record:
We have a time field, our pricing fields and “md_fields”, which represent the demand to sell (“ask”) or buy(“bid”) at various price deltas from the current ask/bid price.
Now, we generate first order differences for the variables in question.
diffs = es[["close", "ask", "bid", 'md_0_ask', 'md_0_bid', 'md_1_ask','md_1_bid', 'md_2_ask', 'md_2_bid', 'md_3_ask', 'md_3_bid', 'md_4_ask','md_4_bid', 'md_5_ask', 'md_5_bid', 'md_6_ask', 'md_6_bid', 'md_7_ask','md_7_bid', 'md_8_ask', 'md_8_bid', 'md_9_ask', 'md_9_bid']].diff(periods=1, axis=0)
Now, on to the math
Calculating feature importance with gini importance
The sklearn RandomForestRegressor
uses a method called Gini Importance.
The gini importance is defined as:
Let’s use an example variable md_0_ask
We split “randomly” on md_0_ask on all 1000 of our trees. Then average the variance reduced on all of the nodes where md_0_ask is used. Note that for classification problems, the gini importance is calculated using gini impurity instead of variance reduction.
Now that we have an understanding of the math, let’s calculate our importances
Running a regression and getting importance
Let’s run a regression. Again ,we’re less concerned with our accuracy and more concerned with understanding the importance of the features. We can use other methods to get better regression performance
from sklearn.ensemble import RandomForestRegressorfrom xgboost import XGBRegressorfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScalerY = diffs[["close"]]X = diffs[['md_0_ask', 'md_0_bid', 'md_1_ask', 'md_1_bid', 'md_2_ask', 'md_2_bid', 'md_3_ask', 'md_3_bid','md_4_ask', 'md_4_bid', 'md_5_ask', 'md_5_bid', 'md_6_ask', 'md_6_bid','md_7_ask', 'md_7_bid', 'md_8_ask', 'md_8_bid', 'md_9_ask', 'md_9_bid']]# I'm training a classifier, just to determine the "weights" of the input variableX_train, X_test, Y_train, Y_test = train_test_split(X,Y)clf = RandomForestRegressor(1000)# fit the modelclf.fit(X_train, Y_train)from sklearn.metrics import mean_squared_error, r2_score# note that I don't expect a good result here, as I'm only building the model to determine importance.Y_hat = clf.predict(X_test)mse = mean_squared_error(Y_test, Y_hat)r2 = r2_score(Y_test, Y_hat)print("mse: {}, r2= {}".format(mse, r2))import pprintimportance = clf.feature_importances_importances = {}for i,v in enumerate(importance):importances[X.columns[i]] = vsorted_importances = sorted(importances.items(), key=lambda k: k[1], reverse=True)pprint.pprint(sorted_importances)
This gives us our output — which is a sorted set of importances
What did we glean from this information? As the price deviates from the actual bid/ask prices, the change in the number of orders on the book decreases (for the most part).
Although there aren’t huge insights to be gained from this example, we can use this for further analysis — e.g. looking into the difference between md_3 and md_1, md_2, which violates that generality that I proposed.
I hope you found this insightful and useful.
If you enjoyed, please see some other articles that you might find useful
Thanks for reading!