Be Wary of Automated Feature Selection — Chi Square Test of Independence Example
When Data Scientists use chi square test for feature selection, they just merely go by the ritualistic “If your p-value is low, the null hypothesis must go”. The automated function they use behaves no differently.
By Venkat Raman, Co-Founder at Aryma Labs
Image source: Unsplash
Of late there is a greater impetus to ‘Automate Everything’ in Data Science. One such area is ‘Feature Selection’. I am of the firm opinion that ‘Feature Selection’ is best done by data scientist (with domain knowledge) and not through some automation.
However, some don’t agree.
I had been thinking about writing on this topic for a long time.
Thanks to one recent Linkedin post ¹, I got enough fodder to prove How and Why ‘Automated Feature Selection’ can be dangerous.
The case in point example is Chi Square Test of Independence. Many Data Scientists blindly use it for feature selection without thinking how and in which context it is supposed to be used.
When Data Scientists use chi square test for feature selection, they just merely go by the ritualistic “If your p-value is low, the null hypothesis must go”.
The automated function they use behaves no differently. We shall see how shortly.
When Data Scientists use chi square test for feature selection, they just merely go by the ritualistic “If your p-value is low, the null hypothesis must go”.
The automated function they use behaves no differently.
Coming back to chi square, the typical hypothesis set up is:
Ho: There is no association between the variables
H1: There is association between the variables.
So, when data scientists use the automated chi square selection of sklearn , it just selects the best features purely based on “Select variables where p-value less than alpha (often 0.05)”.
As illustrated in the below image, sklearn’s chi2 function does not perform ‘Effect Size’ test. It just provides p-values.
Image: Author (Chi square feature selection)
Here starts the trouble.
What many data scientists (without stats background) don’t realize is that p-value is not indicative of ‘strength of the association’. One must understand the difference between ‘Statistical Significance’ and ‘Practical Significance’.
Statistical significance does not always imply practical significance.
Also, P-values can easily be hacked ². All you need to do is increase your sample size!. At large sample sizes even small effects can become significant while for small sample sizes, even large effects may not be significant.
To gauge practical significance, statisticians use ‘Effect Size’.
The automatic feature selection function (e.g. sklearn) does not perform ‘Effect Size’ tests like Cramer’s V.
So what does all these mean?
1) If your data set (training + test) is huge. It becomes that much easy to show some association between the variable and the target variable.
2) In absence of domain knowledge you might be picking variables thinking they have some association to target variable but in reality that association might be very weak.
Automated feature selection exposes you to certain risks that you may not even be aware of!
The Takeaways:
Pls remember, everything in Data Science can’t be automated. When it comes to feature selection, domain knowledge is your biggest safety net. And in case you use auto feature selection like chi square, pls do the ‘Effect size’ test before choosing variables.
References:
- https://www.linkedin.com/posts/awbath-aljaberi-b4b937183_effect-size-ugcPost-6826552045118607360-iQAd
- https://youtu.be/ncqcFNHmMoc
Your comments and opinions are welcome.
You can reach out to me on:
Bio: Venkat Raman is Co-Founder at Aryma Labs.
Original. Reposted with permission.
Related:
- Why Saying “We Accept the Null Hypothesis” is Wrong: An Intuitive Explanation
- The Lost Art of Decile Analysis
- Abstraction and Data Science: Not a great combination