top of page
< Back

Compare Sklearn Feature Importances vs Shap Explainer for best ML Features

Best way to choose the best features for your Machine Learning Model - Compare Features Importances with Shaply Values

Compare Sklearn Feature Importances vs Shap Explainer for best ML Features

In this Python Machine Learning Tips lesson we attempt to answer the question What are the best features to use in my machine learning model to get the best score?  That's a pretty tricky question to answer.  


We will use a house price prediction data set and see if we can choose only a few features and maintain or improve the score achieved when using all 156.  156 features is a lot and a common problem for house price prediction workflows, but are we using all those features? Probably not, probably not most of them.  But how do we choose which features to use?


When we use so many features we are also risking overfitting by giving our model many opportunities to customize to randomness.  In terms of the test score, a few features often can give us a better score by preventing overfitting and simplifying our model.


In this workflow, we will make a comparison between using sklearn's feature importances and the shap libraries kernel explainer.  This approach the question of best features in very different ways.  Sklearn approaches the question from the point of view of which features are helping the accuracy of the model while shaply values show us the impact of each feature on the final output not how the feature helps with the accuracy of the model.


These two methods are really after two different this. What helps accuracy with Sklearn and what causes the output with shaply values. In this free Python ML lesson we will compare these two methods, in an effort, to determine the best way to select the best features.


Follow along for free in the Machine Learning Tips video by DataSimple.education.




free comparison of feature importances and kernel explainer to select the best features

Comparison of Kernel Explainer and Features Importances to Select Best Features for ML model

In this free Python Machine Learning Model Explainability lesson we compared the difference between sklearn's feature importance and shap's kernel explainer. Both ML techniques give us an idea about which features are important in our prediction.


In this house prediction dataset that comes from this Kaggle competition and this is why, when we were finished preprocessing we had a very large number of features. 156 features to be exact and as you can imagine this many features has the risk of confusing our model and hurting prediction and can also lead to overfitting by giving our model so many opportunities to do so.


To make this a real contest we selected only 10 features using either sklearn features importances or 10 features using Shap's kernel explainer and compared the results. However in this situation on this dataset, Shap's kernel explainer chose the best features in terms of getting this highest test score, the difference was marginal between to two meaning both could be valid options to help reduce the number of features being used.


We also compared how each technique ranked and used each feature. Interestingly the sklearn's features importances seemed to rely more heavily on one feature while Shap used a more diverse balanced allocation of impact on the final output.


Thank you for joining us for this free to use Machine Learning Model Explainability Lesson.

learn the best way to select features for a machine learning in python free independent lesson on sklearn and shap
bottom of page