In the Python Project, we will use Pandas and Seaborn to perform our exploratory data analysis in en effort to understand how our features impact our target. A big aspect of this is paying close attention to the distributions of our features use in our supervised machine learning project.
After we've explored the data and extracted valuable insights for our business partners and ideas on how to build our model we will use Sklearn to preprocess the data for our ML model.
Part 1
Part 2
Part 3
Follow Data Science Teacher Brandyn
dataGroups:
A good practice in Machine Learning is to try one of every major model type on your ML problem. All models try to predict the same thing but with different maths. No matter how well you understand the math humans just aren't capable of thinking through all the interrelationships among features and how they relate to the target. An easier solution is to try them all. Sklearn makes it rather easy to try RandomForest, Bagging, AdaBoost, and GradientBoosting
As we go through our EDA section we will collect insight specifically about the distributions of our features because that will be very important to allow us to correctly preprocessing our features for our Machine Learning model.
In our bivariate analysis we will plot the correlation matrix with Pandas .corr() and Seaborn's heatmap() to give use easy understanding of the linear correlations in our features.
Exponential distributions are difficult for ML models and it often is better to take a log transform of your exponential distribution to bring it to a more normal distribution. This is an imperfect technique but will most likely make the average a better representation of the data and make for better predictions.
Using Pandas get_dummies() to one hot encode our categories and bucketized continuous features.
With Sklearn it's handy to create a little function fits and get the scores for each model.
Comments