In this supervised learning Python ML guided project we will be predicting whether or not someone was approved for a credit card. In this Python project, we will only be using the LogisticRegession() in Sklearn for this supervised classification problem. Although this project would work with RandomForestClassifier() or GradientBoostingClassifier() we will be focusing on error analysis in this classification problem. Understanding precision versus recall and why we would want to focus on one versus the other. Understanding which errors matter to us allows us to improve either recall or precision score, whichever score is important for our business use case.
Send data science teacher brandyn a message if you have any questions
As part of our error analysis in this supervised classification Python project we will use Yellowbrick's DescriminationThreshold() to understand how to changing our decision threshold will affect our final precision or recall scores. This is usually a give-and-take type of data relationship.
We will use a for loop in Python to plot many distributions altogether.
We will also use a for loop to print out the value_counts() with Pandas of each categorical feature.
After identifying outliers we will use .clip() in Pandas to truncated our outliers to get our data ready for your machine learning model.
We will build a user defined function in Python that will plot our confusion matrix of the train and test data sets.
We will use logical indexing in Pandas to isolated the rows with incrroect predictions and analyze what is different with them compared to the dataset as a whole. The hope is this will give us clues on features to engineer for better predictions.
Use ClassificationReport from Yellowbrick to look at precision and recall from the perspective of each class in your predictions. Precision and Recall are in terms of the positive prediction by default.
The ROCAUC plot is Yellowbrick can be valuable in understanding our Logistic Regression predictions.
Next we will use Yellowbrick's PrecisionRecallCurve to understand the relationship between precision and recall better.
With our error analysis done, we can now engineer features from those insights to make our model predict better.