In this Python data analysis guided project, we will explore dog breeds from this Kaggle data set. We will use our Python data analysis skills in this beginner data analysis project to understand the eye color, fur color, and height of common dog breeds.
To start our Python data analysis project we will start by doing a little processing to enable our analyses. This is needed because of the semi-structured data format that happens when we have a list of different sizes. Like in the character traits features there is a list of different amounts of traits, this is not amenable to data analysis. To easily solve this issue we will use a Pandas' function explode to turn our features into structured data ready for analysis.
After we complete a univariate analysis of each feature we move on to our Python Bivariate Data Analysis. In our bivariate analysis, we will complete an analysis to determine how one column affects another.
We will understand how the fur color of dogs, the dogs' character traits, and how common health issues affect the dogs' height and life span. We will make use of Seaborn's histplot and will use it with the hue argument to change the color of each category in our histogram plot.
Follow Data Science Teacher Brandyn
A common problem is that in a feature there is a list of different sizes of different categories. To fix this issue we will use Pandas' split function to turn what is a long string into an actual list data type for the next step. After we've turned the long string into a list the feature is ready for Pandas' explode function.
After we've turned a feature into structure data we are able to complete our data analysis and here we look at the most common fur color of dog breeds. We do this using Pandas' plot to create a bar graph.
Here while using Pandas' value_counts function we we apply logical indexing to only plot the values that are greater than one to make our plot user-friendly.
In our Python Data Analysis Project we notice that the height feature was an object data when we first called Pandas' info function. Which gives us a count of the non-null values and all the data types by column in our DataFrame.
Upon inspecting this column we see that it's represented as a range of height and so we will need to clean this feature to begin to analyze it.
To extract the values needed from the string in this feature we will create two user-defined functions to extract the max and minimum values.
After we create each function we will use Pandas' apply function to apply the function and we will be able to save this output to a new column.
After we've extracted the min and max values from the string we use Pandas' plot to plot the distribution of the continuous variable using kind = hist.
Lastly, as we changed each feature from a semi-structured to a structured format of data we at the end of our project are able to understand how fur color and common health problems affect the height and longevity of common dog breeds.