Hugging Face is an open-source library that provides pre-trained models for natural language processing (NLP) tasks. In this project, we will be using the TFDistilBertForSequenceClassification model for sequence classification and the DistilBertConfig and DistilBertTokenizer tools from Hugging Face. The goal of this project is to demonstrate how these tools can be used together to perform sequence classification on text data. We will start by loading the necessary libraries and then move on to exploring the data and implementing the sequence classification task.
In this project, we'll see that traditional NLP processing like removing punctuation and stopwords actually tends not to add value to our predictions. This also includes stemming our words to be in the root form. Although this classical preprocessing isn't as valuable with transformers there is a special type of preprocessing that we will need to complete for the BERT model. We need to add a [CLS] at the beginning of each and [SEP] at the end of the sequences. This will be different for different transformers but BERT was trained with these tags so we will get the best predictions if we do the same here.