Difference Between Tokenizer and TokenizerFast with Hugging Faces transformers
To Tokenizer or TokenizerFast? What's the difference between these two tokenizers in Hugging Faces transformer library and when would it make sense to use one versus the other? We'll use a 650,000 row, Yelp Review, dataset to compare the speed of each tokenizer and see if Fast really means faster. In this lesson, we will use ElectraTokenizer and ElectraFastTokenizer although the would work with any of the models in Hugging Faces that have the fast version available. We'll also show you how to decode the input ids back to words after tokenization.
ElectraTokenizer uses the original, Python-based tokenization implementation. It may not be as fast as the alternatives but is more straightforward and easier to understand.
ElectraTokenizerFast is implemented with Hugging Face's tokenizers library, which is implemented in Rust and provides faster tokenization. This makes it more efficient for large-scale language processing tasks.
ElectraTokenizerFast is designed for better performance and is generally faster than ElectraTokenizer. If you're working with large datasets or require faster tokenization, you might prefer using ElectraTokenizerFast.
ElectraTokenizer might have fewer dependencies as it relies on the original tokenization code.
ElectraTokenizerFast relies on the tokenizers library, which has additional dependencies but offers improved performance.
When choosing between the two, consider your specific use case. If performance is a critical factor, especially for large-scale applications, you might opt for ElectraTokenizerFast. However, if simplicity and minimal dependencies are more important to you, ElectraTokenizer might be a suitable choice.
As we can see from the video the hugging faces fast tokenizer was extremely fast in comparison to the normal tokenizer. This is because it's built with rust versus with python. But to use them is exactly the same just the Tokenizer Fast is considerably faster by almost 20 times and this would make it an import choice if you need to tokenizer in a production environment.