top of page
< Back

Difference Between Tokenizer and TokenizerFast with Hugging Faces transformers

which is better the fast or regular tokenizer in hugging faces in python free tips on how to use hugging faces to tokenizer text and which tokenizer to use and when, whic one is faster

To Tokenizer or TokenizerFast? What's the difference between these two tokenizers in Hugging Faces transformer library and when would it make sense to use one versus the other?    We'll use a 650,000 row, Yelp Review, dataset to compare the speed of each tokenizer and see if Fast really means faster.  In this lesson, we will use ElectraTokenizer and ElectraFastTokenizer although the would work with any of the models in Hugging Faces that have the fast version available.  We'll also show you how to decode the input ids back to words after tokenization.


The main difference between Hugging Face's ElectraTokenizer and ElectraTokenizerFast lies in their underlying implementations and performance characteristics.


Implementation:

ElectraTokenizer uses the original, Python-based tokenization implementation. It may not be as fast as the alternatives but is more straightforward and easier to understand.


ElectraTokenizerFast is implemented with Hugging Face's tokenizers library, which is implemented in Rust and provides faster tokenization. This makes it more efficient for large-scale language processing tasks.


Performance:

ElectraTokenizerFast is designed for better performance and is generally faster than ElectraTokenizer. If you're working with large datasets or require faster tokenization, you might prefer using ElectraTokenizerFast.


Dependencies:

ElectraTokenizer might have fewer dependencies as it relies on the original tokenization code.


ElectraTokenizerFast relies on the tokenizers library, which has additional dependencies but offers improved performance.


When choosing between the two, consider your specific use case. If performance is a critical factor, especially for large-scale applications, you might opt for ElectraTokenizerFast. However, if simplicity and minimal dependencies are more important to you, ElectraTokenizer might be a suitable choice.

Summary

As we can see from the video the hugging faces fast tokenizer was extremely fast in comparison to the normal tokenizer.  This is because it's built with rust versus with python.  But to use them is exactly the same just the Tokenizer Fast is considerably faster by almost 20 times and this would make it an import choice if you need to tokenizer in a production environment.  

looking at distribution of max lengths is very important in an nlp problem so we can understand where to truncated our sequences with the hugging faces tokenizer
comparing the speed of the tokenizer and tokenizer fast in transformers library and the fast tokenizer is incredibly fast by almost 20 times
bottom of page