Fast State-of-the-art tokenizers, optimized for both research and production. Tokenizers provides an implementation of today’s most used tokenizers, with a focus on performance and versatility. These tokenizers are also used in Transformers. Train new vocabularies and tokenize, using today’s most used tokenizers. Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server’s CPU. Easy to use, but also extremely versatile. Designed for both research and production. Full alignment tracking. Even with destructive normalization, it’s always possible to get the part of the original sentence that corresponds to any token. Does all the pre-processing: Truncation, Padding, add the special tokens your model needs.

Features

  • Train new vocabularies and tokenize, using today’s most used tokenizers
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server’s CPU
  • Easy to use, but also extremely versatile
  • Designed for both research and production
  • Full alignment tracking
  • Truncation, Padding, add the special tokens your model needs

Project Samples

Project Activity

See All Activity >

License

Apache License V2.0

Follow Tokenizers

Tokenizers Web Site

You Might Also Like
MongoDB Atlas runs apps anywhere Icon
MongoDB Atlas runs apps anywhere

Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
Start Free
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of Tokenizers!

Additional Project Details

Programming Language

Rust

Related Categories

Rust Artificial Intelligence Software, Rust Machine Learning Software

Registered

2023-03-23