ToMe (Token Merging) is a PyTorch-based optimization framework designed to significantly accelerate Vision Transformer (ViT) architectures without retraining. Developed by researchers at Facebook (Meta AI), ToMe introduces an efficient technique that merges similar tokens within transformer layers, reducing redundant computation while preserving model accuracy. This approach differs from token pruning, which removes background tokens entirely; instead, ToMe merges tokens based on feature similarity, allowing it to compress both foreground and background information efficiently. ToMe integrates seamlessly into existing transformer models such as DeiT, MAE, SWAG, and timm ViTs, offering 2–3x speedups during inference and substantial efficiency gains during training. The method can be applied dynamically at inference time or incorporated into training for improved performance.
Features
- Supported for both ImageNet evaluation and research extensions
- Open source PyTorch patching tools for quick integration
- Offers pretrained checkpoints for DeiT, ViT-B/L/H, and MAE models
- Can be applied without retraining or integrated during training for better results
- Compatible with timm, SWAG, and MAE ViT implementations
- Provides 2–3× inference speedup with minimal accuracy loss