☆ 4.5 Article

ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (2022)

Journal

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS

Volume 10, Issue -, Pages 291-306

Publisher

MIT PRESS

DOI: 10.1162/tacl_a_00461

Keywords

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Most widely used language models are based on token sequences, but token-free models operating directly on raw text have many advantages. This study shows that a standard Transformer architecture can be used with minimal modifications to process byte sequences, and byte-level models are more robust and perform better on spelling and pronunciation tasks.

Most widely used pre-trained language models operate on sequences of tokens corresponding to word or subword units. By comparison, token-free models that operate directly on raw text (bytes or characters) have many benefits: They can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text preprocessing pipelines. Because byte or character sequences are longer than token sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with minimal modifications to process byte sequences. We characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our experiments.(1)

ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models

Journal

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS

Publisher

MIT PRESS

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models

Journal

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS

Publisher

MIT PRESS

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper