2D Tokenization for Large Language Models
This article is about how we process text before passing it to large language models, the problems with this approach, and an alternative solution.
Published in
18 min readMay 2, 2023
The Problem With (1D) Tokenization
When passing text to a Large Language Model (LLM), text is broken down into a sequence of words and sub-words. This sequence of tokens is then replaced with a sequence of integers and passed to the model.