2D Tokenization for Large Language Models

This article is about how we process text before passing it to large language models, the problems with this approach, and an alternative solution.

David Gilbertson
Better Programming
Published in
18 min readMay 2, 2023

--

The Problem With (1D) Tokenization

When passing text to a Large Language Model (LLM), text is broken down into a sequence of words and sub-words. This sequence of tokens is then replaced with a sequence of integers and passed to the model.

--

--