2D Tokenization for Large Language Models

This article is about how we process text before passing it to large language models, the problems with this approach, and an alternative solution.

Published in

Better Programming

18 min readMay 2, 2023

The Problem With (1D) Tokenization

When passing text to a Large Language Model (LLM), text is broken down into a sequence of words and sub-words. This sequence of tokens is then replaced with a sequence of integers and passed to the model.

2D Tokenization for Large Language Models

This article is about how we process text before passing it to large language models, the problems with this approach, and an alternative solution.

The Problem With (1D) Tokenization

Written by David Gilbertson