26 September 2024

Prompt Tokens Limit Explained

Did you know that in Mac Terminal, we can list all words of English dictionary using cat /usr/share/dict/words.

## List words
cat /usr/share/dict/words

## Copy words in .txt file
cat /usr/share/dict/words > ~/Desktop/words.txt

## Count # of words
cat words.txt | wc -l
## 235976

Now, let's say we want to convert this list into an array of words, including few extra piece of information, like word frequency and its meaning.

[
   { word: '', meaning: '', frquency: ''},
   { word: '', meaning: '', frquency: ''},
   { word: '', meaning: '', frquency: ''},
   // ...
]

This seemed like the perfect task for ChatGPT, especially since we can provide structure and format via prompts to get just the kind of response we want. But here’s the catch: ChatGPT isn’t really designed for handling huge data operations. It would probably suggest breaking that big dataset into smaller, more manageable pieces—way smaller pieces. But with 235,976 words to process, even that approach feels pretty impractical.

So, writing a little Python program turned out to be the best way forward. But along the way, I learned a ton about token limits, and here’s what I discovered.

Tokens come from the process of tokenization. Tokenization is the process of breaking down a text into individual words, phrases, or symbols - referred to as "tokens".

For example, the sentence:

"The quick brown fox jumps over the lazy dog."

Would be tokenized into:

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

The token limit for any given AI model, such as ChatGPT, is determined by several key factors. These factors balance the technical capabilities of the model, the infrastructure requirements, and the practical needs of users.

Token Limits and ChatGPT

The token limit for ChatGPT models, particularly those based on the GPT-4 architecture, is 4096 tokens per interaction as of writing this article. This token limit includes both the input (the prompt or conversation history) and the output (the model's response).

Here's how it breaks down:

Input Tokens: These are the tokens used in the prompt or the conversation history you provide.
Output Tokens: These are the tokens in the response generated by the model.

For example, if your input prompt uses 3000 tokens, the model will have up to 1096 tokens left to generate a response, since the total cannot exceed 4096 tokens.

Tokens generally correspond to chunks of words. For instance:

A short word like "cat" might be a single token.
A longer word like "artificial" might be split into multiple tokens ("arti" and "ficial").
Common words or phrases might be a single token.

This token limit ensures that the model's processing stays within manageable computational limits and can deliver coherent responses.

Here's what ChatGPT told me how the token limit for a model should be carefully set based on a combination of:

Technical constraints, such as model architecture, computational resources,
Practical considerations, such as user needs, application scenarios, and
Operational factors, such as cost management, infrastructure capabilities.

And our goal as a model developer is to strike a balance that performs well across a range of tasks while the model remains efficient, scalable and cost-effective. Here's a breakdown of the most significant deciding factors that'd help us achieve this balance.

1. Model Architecture

Transformer Architecture: Models like GPT-3 and GPT-4 are based on the transformer architecture, which processes input data in parallel across multiple layers. Each token processed requires memory and computational resources, and the transformer architecture has inherent limitations on how much data can be processed at once. The token limit is set to ensure that the model can operate efficiently within these architectural constraints.
Attention Mechanism: Transformers use an attention mechanism that scales with the square of the number of tokens. As the number of tokens increases, the computational cost and memory usage grow significantly. The token limit is therefore set to balance the computational load and performance.

2. Computational Resources

Memory Usage: Each token processed by the model consumes memory (RAM), and there’s a practical limit to how much memory can be allocated for any single model instance, especially in a distributed cloud environment.
Inference Time: The more tokens the model needs to process, the longer it takes to generate a response. To ensure that the model remains responsive, the token limit is set to avoid excessive delays in processing.
Scalability: In cloud-based environments, models need to scale across many users simultaneously. Setting a token limit helps manage the computational load across servers, ensuring that the service can scale to meet demand without significant degradation in performance.

3. Training Data Constraints

Contextual Understanding: While larger token limits allow for more context, they can also introduce noise or make it harder for the model to focus on relevant parts of the input. The token limit is often a balance between providing enough context for the model to generate coherent and relevant outputs, without overwhelming it with too much information.
Training Efficiency: During the training phase, models are exposed to a vast amount of data. Longer sequences of tokens require more computational resources to train. The token limit is partially informed by what is feasible during the training process in terms of both time and cost.

4. User Needs and Application Scenarios

Practical Use Cases: The token limit is set to accommodate typical use cases, such as conversations, document summarization, or code generation. The limit ensures that the model can handle these tasks effectively without exceeding what’s necessary for most practical applications.
User Experience: Token limits are also determined with the user experience in mind. The goal is to provide enough capacity for meaningful interactions without overcomplicating the usage of the model or leading to partial outputs due to token overflow.

5. Infrastructure and Cost Management

Server Load: In a commercial setting, AI models run on clusters of GPUs or specialized hardware. The token limit helps manage server load and ensures that the infrastructure can serve many users concurrently without degrading performance.
Cost Considerations: Higher token limits mean more computational resources per query, which directly translates to higher operational costs. The token limit is often a balance between providing sufficient capacity for users while keeping operational costs within a sustainable range.

6. Security and Misuse Prevention

Abuse Mitigation: A lower token limit can help prevent certain types of abuse, such as generating extremely large outputs or attempting to exploit the model by feeding it excessive or harmful content. By capping the number of tokens, providers can reduce the risk of such activities.

End