Selecting the right model is no small feat. Legal approvals, alignment with the use case, model size constraints, and open-source availability are all critical considerations. Yet, one increasingly important — and often overlooked — factor in model selection today is supported context length. For many real-world use cases, bigger is better.

While working with a customer, I was asked a pointed question:

“We have a use case where our input data requires processing up to 16k tokens in a single request, but the model we’re currently using supports only 8k tokens. What are our options to bridge this gap?”

I started evaluating possible solutions. changing out the model wasn’t viable, and re-training existing model by extending positional embeddings wasn’t straightforward.

That’s when a thought occurred: Could we extend the model’s context length at inference time — without retraining or compromising accuracy?

A quick search led me to a promising answer: vLLM, the inference engine included with the Red Hat AI stack. It supports RoPE scaling out of the box — allowing us to extend a model’s context length easily and effectively, with no loss in accuracy.

This blog captures the process we followed and the steps we took to enable this capability for our customer.

The model in discussion is Saravam-1 — a slightly older, yet highly capable model, especially well-suited for Indian languages. small footprint 2.5B parameters, is built on the LLaMA architecture, and uses BF16 tensor precision. As per its config.json, the model has a max_position_embeddings value of 8192, meaning it was trained to process input sequences up to 8192 tokens in length.

This is how the model was deployed using vllm inference engine on Red Hat Enterprise Linux 9 having 4 A10 GPUs running on AWS cloud, India region.

podman run --rm -it \
  --device nvidia.com/gpu=all \
  --shm-size=4GB \
  -p 8000:8000 \
  --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
  --env "HF_HUB_OFFLINE=0" \
  --env VLLM_NO_USAGE_STATS=1 \
  registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.0.0 \
  --model sarvamai/sarvam-1 \
  --tensor-parallel-size=4

After successfully loading the model, I was able to interact with it using an API endpoint that is compatible with the OpenAI API specification. This means I could send requests and receive responses just like I would with OpenAI’s own API, making it easy to integrate the model into existing applications or workflows without changing the client code.

Everything works fine, but when I set max_tokens to 8500, I’m increasing the output tokens to test how the model handles longer context. However, in real use cases, it’s usually the input tokens that are instead of output, with big input prompt.

And this is what I observe on the server side: the API receives the request, processes it through the loaded model, and returns the generated response. The server logs clearly show the input prompt, token usage, response time, and any system messages, which helps in monitoring and debugging the model’s behavior during inference.

ERROR 06-13 22:56:59 [serving_completion.py:116] Error in preprocessing prompt inputs
ERROR 06-13 22:56:59 [serving_completion.py:116] Traceback (most recent call last):
ERROR 06-13 22:56:59 [serving_completion.py:116]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/serving_completion.py", line 108, in create_completion
ERROR 06-13 22:56:59 [serving_completion.py:116]     request_prompts, engine_prompts = await self._preprocess_completion(
ERROR 06-13 22:56:59 [serving_completion.py:116]                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-13 22:56:59 [serving_completion.py:116]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/serving_engine.py", line 350, in _preprocess_completion
ERROR 06-13 22:56:59 [serving_completion.py:116]     request_prompts = await self._tokenize_prompt_input_or_inputs_async(
ERROR 06-13 22:56:59 [serving_completion.py:116]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-13 22:56:59 [serving_completion.py:116]   File "/usr/lib64/python3.11/concurrent/futures/thread.py", line 58, in run
ERROR 06-13 22:56:59 [serving_completion.py:116]     result = self.fn(*self.args, **self.kwargs)
ERROR 06-13 22:56:59 [serving_completion.py:116]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-13 22:56:59 [serving_completion.py:116]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/serving_engine.py", line 326, in _tokenize_prompt_input_or_inputs
ERROR 06-13 22:56:59 [serving_completion.py:116]     return [
ERROR 06-13 22:56:59 [serving_completion.py:116]            ^
ERROR 06-13 22:56:59 [serving_completion.py:116]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/serving_engine.py", line 327, in <listcomp>
ERROR 06-13 22:56:59 [serving_completion.py:116]     self._normalize_prompt_text_to_input(
ERROR 06-13 22:56:59 [serving_completion.py:116]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/serving_engine.py", line 184, in _normalize_prompt_text_to_input
ERROR 06-13 22:56:59 [serving_completion.py:116]     return self._validate_input(request, input_ids, input_text)
ERROR 06-13 22:56:59 [serving_completion.py:116]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-13 22:56:59 [serving_completion.py:116]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/serving_engine.py", line 247, in _validate_input
ERROR 06-13 22:56:59 [serving_completion.py:116]     raise ValueError(
ERROR 06-13 22:56:59 [serving_completion.py:116] ValueError: This model's maximum context length is 8192 tokens. However, you requested 8545 tokens (45 in the messages, 8500 in the completion). Please reduce the length of the messages or completion.
INFO:     182.69.85.130:50174 - "POST /v1/completions HTTP/1.1" 400 Bad Request

To increase the context length of the model at the inference level, I stopped model serving and restarted with following additional parameters added to demonstrates how to extend the context length of a Sarvam-1 model using the Dynamic method (rope_scaling)

podman run --rm -it \
  --device nvidia.com/gpu=all \
  --shm-size=4GB \
  -p 8000:8000 \
  --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
  --env "HF_HUB_OFFLINE=0" \
  --env=VLLM_NO_USAGE_STATS=1 \
  registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.0.0 \
  --model sarvamai/sarvam-1 \
  --tensor-parallel-size=4 \
  --rope-scaling '{"rope_type":"dynamic","factor":2.0}' \
  --rope-theta 1000000.0

–rope-scaling

This flag enables RoPE (Rotary Positional Embedding) scaling, which allows a model to extend its context length beyond what it was originally trained with. Purpose: RoPE scaling makes it possible to interpolate or extrapolate positional embeddings so you can feed in longer sequences than the default (e.g., going from 8k to 16k tokens).

For example:

--rope-scaling '{"type":"dynamic","factor":2.0}'

“type”: “dynamic” – the type of scaling (dynamic is generally preferred over static).
“factor”: 2.0 – doubles the context length.

Types of RoPE Scaling:

linear: linear Linearly stretches the position encoding.
dynamic: dynamic Adjusts scaling dynamically per layer for better extrapolation performance.
yarn: technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts based on the research paper.

–rope-theta : This changes the theta (base angle) of the RoPE frequency, which can improve performance when scaling context length.

With this model, the context length was successfully doubled. While it’s possible to extend it further, 16K met the requirements of the use case, so we’re good. Here’s the logline that is reported on the vllm server confirms :”

INFO 06-13 23:36:13 [kv_cache_utils.py:637] Maximum concurrency for 16,384 tokens per request: 85.04x

and now I can send inference request to model with max_token set to 15000!!

vLLM’s out-of-the-box support for RoPE scaling enabled us to extend the model’s context length without retraining or sacrificing accuracy. We validated this through multiple lm_eval tests, which showed only negligible loss. The use case involved summarization.

One Comment

Leave a Reply

Your email address will not be published. Required fields are marked *