Getting started with Llamastack – Thinking Quadratically

Exploreing LlamaStack architecture, and creating a playground to try out a unified APIs to rule all for AI application development.

Prerequisites :
1. System/pod/container with a GPU, and GPU driver installed.

Procedure : Server Side

git clone the repostory

git clone https://github.com/pmukhedk/llamastack.git

2. install the necessary python packages

cd llamastack/getting-started
pip3 install -r requirements.yaml

Note : ensure python version is 3.11 and above

3. Inference a model using vllm

vllm server ibm-granite/granite-3.3-2b-instruct

4. Start llama-stack server

llama stack run llamastack-run.yaml

Procedure : Client Side

1. Install the llamastack client cli and configure the endpoint

pip3 install llama-stack-client

llama-stack-client configure --endpoint http://localhost:8321

2. List the providers and models

# llama-stack-client providers list
INFO:httpx:HTTP Request: GET http://localhost:8321/v1/providers "HTTP/1.1 200 OK"
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ API           ┃ Provider ID    ┃ Provider Type          ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩
│ inference     │ vllm           │ remote::vllm           │
│ inference     │ vllm-mass      │ remote::vllm           │
│ safety        │ llama-guard    │ inline::llama-guard    │
│ agents        │ meta-reference │ inline::meta-reference │
│ vector_io     │ faiss          │ inline::faiss          │
│ datasetio     │ localfs        │ inline::localfs        │
│ scoring       │ basic          │ inline::basic          │
│ eval          │ meta-reference │ inline::meta-reference │
│ post_training │ torchtune      │ inline::torchtune      │
│ tool_runtime  │ tavily-search  │ remote::tavily-search  │
│ telemetry     │ meta-reference │ inline::meta-reference │
│ files         │ localfs        │ inline::localfs        │
└───────────────┴────────────────┴────────────────────────┘

root@eb526431b778:~# llama-stack-client models list
INFO:httpx:HTTP Request: GET http://localhost:8321/v1/models "HTTP/1.1 200 OK"

Available Models

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ model_type   ┃ identifier                     ┃ provider_resource_id                   ┃ metadata  ┃ provider_id ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ llm          │ Llama-4-Scout-17B-16E-W4A16    │ Llama-4-Scout-17B-16E-W4A16            │           │ vllm-mass   │
├──────────────┼────────────────────────────────┼────────────────────────────────────────┼───────────┼─────────────┤
│ llm          │ granite-3.3-2b-instruct        │ ibm-granite/granite-3.3-2b-instruct    │           │ vllm        │
└──────────────┴────────────────────────────────┴────────────────────────────────────────┴───────────┴─────────────┘

Total models: 2

root@eb526431b778:~#

3. Interact with models

#llama-stack-client inference chat-completion --message "What is the linear Algebra, explain me in 30 words " --model-id Llama-4-Scout-17B-16E-W4A16 
INFO:httpx:HTTP Request: POST http://localhost:8321/v1/inference/chat-completion "HTTP/1.1 200 OK"
ChatCompletionResponse(
    completion_message=CompletionMessage(
        content='Linear Algebra is a branch of mathematics that deals with vectors, vector spaces, linear 
transformations, and systems of linear equations, focusing on solving and representing linear equations.',
        role='assistant',
        stop_reason='end_of_turn',
        tool_calls=[]
    ),
    logprobs=None,
    metrics=[
        Metric(metric='prompt_tokens', value=23.0, unit=None),
        Metric(metric='completion_tokens', value=42.0, unit=None),
        Metric(metric='total_tokens', value=65.0, unit=None)
    ]
)

Related Posts

Mysql for RHOAI Model Registry

Installing NVIDIA Drivers on RHEL 9

model fine tuning for summarisation using data distll – dump

Leave a Reply Cancel reply