Exploreing LlamaStack architecture, and creating a playground to try out a unified APIs to rule all for AI application development.
Prerequisites :
1. System/pod/container with a GPU, and GPU driver installed.
Procedure : Server Side
- git clone the repostory
git clone https://github.com/pmukhedk/llamastack.git
2. install the necessary python packages
cd llamastack/getting-started
pip3 install -r requirements.yaml
Note : ensure python version is 3.11 and above
3. Inference a model using vllm
vllm server ibm-granite/granite-3.3-2b-instruct
4. Start llama-stack server
llama stack run llamastack-run.yaml
Procedure : Client Side
1. Install the llamastack client cli and configure the endpoint
pip3 install llama-stack-client
llama-stack-client configure --endpoint http://localhost:8321
2. List the providers and models
# llama-stack-client providers list
INFO:httpx:HTTP Request: GET http://localhost:8321/v1/providers "HTTP/1.1 200 OK"
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ API ┃ Provider ID ┃ Provider Type ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩
│ inference │ vllm │ remote::vllm │
│ inference │ vllm-mass │ remote::vllm │
│ safety │ llama-guard │ inline::llama-guard │
│ agents │ meta-reference │ inline::meta-reference │
│ vector_io │ faiss │ inline::faiss │
│ datasetio │ localfs │ inline::localfs │
│ scoring │ basic │ inline::basic │
│ eval │ meta-reference │ inline::meta-reference │
│ post_training │ torchtune │ inline::torchtune │
│ tool_runtime │ tavily-search │ remote::tavily-search │
│ telemetry │ meta-reference │ inline::meta-reference │
│ files │ localfs │ inline::localfs │
└───────────────┴────────────────┴────────────────────────┘
root@eb526431b778:~# llama-stack-client models list
INFO:httpx:HTTP Request: GET http://localhost:8321/v1/models "HTTP/1.1 200 OK"
Available Models
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ model_type ┃ identifier ┃ provider_resource_id ┃ metadata ┃ provider_id ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ llm │ Llama-4-Scout-17B-16E-W4A16 │ Llama-4-Scout-17B-16E-W4A16 │ │ vllm-mass │
├──────────────┼────────────────────────────────┼────────────────────────────────────────┼───────────┼─────────────┤
│ llm │ granite-3.3-2b-instruct │ ibm-granite/granite-3.3-2b-instruct │ │ vllm │
└──────────────┴────────────────────────────────┴────────────────────────────────────────┴───────────┴─────────────┘
Total models: 2
root@eb526431b778:~#
3. Interact with models
#llama-stack-client inference chat-completion --message "What is the linear Algebra, explain me in 30 words " --model-id Llama-4-Scout-17B-16E-W4A16
INFO:httpx:HTTP Request: POST http://localhost:8321/v1/inference/chat-completion "HTTP/1.1 200 OK"
ChatCompletionResponse(
completion_message=CompletionMessage(
content='Linear Algebra is a branch of mathematics that deals with vectors, vector spaces, linear
transformations, and systems of linear equations, focusing on solving and representing linear equations.',
role='assistant',
stop_reason='end_of_turn',
tool_calls=[]
),
logprobs=None,
metrics=[
Metric(metric='prompt_tokens', value=23.0, unit=None),
Metric(metric='completion_tokens', value=42.0, unit=None),
Metric(metric='total_tokens', value=65.0, unit=None)
]
)