Saturday, 18 April 2026

What is Ollama Serve (REST API)

Running LLMs locally is becoming very common, and tools like Ollama make it extremely simple.
But one feature that really unlocks its power is
ollama serve
This turns your local machine into a REST API server for AI models.

When we run:
ollama serve

It starts a local web server. This server allows other applications to talk to your AI models using HTTP requests.

Without serve → You manually run prompts in terminal
With serve → Your apps can call the model like an API

Default API Endpoint: Once the server starts
http://localhost:11434  becomes base URL.

Example API Call
Here’s a  request:
curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Explain cloud computing"
}'

Sample API Output (With Metrics):
{
  "model": "llama3",
  "created_at": "2026-04-18T12:10:00Z",
  "response": "Cloud computing is the delivery of computing services over the internet...",
  "done": true,

  "total_duration": 2450000000,
  "load_duration": 800000000,
  "prompt_eval_count": 12,
  "prompt_eval_duration": 200000000,
  "eval_count": 65,
  "eval_duration": 1450000000
}


Now let’s understand this response
Basic Response Fields
model  ===> llama3
created_at  ===>  2026-04-18T12:10:00Z
response    ===>  "Cloud computing is the delivery of computing services over the internet..."
done    ===>  true , Means response is complete, No more data coming

Performance Metrics:
1. total_duration ===>   2450000000 ns → ~2.45 seconds
This is the total time taken ==> From Request received To Final response sent

2. load_duration  ===> 800000000 ns → ~0.8 seconds
Time taken to load the model into memory 
This usually happens On first request ,When model is not already loaded

3. prompt_eval_count  ===>  12 tokens
Number of tokens in your input  

4. prompt_eval_duration   ===>  200000000 ns → ~0.2 seconds

Time model spent reading your question

5. eval_count    ===>  65 tokens

Number of tokens generated in response , This directly affects response size,Cost (in cloud scenarios) and Latency

6. eval_duration    ===>  1450000000 ns → ~1.45 seconds

Time spent generating the response , This is Actual thinking + answering time