Ollama large language models.
OllamaLLM()Controls the reasoning/thinking mode for supported models.
True: Enables reasoning mode. The model's reasoning process will be
captured and returned separately in the additional_kwargs of the
response message, under reasoning_content. The main response
content will not include the reasoning tags.False: Disables reasoning mode. The model will not perform any reasoning,
and the response will not include any reasoning content.None (Default): The model will use its default reasoning behavior. If
the model performs reasoning, the <think> and </think> tags will
be present directly within the main response content.Additional kwargs to merge with client_kwargs before passing to httpx client.
These are clients unique to the async client; for shared args use client_kwargs.
For a full list of the params, see the httpx documentation.
Additional kwargs to merge with client_kwargs before passing to httpx client.
These are clients unique to the sync client; for shared args use client_kwargs.
For a full list of the params, see the httpx documentation.
Setup:
Install langchain-ollama and install/run the Ollama server locally:
pip install -U langchain-ollama
# Visit https://ollama.com/download to download and install Ollama
# (Linux users): start the server with `ollama serve`
Download a model to use:
ollama pull llama3.1
Key init args — generation params:
model: str
Name of the Ollama model to use (e.g. 'llama4').
temperature: float | None
Sampling temperature. Higher values make output more creative.
num_predict: int | None
Maximum number of tokens to predict.
top_k: int | None
Limits the next token selection to the K most probable tokens.
top_p: float | None
Nucleus sampling parameter. Higher values lead to more diverse text.
mirostat: int | None
Enable Mirostat sampling for controlling perplexity.
seed: int | None
Random number seed for generation reproducibility.
Key init args — client params: base_url: Base URL where Ollama server is hosted. keep_alive: How long the model stays loaded into memory. format: Specify the format of the output.
See full list of supported init args and their descriptions in the params section.
Instantiate:
from langchain_ollama import OllamaLLM
model = OllamaLLM(
model="llama3.1",
temperature=0.7,
num_predict=256,
# base_url="http://localhost:11434",
# other params...
)
Invoke:
input_text = "The meaning of life is "
response = model.invoke(input_text)
print(response)
"a philosophical question that has been contemplated by humans for
centuries..."
Stream:
for chunk in model.stream(input_text):
print(chunk, end="")
a philosophical question that has been contemplated by humans for
centuries...
Async:
response = await model.ainvoke(input_text)
# stream:
# async for chunk in model.astream(input_text):
# print(chunk, end="")