This tutorial will guide you through setting up Ollama, a powerful platform serving large language model, on a GPU Pod using Runpod. Ollama makes it easy to run, create, and customize models. However, not everyone has access to the compute power needed to run these models. With Runpod, you can spin up and manage GPUs in the Cloud. Runpod offers templates with preinstalled libraries, which makes it quick to run Ollama. In the following tutorial, you’ll set up a Pod on a GPU, install and serve the Ollama model, and interact with it on the CLI.Documentation Index
Fetch the complete documentation index at: https://runpod-b18f5ded-promptless-github-integration-timeout-clari.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
The tutorial assumes you have a Runpod account with credits. No other prior knowledge is needed to complete this tutorial.Step 1: Start a PyTorch Template on Runpod
You will create a new Pod with the PyTorch template. In this step, you will set overrides to configure Ollama.- Log in to your Runpod account and choose + GPU Pod.
-
Choose a GPU Pod like
A40. - From the available templates, select the lastet PyTorch template.
-
Select Customize Deployment.
-
Add the port
11434to the list of exposed ports. This port is used by Ollama for HTTP API requests. -
Add the following environment variable to your Pod to allow Ollama to bind to the HTTP port:
- Key:
OLLAMA_HOST - Value:
0.0.0.0
- Key:
-
Add the port
- Select Set Overrides, Continue, then Deploy.
Step 2: Install Ollama
Now that your Pod is running, you can Log in to the web terminal. The web terminal is a powerful way to interact with your Pod.- Select Connect and choose Start Web Terminal.
- Make note of the Username and Password, then select Connect to Web Terminal.
- Enter your username and password.
- To ensure Ollama can automatically detect and utilize your GPU, run the following commands.
- Run the following command to install Ollama and send to the background:
ollama serve part starts the Ollama server, making it ready to serve AI models.
Now that your Ollama server is running on your Pod, add a model.
Step 3: Run an AI Model with Ollama
To run an AI model using Ollama, pass the model name to theollama run command:
[model name] with the name of the AI model you wish to deploy. For a complete list of models, see the Ollama Library.
This command pulls the model and runs it, making it accessible for inference. You can begin interacting with the model directly from your web terminal.
Optionally, you can set up an HTTP API request to interact with Ollama. This is covered in the next step.
Step 4: Interact with Ollama via HTTP API
With Ollama set up and running, you can now interact with it using HTTP API requests. In step 1.4, you configured Ollama to listen on all network interfaces. This means you can use your Pod as a server to receive requests. Get a list of models To list the local models available in Ollama, you can use the following GET request:- cURl
- Output
[your-pod-id] with your actual Pod Id.[your-pod-id] with your actual Pod Id.
Because port 11434 is exposed, you can make requests to your Pod using the curl command.
For more information on constructing HTTP requests and other operations you can perform with the Ollama API, consult the Ollama API documentation.
Additional considerations
This tutorial provides a foundational understanding of setting up and using Ollama on a GPU Pod with Runpod.- Port Configuration and documentation: For further details on exposing ports and the link structure, refer to the Runpod documentation.
- Connect VSCode to Runpod: For information on connecting VSCode to Runpod, refer to the How to Connect VSCode To Runpod.