How to Run Llama 3 Locally with ZeroBoxx

A practical guide to deploying Meta's Llama 3.1 on ZeroBoxx hardware. From unboxing to your first inference in a few hours, with no cloud connections required.

One of the most common questions we get after a ZeroBoxx demo is: “How quickly can I actually get Llama 3 running?”

The answer is: within a few hours of delivery, assuming you have a network connection to your local area network and someone comfortable running basic terminal commands.

This guide walks through the complete process from receiving your ZeroBoxx hardware to running your first inference against Llama 3.1.

What Ships With ZeroBoxx

Every ZeroBoxx unit ships pre-configured with:

  • Ubuntu 24.04 LTS installed and updated
  • NVIDIA CUDA and cuDNN drivers configured
  • Ollama installed and running as a system service
  • NVIDIA Container Toolkit for containerized workloads
  • OpenWebUI available for browser-based model interaction

You do not need to set up drivers, configure CUDA, or install inference software. The system is ready to pull and run models immediately.

Step 1: Connect to Your Network

ZeroBoxx ships with a static IP pre-configured for common network environments, along with instructions to update it for your specific subnet. Once connected to your LAN:

  1. Connect the ethernet cable (ZeroBoxx Standard: RJ45 to your switch; ZeroBoxx Pro: QSFP112 to your 400GbE switch or adapter)
  2. Power on the unit
  3. Wait approximately 90 seconds for all services to start
  4. SSH into the unit from any machine on the same network
ssh zeroboxx@<ip-address>

The default credentials are included in the quick-start card shipped with the hardware. Change these immediately after first login.

Step 2: Pull the Llama 3.1 Model

Ollama is already running as a background service. Pull your preferred Llama 3.1 variant:

# 8B parameter model - fast, good for most tasks
ollama pull llama3.1:8b

# 70B parameter model - higher quality, requires more VRAM
ollama pull llama3.1:70b

# 405B parameter model - maximum quality (ZeroBoxx Pro recommended)
ollama pull llama3.1:405b

Model download time depends on your local network speed. The 8B model is approximately 4.7 GB. The 70B model is approximately 40 GB. The 405B model is approximately 230 GB.

Downloads happen over your LAN from the Ollama model registry. Once downloaded, models are cached locally. No internet connection is required after the initial pull.

Step 3: Run Your First Inference

Once the model is pulled, test it immediately with a direct Ollama command:

ollama run llama3.1:8b "Summarize the key risks of cloud AI for a regulated industry."

You will see the response stream to your terminal in real time, running entirely on local hardware.

Step 4: Use the OpenAI-Compatible API

ZeroBoxx runs Ollama’s OpenAI-compatible API endpoint by default. This means any application that works with the OpenAI Python SDK or REST API can point to ZeroBoxx with a one-line change:

from openai import OpenAI

client = OpenAI(
    base_url="http://<zeroboxx-ip>:11434/v1",
    api_key="not-required",  # Ollama does not require authentication by default
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[
        {"role": "user", "content": "What are the benefits of on-premise AI infrastructure?"}
    ]
)

print(response.choices[0].message.content)

If you are migrating an existing application from OpenAI, change base_url to your ZeroBoxx IP and model to your local model name. In most cases, that is the only change required.

Step 5: Access the Web Interface

ZeroBoxx ships with OpenWebUI configured on port 3000. Point any browser on your LAN to:

http://<zeroboxx-ip>:3000

OpenWebUI provides a ChatGPT-style interface for interacting with any model loaded on the system. It supports multi-turn conversations, document uploads, and model switching from a dropdown.

This is useful for non-technical team members who need access to the model without writing code.

Running Multiple Models

ZeroBoxx can load and serve multiple models simultaneously, switching between them based on the model parameter in your API requests. Load your models once:

ollama pull mistral
ollama pull gemma2:27b
ollama pull qwen2.5:72b

Then route different workflows to different models by changing only the model parameter in your API calls. No configuration changes are needed to add new models to the rotation.

Fine-Tuning on Your Own Data

If you want to create a domain-specific model trained on your internal documents, ZeroBoxx includes the tools you need:

  • Unsloth for memory-efficient QLoRA fine-tuning
  • NVIDIA NeMo for enterprise-grade training workflows
  • Hugging Face Transformers for full access to the open-source model ecosystem

A full fine-tuning tutorial is beyond the scope of this post, but the short version is that you can create a fine-tuned model from a base Llama 3.1 checkpoint using your own document corpus in a few hours of training on ZeroBoxx Pro’s 252 GB HBM3e memory.

What You Should Do Next

Once your first inference is running, the typical next steps are:

  1. Connect your applications: Update your existing OpenAI API integrations to point at ZeroBoxx using the compatibility layer
  2. Set up authentication: Configure Ollama’s API key requirement for production use
  3. Provision access: Set up OpenWebUI accounts for team members who need browser access
  4. Load your models: Pull the specific models that match your use cases

The ZeroBoxx team provides onboarding support with every unit, including a walkthrough of your specific integration scenarios.

Book a demo to see a live deployment and ask questions about your specific use case.

Back to Blog