Build Your Own Private, Self-Hosted AI Applications with Ollama & Laravel

Today

Imagine your team wants to bring AI into their workflow to automate routine tasks, extract insights from data, assist with content creation, or improve customer support. Smart move!

You can integrate AI into your system using APIs from OpenAI (the company behind ChatGPT), Anthropic (Claude), or Google (Gemini). But before you do, you need to reckon with two crucial questions:

Will my data be used to train AI? Any inputs you send (documents, prompts, uploaded files, or any other content) are no longer fully private once they reach the AI provider. You'll need to review their terms carefully to confirm providers don't train models with user data. Even with such assurances, companies in highly regulated industries (e.g., healthcare, legal, finance, government, or any sector with strict confidentiality requirements) simply cannot risk sharing sensitive user data with third parties. Especially AI providers.
How much will I spend? These providers typically charge per token (chunk of text). Each time your system interacts with the AI (summarizing a report, transcribing audio, or answering a user question) the token counter goes up. You'll have little control over the final cost. If usage suddenly spikes or your AI tools become widely adopted, that bill can skyrocket quickly. Of course, you can implement limits, but what if, no matter how much usage scales, you truly need predictable, fixed costs?

The solution? Run your own AI stack. Today, you'll learn how to deploy a private, self-hosted AI instance on your own infrastructure.

We'll use Ollama: a lightweight framework that lets you download and run large language models like Llama, DeepSeek, and Gemma. Ollama is open source and available for macOS, Linux, and Windows. It can run on a server, or even right on your local machine, allowing you to use AI completely offline.

Best of all, Ollama supports an API that mirrors OpenAI's, meaning if you already use OpenAI, you can often switch to Ollama with minimal code changes and get immediate control over your data and costs.

By the end of this article, you will know:

How to install and run Ollama locally or on a private server
How to choose the right AI models for your use case and hardware
How to integrate those models into a Laravel-based chatbot using the Laravel-Ollama package

Ready for the challenge? Let's do it!

Overview

How to Choose Our AI Model
How to Choose Our Server
How to Install and Run Ollama and Llama 3.1
- Ollama API
How to Integrate AI into Laravel Using Ollama Laravel
- How to Build a Private Chatbot
- Demo: A Chatbot For Super-Spies!
In Closing

How to Choose Our AI Model

Ollama lets you run and manage different large language models, or LLMs. But what exactly are models? Think of them as digital brains, trained on colossal amounts of text data to understand and generate human-like language. Generally, the more data and the more complex the training, the better the results.

If you've ever used a free ChatGPT account, you might have seen a message like, “Sorry, buddy, your quota for this model has run out.” While you can still use the service, the answers aren't quite as sharp. That's because they're being generated by a less powerful model.

With Ollama, you can run open-source models. While you won't have access to proprietary, closed-source models like GPT-4o or Gemini 2.5, you'll find excellent, free alternatives like:

Llama 3.3 by Meta: A powerful and versatile model known for strong reasoning and conversational skills (what a phrase to add to your resume!)
Phi-4 by Microsoft: Built for advanced reasoning and trained on high-quality synthetic datasets, filtered public domain websites, and academic books.
Gemma 3 by Google: A model described—in Google's own words—as "the most capable model that can run on a single GPU or TPU."

You can find all supported models in Ollama's model library. You'll also find a short list of popular options right on the homepage of their public repo.

When choosing a model, there are a few key specs to keep in mind:

Model size (e.g., 4 GB, 40 GB): This is how much disk space the model files will take up.
Parameter count (e.g., 7B, 13B, 70B): This tells you how complex and powerful the model is. The "B" stands for billion parameters. More parameters usually mean better performance, but also higher demands on your hardware. As a rule of thumb, you will want at least 8 GB of RAM for 7B models, 16 GB for 13B, and 32 GB for 33B.
Context length (e.g., 128k, 160k): This indicates how much text the model can process in a single request. It is expressed in tokens, which are chunks of text. An English word is about 4 tokens, and a 128k context window can handle approximately 300 pages of text. That's a large enough context window to process long conversations, summarize documents, or perform tasks where memory and continuity matter.

For this demo, we will choose Llama 3.1 because it balances performance and resource demands, making it ideal for most chatbot applications. This model has a context window of 128k and comes in many flavors: 8B (approx. 4.7 GB), 70B (approx. 43 GB), and 405B (approx. 231 GB).

We will stick with the smallest of these, the 8B model, which will allow us to run our tests locally, or even completely offline. For production, you can choose more powerful alternatives depending on your server's capabilities. Let's quickly talk about that.

How to Choose Our Server

If we want to self-host our AI model, we need to figure out where to run it.

With traditional web apps, you're probably used to spinning up a virtual private server on platforms like DigitalOcean or Linode. You pick your specs, maybe 2 vCPUs, 4 GB of RAM, and 80 GB of SSD storage, and you're up and running. But with AI, there's another major factor to consider: the graphics card, aka the GPU.

You see, GPUs are designed with thousands of smaller cores optimized for parallel computations, like the matrix multiplications inherent in neural networks. CPUs, on the other hand, have fewer, more powerful cores better suited for sequential tasks. Most LLMs are optimized to run on GPUs, so you'll almost certainly need one. Trying to run a 70B parameter model on a standard server without a GPU is like trying to tow a truck with a bicycle. Even if it works, it's going to be painfully slow.

So, you need sufficient GPU power and ample VRAM (the high-speed memory directly on the GPU). The model parameters need to be loaded into VRAM, and ideally, they should fit entirely (or at least mostly) within it. This avoids constant, slow data transfers with system RAM. The more parameters and the higher the precision of the model, the more VRAM is needed.

GPU-specific Virtual Servers

Thankfully, there are GPU cloud providers built specifically for this kind of workload. Digital Ocean offers GPU Droplets; and services like RunPod, Paperspace, and Lambda Cloud let you launch machines with high-end GPUs like the A100, RTX 6000, or 4090 in just a few clicks.

These platforms typically use a pay-as-you-go pricing model. You are billed by the hour (or sometimes even by the second) based on how long your server is running. You can leave it on 24/7, or simply spin it up for specific tasks and shut it down when you're done.

This provides simple, predictable pricing. Instead of guessing how many tokens your app might consume and worrying about spikes or abuse, you are just paying for compute time. No surprises.

Now, GPU servers aren't cheap. But if your app depends heavily on AI or handles sensitive data, the performance and privacy gains often make it worth it. Owning your stack means full control of costs and confidentiality.

That said, we're keeping things simple today. As we mentioned before, we've chosen the 8B version of Llama 3.1. This is a relatively small model that runs even on a modest machine or Linux server. It'll be faster if you have a GPU (so perhaps borrow a gamer friend's PC!), but it will still run just fine on a CPU with at least 8 GB of RAM.

How to Install and Run Ollama and Llama 3.1

It's time to install Ollama! Head over to their website, click the "Download" button, and choose your operating system. For this guide, we'll assume a Linux server, so you can simply run:

curl -fsSL https://ollama.com/install.sh | sh

Once Ollama is installed on your system, you can open your terminal and type:

ollama

... to see the list of available commands:

Usage:
  ollama [flags]
  ollama [command]
 
Available Commands:
  serve         Start ollama
  create        Create a model from a Modelfile
  show          Show information for a model
  run           Run a model
  stop          Stop a running model
  pull          Pull a model from a registry
  push          Push a model to a registry
  list          List models
  ps            List running models
  cp            Copy a model
  rm            Remove a model
  help          Help about any command
 
Flags:
  -h, --help    help for ollama
  -v, --version Show version information
 
Use "ollama [command] --help" for more information about a command.

Now, let's get our chosen model. In the Ollama model library, you can search for the model you want to use and copy its installation command. As we discussed, we've chosen Llama 3.1 8B, so we'll run:

ollama run llama3.1

This command will download the model and then run it. The next time you execute the same command, Ollama won't need to download anything since the model will already be on your disk.

You'll see a spinning loading indicator, and then... there it is! A prompt input appears, ready for you to chat with the model.

>>> Send a message (/? for help)

We can start interacting with the AI right away, just as we would with ChatGPT or other AI chat interfaces.

>>> Translate this to spanish: "Hello, world!"
"Hola, mundo!"
 
>>> Great! Now, to french.
"Bonjour, monde!"

Ollama API

Ollama comes with a built-in API that allows us to interact with the models via HTTP. This is incredibly useful for integrating AI into our applications, as we'll demonstrate in the next section.

This API follows the OpenAI Chat Completions API standard, which has become widely adopted across the AI ecosystem. Providers like Gemini and Grok use similar endpoint structures, so if you're already familiar with those, you'll feel right at home with Ollama's API. And if not, no worries! We'll walk you through everything you need to know.

By default, this API is hosted on localhost:11434. If you open that URI in your browser, you should see the message "Ollama is running." This confirms we are ready to test the API. Open a new terminal window and run:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "What is the capital of Australia?",
  "stream": false
}'

The AI will then return a JSON response similar to this one:

{
  "model": "llama3.1",
  "created_at": "2025-06-28T14:11:46.559797Z",
  "response": "The capital of Australia is Canberra.",
  "done": true,
  "done_reason": "stop",
  "context": [...],
  "total_duration": 76411982916,
  "load_duration": 120536958,
  "prompt_eval_count": 17,
  "prompt_eval_duration": 17114235292,
  "eval_count": 8,
  "eval_duration": 8453654375
}

Awesome! Our local AI is working. The response text in the JSON is the answer to our question, and the rest is a set of very useful statistics about the model's performance.

Now, you might have noticed that we sent "stream": false in our request. This tells Ollama to wait and send the entire response text at once. To stream the response, we can set it to true (or simply leave it out entirely, as streaming is the default behavior for the /api/generate endpoint).

Let's try that by removing the stream parameter from our curl command:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "What is the plot of Back to the Future?"
}'

This will output a stream of JSON objects, one per line. Each object will contain a small chunk of the response, allowing your application to display the AI's output in real-time as it is generated, giving users a much more dynamic and responsive experience:

{"model":"llama3.1","response":"The","done":false}
{"model":"llama3.1","response":" plot","done":false}
{"model":"llama3.1","response":" of","done":false}
// ... many more lines ...
{"model":"llama3.1","response":" ending.","done":false}
{"model":"llama3.1","response":"","done":true,"total_duration":76411982916}

As you can see, each line represents a piece of the model's output, and the final line will include the full statistics once the generation is complete.

In the next section, we'll use the cloudstudio/ollama-laravel package to integrate Ollama's local AI capabilities directly into your Laravel applications, simplifying the entire process.

How to Integrate AI into Laravel Using Ollama Laravel

You can interact with Ollama's API directly using Laravel's Http client. However, the Ollama Laravel package offers a more streamlined and convenient approach:

use Cloudstudio\Ollama\Facades\Ollama;
 
$call = Ollama::prompt('Who is Batman?')->model('llama3.1')->ask();
echo $call['response']; // Batman is a fictional superhero created by...

As you can see, the package works like an SDK that simplifies access to all of Ollama's API endpoints. To integrate it into your Laravel project, open your terminal and run:

composer require cloudstudio/ollama-laravel

Then, publish the configuration file:

php artisan vendor:publish --tag="ollama-laravel-config"

... and update your .env file:

OLLAMA_MODEL=llama3.1
OLLAMA_URL=http://localhost:11434
OLLAMA_DEFAULT_PROMPT="Hello, how can I assist you today?"
OLLAMA_CONNECTION_TIMEOUT=300

And that's it! You can now start using AI in your controllers, jobs, or anywhere else you need it.

How to Build a Private AI Chatbot

Let's put this to the test by creating a private AI chatbot in Laravel!

Sure, this is a classic example, but let's look at why it actually makes perfect sense for this article. A private, self-hosted chatbot highlights the core benefits of what we're building today: It uses sensitive contextual information to answer user questions, without ever submitting that data to a third party. For example, this chatbot could take the form of:

An investment app that accurately answers a logged-in user's questions (e.g., "What were my best investments this quarter?" or "What is my current balance?")
A company intranet that securely responds to employee queries by accessing private internal documents or databases
A research assistant where it generates hypotheses based on proprietary data and help scientists validate their theories, keeping valuable intellectual property private

Now, let's focus on the code. Previously, we used Ollama Laravel's prompt method to define a question and ask to retrieve the response. That works well for one-off queries, but now that we want to build an ongoing conversation between the user and the AI, we'll switch to a different method: chat.

The chat method accepts an array of messages, each assigned a specific role:

system: Sets the assistant's behavior or personality—think of it as high-level guidance before the conversation begins.
user: Represents messages sent by the person interacting with the assistant.
assistant: Contains the assistant's replies.

Here's an example:

$messages = [
    ['role' => 'system', 'content' => 'You are a helpful biologist.'],
    ['role' => 'user', 'content' => 'Are toads amphibians?'],
    ['role' => 'assistant', 'content' => 'Yes, they are.'],
    ['role' => 'user', 'content' => 'And snakes?']
];
 
$call = Ollama::model('llama3.1')->chat($messages);
echo $call['response']; // No, snakes are reptiles.

This lets the AI respond based on the full conversation history, drawing context from all previous messages to maintain a natural, continuous dialogue.

In practice, this means updating the messages array as the conversation unfolds:

When the user sends a message, add an element with 'role' => 'user'.
When the AI replies, add the response with 'role' => 'assistant'.

Demo: A Chatbot For Super-Spies!

Now, what can our chatbot be about? To keep things lighthearted and fun, let's make a chatbot for a super-spy agent. After all, what could be more confidential than super-spy data?

This chatbot can answer questions about spy gear, secret missions, and undercover agents. This super secret information will be part of its system message:

You are MINDY, Mission Intelligence Neural Device (Yeah!), the AI assistant of a super spy agent. If they ask for a mission, choose one of the secret missions listed below. Use the CONTEXT to answer questions, but DO NOT add new information to it. Stay in script. Reply in short sentences and end your messages with snarky remarks (super spies aren't very smart).

CONTEXT

SPY GEAR

Super Pen: Click once for regular ink. Click twice for a .22 caliber round. Or was it the other way around?
Explosive Cufflinks: Rip them off your shirt and throw them against a surface to make them explode (a shame, theyre very stylish).
Google Glasses: Wait, remember those?

SECRET MISSIONS

Operation Daytona USA: Bad guys are good at racing arcades, win the tournament to crush their spirit.
Operation Boom: There's a bomb about to explode... somewhere. Find it and cut the red cable. The red one!
Operation Speed Date: Each participant gets 3 minutes. One of them's dealing rocket launchers. Be charming. Be quick. Don't fall in love.

UNDERCOVER AGENTS

Agent Noodle: Undercover as a ramen chef in Osaka. Loves jazz. Hates clowns. If they whisper "Extra spicy," they need extraction.
Agent Mirage: Master of disguises. Last seen posing as a grandmother in Lisbon. Dont let the cookies fool you, they contain nanotrackers. Or even worse, gluten!
Agent Paloma: No one knows their location or identity. Communicates via courier pigeons wearing tiny sunglasses. Say "Nice shades" to receive intel.

To follow along, copy the text and save it in the root of your project as system.txt.

To interact with our chatbot, we'll start by creating a custom command. While you could build a full frontend to handle user input and stream AI responses, today we'll keep things simple and use Laravel's powerful CLI tools. Open your terminal and run:

php artisan make:command SuperSpyAssistant

Open app/Console/Commands/SuperSpyAssistant.php and edit the signature, description, and handle method:

protected $signature = 'mindy';
protected $description = 'Shhh, top secret!';
 
public function handle()
{
    $this->info('This is MINDY.');
}

Alright! Running php artisan mindy should display that info message.

With everything wired up, we now need to implement the core logic:

Read the contents of system.txt and set them as a system message
Prompt the user to enter a message
Stream the AI's response
Repeat the process until the user enters the string exit

To do that, our complete command might look like this:

<?php
 
namespace App\Console\Commands;
 
use Illuminate\Console\Command;
use Illuminate\Support\Facades\File;
use Cloudstudio\Ollama\Facades\Ollama;
 
class SuperSpyAssistant extends Command
{
    protected $signature = 'mindy';
    protected $description = 'Shhh, top secret!';
 
    public function handle()
    {
        // Show the welcome message
        $this->info('This is MINDY. What do you want to know, agent? Type "exit" to quit.');
 
        // Load the system prompt
        $messages = [['role' => 'system', 'content' => File::get(base_path('system.txt'))]];
 
        // Start the chat loop
        while (true) {
            // Get user input
            $input = $this->ask('Your message');
 
            // Check for exit command
            if (strtolower($input) === 'exit') break;
 
            // Add user input to messages
            $messages[] = ['role' => 'user', 'content' => $input];
 
            // Call the Ollama model and stream the response
            $stream = Ollama::model('llama3.1')
                ->stream(true)
                ->chat($messages);
 
            // Process the streamed response
            $chunks = Ollama::processStream($stream->getBody(), function ($data) {
                // Print the response content
                echo $data['message']['content'];
                // Flush the output buffer to show the response in real-time
                flush();
            });
 
            // Add the answer to the messages
            $messages[] = ['role' => 'assistant', 'content' => implode('', array_column($chunks, 'response'))];
 
            // Print line breaks for better readability
            $this->newLine(2);
        }
    }
}

And that's it! Mindy is ready to chat as soon as we run php artisan mindy.

This is MINDY. What do you want to know, agent? Type "exit" to quit.

Your message:
> Hi, Mindy! There's a pigeon with sunglasses here, looking at me as if it's expecting a compliment. What should I say?

It's Agent Paloma's messenger bird. Just say, "Nice shades," and get ready for some cryptic information. Oh, great, now the pigeon is strutting around like it owns the place.

Your message:
> Cute, tho. Okay, it gave me this message: "Never eat cookies in Lisbon." What does that mean?

It means Agent Mirage was last seen there and might be tracking you through gluten-filled baked goods. Be on your guard against Portuguese pastries.

There you have it! The AI responds based on the information we supplied. We can keep chatting, and it will take the entire conversation history into account when replying.

Without the system message, a question like "Where is Noodle?" would likely trigger a very different response. But because we provided clear context, the model can process it and respond appropriately, even using the tone we defined in our instructions.

In your own applications, you can include similar context to shape the AI's output. Just keep in mind that each model has a limited context window, so it's best to include only the most relevant information. For example, instead of feeding it an entire book, you could provide only the chapter or passage that relates to the user's question. Several techniques can help with this, and one of the most effective is RAG: Retrieval-Augmented Generation. But we'll explore how that works in an upcoming article.

In Closing

Today, we explored the power of self-hosted AI models. While the convenience of third-party services is undeniable, this article has hopefully shown that you don't have to be a Machine Learning Engineer to set up your own AI and integrate it into your applications. It's much more straightforward than most expect. Want to quickly switch models? Want to go a step further and train your own? With self-hosted AI, you have the freedom to do so.

Furthermore, a private, self-hosted AI stack is the enterprise-grade foundation for any business serious about maintaining data sovereignty, controlling costs, and future-proofing their infrastructure.

Did you find this insightful? Would you like to see more articles like this? Let us know!

Until next time.

17 New(ish) Vanilla JavaScript Features You Might Have Missed

Get our latest insights
in your inbox: