1.
Classification of AI Models
Models are classified based on their accessibility and level of control:
Proprietary Models (Closed Source)
Owned and controlled by specific companies (e.g., OpenAI, Google, Anthropic).
● Black Box Nature: The source code, training data, and internal weights are hidden.
Users cannot inspect the "why" behind a decision.
● Access Method: Usually via API (Application Programming Interface) or paid
subscriptions.
● Deployment: Cloud-based. You send your data to their servers, and they send back the
answer.
Open Source Models (Open Weight)
Core details are shared with the public (e.g., Meta’s Llama 3, Mistral, DeepSeek).
● Transparency: Architecture and weights are public, allowing for full inspection and local
hosting.
● Customization: Can be "fine-tuned" on private data to create specialized experts.
● Access: Downloadable for free from platforms like Hugging Face.
2. Ollama: The Local AI Orchestrator
Ollama is a tool that allows you to run these Open Source LLMs directly on your own hardware.
It simplifies the complex process of setting up and managing AI models locally.
Benefit Description
Privacy Sensitive data (legal files, medical records) never leaves your local
machine.
Zero Cost No "pay-per-token" fees. You only pay for your hardware and
electricity.
Offline Works without an internet connection once the model is
Access downloaded.
Simple CLI Manage models like a "Play Store" using commands like run, pull,
and rm.
No Lock-in You aren't tied to a specific cloud vendor's pricing or terms.
System Requirements
• Operating System: Windows, macOS, or Linux.
• Hardware: Laptop/PC with at least 8 GB of RAM (more RAM improves smoothness).
• Storage: Sufficient disk space is required, as models range from 3 GB to 15 GB.
• Connectivity: Internet is required only for the initial download of the tool and models.
• Skills: Basic knowledge of command line/terminal functions.
• GPU: Optional, but speeds up performance if available.
3. The Modelfile: The Blueprint of Your Model
A Modelfile is a configuration file that allows you to customize an LLM's personality, settings,
and behavior. It works similarly to a "Dockerfile"—it doesn't store the massive brain (the
weights), but it provides the instructions on how to use it.
Core Components of a Modelfile
The Modelfile uses specific directives to build a custom model:
● FROM (Required): Defines the base model you are building upon.
○ Theory: You take a generic model (like llama3.2) and use it as the foundation.
● SYSTEM: Sets the "Identity" or "System Prompt."
○ Theory: This defines the permanent rules the model must follow (e.g., "You are a
professional accountant who only speaks in bullet points").
● PARAMETER: Adjusts the technical "knobs" of the model.
○Theory: You can control Temperature (creativity vs. logic), Num_Ctx (how much
history it remembers), and Stop Sequences (where the model should stop
talking).
● TEMPLATE: Defines the interaction format.
○ Theory: It structures how the user's prompt and the system's response are
"wrapped" so the LLM understands who is speaking.
● MESSAGE: Pre-loads conversation history.
○ Theory: You can give the model "examples" of how to behave by providing a few
sample questions and answers within the blueprint.
The Creation Process
1. Write: You create a text file named Modelfile.
2. Build: You run a command (ollama create) which "packages" your instructions with
the base model.
3. Deploy: You now have a new model identity (e.g., legal-assistant) that appears
in your ollama list.
4. Tool Calling: Giving LLMs "Hands"
Tool Calling (or Function Calling) is the process where an LLM realizes it cannot answer a
question on its own and requests to use an external tool.
The Theoretical Workflow
Instead of just "chatting," tool calling follows a 4-step loop:
1. Declaration (The Menu):
○ You provide the LLM with a list of "Tools" (functions) described in JSON
Schema.
○ Theory: You aren't giving the model the code; you are giving it a "User Manual"
for the tool (e.g., "I have a tool called get_weather that needs a city name").
2. Recognition (The Decision):
○ The user asks: "What is the temperature in Paris?"
○ The LLM realizes it doesn't know live data. It looks at its "Menu" and decides: "I
need to call get_weather(city='Paris')."
○ Crucial Point: The LLM does not run the code. It simply outputs a "request" in
JSON format.
3. Execution (The Action):
○ Your local system (or LangChain) sees the LLM's request.
○ It runs the actual code (e.g., hits a weather API or checks a database).
○ It collects the result (e.g., "22°C").
4. Integration (The Final Answer):
○ The result ("22°C") is sent back to the LLM.
○ The LLM reads that result and finally answers the user: "The temperature in Paris
is currently 22°C."
Why use Tool Calling locally?
● Real-time Data: LLMs are frozen in time; tools let them see today's news or stock
prices.
● Accuracy: LLMs are bad at math; a tool can send a calculation to a Python script for a
100% correct answer.
● Action-Oriented: Tools allow the AI to actually do things, like sending an email or
saving a file to your desktop.
5. Ollama and LangChain: The Orchestration Layer
While Ollama acts as the engine (running the model), LangChain acts as the brain or the "glue"
that connects that engine to your data, tools, and workflows.
The Role of LangChain
In a standard setup, LangChain provides a standardized interface. This means you can write
your application logic once and switch between a local Ollama model and a cloud model (like
OpenAI) by changing just one line of configuration.
Key Integration Concepts
● Prompt Templating: LangChain manages complex prompts. Instead of sending raw
text to Ollama, LangChain structures it into "System," "AI," and "Human" messages to
ensure the local model follows instructions strictly.
● Chains: You can link multiple Ollama calls together. For example:
1. Chain 1: Use a small Ollama model (like Phi-3) to summarize a document.
2. Chain 2: Use a larger Ollama model (like Llama 3) to answer a specific question
based on that summary.
● RAG (Retrieval-Augmented Generation): This is the most popular use case.
LangChain "retrieves" relevant facts from your private files (PDFs, Excel) and feeds
them to the local Ollama model as "context" so it can answer questions about your
private data without that data ever leaving your machine.
● Memory: LangChain handles the conversation history. It stores previous turns of a chat
and resends them to Ollama so the local model "remembers" what you said earlier in the
conversation.
6. Ollama Cloud (Released in late 2025/2026)
Ollama Cloud is a hybrid expansion of the local Ollama tool. It allows users to run massive
models that a standard laptop cannot handle while maintaining the same simple user
experience.
How it Works
● Remote Inference: Instead of ollama run llama3, you can run ollama run
llama3-cloud. The command stays the same, but the heavy lifting happens on
Ollama’s high-performance servers.
● Hybrid "Bursting": You can develop and test locally on a 7B (7 billion parameter)
model. When you need "God-mode" reasoning for a complex task, you "burst" that
specific query to Ollama Cloud to run a 400B+ parameter model.
● Zero-Configuration Sync: Your local Modelfiles can be pushed to Ollama Cloud. This
ensures that the "personality" and "instructions" you built locally behave exactly the
same way in the cloud.
Core Benefits of the Cloud Tier
● Hardware Independence: You can run state-of-the-art models from a cheap
Chromebook or an old tablet.
● Battery Life: Local inference is heavy on power; using the cloud tier saves your laptop’s
battery during long sessions.
● Privacy-First Cloud: Unlike standard proprietary APIs, Ollama Cloud is designed with
"Stateless Inference"—meaning they process the request but do not store your data for
training.