Topic

Private, Local & Self-hosted AI

Local models, private deployment patterns, self-hosted inference, and hybrid architectures.

10 stories (4 articles · 6 videos)

Start here

A few good first pieces before you browse the full feed.

10 min read

Article

Local AI on your Mac: Ollama, LM Studio, and what 7B models can really do

Running AI locally has matured. With Ollama or LM Studio and a modern Mac, you can run capable models offline, free, and private. What works, what doesn't, and the use cases that actually benefit.

Evaluate the implementation pattern, failure modes, and guardrails before building.

Intermediate

10 min read

Article

Private AI deployment patterns: local, VPC, self-hosted, and hybrid

Private AI is not one architecture. A practical comparison of local models, enterprise SaaS, VPC deployments, self-hosted inference, and hybrid patterns for SMEs that care about privacy and control.

Choose a private AI deployment pattern based on data sensitivity, capability needs, cost, latency, and operational capacity.

Advanced

11 min read

Article

Self-hosted vs hosted inference: vLLM, TGI, and the break-even math

At what scale does self-hosting beat API calls? The actual math, the operational realities, and the patterns that distinguish teams who should self-host from teams who should keep paying for managed inference.

Use the article as decision context for adoption, risk, governance, or investment choices.

Advanced

VMware Private AI Foundation Capabilities and Features Update from Broadcom

Tech Field Day. Shows private AI as layered infrastructure: controlled compute, isolated environments, Kubernetes, inference containers, model governance, self-service provisioning, GPU sharing and monitoring. That maps directly to the article's warning that privacy depends on deployment boundaries, logs, access and operations, not on the word "local."

Advanced

13 min read

Article

Fine-tuning in 2026: when LoRA beats RAG, and how to do it without a cluster

LoRA fine-tuning has become accessible — you can run real fine-tunes on a laptop or rent a GPU for an hour. The patterns that work, the cases where fine-tuning beats RAG, and a practical end-to-end workflow from data prep to deployment.

Evaluate the implementation pattern, failure modes, and guardrails before building.

Advanced

32 minutes

Video

Fast LLM Serving with vLLM and PagedAttention

Anyscale. Walks through why naive LLM serving wastes 60–80% of GPU memory, how PagedAttention borrows OS-style paging to fix that, and why continuous batching produces the 24× throughput numbers the article uses in its math. After this, the article's "you'll be lucky to hit 50% utilisation" line stops feeling abstract.

Advanced

59 minutes

Video

Developing an LLM: Building, Training, Finetuning

Sebastian Raschka. Sebastian Raschka's slower walkthrough of where fine-tuning sits in the broader LLM training pipeline — instruction tuning, classification fine-tuning, parameter-efficient methods, and the trade-offs the article calls out before recommending LoRA. Good calibration before you start, especially if your team is debating whether fine-tuning is even the right step.

Advanced

157 minutes

Video

Fine Tuning LLM Models – Generative AI Course

freeCodeCamp.org. Long, theory-then-code course covering quantisation, LoRA, QLoRA, and full PEFT on Llama 2 and Gemma — on hardware most developers actually have. It is the closest thing to a "shadow somebody who has done this" experience on YouTube and lines up with the article's "you don't need a cluster" claim with concrete VRAM budgets.

Advanced

6 minutes

Video

LM Studio Tutorial: Run Large Language Models (LLM) on Your Laptop

Kevin Stratvert. Same workflow as Ollama but in a GUI: download LM Studio, pull a Llama or Gemma model, chat, drop a PDF in and ask questions about it. Good for readers who'd rather not live in the terminal — also useful for getting a feel for how a 1B–3B model actually performs against a heavier one.

Intermediate

14 minutes

Video

Learn Ollama in 15 Minutes - Run LLM Models Locally for FREE

Tech With Tim. A tight, no-nonsense Ollama walkthrough — install, pull a model, chat, then poke at the local HTTP API from Python and create a custom model with a Modelfile. Covers exactly the workflow the article describes for daily use on a Mac, including how to think about model size vs. your machine's RAM.

Intermediate

Start here

Local AI on your Mac: Ollama, LM Studio, and what 7B models can really do

Private AI deployment patterns: local, VPC, self-hosted, and hybrid

Self-hosted vs hosted inference: vLLM, TGI, and the break-even math

More in this topic

VMware Private AI Foundation Capabilities and Features Update from Broadcom

Fine-tuning in 2026: when LoRA beats RAG, and how to do it without a cluster

Fast LLM Serving with vLLM and PagedAttention

Developing an LLM: Building, Training, Finetuning

Fine Tuning LLM Models – Generative AI Course

LM Studio Tutorial: Run Large Language Models (LLM) on Your Laptop

Learn Ollama in 15 Minutes - Run LLM Models Locally for FREE