Skip to content

Local LLM Providers

nxusKit supports two categories of local LLM providers:

  • In-Process Providers — Load and run models directly in your application process. No external server required. Requires Cargo feature flags.
  • HTTP-Based Providers — Connect to a locally running inference server (Ollama, LM Studio). No feature flags needed.

In-process providers load GGUF model files directly into your application and run inference without any external server. This gives you the lowest possible latency, full control over model lifecycle, and zero network overhead.

nxusKit provides two in-process inference backends:

BackendFeature FlagEngineStatus
llama.cppprovider-local-llamallama-cpp-2 (safe Rust bindings to llama.cpp)Production-ready
mistral.rsprovider-local-mistralrsmistral.rs (pure-Rust inference on Candle)Experimental

Both backends load the same GGUF model format. If both features are enabled, you can select a backend explicitly or let nxusKit auto-select the first available one.

GGUF (GPT-Generated Unified Format) is the only supported format. GGUF files are self-contained — they bundle weights, tokenizer, and metadata in a single file. Both backends read the same .gguf files.

You can obtain GGUF models from:

  • Hugging Face — Search for “GGUF” in model filters
  • Ollama — Models pulled by Ollama are stored as GGUF blobs (nxusKit can discover these)
  • TheBloke on Hugging Face — Prolific quantizer of popular models

Any model published in GGUF format that is compatible with llama.cpp should work. The following model families are known to work:

Model FamilyParametersExample GGUFNotes
Llama 3.21B, 3BLlama-3.2-1B-Instruct-Q4_K_M.ggufMeta’s latest small models. Great for CPU.
Llama 3.18B, 70B, 405BMeta-Llama-3.1-8B-Instruct-Q4_K_M.ggufExcellent general-purpose. 8B runs well on 16GB RAM.
Llama 38B, 70BMeta-Llama-3-8B-Instruct-Q4_K_M.ggufPredecessor to 3.1, widely available.
Llama 27B, 13B, 70Bllama-2-7b-chat.Q4_K_M.ggufOlder but battle-tested.
Mistral7Bmistral-7b-instruct-v0.3.Q4_K_M.ggufStrong performance relative to size.
Mixtral8x7B, 8x22Bmixtral-8x7b-instruct-v0.1.Q4_K_M.ggufMixture-of-experts. Needs more RAM.
Phi-3 / Phi-3.53.8B, 14BPhi-3.5-mini-instruct-Q4_K_M.ggufMicrosoft. Strong reasoning for size.
Gemma 22B, 9B, 27Bgemma-2-9b-it-Q4_K_M.ggufGoogle. Good coding and instruction following.
Qwen 2.50.5B–72BQwen2.5-7B-Instruct-Q4_K_M.ggufAlibaba. Multilingual, strong at math/code.
DeepSeek-R11.5B–70B (distilled)DeepSeek-R1-Distill-Llama-8B-Q4_K_M.ggufReasoning-focused distilled models.
CodeLlama7B, 13B, 34Bcodellama-7b-instruct.Q4_K_M.ggufCode-specialized Llama variant.
TinyLlama1.1Btinyllama-1.1b-chat-v1.0.Q4_K_M.ggufTiny. Useful for testing and CI.
StableLM1.6B, 3Bstablelm-2-1_6b-chat.Q4_K_M.ggufStability AI. Small and fast.
Yi6B, 9B, 34BYi-1.5-9B-Chat-Q4_K_M.gguf01.AI. Strong multilingual.
Command R35B, 104Bc4ai-command-r-v01-Q4_K_M.ggufCohere. RAG-optimized.
Falcon7B, 40B, 180Bfalcon-7b-instruct-Q4_K_M.ggufTII.

Tip: For testing and development, start with TinyLlama 1.1B (fast, tiny, runs anywhere) or Llama 3.2 1B (modern, instruction-tuned, still small enough for CPU).

GGUF models come in different quantization levels that trade quality for size/speed. nxusKit auto-detects the quantization from the filename:

QuantizationBits/WeightRelative QualityRelative SizeRecommended For
F3232Best (reference)LargestAccuracy testing only
F1616Excellent~50% of F32GPU with sufficient VRAM
Q8_08.5Near-lossless~50% of F16Quality-sensitive tasks
Q6_K6.6Very good~38% of F16Good quality/size balance
Q5_K_M5.7Good~33% of F16Balanced
Q5_K_S5.5Good~32% of F16Slightly smaller than K_M
Q4_K_M4.8Acceptable~27% of F16Most popular. Best quality/size tradeoff.
Q4_K_S4.5Acceptable~26% of F16Slightly smaller than K_M
Q3_K_M3.9Degraded~22% of F16Memory-constrained environments
Q2_K3.4Poor~18% of F16Extreme memory constraints only

Recommendation: Use Q4_K_M for most deployments. It provides the best balance of quality, speed, and memory usage.

Approximate RAM needed to load a model (varies by quantization):

Model SizeQ4_K_MQ8_0F16
1B params~0.8 GB~1.2 GB~2 GB
3B params~2 GB~3.5 GB~6 GB
7-8B params~4.5 GB~8 GB~15 GB
13B params~8 GB~14 GB~26 GB
34B params~20 GB~36 GB~68 GB
70B params~40 GB~72 GB~140 GB

These are approximate. Actual usage depends on context window size, batch size, and KV cache.

GPU offloading is automatic when available:

PlatformAccelerationHow to Enable
macOS (Apple Silicon)MetalAutomatic — llama.cpp detects Metal at build time
Linux (NVIDIA)CUDAInstall CUDA toolkit; llama-cpp-2 detects at build time
Linux (AMD)ROCm/VulkanVulkan support via llama.cpp build flags
Windows (NVIDIA)CUDAInstall CUDA toolkit
All platformsCPU (AVX2/NEON)Always available as fallback

Control GPU offloading with n_gpu_layers:

  • -1 — offload all layers to GPU (maximum acceleration)
  • 0 — CPU only (default)
  • N — offload first N layers (partial offloading for limited VRAM)
{
"provider_type": "local",
"model": "/path/to/model.gguf",
"options": {
"backend": "llama-cpp",
"n_gpu_layers": -1,
"context_size": 4096,
"batch_size": 512,
"threads": 8
}
}

Configuration options:

OptionTypeDefaultDescription
backendstringauto-detect"llama-cpp" or "mistralrs". Auto-selects first available if omitted.
n_gpu_layersinteger0GPU layer offloading. -1 = all, 0 = CPU only, N = first N layers.
context_sizeintegermodel defaultContext window size in tokens.
batch_sizeintegerbackend defaultPrompt processing batch size. Higher = faster prompt processing, more memory.
threadsintegerauto-detectCPU thread count. Leave unset for optimal auto-detection.
use nxuskit-engine::{LocalRuntimeProvider, ChatRequest, Message};
let provider = LocalRuntimeProvider::builder()
.model_path("/models/Llama-3.2-1B-Instruct-Q4_K_M.gguf")
.n_gpu_layers(-1) // Use GPU
.context_size(4096)
.build()?;
let request = ChatRequest::new("Llama-3.2-1B-Instruct-Q4_K_M.gguf")
.with_message(Message::user("Explain quantum computing in one paragraph."))
.with_temperature(0.7)
.with_max_tokens(256);
let response = provider.chat(&request).await?;
println!("{}", response.content);

The provider can discover available models from multiple sources:

let models = provider.list_models().await?;
for model in &models {
println!("{}: {} ({})",
model.id,
model.name,
model.metadata.get("quantization").unwrap_or(&"unknown".into()));
}

Discovery sources (in priority order):

  1. Explicit path — The model path in your configuration
  2. Search paths — Directories you configure (scans for .gguf files)
  3. Ollama store (opt-in) — Discovers models already pulled by Ollama

Ollama store locations (auto-detected):

  • macOS: ~/.ollama/models
  • Linux: /usr/share/ollama/.ollama/models or ~/.ollama/models
  • Windows: %USERPROFILE%\.ollama\models

Environment variables:

  • NXUSKIT_MODELS — Custom model search directory
  • OLLAMA_MODELS — Override Ollama store location

Models stay loaded in memory after first use. You can manage the cache explicitly:

// Pre-load a model (async, happens in background)
provider.preload_model("/models/llama-3.2-1b.Q4_K_M.gguf").await?;
// Check what's loaded
for info in provider.cached_models() {
println!("{}: {} bytes", info.path, info.memory_bytes.unwrap_or(0));
}
// Free memory
provider.unload_model("/models/llama-3.2-1b.Q4_K_M.gguf");
CapabilitySupported
System messagesYes
StreamingYes (token-by-token)
Vision/imagesNo
JSON modeNo (llama-cpp) / Yes (mistral.rs)
Seed (deterministic)Yes
Stop sequencesYes (up to 4)
TemperatureYes
Top-pYes
Max tokensYes
Presence penaltyNo
Frequency penaltyNo
Tool callingNo
Featurellama.cpp (provider-local-llama)mistral.rs (provider-local-mistralrs)
MaturityProduction-readyExperimental
LanguageC++ with Rust bindingsPure Rust (Candle)
Build timeFast (~30s)Slow (~3-5 min, pulls Candle)
GPU supportMetal, CUDA, VulkanMetal, CUDA (via Candle)
JSON modeNoYes (ISQ support)
Chat templatesManualAuto-detected from GGUF metadata
PagedAttentionNoYes (CUDA, Apple Silicon)
In-situ quantizationNoYes (load unquantized, quantize at runtime)
Binary size impactSmall (~2 MB)Large (~20 MB, Candle framework)

HTTP-based local providers connect to a separately running inference server. No Cargo feature flags are needed — they use the standard HTTP transport.

Run models locally via Ollama.

{
"provider_type": "ollama",
"base_url": "http://localhost:11434",
"timeout_ms": 120000
}

Environment variable: OLLAMA_HOST (optional, defaults to http://localhost:11434)

Supported models: Any model pulled via ollama pull, e.g., llama3.1, codellama, mistral, phi3

Capabilities: System messages, streaming

Note: No API key required. The provider connects to a locally running Ollama server. The default timeout is 120 seconds (longer than cloud providers) to accommodate model loading.

Run models locally via LM Studio.

{
"provider_type": "lmstudio",
"base_url": "http://localhost:1234/v1",
"timeout_ms": 120000
}

Environment variable: LMSTUDIO_HOST (optional, defaults to http://localhost:1234/v1)

Capabilities: System messages, streaming

Note: No API key required. Start the LM Studio local server before using this provider.


ConsiderationIn-Process (local)OllamaLM Studio
Setup complexityDownload a GGUF fileInstall Ollama + pull modelInstall LM Studio + download model
External serverNone requiredMust be runningMust be running
LatencyLowest (no HTTP overhead)Low (localhost HTTP)Low (localhost HTTP)
Model lifecycle controlFull (preload/unload API)Managed by OllamaManaged by LM Studio
Memory managementDirect (cache API)Managed by OllamaManaged by LM Studio
Feature flags neededYes (provider-local-llama)NoNo
Build dependenciesC++ compiler (llama.cpp)NoneNone
Best forEmbedded/library use, max controlQuick experimentationGUI-based development