AiTLLM Hardware & Model Sizing Reference

Section 1

Model recommendations by tier

Each tier maps to a specific infrastructure model and default model choice. Intelligent IT manages the deployment; your team picks the tier that matches your compliance and data-residency requirements.

Tier	Default model	Infrastructure	Runtime	Throughput	VRAM / memory
Connect	Gateway only — routes to managed cloud API (Claude, GPT-4o, Gemini)	No on-prem hardware required	Managed API via Intelligent IT gateway	Provider SLA	—
Private	Qwen3.6-27B Q4_K_M	Cloud Run + NVIDIA L4 24 GB GPU (dedicated pod)	vLLM 0.5+	~30 tok/s	~16 GB (fits single L4)
Sovereign	Qwen3.6-27B Q5, Kimi K2.6, or DeepSeek V3	Customer hardware: 1× RTX 6000 Ada 48 GB, 4× Mac mini M4 Pro 64 GB, or 8× H100 cluster	vLLM / llama.cpp (MLX on Apple)	40–120 tok/s (hardware-dependent)	24–160 GB (configuration-dependent)

Section 2

Local-machine sizing

For clients who require air-gapped or fully local inference with no cloud dependency. These configurations run under Ollama or llama.cpp and are managed by Intelligent IT the same as cloud tiers.

Hardware	Recommended model	Quant	Runtime	Throughput	Notes
16 GB MBP or Windows laptop (16 GB RAM)	Phi-4-mini (3.8B, MIT) or Qwen3-8B	Q4	Ollama	25–40 tok/s	Suitable for single-user RAG and summarization tasks
32 GB MBP (M4 Pro)	Mistral Small 4 24B	Q4	Ollama / MLX	~55 tok/s	~13 GB model; handles legal drafts and code review comfortably
48–64 GB MBP (M4 Max / Ultra) or PC with RTX 4090 (24 GB VRAM)	Qwen3.6-27B	Q4	Ollama / llama.cpp	50–60 tok/s (4090 GPU); 40–50 tok/s (M4 Max unified)	Full private-tier model performance on a single workstation

Section 3

Quantization levels explained

Quantization trades a small amount of output quality for significantly lower memory requirements. Intelligent IT selects the appropriate level based on your hardware and compliance workload.

Level	Memory multiplier	Quality trade-off	When to use
Q4_K_M	~0.6× model size in GB	Balanced — minimal perceptible degradation on business tasks	Default for Private tier and most local-machine configs. Best memory efficiency.
Q5_K_M	~0.75× model size in GB	Quality-leaning — closer to full-precision output on complex reasoning	Sovereign tier, RTX 4090 (24 GB), 48 GB+ unified memory machines, legal/compliance drafting.
Q8_0	~1.1× model size in GB	Maximum quality — near-identical to full float16	High-VRAM servers only (H100 / A100 cluster). Not practical on consumer hardware.

Memory rule of thumb: required VRAM or RAM = model parameter count (in billions) × 0.6 for Q4 / 0.75 for Q5 / 1.1 for Q8. Example: Qwen3.6-27B at Q4 = 27 × 0.6 ≈ 16 GB. Add 2–4 GB for the KV cache at typical context lengths.

Section 4

Why Qwen 3.6 is the Private-tier default

Model selection is reviewed quarterly. As of May 2026, Qwen 3.6 holds the best combination of license, benchmark score, context length, and hardware fit for regulated SMB deployments.

Apache 2.0 license. Fully permissive for commercial use. Safe for MSP resale, sublicensing, and client deployment without royalty or attribution requirements. No usage-cap clauses.
Top open-weight benchmarks. SWE-Bench 77.2 (code), IFEval 92.6 (instruction following), MATH 87.1 (reasoning). Matches or exceeds GPT-4o-mini across legal, financial, and healthcare task categories in Intelligent IT's internal eval suite.
256 K context window. Handles full contract reviews, large FINRA disclosure packets, and multi-document RAG queries without chunking artifacts or context truncation.
Single-GPU fit at Q4. The 27B model quantized to Q4_K_M (~16 GB) runs entirely on one NVIDIA L4 24 GB GPU, keeping Private-tier infrastructure to a single dedicated pod and preserving tenant isolation without multi-GPU coordination overhead.

Section 5

License notes for buyers

Before deploying any open-weight model in a regulated environment, verify the artifact license on Hugging Face against the upstream release. License terms for open models can differ between the original release and derivative artifacts.

Model	License	MSP-resale status	Notes
Qwen 3.6	Apache 2.0	Clean — permissive	No usage caps, no attribution requirement in output. Safe for all AiTLLM tiers.
DeepSeek V3 / R1	MIT (relicensed Mar 2025)	Clean — permissive	MIT license applies to weights and derivative works. Verify the Hugging Face artifact matches the upstream MIT release before use.
Kimi K2.6	MIT	Clean — permissive	MoE architecture (1T total / 32B active). Verify HF artifact license on each pull.
GLM-5.1	Apache 2.0	Clean — permissive	Strong bilingual (ZH/EN) performance. Confirm HF artifact matches upstream release.
Llama 4	Llama 4 Community License	Monitor as we scale	Permissive for commercial use up to 700 M monthly active users. Still ships in AiTLLM routing; watch as tenant volume grows past that threshold.
Phi-4-mini	MSRLA (first release)	Verify HF artifact	Microsoft Research License Agreement on initial weights. Verify that the Hugging Face artifact you pull carries the MSRLA and confirm its commercial-use terms for your deployment.
Mistral Small 4 24B	Apache 2.0	Clean — permissive	Released Apr 2025. Good balance of size and instruction-following for local-machine configs.

Ready to size your deployment?

We walk through your infrastructure, compliance requirements, and user count to confirm which tier and hardware configuration fits. 30 minutes. No slide deck.

Book a discovery call

Sizing data as of 2026-05-07. Benchmark figures from public model card releases and Intelligent IT internal evaluation suite. Hardware throughput figures measured at 2 K token output length; actual performance varies by context length, batch size, and system load. License terms subject to change by model vendors — verify HF artifact before deployment. © Intelligent Group (DBA Intelligent IT) · intelligentit.io