Model recommendations by tier
Each tier maps to a specific infrastructure model and default model choice. Intelligent IT manages the deployment; your team picks the tier that matches your compliance and data-residency requirements.
| Tier | Default model | Infrastructure | Runtime | Throughput | VRAM / memory |
|---|---|---|---|---|---|
| Connect | Gateway only — routes to managed cloud API (Claude, GPT-4o, Gemini) | No on-prem hardware required | Managed API via Intelligent IT gateway | Provider SLA | — |
| Private | Qwen3.6-27B Q4_K_M | Cloud Run + NVIDIA L4 24 GB GPU (dedicated pod) | vLLM 0.5+ | ~30 tok/s | ~16 GB (fits single L4) |
| Sovereign | Qwen3.6-27B Q5, Kimi K2.6, or DeepSeek V3 | Customer hardware: 1× RTX 6000 Ada 48 GB, 4× Mac mini M4 Pro 64 GB, or 8× H100 cluster | vLLM / llama.cpp (MLX on Apple) | 40–120 tok/s (hardware-dependent) | 24–160 GB (configuration-dependent) |
Local-machine sizing
For clients who require air-gapped or fully local inference with no cloud dependency. These configurations run under Ollama or llama.cpp and are managed by Intelligent IT the same as cloud tiers.
| Hardware | Recommended model | Quant | Runtime | Throughput | Notes |
|---|---|---|---|---|---|
| 16 GB MBP or Windows laptop (16 GB RAM) | Phi-4-mini (3.8B, MIT) or Qwen3-8B | Q4 | Ollama | 25–40 tok/s | Suitable for single-user RAG and summarization tasks |
| 32 GB MBP (M4 Pro) | Mistral Small 4 24B | Q4 | Ollama / MLX | ~55 tok/s | ~13 GB model; handles legal drafts and code review comfortably |
| 48–64 GB MBP (M4 Max / Ultra) or PC with RTX 4090 (24 GB VRAM) | Qwen3.6-27B | Q4 | Ollama / llama.cpp | 50–60 tok/s (4090 GPU); 40–50 tok/s (M4 Max unified) | Full private-tier model performance on a single workstation |
Quantization levels explained
Quantization trades a small amount of output quality for significantly lower memory requirements. Intelligent IT selects the appropriate level based on your hardware and compliance workload.
| Level | Memory multiplier | Quality trade-off | When to use |
|---|---|---|---|
| Q4_K_M | ~0.6× model size in GB | Balanced — minimal perceptible degradation on business tasks | Default for Private tier and most local-machine configs. Best memory efficiency. |
| Q5_K_M | ~0.75× model size in GB | Quality-leaning — closer to full-precision output on complex reasoning | Sovereign tier, RTX 4090 (24 GB), 48 GB+ unified memory machines, legal/compliance drafting. |
| Q8_0 | ~1.1× model size in GB | Maximum quality — near-identical to full float16 | High-VRAM servers only (H100 / A100 cluster). Not practical on consumer hardware. |
Why Qwen 3.6 is the Private-tier default
Model selection is reviewed quarterly. As of May 2026, Qwen 3.6 holds the best combination of license, benchmark score, context length, and hardware fit for regulated SMB deployments.
- Apache 2.0 license. Fully permissive for commercial use. Safe for MSP resale, sublicensing, and client deployment without royalty or attribution requirements. No usage-cap clauses.
- Top open-weight benchmarks. SWE-Bench 77.2 (code), IFEval 92.6 (instruction following), MATH 87.1 (reasoning). Matches or exceeds GPT-4o-mini across legal, financial, and healthcare task categories in Intelligent IT's internal eval suite.
- 256 K context window. Handles full contract reviews, large FINRA disclosure packets, and multi-document RAG queries without chunking artifacts or context truncation.
- Single-GPU fit at Q4. The 27B model quantized to Q4_K_M (~16 GB) runs entirely on one NVIDIA L4 24 GB GPU, keeping Private-tier infrastructure to a single dedicated pod and preserving tenant isolation without multi-GPU coordination overhead.
License notes for buyers
Before deploying any open-weight model in a regulated environment, verify the artifact license on Hugging Face against the upstream release. License terms for open models can differ between the original release and derivative artifacts.
| Model | License | MSP-resale status | Notes |
|---|---|---|---|
| Qwen 3.6 | Apache 2.0 | Clean — permissive | No usage caps, no attribution requirement in output. Safe for all AiTLLM tiers. |
| DeepSeek V3 / R1 | MIT (relicensed Mar 2025) | Clean — permissive | MIT license applies to weights and derivative works. Verify the Hugging Face artifact matches the upstream MIT release before use. |
| Kimi K2.6 | MIT | Clean — permissive | MoE architecture (1T total / 32B active). Verify HF artifact license on each pull. |
| GLM-5.1 | Apache 2.0 | Clean — permissive | Strong bilingual (ZH/EN) performance. Confirm HF artifact matches upstream release. |
| Llama 4 | Llama 4 Community License | Monitor as we scale | Permissive for commercial use up to 700 M monthly active users. Still ships in AiTLLM routing; watch as tenant volume grows past that threshold. |
| Phi-4-mini | MSRLA (first release) | Verify HF artifact | Microsoft Research License Agreement on initial weights. Verify that the Hugging Face artifact you pull carries the MSRLA and confirm its commercial-use terms for your deployment. |
| Mistral Small 4 24B | Apache 2.0 | Clean — permissive | Released Apr 2025. Good balance of size and instruction-following for local-machine configs. |
Ready to size your deployment?
We walk through your infrastructure, compliance requirements, and user count to confirm which tier and hardware configuration fits. 30 minutes. No slide deck.
Book a discovery callSizing data as of 2026-05-07. Benchmark figures from public model card releases and Intelligent IT internal evaluation suite. Hardware throughput figures measured at 2 K token output length; actual performance varies by context length, batch size, and system load. License terms subject to change by model vendors — verify HF artifact before deployment. © Intelligent Group (DBA Intelligent IT) · intelligentit.io