Home / Products / AiTLLM / Hardware & Model Sizing

AiTLLM Hardware & Model Sizing

Technical reference for buyers evaluating which tier fits their infrastructure, compliance posture, and performance requirements.

Book a discovery call ← Back to AiTLLM
Section 1

Model recommendations by tier

Each tier maps to a specific infrastructure model and default model choice. Intelligent IT manages the deployment; your team picks the tier that matches your compliance and data-residency requirements.

Tier Default model Infrastructure Runtime Throughput VRAM / memory
Connect Gateway only — routes to managed cloud API (Claude, GPT-4o, Gemini) No on-prem hardware required Managed API via Intelligent IT gateway Provider SLA
Private Qwen3.6-27B Q4_K_M Cloud Run + NVIDIA L4 24 GB GPU (dedicated pod) vLLM 0.5+ ~30 tok/s ~16 GB (fits single L4)
Sovereign Qwen3.6-27B Q5, Kimi K2.6, or DeepSeek V3 Customer hardware: 1× RTX 6000 Ada 48 GB, 4× Mac mini M4 Pro 64 GB, or 8× H100 cluster vLLM / llama.cpp (MLX on Apple) 40–120 tok/s (hardware-dependent) 24–160 GB (configuration-dependent)
Section 2

Local-machine sizing

For clients who require air-gapped or fully local inference with no cloud dependency. These configurations run under Ollama or llama.cpp and are managed by Intelligent IT the same as cloud tiers.

Hardware Recommended model Quant Runtime Throughput Notes
16 GB MBP or Windows laptop (16 GB RAM) Phi-4-mini (3.8B, MIT) or Qwen3-8B Q4 Ollama 25–40 tok/s Suitable for single-user RAG and summarization tasks
32 GB MBP (M4 Pro) Mistral Small 4 24B Q4 Ollama / MLX ~55 tok/s ~13 GB model; handles legal drafts and code review comfortably
48–64 GB MBP (M4 Max / Ultra) or PC with RTX 4090 (24 GB VRAM) Qwen3.6-27B Q4 Ollama / llama.cpp 50–60 tok/s (4090 GPU); 40–50 tok/s (M4 Max unified) Full private-tier model performance on a single workstation
Section 3

Quantization levels explained

Quantization trades a small amount of output quality for significantly lower memory requirements. Intelligent IT selects the appropriate level based on your hardware and compliance workload.

Level Memory multiplier Quality trade-off When to use
Q4_K_M ~0.6× model size in GB Balanced — minimal perceptible degradation on business tasks Default for Private tier and most local-machine configs. Best memory efficiency.
Q5_K_M ~0.75× model size in GB Quality-leaning — closer to full-precision output on complex reasoning Sovereign tier, RTX 4090 (24 GB), 48 GB+ unified memory machines, legal/compliance drafting.
Q8_0 ~1.1× model size in GB Maximum quality — near-identical to full float16 High-VRAM servers only (H100 / A100 cluster). Not practical on consumer hardware.
Memory rule of thumb: required VRAM or RAM = model parameter count (in billions) × 0.6 for Q4 / 0.75 for Q5 / 1.1 for Q8. Example: Qwen3.6-27B at Q4 = 27 × 0.6 ≈ 16 GB. Add 2–4 GB for the KV cache at typical context lengths.
Section 4

Why Qwen 3.6 is the Private-tier default

Model selection is reviewed quarterly. As of May 2026, Qwen 3.6 holds the best combination of license, benchmark score, context length, and hardware fit for regulated SMB deployments.

  • Apache 2.0 license. Fully permissive for commercial use. Safe for MSP resale, sublicensing, and client deployment without royalty or attribution requirements. No usage-cap clauses.
  • Top open-weight benchmarks. SWE-Bench 77.2 (code), IFEval 92.6 (instruction following), MATH 87.1 (reasoning). Matches or exceeds GPT-4o-mini across legal, financial, and healthcare task categories in Intelligent IT's internal eval suite.
  • 256 K context window. Handles full contract reviews, large FINRA disclosure packets, and multi-document RAG queries without chunking artifacts or context truncation.
  • Single-GPU fit at Q4. The 27B model quantized to Q4_K_M (~16 GB) runs entirely on one NVIDIA L4 24 GB GPU, keeping Private-tier infrastructure to a single dedicated pod and preserving tenant isolation without multi-GPU coordination overhead.
Section 5

License notes for buyers

Before deploying any open-weight model in a regulated environment, verify the artifact license on Hugging Face against the upstream release. License terms for open models can differ between the original release and derivative artifacts.

Model License MSP-resale status Notes
Qwen 3.6 Apache 2.0 Clean — permissive No usage caps, no attribution requirement in output. Safe for all AiTLLM tiers.
DeepSeek V3 / R1 MIT (relicensed Mar 2025) Clean — permissive MIT license applies to weights and derivative works. Verify the Hugging Face artifact matches the upstream MIT release before use.
Kimi K2.6 MIT Clean — permissive MoE architecture (1T total / 32B active). Verify HF artifact license on each pull.
GLM-5.1 Apache 2.0 Clean — permissive Strong bilingual (ZH/EN) performance. Confirm HF artifact matches upstream release.
Llama 4 Llama 4 Community License Monitor as we scale Permissive for commercial use up to 700 M monthly active users. Still ships in AiTLLM routing; watch as tenant volume grows past that threshold.
Phi-4-mini MSRLA (first release) Verify HF artifact Microsoft Research License Agreement on initial weights. Verify that the Hugging Face artifact you pull carries the MSRLA and confirm its commercial-use terms for your deployment.
Mistral Small 4 24B Apache 2.0 Clean — permissive Released Apr 2025. Good balance of size and instruction-following for local-machine configs.

Ready to size your deployment?

We walk through your infrastructure, compliance requirements, and user count to confirm which tier and hardware configuration fits. 30 minutes. No slide deck.

Book a discovery call

Sizing data as of 2026-05-07. Benchmark figures from public model card releases and Intelligent IT internal evaluation suite. Hardware throughput figures measured at 2 K token output length; actual performance varies by context length, batch size, and system load. License terms subject to change by model vendors — verify HF artifact before deployment. © Intelligent Group (DBA Intelligent IT) · intelligentit.io