self-hosted/ai
§01·model · /models

gpt-oss 120B

llmactiveApache-2.0

gpt-oss-120b is OpenAI's larger open-weight reasoning model (release 2025-08-05), the datacenter-scale sibling of gpt-oss-20b. It is a Mixture-of-Experts transformer with 117B total parameters and 5.1B active per token (36 layers, 128 experts, top-4 routing), using native MXFP4 quantization of the expert weights (~4.25 bits/param) with higher-precision attention, embeddings and router. Text-only chain-of-thought reasoning with native tool use (function calling, browsing, Python) via the harmony response format, and low/medium/high reasoning-effort levels set in the system prompt. 128K context (GQA + alternating full/sliding-window attention with learned attention sinks). Licensed Apache-2.0 (commercial use permitted). llama.cpp, Ollama (gpt-oss:120b, 65 GB), vLLM, SGLang and transformers all support it. Deployment footprint is ~63.7 GB of MXFP4 weights — designed to fit a single 80 GB datacenter GPU (NVIDIA H100 / AMD MI300X). This exceeds every consumer single-GPU tier and even 64 GB Apple unified memory once KV cache and OS overhead are counted; the only consumer angle is CPU-MoE offload on a 24-32 GB GPU with ~64 GB system RAM, which the 5.1B-active design makes runnable but RAM-bound. For a consumer-fit member of this family, see gpt-oss-20b (~13.8 GB, fits a 16 GB card).

§02·GPUs that run this model
0 total
No community benchmarks or recipes for this model yet.