What You'll Build
A local inference setup for Meta's Segment Anything Model 3 (SAM 3) on an RTX 4080 SUPER 16GB, capable of concept-prompted (short text phrase, image exemplar, or both) image segmentation and video object tracking. SAM 3 unifies a DETR-style text-conditioned detector with a SAM 2-style memory tracker, both sharing a single Perception Encoder backbone — and at ~4 GB peak inference VRAM the 16 GB card is wildly over-provisioned, leaving roughly 12 GB free for other workloads (concurrent models, long-video tracking sessions, larger batches).
Hardware data: RTX 4080 SUPER (16GB VRAM) · ~4 GB peak inference VRAM observed by third-party testing on a different card (see Results) · See benchmark data
ℹ️ Gated weights. SAM 3 is released under the custom SAM 3 License, and the model card states the weights are gated — you must agree to share your contact information before download. Log in to HuggingFace and accept the license on the model page first, then authenticate locally (
huggingface-cli login) sofrom_pretrainedcan fetch the checkpoint.
Note: As of this writing the backend has no measured benchmarks for this exact pair. The VRAM figure below comes from independent third-party testing on a non-RTX-4080-SUPER card; expect comparable or better behaviour on the 4080 SUPER, but treat it as a working estimate until empirical data lands at
/check/.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 4GB VRAM CUDA GPU | RTX 4080 SUPER (16GB) — pair not yet benchmarked, see /check/ |
| RAM | 16GB | — |
| Storage | ~3.4 GB for SAM 3 weights (Roboflow) | ~5 GB recommended with cache |
| Software | Python 3.12, recent PyTorch + CUDA, a transformers build that includes the SAM 3 classes (official repo) | — |
Installation
Install steps below come from the canonical Meta sources only — the official
facebookresearch/sam3README and the HuggingFace model card. No independent third-party walkthrough is required because the install path is the upstream-supported one; report deviations via submission form.
1. Set up a Python environment
Per the official facebookresearch/sam3 README:
conda create -n sam3 python=3.12
conda activate sam3
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128
The RTX 4080 SUPER is Ada Lovelace (AD103, sm_89), which has full kernel coverage in the default PyTorch CUDA wheels — unlike Blackwell (sm_120) GPUs, no special wheel selection is required. The cu128 index matches the version Meta tests against upstream; the older cu126 / cu121 stable wheels also work for the Transformers inference path on Ada.
2. Accept the license and authenticate
The weights are gated. Open huggingface.co/facebook/sam3 while logged in, accept the SAM 3 license, then authenticate the CLI so downloads succeed:
pip install -U "huggingface_hub[cli]"
huggingface-cli login
3a. Option A — install via HuggingFace Transformers (recommended for quick use)
This is the lowest-friction path; the model card ships Sam3Model / Sam3Processor classes:
pip install -U transformers accelerate pillow requests
If you hit
ImportErroronSam3Model, yourtransformersrelease predates the SAM 3 classes. Install from source instead — see Troubleshooting.
3b. Option B — install from the official repository
If you need the reference implementation, training, or finetuning utilities:
git clone https://github.com/facebookresearch/sam3.git
cd sam3
pip install -e .
4. Download model weights
With the Transformers path, weights download automatically the first time you call from_pretrained("facebook/sam3") (~3.4 GB to your HuggingFace cache, per Roboflow), provided you accepted the license in step 2.
Running
Image segmentation with a text prompt
from transformers import Sam3Model, Sam3Processor
from PIL import Image
import torch
import requests
device = "cuda" if torch.cuda.is_available() else "cpu"
model = Sam3Model.from_pretrained("facebook/sam3").to(device)
processor = Sam3Processor.from_pretrained("facebook/sam3")
img_url = "https://example.com/image.jpg"
image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
inputs = processor(images=image, text="ear", return_tensors="pt").to(device)
with torch.no_grad():
outputs = model(**inputs)
results = processor.post_process_instance_segmentation(
outputs, threshold=0.5, mask_threshold=0.5,
target_sizes=inputs.get("original_sizes").tolist()
)[0]
print(results["masks"].shape)
Source: facebook/sam3 model card. Swap text="ear" for any concept phrase; the detector is open-vocabulary.
Video segmentation
from transformers import Sam3VideoModel, Sam3VideoProcessor
from transformers.video_utils import load_video
import torch
model = Sam3VideoModel.from_pretrained("facebook/sam3").to("cuda")
processor = Sam3VideoProcessor.from_pretrained("facebook/sam3")
video_url = "https://example.com/video.mp4"
video_frames, _ = load_video(video_url)
inference_session = processor.init_video_session(
video=video_frames,
inference_device="cuda",
dtype=torch.bfloat16,
)
for frame_idx in range(len(video_frames)):
sam3_output = processor.add_text_prompt(
inference_session=inference_session,
text="person",
)
dtype=torch.bfloat16 is the path of least resistance; with 16 GB the 4080 SUPER also has the headroom to drop to torch.float32 if you want full precision. Source: facebook/sam3 model card.
Results
- VRAM usage: ~4 GB peak during single-image inference, measured by third-party hands-on testing on an NVIDIA RTX 6000 (sonusahani.com, Nov 2025) — not on the RTX 4080 SUPER, so treat it as a cross-card estimate. Roboflow's overview corroborates that SAM 3 fits comfortably on 16 GB GPUs and uses less VRAM per inference than SAM 2. On the 16 GB RTX 4080 SUPER that leaves roughly 12 GB free — comfortably enough for concurrent models, long-form video sessions, or larger batch sizes than the 8 GB tier can handle.
- Model size: 848M parameters (facebook/sam3 model card), ~3.4 GB on disk (Roboflow).
- Speed: No RTX 4080 SUPER-named (or any consumer-GPU-named) throughput measurement was found at authoring time, and the backend has no benchmark for this pair. Rather than restate a figure from a different card, speed is omitted here — once a community run lands it will appear at /check/sam-3/rtx-4080-super. If you measure it, please contribute.
- Quality notes: SAM 3 adds open-vocabulary concept prompts (text phrases and image exemplars) on top of SAM 2's box/point/mask prompts and supports video tracking via the same model. The detector is DETR-based; the tracker is a SAM 2-style memory transformer reusing the shared Perception Encoder backbone.
For the full benchmark data, see /check/sam-3/rtx-4080-super.
Troubleshooting
Sam3Model or Sam3Processor not found in transformers
SAM 3 classes were added to transformers after the model's November 2025 release. First try upgrading: pip install -U transformers. If your pinned environment can't resolve a release that includes them, install from source:
pip install git+https://github.com/huggingface/transformers
As a guaranteed-working alternative, use Option B above (install from facebookresearch/sam3), whose reference implementation does not depend on the transformers release cadence.
401 / gated-repo error on download
from_pretrained("facebook/sam3") returns an access error if you haven't accepted the license. Visit huggingface.co/facebook/sam3 while logged in, accept the SAM 3 license, run huggingface-cli login with a token that has read access, then retry.
Long video sessions and concurrent models
Video sessions hold per-frame state, so peak usage during a long video can exceed the ~4 GB single-image envelope. The 16 GB card is forgiving here — you have headroom for longer clips than the 8 GB tier — but the same hygiene applies: lower the resolution or drop frame count if you push into multi-minute sessions, and free old sessions explicitly with del inference_session; torch.cuda.empty_cache() before opening a fresh one. If you intend to run SAM 3 alongside another model on the same card, the ~12 GB free after SAM 3 loads is enough room for most small-to-mid models — verify by watching nvidia-smi after both are warm.
For other issues, file a report via the submission form.