What You'll Build
A local inference setup for Meta's Segment Anything Model 3 (SAM 3) on an RTX 5060, capable of text-prompted image segmentation and video object tracking. SAM 3 unifies the SAM 2 tracker with a DETR-style text-conditioned detector — and at ~4 GB peak inference VRAM it leaves comfortable headroom on the 8 GB card.
Hardware data: RTX 5060 (8GB VRAM) · ~4 GB peak inference VRAM observed by third-party testing · See benchmark data
Note: As of this writing the backend has no measured benchmarks for this pair. The VRAM figure below comes from independent third-party testing on a non-5060 card; expect comparable behaviour, but treat it as a working estimate until empirical data lands at
/check/.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 4GB VRAM CUDA GPU | RTX 5060 (8GB) — pair not yet benchmarked, see /check/ |
| RAM | 16GB | — |
| Storage | ~3.4 GB for SAM 3 weights (Roboflow) | ~5 GB recommended with cache |
| Software | Python 3.12, PyTorch 2.10, CUDA 12.8 (official pin — see Installation; mandatory on this Blackwell card) (official repo) | — |
The RTX 5060 is a Blackwell-architecture card (GB206-250 die, sm_120 / compute capability 12.0, ~3840 CUDA cores, 8 GB GDDR7 on a 128-bit bus ≈ 448 GB/s, ~145 W, PCIe Gen5 x8) (Wikipedia: GeForce RTX 50 series, ASUS Dual RTX 5060 O8G). Because sm_120 is newer than the CUDA toolkits bundled with older PyTorch wheels, the CUDA 12.8 (cu128) build is mandatory here — see Installation.
Installation
Install steps below come from the canonical Meta sources only — the official
facebookresearch/sam3README and the HuggingFace model card. No independent third-party walkthrough is required because the install path is the upstream-supported one; report deviations via submission form.
SAM 3 weights are gated. Request access on the model card (sign in to HuggingFace, accept the SAM License terms), then authenticate locally with
huggingface-cli loginbefore the first download. The weights are distributed under the SAM License, not Apache-2.0 — review the terms before redistribution or commercial use.
1. Set up a Python environment
Per the official facebookresearch/sam3 README:
conda create -n sam3 python=3.12
conda activate sam3
pip install torch==2.10.0 torchvision --index-url https://download.pytorch.org/whl/cu128
The cu128 index is mandatory on a Blackwell-class card like the RTX 5060: the official SAM 3 repo pins torch==2.10.0 built against CUDA 12.8, and sm_120 needs that toolkit. Prefer the cu128 index over the default cu126/cu121 builds to avoid kernel-launch failures.
2a. Option A — install via HuggingFace Transformers (recommended for quick use)
This is the lowest-friction path; the model card ships Sam3Model / Sam3Processor classes:
pip install transformers accelerate pillow
2b. Option B — install from the official repository
If you need the reference implementation, training, or finetuning utilities:
git clone https://github.com/facebookresearch/sam3.git
cd sam3
pip install -e .
3. Download model weights
With the Transformers path, weights download automatically the first time you call from_pretrained("facebook/sam3") (~3.4 GB to your HuggingFace cache). Because the repo is gated, make sure you have requested access and run huggingface-cli login first (see the note above).
Running
Image segmentation with a text prompt
from transformers import Sam3Processor, Sam3Model
from PIL import Image
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model = Sam3Model.from_pretrained("facebook/sam3").to(device)
processor = Sam3Processor.from_pretrained("facebook/sam3")
image = Image.open("your_image.jpg")
inputs = processor(images=image, text="ear", return_tensors="pt").to(device)
with torch.no_grad():
outputs = model(**inputs)
results = processor.post_process_instance_segmentation(
outputs, threshold=0.5, mask_threshold=0.5,
target_sizes=inputs.get("original_sizes").tolist()
)[0]
Source: facebook/sam3 model card.
Video segmentation
from transformers import Sam3VideoModel, Sam3VideoProcessor
from transformers.video_utils import load_video
import torch
model = Sam3VideoModel.from_pretrained("facebook/sam3").to("cuda")
processor = Sam3VideoProcessor.from_pretrained("facebook/sam3")
video_frames, _ = load_video("your_video.mp4")
inference_session = processor.init_video_session(
video=video_frames,
inference_device="cuda",
dtype=torch.bfloat16,
)
inference_session = processor.add_text_prompt(
inference_session=inference_session,
text="person",
)
dtype=torch.bfloat16 is the path of least resistance for fitting comfortably in 8 GB. Source: facebook/sam3 model card.
Results
- VRAM usage: ~4 GB peak during single-image inference, measured by third-party hands-on testing on an NVIDIA RTX 6000 (sonusahani.com, Nov 2025). Roboflow's overview corroborates that SAM 3 "fits comfortably on 16 GB GPUs" and "uses less VRAM per inference than SAM 2." A 4 GB peak leaves substantial headroom on the 8 GB RTX 5060.
- Model size: 848M parameters (official
facebookresearch/sam3README), 3.4 GB on disk (Roboflow). - Quality notes: SAM 3 adds open-vocabulary text prompts on top of SAM 2's box/point/mask prompts and supports video tracking via the same model. The detector is DETR-based, conditioned on text, geometry, and image exemplars.
Once empirical RTX 5060 numbers are seeded, they will appear at /check/sam-3/rtx-5060.
Troubleshooting
"CUDA out of memory" on first video session
Video sessions hold per-frame state, so peak usage during a long video can exceed the ~4 GB single-image envelope. Lower the resolution, drop frame count, or switch the session to torch.bfloat16 (already shown above) before opening it. If you start a fresh session, free the old one explicitly with del inference_session; torch.cuda.empty_cache(). The 8 GB RTX 5060 is comfortable for the image path but tighter on long-form video — start with short clips.
Sam3Model or Sam3Processor not found in transformers
SAM 3 classes were added to transformers after the model's November 2025 release. Upgrade with pip install -U transformers. If a pinned environment can't upgrade, use Option B (install from facebookresearch/sam3).
Wheel/CUDA mismatch with PyTorch 2.10
The official repo pins torch==2.10.0 with CUDA 12.8 wheels. On a Blackwell-class card like the RTX 5060, the cu128 index (--index-url https://download.pytorch.org/whl/cu128) is mandatory — prefer it over the default cu126/cu121 builds, which lack the sm_120 kernels and will fail at kernel launch.
For other issues, file a report via the submission form.