SAM 3 on RTX 5080: Promptable Image and Video Segmentation

What You'll Build

A local inference setup for Meta's Segment Anything Model 3 (SAM 3) on an RTX 5080, capable of text-prompted image segmentation and video object tracking. SAM 3 pairs a DETR-style text-conditioned detector with the SAM 2 tracker, both sharing a single vision encoder — so one model handles open-vocabulary detection, image segmentation, and video tracking from the same prompt.

Hardware data: RTX 5080 (16GB VRAM) · ~4 GB documented inference VRAM floor (measured on a larger card — see Results) · See benchmark data

Note: As of this writing the backend has no measured benchmarks for this pair — /check/sam-3/rtx-5080 returns verdict: unknown. The VRAM figure below comes from independent third-party testing on a different, larger-VRAM card; treat it as a working floor until empirical RTX 5080 data lands. If you measure it, contribute via the submission form.

ℹ️ Not an image generator. SAM 3 is a segmentation model — it finds and masks objects you name with a text or visual prompt; it does not synthesize new images or video. It lives in our specialized vertical because promptable segmentation doesn't map cleanly onto image/video generation.

Requirements

Component	Minimum	Tested
GPU	4GB VRAM CUDA GPU	RTX 5080 (16GB) — pair not yet benchmarked, see /check/
RAM	16GB	—
Storage	~3.4 GB for the SAM 3 checkpoint (Roboflow; `model.safetensors` is 3.44 GB per the HF Files tree)	~5 GB recommended with cache
Software	Python 3.12+, PyTorch 2.7+, CUDA 12.6+ (official minimum; the install command pins `torch==2.10.0` on cu128 — see Installation) (official repo)	—

Installation

Install steps below come from the canonical Meta sources only — the official facebookresearch/sam3 README and the gated Hugging Face model card. The install path is the upstream-supported one, so no third-party walkthrough is required; report deviations via the submission form.

1. Request access to the gated checkpoints

The facebook/sam3 weights are gated — you must request and be granted access before they will download. Visit huggingface.co/facebook/sam3, accept the access terms, then authenticate locally once approved:

hf auth login

Per the official repo, authentication is required to download the checkpoints after your access request is accepted.

2. Set up a Python environment

Per the official facebookresearch/sam3 README:

conda create -n sam3 python=3.12
conda activate sam3
pip install torch==2.10.0 torchvision --index-url https://download.pytorch.org/whl/cu128

The cu128 index matters on a Blackwell-class card like the RTX 5080 (sm_120): the default PyTorch wheel must ship sm_120 kernels, and the cu128 wheel is the one the upstream repo pins.

3. Install SAM 3 from the official repository

git clone https://github.com/facebookresearch/sam3.git
cd sam3
pip install -e .

To run the example notebooks as well, the repo also documents pip install -e ".[notebooks]".

Running

The canonical inference API ships with the official repo. The snippets below are taken verbatim from the facebookresearch/sam3 README's Basic Usage section.

Image segmentation with a text prompt

import torch
from PIL import Image
from sam3.model_builder import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor

# Load the model
model = build_sam3_image_model()
processor = Sam3Processor(model)

# Load an image
image = Image.open("<YOUR_IMAGE_PATH.jpg>")
inference_state = processor.set_image(image)

# Prompt the model with text
output = processor.set_text_prompt(state=inference_state, prompt="<YOUR_TEXT_PROMPT>")

# Get the masks, bounding boxes, and scores
masks, boxes, scores = output["masks"], output["boxes"], output["scores"]

Video segmentation with a text prompt

from sam3.model_builder import build_sam3_video_predictor

video_predictor = build_sam3_video_predictor()
video_path = "<YOUR_VIDEO_PATH>"  # a JPEG folder or an MP4 video file

# Start a session
response = video_predictor.handle_request(
    request=dict(
        type="start_session",
        resource_path=video_path,
    )
)
response = video_predictor.handle_request(
    request=dict(
        type="add_prompt",
        session_id=response["session_id"],
        frame_index=0,  # Arbitrary frame index
        text="<YOUR_TEXT_PROMPT>",
    )
)
output = response["outputs"]

The first call downloads the gated checkpoint (~3.4 GB) into your Hugging Face cache. A Hugging Face Transformers integration (Sam3Model / Sam3Processor under transformers) is also documented on the model card for users who prefer that loader; the repo-native API above is shown here because it is verifiable from the public GitHub README without gated access.

Spending the headroom on a 16 GB card

At a ~4 GB inference floor, SAM 3 leaves roughly 12 GB unused on the RTX 5080's 16 GB. The natural use of that headroom is colocation: run SAM 3's segmentation alongside a second model on the same card — for example, batch image inputs through SAM 3 while a small detection or captioning model runs concurrently, or keep a 7B-class LLM at Q4 resident for prompt generation. The official repo also ships an SA-Co agent example (sam3_agent.ipynb) that drives SAM 3 from an MLLM — a workflow the spare VRAM makes practical on a single 5080.

Results

VRAM usage: ~4 GB peak during single-image inference. This figure was measured by hands-on third-party testing on an NVIDIA RTX 6000 (48 GB), not on the RTX 5080 — "VRAM usage hovered just under 4 GB during inference" (sonusahani.com, Nov 20 2025). Roboflow's overview independently corroborates the envelope, stating SAM 3 "uses less VRAM per inference than SAM 2, and fits comfortably on 16 GB GPUs." Treat 4 GB as a documented floor that the RTX 5080 comfortably clears, not as a 5080-measured peak.
Speed: Not quoted. No source benchmarks SAM 3 on an RTX 5080 (or a comparable consumer card) with a named throughput number, so quoting a figure would be fabrication. Once a community benchmark lands it will appear at /check/sam-3/rtx-5080 — please contribute yours.
Model size: 848M parameters (official facebookresearch/sam3 README); the model.safetensors checkpoint is 3.44 GB on disk (HF Files tree), consistent with Roboflow's ≈3.4 GB figure.
Quality notes: SAM 3 adds open-vocabulary text prompts on top of SAM 2's box/point/mask prompts and supports video tracking via the same model. The detector is DETR-based, conditioned on text, geometry, and image exemplars; the tracker inherits the SAM 2 encoder-decoder architecture. As of 03/27/2026 the repo also ships the SAM 3.1 "Object Multiplex" update for faster joint multi-object tracking (official repo).

For the full benchmark data, see /check/sam-3/rtx-5080.

Troubleshooting

Checkpoint download fails with a 401 / "access restricted"

The facebook/sam3 repo is gated. You must request access on the model page, wait for approval, and run hf auth login with a token that has read access. Until access is granted, build_sam3_image_model() (and any from_pretrained call) will fail to fetch the weights.

"CUDA out of memory" on a video session

Video sessions hold per-frame state, so peak memory grows with frame count and resolution — well above the ~4 GB single-image floor. Lower the resolution, reduce the frame count, or start a fresh session and free the previous one explicitly with del/torch.cuda.empty_cache(). On a 16 GB RTX 5080 this is rarely a problem for short clips, but long high-resolution videos can still climb.

Wheel / CUDA mismatch with PyTorch 2.10 on Blackwell

The official repo pins torch==2.10.0 with the cu128 wheel. On a Blackwell-class card (RTX 5080, sm_120), install from the cu128 index (--index-url https://download.pytorch.org/whl/cu128) rather than the default cu126/cu121 wheels, which may lack sm_120 kernels and fail at kernel launch. The repo's optional FlashAttention-3 extra (flash-attn-3) is also fetched from the cu128 index for the same reason.

For other issues, file a report via the submission form.