SAM 3 on RTX 4070: Promptable Image and Video Segmentation

What You'll Build

A local inference setup for Meta's Segment Anything Model 3 (SAM 3) on an RTX 4070, capable of text-prompted image segmentation and video object tracking. SAM 3 pairs a DETR-style text-conditioned detector with the SAM 2 tracker, both sharing a single vision encoder — so one model handles open-vocabulary detection, image segmentation, and video tracking from the same prompt.

Hardware data: RTX 4070 (12GB GDDR6X, 192-bit) · ~4 GB documented inference VRAM floor (measured on a larger card — see Results) · See benchmark data

Note: As of this writing the backend has no measured benchmarks for this pair — /check/sam-3/rtx-4070 returns verdict: unknown. The VRAM figure below is a documented floor measured by independent third-party testing on a different, larger-VRAM card (an RTX 6000) — not an RTX 4070 measurement. Treat it as a working floor until empirical RTX 4070 data lands. If you measure it, contribute via the submission form.

ℹ️ Not an image generator. SAM 3 is a segmentation model — it finds and masks objects you name with a text or visual prompt; it does not synthesize new images or video. It lives in our specialized vertical because promptable segmentation doesn't map cleanly onto image/video generation.

Requirements

Component	Minimum	Tested
GPU	4GB VRAM CUDA GPU	RTX 4070 (12GB GDDR6X, Ada Lovelace AD104 sm_89) — pair not yet benchmarked, see /check/
RAM	16GB	—
Storage	~3.4 GB for the SAM 3 checkpoint (Roboflow; `model.safetensors` is 3.44 GB per the HF Files tree)	~5 GB recommended with cache
Software	Python 3.12+, PyTorch 2.7+, CUDA 12.6+ (official minimum; the install command pins `torch==2.10.0` on cu128 — see Installation) (official repo)	—

Installation

Install steps below come from the canonical Meta sources only — the official facebookresearch/sam3 README and the gated Hugging Face model card. The install path is the upstream-supported one, so no third-party walkthrough is required; report deviations via the submission form.

1. Request access to the gated checkpoints

The facebook/sam3 weights are gated (gated: manual on Hugging Face) — you must request and be granted access before they will download. Visit huggingface.co/facebook/sam3, accept the access terms, then authenticate locally once approved:

pip install -U "huggingface_hub[cli]"
hf auth login

Per the official repo, authentication is required to download the checkpoints after your access request is accepted. SAM 3 is released under the custom SAM 3 License (it is not Apache-2.0); the gate is a one-time access request, separate from the license terms — review the license before deploying.

2. Set up a Python environment

Per the official facebookresearch/sam3 README:

conda create -n sam3 python=3.12
conda deactivate
conda activate sam3
pip install torch==2.10.0 torchvision --index-url https://download.pytorch.org/whl/cu128

The RTX 4070 is Ada Lovelace (AD104, sm_89), which has full kernel coverage in the standard PyTorch CUDA wheels — unlike Blackwell (sm_120) GPUs, no special wheel selection is required for sm_89. The cu128 index here is simply the version the upstream repo pins for the torch==2.10.0 build (CUDA 12.8 includes Ada sm_89 kernels); the official minimum is CUDA 12.6+, so the older cu126 stable wheel also works for the inference path on Ada.

3. Install SAM 3 from the official repository

git clone https://github.com/facebookresearch/sam3.git
cd sam3
pip install -e .

To run the example notebooks as well, the repo also documents pip install -e ".[notebooks]".

4. (Optional) Faster inference with FlashAttention

The repo documents an optional FlashAttention extra for faster inference. On Ada Lovelace (sm_89), FlashAttention prebuilt kernels are available — there is no sm_120 kernel gap to work around (that limitation is Blackwell-only), so you can install it directly:

pip install einops ninja && pip install flash-attn-3 --no-deps --index-url https://download.pytorch.org/whl/cu128
pip install git+https://github.com/ronghanghu/cc_torch.git

This step is optional — SAM 3 runs on the default attention path without it.

Running

The canonical inference API ships with the official repo. The snippets below are taken verbatim from the facebookresearch/sam3 README's Basic Usage section.

Image segmentation with a text prompt

import torch
from PIL import Image
from sam3.model_builder import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor

# Load the model
model = build_sam3_image_model()
processor = Sam3Processor(model)

# Load an image
image = Image.open("<YOUR_IMAGE_PATH.jpg>")
inference_state = processor.set_image(image)

# Prompt the model with text
output = processor.set_text_prompt(state=inference_state, prompt="<YOUR_TEXT_PROMPT>")

# Get the masks, bounding boxes, and scores
masks, boxes, scores = output["masks"], output["boxes"], output["scores"]

Video segmentation with a text prompt

from sam3.model_builder import build_sam3_video_predictor

video_predictor = build_sam3_video_predictor()
video_path = "<YOUR_VIDEO_PATH>"  # a JPEG folder or an MP4 video file

# Start a session
response = video_predictor.handle_request(
    request=dict(
        type="start_session",
        resource_path=video_path,
    )
)
response = video_predictor.handle_request(
    request=dict(
        type="add_prompt",
        session_id=response["session_id"],
        frame_index=0,  # Arbitrary frame index
        text="<YOUR_TEXT_PROMPT>",
    )
)
output = response["outputs"]

The first call downloads the gated checkpoint (~3.4 GB) into your Hugging Face cache. A Hugging Face Transformers integration (Sam3Model / Sam3Processor under transformers) is also documented on the model card for users who prefer that loader; the repo-native API above is shown here because it is verifiable from the public GitHub README without gated access.

Spending the headroom on a 12 GB card

At a ~4 GB inference floor, SAM 3 leaves roughly 7–8 GB unused on the RTX 4070's 12 GB. That is lighter headroom than the larger 16 GB Ada cards (a 4 vs 12 GB envelope is about a 3× ratio), but still enough for colocation: run SAM 3's segmentation alongside a second small model on the same card — for example, batch image inputs through SAM 3 while a compact detection or captioning model runs concurrently, or keep a small (3B-class) LLM at Q4 resident for prompt generation. The official repo also ships an SA-Co agent example (sam3_agent.ipynb) that drives SAM 3 from an MLLM — feasible on a single 4070 if the companion model is kept small, since the 12 GB envelope is tighter than on a 16 GB card.

Results

VRAM usage: ~4 GB peak during single-image inference. This figure was measured by hands-on third-party testing on an NVIDIA RTX 6000 (48 GB), not on the RTX 4070 — "VRAM usage hovered just under 4 GB during inference" (sonusahani.com, Nov 20 2025). Roboflow's overview independently corroborates the envelope, stating SAM 3 "uses less VRAM per inference than SAM 2, and fits comfortably on 16 GB GPUs". Treat 4 GB as a documented floor that the RTX 4070's 12 GB comfortably clears — not as a 4070-measured peak.
Speed: Not quoted. No source benchmarks SAM 3 on an RTX 4070 (or a comparable consumer card) with a named throughput number; the only throughput figure found is ~30 ms per image on an H200 datacenter GPU (Roboflow), which is far above the 4070's class and would not transfer. Quoting it as a 4070 figure would be fabrication, so speed is omitted. Once a community benchmark lands it will appear at /check/sam-3/rtx-4070 — please contribute yours.
Model size: 848M parameters (official facebookresearch/sam3 README); the model.safetensors checkpoint is 3.44 GB on disk (HF Files tree), consistent with Roboflow's ≈3.4 GB figure.
Quality notes: SAM 3 adds open-vocabulary text prompts on top of SAM 2's box/point/mask prompts and supports video tracking via the same model. The detector is DETR-based, conditioned on text, geometry, and image exemplars; the tracker inherits the SAM 2 encoder-decoder architecture. The repo also ships the SAM 3.1 "Object Multiplex" update, a shared-memory approach for faster joint multi-object tracking (official repo).

For the full benchmark data, see /check/sam-3/rtx-4070.

Troubleshooting

Checkpoint download fails with a 401 / "access restricted"

The facebook/sam3 repo is gated (gated: manual). You must request access on the model page, wait for approval, and run hf auth login with a token that has read access. Until access is granted, build_sam3_image_model() (and any from_pretrained call) will fail to fetch the weights with a 401. The gated repo's README.md is also served as an access-restricted stub when unauthenticated — use the public GitHub README for the usage snippets, as this recipe does.

"CUDA out of memory" on a video session

Video sessions hold per-frame state, so peak memory grows with frame count and resolution — well above the ~4 GB single-image floor. Lower the resolution, reduce the frame count, or start a fresh session and free the previous one explicitly with del/torch.cuda.empty_cache(). On a 12 GB RTX 4070 this floor is comfortable for short clips, but long high-resolution videos can climb toward — and past — the card's usable VRAM (a desktop card with a display attached exposes roughly 10.5–11.3 GB usable), so keep an eye on nvidia-smi for long sequences.

`Sam3Model` / `Sam3Processor` not found, or you prefer the Transformers loader

The repo-native API (build_sam3_image_model / Sam3Processor from the sam3 package) is the verifiable path used in this recipe. If you instead want the Hugging Face Transformers integration (Sam3Model / Sam3Processor imported from transformers), note those classes were added to transformers after SAM 3's November 2025 release — upgrade with pip install -U transformers, or install from source (pip install git+https://github.com/huggingface/transformers) if your pinned release predates them. The repo-native install in step 3 above does not depend on the transformers release cadence.

For other issues, file a report via the submission form.