What You'll Build
A local inference setup for Meta's Segment Anything Model 3 (SAM 3) on an RTX 4070, capable of text-prompted image segmentation and video object tracking. SAM 3 pairs a DETR-style text-conditioned detector with the SAM 2 tracker, both sharing a single vision encoder — so one model handles open-vocabulary detection, image segmentation, and video tracking from the same prompt.
Hardware data: RTX 4070 (12GB GDDR6X, 192-bit) · ~4 GB documented inference VRAM floor (measured on a larger card — see Results) · See benchmark data
Note: As of this writing the backend has no measured benchmarks for this pair — /check/sam-3/rtx-4070 returns
verdict: unknown. The VRAM figure below is a documented floor measured by independent third-party testing on a different, larger-VRAM card (an RTX 6000) — not an RTX 4070 measurement. Treat it as a working floor until empirical RTX 4070 data lands. If you measure it, contribute via the submission form.
ℹ️ Not an image generator. SAM 3 is a segmentation model — it finds and masks objects you name with a text or visual prompt; it does not synthesize new images or video. It lives in our
specializedvertical because promptable segmentation doesn't map cleanly onto image/video generation.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 4GB VRAM CUDA GPU | RTX 4070 (12GB GDDR6X, Ada Lovelace AD104 sm_89) — pair not yet benchmarked, see /check/ |
| RAM | 16GB | — |
| Storage | ~3.4 GB for the SAM 3 checkpoint (Roboflow; model.safetensors is 3.44 GB per the HF Files tree) | ~5 GB recommended with cache |
| Software | Python 3.12+, PyTorch 2.7+, CUDA 12.6+ (official minimum; the install command pins torch==2.10.0 on cu128 — see Installation) (official repo) | — |
Installation
Install steps below come from the canonical Meta sources only — the official
facebookresearch/sam3README and the gated Hugging Face model card. The install path is the upstream-supported one, so no third-party walkthrough is required; report deviations via the submission form.
1. Request access to the gated checkpoints
The facebook/sam3 weights are gated (gated: manual on Hugging Face) — you must request and be granted access before they will download. Visit huggingface.co/facebook/sam3, accept the access terms, then authenticate locally once approved:
pip install -U "huggingface_hub[cli]"
hf auth login
Per the official repo, authentication is required to download the checkpoints after your access request is accepted. SAM 3 is released under the custom SAM 3 License (it is not Apache-2.0); the gate is a one-time access request, separate from the license terms — review the license before deploying.
2. Set up a Python environment
Per the official facebookresearch/sam3 README:
conda create -n sam3 python=3.12
conda deactivate
conda activate sam3
pip install torch==2.10.0 torchvision --index-url https://download.pytorch.org/whl/cu128
The RTX 4070 is Ada Lovelace (AD104, sm_89), which has full kernel coverage in the standard PyTorch CUDA wheels — unlike Blackwell (sm_120) GPUs, no special wheel selection is required for sm_89. The cu128 index here is simply the version the upstream repo pins for the torch==2.10.0 build (CUDA 12.8 includes Ada sm_89 kernels); the official minimum is CUDA 12.6+, so the older cu126 stable wheel also works for the inference path on Ada.
3. Install SAM 3 from the official repository
git clone https://github.com/facebookresearch/sam3.git
cd sam3
pip install -e .
To run the example notebooks as well, the repo also documents pip install -e ".[notebooks]".
4. (Optional) Faster inference with FlashAttention
The repo documents an optional FlashAttention extra for faster inference. On Ada Lovelace (sm_89), FlashAttention prebuilt kernels are available — there is no sm_120 kernel gap to work around (that limitation is Blackwell-only), so you can install it directly:
pip install einops ninja && pip install flash-attn-3 --no-deps --index-url https://download.pytorch.org/whl/cu128
pip install git+https://github.com/ronghanghu/cc_torch.git
This step is optional — SAM 3 runs on the default attention path without it.
Running
The canonical inference API ships with the official repo. The snippets below are taken verbatim from the facebookresearch/sam3 README's Basic Usage section.
Image segmentation with a text prompt
import torch
from PIL import Image
from sam3.model_builder import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor
# Load the model
model = build_sam3_image_model()
processor = Sam3Processor(model)
# Load an image
image = Image.open("<YOUR_IMAGE_PATH.jpg>")
inference_state = processor.set_image(image)
# Prompt the model with text
output = processor.set_text_prompt(state=inference_state, prompt="<YOUR_TEXT_PROMPT>")
# Get the masks, bounding boxes, and scores
masks, boxes, scores = output["masks"], output["boxes"], output["scores"]
Video segmentation with a text prompt
from sam3.model_builder import build_sam3_video_predictor
video_predictor = build_sam3_video_predictor()
video_path = "<YOUR_VIDEO_PATH>" # a JPEG folder or an MP4 video file
# Start a session
response = video_predictor.handle_request(
request=dict(
type="start_session",
resource_path=video_path,
)
)
response = video_predictor.handle_request(
request=dict(
type="add_prompt",
session_id=response["session_id"],
frame_index=0, # Arbitrary frame index
text="<YOUR_TEXT_PROMPT>",
)
)
output = response["outputs"]
The first call downloads the gated checkpoint (~3.4 GB) into your Hugging Face cache. A Hugging Face Transformers integration (Sam3Model / Sam3Processor under transformers) is also documented on the model card for users who prefer that loader; the repo-native API above is shown here because it is verifiable from the public GitHub README without gated access.
Spending the headroom on a 12 GB card
At a ~4 GB inference floor, SAM 3 leaves roughly 7–8 GB unused on the RTX 4070's 12 GB. That is lighter headroom than the larger 16 GB Ada cards (a 4 vs 12 GB envelope is about a 3× ratio), but still enough for colocation: run SAM 3's segmentation alongside a second small model on the same card — for example, batch image inputs through SAM 3 while a compact detection or captioning model runs concurrently, or keep a small (3B-class) LLM at Q4 resident for prompt generation. The official repo also ships an SA-Co agent example (sam3_agent.ipynb) that drives SAM 3 from an MLLM — feasible on a single 4070 if the companion model is kept small, since the 12 GB envelope is tighter than on a 16 GB card.
Results
- VRAM usage: ~4 GB peak during single-image inference. This figure was measured by hands-on third-party testing on an NVIDIA RTX 6000 (48 GB), not on the RTX 4070 — "VRAM usage hovered just under 4 GB during inference" (sonusahani.com, Nov 20 2025). Roboflow's overview independently corroborates the envelope, stating SAM 3 "uses less VRAM per inference than SAM 2, and fits comfortably on 16 GB GPUs". Treat 4 GB as a documented floor that the RTX 4070's 12 GB comfortably clears — not as a 4070-measured peak.
- Speed: Not quoted. No source benchmarks SAM 3 on an RTX 4070 (or a comparable consumer card) with a named throughput number; the only throughput figure found is ~30 ms per image on an H200 datacenter GPU (Roboflow), which is far above the 4070's class and would not transfer. Quoting it as a 4070 figure would be fabrication, so speed is omitted. Once a community benchmark lands it will appear at /check/sam-3/rtx-4070 — please contribute yours.
- Model size: 848M parameters (official
facebookresearch/sam3README); themodel.safetensorscheckpoint is 3.44 GB on disk (HF Files tree), consistent with Roboflow's ≈3.4 GB figure. - Quality notes: SAM 3 adds open-vocabulary text prompts on top of SAM 2's box/point/mask prompts and supports video tracking via the same model. The detector is DETR-based, conditioned on text, geometry, and image exemplars; the tracker inherits the SAM 2 encoder-decoder architecture. The repo also ships the SAM 3.1 "Object Multiplex" update, a shared-memory approach for faster joint multi-object tracking (official repo).
For the full benchmark data, see /check/sam-3/rtx-4070.
Troubleshooting
Checkpoint download fails with a 401 / "access restricted"
The facebook/sam3 repo is gated (gated: manual). You must request access on the model page, wait for approval, and run hf auth login with a token that has read access. Until access is granted, build_sam3_image_model() (and any from_pretrained call) will fail to fetch the weights with a 401. The gated repo's README.md is also served as an access-restricted stub when unauthenticated — use the public GitHub README for the usage snippets, as this recipe does.
"CUDA out of memory" on a video session
Video sessions hold per-frame state, so peak memory grows with frame count and resolution — well above the ~4 GB single-image floor. Lower the resolution, reduce the frame count, or start a fresh session and free the previous one explicitly with del/torch.cuda.empty_cache(). On a 12 GB RTX 4070 this floor is comfortable for short clips, but long high-resolution videos can climb toward — and past — the card's usable VRAM (a desktop card with a display attached exposes roughly 10.5–11.3 GB usable), so keep an eye on nvidia-smi for long sequences.
Sam3Model / Sam3Processor not found, or you prefer the Transformers loader
The repo-native API (build_sam3_image_model / Sam3Processor from the sam3 package) is the verifiable path used in this recipe. If you instead want the Hugging Face Transformers integration (Sam3Model / Sam3Processor imported from transformers), note those classes were added to transformers after SAM 3's November 2025 release — upgrade with pip install -U transformers, or install from source (pip install git+https://github.com/huggingface/transformers) if your pinned release predates them. The repo-native install in step 3 above does not depend on the transformers release cadence.
For other issues, file a report via the submission form.