What You'll Build
A local inference setup for Meta's Segment Anything Model 3 (SAM 3) on a 24 GB Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100) running on AMD's ROCm stack, capable of concept-prompted (short text phrase, image exemplar, or both) image segmentation and video object tracking. SAM 3 unifies a DETR-style text-conditioned detector with a SAM 2-style memory tracker, both sharing a single Perception Encoder backbone — and at ~4 GB peak inference VRAM the 24 GB card is wildly over-provisioned, leaving roughly 20 GB free for other workloads (concurrent models, long-video tracking sessions, larger batches).
Hardware data: RX 7900 XTX (24GB VRAM) · BF16 · Transformers on ROCm 7.2 · See benchmark data
⚠️ This is a ROCm recipe, not CUDA. The RX 7900 XTX runs on AMD's ROCm/HIP stack — there is no
cu124/cu128wheel here, no FlashAttention build, and no FP8/FP4 path. RDNA3's WMMA units accept FP16, BF16, INT8, INT4 only (no FP8/FP4 hardware), so the right precision for SAM 3 is the native BF16 the model card already uses — and at 24 GB you have no reason to quantize. The attention path is PyTorch SDPA (the transformers default on ROCm), not FlashAttention-2 and not xformers. If an NVIDIA-oriented guide tells you to pick acu12xwheel orpip install flash-attnfor this card, it's written for the wrong vendor.
ℹ️ Gated weights — gated ≠ restrictively licensed. SAM 3 is released under Meta's custom SAM License (titled "SAM License", not Apache-2.0), and the model card gates the download: it states "You need to agree to share your contact information to access this model." This is an access-request step, not a paywall — log in to HuggingFace, open the model page, agree to share your contact info / accept the conditions, then authenticate locally (
huggingface-cli login) sofrom_pretrainedcan fetch the checkpoint.
ℹ️ Verdict: PASS (inherited) — watch on first run. SAM 3 is transformers-native, and the transformers SDPA attention path is the default, well-supported route on ROCm — so this should port cleanly. However, the available AMD evidence for SAM 3 + transformers is on a Ryzen AI Max+ 395 APU, not a discrete RDNA3 gfx1100 card; there is no published discrete-7900-XTX confirmation yet. Expect it to work, but treat it as unverified-on-this-card until you (or a community submission) confirm it. Report results via the submission form.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 4 GB VRAM (ROCm-supported AMD card) | RX 7900 XTX (24 GB) — pair not yet benchmarked, see /check/ |
| RAM | 16 GB | — |
| Storage | ~4 GB for SAM 3 weights (model card, 0.9B params) | ~5 GB recommended with cache |
| Driver | AMD ROCm 7.2.x on Linux | — |
| Software | Python 3.12, PyTorch built for ROCm, a transformers build that includes the SAM 3 classes (official repo) | — |
The weights are released under Meta's SAM License (a custom Meta license, not Apache-2.0) and are gated on HuggingFace — you must agree to share your contact information before download. See the gated-access step below.
Installation
Install steps below come from the canonical Meta sources — the official
facebookresearch/sam3README and the HuggingFace model card — with the PyTorch install re-derived for ROCm (the official command targets CUDA). Report deviations via the submission form.
1. Set up a Python environment with PyTorch for ROCm
The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU on Linux, so it uses the stable ROCm PyTorch wheel index — not a CUDA cu12x index. Create the environment per the official facebookresearch/sam3 README (Python 3.12), but swap its CUDA torch line for the ROCm wheel:
conda create -n sam3 python=3.12
conda activate sam3
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2
ℹ️ Verify the ROCm tag before you copy it. As of this writing the stable ROCm PyTorch wheel is pinned at
rocm7.2(per the ComfyUI README "AMD GPUs (Linux)" and the PyTorch "Get Started" selector) — but therocmX.Ytag moves over time (6.3 → 6.4 → 7.x). Read the current line on the live PyTorch selector before running. AMD also publishes its own Radeon-recommended wheels at repo.radeon.com if you prefer the vendor build.
Confirm the install is the ROCm build, not CUDA: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (under HIP, ROCm masquerades as the cuda device namespace — so the device = "cuda" lines below run on your AMD GPU unchanged).
2. Accept the SAM License and authenticate
The weights are gated. Open huggingface.co/facebook/sam3 while logged in, agree to share your contact information to accept the conditions (this is Meta's SAM License), then authenticate the CLI so downloads succeed:
pip install -U "huggingface_hub[cli]"
huggingface-cli login
3a. Option A — install via HuggingFace Transformers (recommended for quick use)
This is the lowest-friction path; the model card ships Sam3Model / Sam3Processor (and Sam3VideoModel / Sam3VideoProcessor) classes, and the transformers SDPA attention path is the default, well-supported route on ROCm:
pip install -U transformers accelerate pillow requests
If you hit
ImportErroronSam3Model, yourtransformersrelease predates the SAM 3 classes. Install from source instead — see Troubleshooting.
3b. Option B — install from the official repository
If you need the reference implementation, training, or finetuning utilities:
git clone https://github.com/facebookresearch/sam3.git
cd sam3
pip install -e .
4. Download model weights
With the Transformers path, weights download automatically the first time you call from_pretrained("facebook/sam3") (~4 GB for the 0.9B-param checkpoint to your HuggingFace cache), provided you accepted the SAM License in step 2.
Running
Image segmentation with a text prompt
from transformers import Sam3Model, Sam3Processor
from PIL import Image
import torch
import requests
device = "cuda" if torch.cuda.is_available() else "cpu" # "cuda" is the HIP/ROCm device on AMD
model = Sam3Model.from_pretrained("facebook/sam3").to(device)
processor = Sam3Processor.from_pretrained("facebook/sam3")
img_url = "https://example.com/image.jpg"
image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
inputs = processor(images=image, text="ear", return_tensors="pt").to(device)
with torch.no_grad():
outputs = model(**inputs)
results = processor.post_process_instance_segmentation(
outputs, threshold=0.5, mask_threshold=0.5,
target_sizes=inputs.get("original_sizes").tolist()
)[0]
print(results["masks"].shape)
Source: facebook/sam3 model card. Swap text="ear" for any concept phrase; the detector is open-vocabulary. Attention runs through PyTorch SDPA on ROCm — no FlashAttention or xformers step is needed (or wanted) on this card.
Video segmentation
from transformers import Sam3VideoModel, Sam3VideoProcessor
from transformers.video_utils import load_video
import torch
model = Sam3VideoModel.from_pretrained("facebook/sam3").to("cuda") # HIP/ROCm device on AMD
processor = Sam3VideoProcessor.from_pretrained("facebook/sam3")
video_url = "https://example.com/video.mp4"
video_frames, _ = load_video(video_url)
inference_session = processor.init_video_session(
video=video_frames,
inference_device="cuda",
dtype=torch.bfloat16,
)
for frame_idx in range(len(video_frames)):
sam3_output = processor.add_text_prompt(
inference_session=inference_session,
text="person",
)
dtype=torch.bfloat16 is the path of least resistance and matches the model card's own examples — BF16 is a native RDNA3 WMMA input format, so it runs without upcasting. With 24 GB the 7900 XTX also has ample headroom to drop to torch.float32 if you want full precision. Source: facebook/sam3 model card.
Results
- VRAM usage: ~4 GB peak during single-image inference, in line with SAM 3's 0.9B parameter count (model card) and the broader observation that SAM 3 fits comfortably on small GPUs and uses less VRAM per inference than SAM 2 (Roboflow overview). That figure comes from NVIDIA-side testing, not a 7900 XTX, so treat it as a cross-card estimate; the model architecture is identical across vendors and the 0.9B checkpoint in BF16 leaves the 24 GB card with roughly 20 GB free — comfortably enough for concurrent models, long-form video sessions, or larger batch sizes.
- Model size: 0.9B parameters (facebook/sam3 model card), ~4 GB on disk.
- Speed: No RX-7900-XTX-named (or any AMD-GPU-named) throughput measurement was found at authoring time, and the backend has no benchmark for this pair. Rather than transfer a figure from a different card or vendor, speed is omitted here — once a community run lands it will appear at /check/sam-3/rx-7900-xtx. If you measure it on a 7900 XTX, please contribute.
- Quality notes: SAM 3 adds open-vocabulary concept prompts (text phrases and image exemplars) on top of SAM 2's box/point/mask prompts and supports video tracking via the same model. The detector is DETR-based; the tracker is a SAM 2-style memory transformer reusing the shared Perception Encoder backbone.
For the full benchmark data, see /check/sam-3/rx-7900-xtx.
Troubleshooting
"Torch not compiled with CUDA enabled" / model lands on CPU
This means a CUDA build of PyTorch got installed instead of the ROCm build. Uninstall and reinstall against the ROCm wheel index:
pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2
Then confirm torch.cuda.is_available() returns True and torch.__version__ carries a +rocm7.2-style suffix. Under HIP, ROCm presents as the cuda device namespace, so the device = "cuda" code above runs on the AMD GPU unchanged — you do not edit it to say "hip".
Sam3Model or Sam3Processor not found in transformers
SAM 3 classes were added to transformers after the model's November 2025 release. First try upgrading: pip install -U transformers. If your pinned environment can't resolve a release that includes them, install from source:
pip install git+https://github.com/huggingface/transformers
As a guaranteed-working alternative, use Option B above (install from facebookresearch/sam3), whose reference implementation does not depend on the transformers release cadence.
401 / gated-repo error on download
from_pretrained("facebook/sam3") returns an access error if you haven't accepted the SAM License. Visit huggingface.co/facebook/sam3 while logged in, agree to share your contact information to accept the conditions, run huggingface-cli login with a token that has read access, then retry.
Don't reach for FlashAttention, xformers, or an FP8 checkpoint
NVIDIA-oriented guides often suggest pip install flash-attn, pip install xformers, or an FP8 weight variant to save memory. On RDNA3 these are the wrong path: there are no consumer-card FlashAttention prebuilt wheels for gfx1100, the ROCm xformers fork is limited, and RDNA3 has no FP8 hardware (an FP8 checkpoint just upcasts to BF16/FP16 with no memory win). The transformers default — PyTorch SDPA in BF16 — is the correct and supported path here, and at 24 GB you have no memory pressure to optimize away.
Long video sessions and concurrent models
Video sessions hold per-frame state, so peak usage during a long video can exceed the ~4 GB single-image envelope. The 24 GB card is very forgiving here, but the same hygiene applies: lower the resolution or drop frame count if you push into multi-minute sessions, and free old sessions explicitly with del inference_session; torch.cuda.empty_cache() before opening a fresh one (torch.cuda.empty_cache() is the correct call on ROCm too — it maps to the HIP allocator). If you intend to run SAM 3 alongside another model on the same card, the ~20 GB free after SAM 3 loads is enough room for most models — verify by watching rocm-smi after both are warm.
For other issues, file a report via the submission form.