self-hosted/ai
§01·recipe · specialized

SAM 3 on RX 7800 XT: Promptable Image and Video Segmentation on ROCm

specializedbeginner4GB+ VRAMJun 19, 2026

This beginner recipe sets up SAM 3 on the RX 7800 XT, needing about 4 GB of VRAM.

models
tools
prerequisites
  • AMD Radeon RX 7800 XT (16 GB VRAM, RDNA3 / Navi 32 / gfx1101) or equivalent ROCm-supported card
  • Linux (Ubuntu 24.04 / 22.04 or RHEL) with the AMD ROCm stack installed (ROCm 7.2.x)
  • Python 3.12 (per official repo recommendation)
  • A HuggingFace account — the weights are gated and require agreeing to share your contact info under the SAM License

What You'll Build

A local inference setup for Meta's Segment Anything Model 3 (SAM 3) on a 16 GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) running on AMD's ROCm stack, capable of concept-prompted (short text phrase, image exemplar, or both) image segmentation and video object tracking. SAM 3 unifies a DETR-style text-conditioned detector with a SAM 2-style memory tracker, both sharing a single Perception Encoder backbone — and at ~4 GB peak inference VRAM the 16 GB card is comfortably over-provisioned, leaving roughly 12 GB free for other workloads (concurrent models, long-video tracking sessions, larger batches).

Hardware data: RX 7800 XT (16GB VRAM) · BF16 · Transformers on ROCm 7.2 · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7800 XT runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel here, no FlashAttention build, and no FP8/FP4 path. RDNA3's WMMA units accept FP16, BF16, INT8, INT4 only (no FP8/FP4 hardware), so the right precision for SAM 3 is the native BF16 the model card already uses — and at ~4 GB the model fits the 16 GB card trivially without quantization. The attention path is PyTorch SDPA (the transformers default on ROCm), not FlashAttention-2 and not xformers. If an NVIDIA-oriented guide tells you to pick a cu12x wheel or pip install flash-attn for this card, it's written for the wrong vendor.

ℹ️ Gated weights — gated ≠ restrictively licensed. SAM 3 is released under Meta's custom SAM License (titled "SAM License", not Apache-2.0), and the model card gates the download: it states "You need to agree to share your contact information to access this model". This is an access-request step, not a paywall — log in to HuggingFace, open the model page, agree to share your contact info / accept the conditions, then authenticate locally (huggingface-cli login) so from_pretrained can fetch the checkpoint.

ℹ️ Verdict: PASS (inherited) — watch on first run. SAM 3 is transformers-native, and the transformers SDPA attention path is the default, well-supported route on ROCm — so this should port cleanly. However, the available AMD evidence for SAM 3 + transformers is on a Ryzen AI Max+ 395 APU, not a discrete RDNA3 gfx1101 card; there is no published discrete-7800-XT confirmation yet. Expect it to work, but treat it as unverified-on-this-card until you (or a community submission) confirm it. Report results via the submission form.

Requirements

ComponentMinimumTested
GPU4 GB VRAM (ROCm-supported AMD card)RX 7800 XT (16 GB) — pair not yet benchmarked, see /check/
RAM16 GB
Storage~4 GB for SAM 3 weights (model card, 0.9B params)~5 GB recommended with cache
DriverAMD ROCm 7.2.x on Linux
SoftwarePython 3.12, PyTorch built for ROCm, a transformers build that includes the SAM 3 classes (official repo)

The weights are released under Meta's SAM License (a custom Meta license, not Apache-2.0) and are gated on HuggingFace — you must agree to share your contact information before download. See the gated-access step below.

Installation

Install steps below come from the canonical Meta sources — the official facebookresearch/sam3 README and the HuggingFace model card — with the PyTorch install re-derived for ROCm (the official command targets CUDA). Report deviations via the submission form.

1. Set up a Python environment with PyTorch for ROCm

The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU on Linux, so it uses the stable ROCm PyTorch wheel index — not a CUDA cu12x index. Create the environment per the official facebookresearch/sam3 README (Python 3.12), but swap its CUDA torch line for the ROCm wheel:

conda create -n sam3 python=3.12
conda activate sam3
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. As of this writing the stable ROCm PyTorch wheel is pinned at rocm7.2 (per the ComfyUI README "AMD GPUs (Linux)" and the PyTorch "Get Started" selector) — but the rocmX.Y tag moves over time (6.3 → 6.4 → 7.x). Read the current line on the live PyTorch selector before running. AMD also publishes its own Radeon-recommended wheels at repo.radeon.com if you prefer the vendor build.

Confirm the install is the ROCm build, not CUDA: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (under HIP, ROCm masquerades as the cuda device namespace — so the device = "cuda" lines below run on your AMD GPU unchanged).

2. Accept the SAM License and authenticate

The weights are gated. Open huggingface.co/facebook/sam3 while logged in, agree to share your contact information to accept the conditions (this is Meta's SAM License), then authenticate the CLI so downloads succeed:

pip install -U "huggingface_hub[cli]"
huggingface-cli login

3a. Option A — install via HuggingFace Transformers (recommended for quick use)

This is the lowest-friction path; the model card ships Sam3Model / Sam3Processor (and Sam3VideoModel / Sam3VideoProcessor) classes, and the transformers SDPA attention path is the default, well-supported route on ROCm:

pip install -U transformers accelerate pillow requests

If you hit ImportError on Sam3Model, your transformers release predates the SAM 3 classes. Install from source instead — see Troubleshooting.

3b. Option B — install from the official repository

If you need the reference implementation, training, or finetuning utilities:

git clone https://github.com/facebookresearch/sam3.git
cd sam3
pip install -e .

4. Download model weights

With the Transformers path, weights download automatically the first time you call from_pretrained("facebook/sam3") (~4 GB for the 0.9B-param checkpoint to your HuggingFace cache), provided you accepted the SAM License in step 2.

Running

Image segmentation with a text prompt

from transformers import Sam3Model, Sam3Processor
from PIL import Image
import torch
import requests

device = "cuda" if torch.cuda.is_available() else "cpu"  # "cuda" is the HIP/ROCm device on AMD

model = Sam3Model.from_pretrained("facebook/sam3").to(device)
processor = Sam3Processor.from_pretrained("facebook/sam3")

img_url = "https://example.com/image.jpg"
image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")

inputs = processor(images=image, text="ear", return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model(**inputs)

results = processor.post_process_instance_segmentation(
    outputs, threshold=0.5, mask_threshold=0.5,
    target_sizes=inputs.get("original_sizes").tolist()
)[0]

print(results["masks"].shape)

Source: facebook/sam3 model card. Swap text="ear" for any concept phrase; the detector is open-vocabulary. Attention runs through PyTorch SDPA on ROCm — no FlashAttention or xformers step is needed (or wanted) on this card.

Video segmentation

from transformers import Sam3VideoModel, Sam3VideoProcessor
from transformers.video_utils import load_video
import torch

model = Sam3VideoModel.from_pretrained("facebook/sam3").to("cuda")  # HIP/ROCm device on AMD
processor = Sam3VideoProcessor.from_pretrained("facebook/sam3")

video_url = "https://example.com/video.mp4"
video_frames, _ = load_video(video_url)

inference_session = processor.init_video_session(
    video=video_frames,
    inference_device="cuda",
    dtype=torch.bfloat16,
)

for frame_idx in range(len(video_frames)):
    sam3_output = processor.add_text_prompt(
        inference_session=inference_session,
        text="person",
    )

dtype=torch.bfloat16 is the path of least resistance and matches the model card's own examples — BF16 is a native RDNA3 WMMA input format, so it runs without upcasting. With 16 GB the 7800 XT also has ample headroom to drop to torch.float32 if you want full precision. Source: facebook/sam3 model card.

Results

  • VRAM usage: ~4 GB peak during single-image inference, in line with SAM 3's 0.9B parameter count (model card) and the broader observation that SAM 3 fits comfortably on small GPUs and uses less VRAM per inference than SAM 2 (Roboflow overview). That figure comes from NVIDIA-side testing, not a 7800 XT, so treat it as a cross-card estimate; the model architecture is identical across vendors and the 0.9B checkpoint in BF16 leaves the 16 GB card with roughly 12 GB free — comfortably enough for concurrent models, long-form video sessions, or larger batch sizes.
  • Model size: 0.9B parameters (facebook/sam3 model card), ~4 GB on disk for the transformers model.safetensors checkpoint.
  • Speed: No RX-7800-XT-named (or any AMD-GPU-named) throughput measurement was found at authoring time, and the backend has no benchmark for this pair. Rather than transfer a figure from a different card or vendor, speed is omitted here — once a community run lands it will appear at /check/sam-3/rx-7800-xt. If you measure it on a 7800 XT, please contribute.
  • Quality notes: SAM 3 adds open-vocabulary concept prompts (text phrases and image exemplars) on top of SAM 2's box/point/mask prompts and supports video tracking via the same model. The detector is DETR-based; the tracker is a SAM 2-style memory transformer reusing the shared Perception Encoder backbone.

For the full benchmark data, see /check/sam-3/rx-7800-xt.

Troubleshooting

"Torch not compiled with CUDA enabled" / model lands on CPU

This means a CUDA build of PyTorch got installed instead of the ROCm build. Uninstall and reinstall against the ROCm wheel index:

pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

Then confirm torch.cuda.is_available() returns True and torch.__version__ carries a +rocm7.2-style suffix. Under HIP, ROCm presents as the cuda device namespace, so the device = "cuda" code above runs on the AMD GPU unchanged — you do not edit it to say "hip".

Sam3Model or Sam3Processor not found in transformers

SAM 3 classes were added to transformers after the model's November 2025 release. First try upgrading: pip install -U transformers. If your pinned environment can't resolve a release that includes them, install from source:

pip install git+https://github.com/huggingface/transformers

As a guaranteed-working alternative, use Option B above (install from facebookresearch/sam3), whose reference implementation does not depend on the transformers release cadence.

401 / gated-repo error on download

from_pretrained("facebook/sam3") returns an access error if you haven't accepted the SAM License. Visit huggingface.co/facebook/sam3 while logged in, agree to share your contact information to accept the conditions, run huggingface-cli login with a token that has read access, then retry.

Don't reach for FlashAttention, xformers, or an FP8 checkpoint

NVIDIA-oriented guides often suggest pip install flash-attn, pip install xformers, or an FP8 weight variant to save memory. On RDNA3 these are the wrong path: there are no consumer-card FlashAttention prebuilt wheels for gfx1101, the ROCm xformers fork is limited, and RDNA3 has no FP8 hardware (an FP8 checkpoint just upcasts to BF16/FP16 with no memory win). The transformers default — PyTorch SDPA in BF16 — is the correct and supported path here, and at ~4 GB on a 16 GB card you have plenty of memory headroom rather than pressure to optimize away.

Long video sessions and concurrent models

Video sessions hold per-frame state, so peak usage during a long video can exceed the ~4 GB single-image envelope. The 16 GB card is forgiving here, but the same hygiene applies: lower the resolution or drop frame count if you push into multi-minute sessions, and free old sessions explicitly with del inference_session; torch.cuda.empty_cache() before opening a fresh one (torch.cuda.empty_cache() is the correct call on ROCm too — it maps to the HIP allocator). If you intend to run SAM 3 alongside another model on the same card, the ~12 GB free after SAM 3 loads is enough room for most mid-sized models — verify by watching rocm-smi after both are warm.

For other issues, file a report via the submission form.

common questions
How much VRAM does SAM 3 need?

About 4 GB — the minimum this recipe targets.

Which GPUs is SAM 3 tested on?

RX 7800 XT (16 GB).

How hard is this setup?

Beginner — follow the steps above.