SAM 3 on RTX 4060 Ti 16GB: Promptable Image and Video Segmentation

What You'll Build

A local inference setup for Meta's Segment Anything Model 3 (SAM 3) on an RTX 4060 Ti 16GB, capable of text-prompted image segmentation and video object tracking. SAM 3 unifies the SAM 2 tracker with a DETR-style text-conditioned detector — and at ~4 GB peak inference VRAM the 16 GB card is wildly over-provisioned, leaving roughly 12 GB free for other workloads (concurrent models, long-video tracking sessions, larger batch sizes).

Hardware data: RTX 4060 Ti 16GB (16GB VRAM) · ~4 GB peak inference VRAM observed by third-party testing · See benchmark data

Note: As of this writing the backend has no measured benchmarks for this pair. The VRAM figure below comes from independent third-party testing on a non-4060-Ti card; expect comparable behaviour, but treat it as a working estimate until empirical data lands at /check/.

Requirements

Component	Minimum	Tested
GPU	4GB VRAM CUDA GPU	RTX 4060 Ti 16GB — pair not yet benchmarked, see /check/
RAM	16GB	—
Storage	~3.4 GB for SAM 3 weights (Roboflow)	~5 GB recommended with cache
Software	Python 3.12, PyTorch 2.10, CUDA 12.8 (official pin — see Installation; older 2.7+ likely works for the Transformers path but is not upstream-tested) (official repo)	—

Installation

Install steps below come from the canonical Meta sources only — the official facebookresearch/sam3 README and the HuggingFace model card. No independent third-party walkthrough is required because the install path is the upstream-supported one; report deviations via submission form.

1. Set up a Python environment

Per the official facebookresearch/sam3 README:

conda create -n sam3 python=3.12
conda activate sam3
pip install torch==2.10.0 torchvision --index-url https://download.pytorch.org/whl/cu128

The cu128 index is the version Meta tests against upstream; it works without ceremony on the RTX 4060 Ti's Ada Lovelace (sm_89) architecture. The older cu126 / cu121 wheels also work for the Transformers path on Ada — only stick to cu128 if you're following the official repo's pin verbatim.

2a. Option A — install via HuggingFace Transformers (recommended for quick use)

This is the lowest-friction path; the model card ships Sam3Model / Sam3Processor classes:

pip install transformers accelerate pillow

2b. Option B — install from the official repository

If you need the reference implementation, training, or finetuning utilities:

git clone https://github.com/facebookresearch/sam3.git
cd sam3
pip install -e .

3. Download model weights

With the Transformers path, weights download automatically the first time you call from_pretrained("facebook/sam3") (~3.4 GB to your HuggingFace cache).

Running

Image segmentation with a text prompt

from transformers import Sam3Processor, Sam3Model
from PIL import Image
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model = Sam3Model.from_pretrained("facebook/sam3").to(device)
processor = Sam3Processor.from_pretrained("facebook/sam3")

image = Image.open("your_image.jpg")
inputs = processor(images=image, text="ear", return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model(**inputs)

results = processor.post_process_instance_segmentation(
    outputs, threshold=0.5, mask_threshold=0.5,
    target_sizes=inputs.get("original_sizes").tolist()
)[0]

Source: facebook/sam3 model card.

Video segmentation

from transformers import Sam3VideoModel, Sam3VideoProcessor
from transformers.video_utils import load_video
import torch

model = Sam3VideoModel.from_pretrained("facebook/sam3").to("cuda")
processor = Sam3VideoProcessor.from_pretrained("facebook/sam3")

video_frames, _ = load_video("your_video.mp4")

inference_session = processor.init_video_session(
    video=video_frames,
    inference_device="cuda",
    dtype=torch.bfloat16,
)

inference_session = processor.add_text_prompt(
    inference_session=inference_session,
    text="person",
)

dtype=torch.bfloat16 is the path of least resistance; with 16 GB the 4060 Ti also has the headroom to drop to torch.float32 if you want full precision. Source: facebook/sam3 model card.

Results

VRAM usage: ~4 GB peak during single-image inference, measured by third-party hands-on testing on an NVIDIA RTX 6000 (sonusahani.com, Nov 2025). Roboflow's overview corroborates that SAM 3 "fits comfortably on 16 GB GPUs" and "uses less VRAM per inference than SAM 2." On the 16 GB RTX 4060 Ti that leaves roughly 12 GB free — comfortably enough for concurrent models, long-form video sessions, or larger batch sizes than the 8 GB tier can handle.
Model size: 848M parameters (official facebookresearch/sam3 README), 3.4 GB on disk (Roboflow).
Quality notes: SAM 3 adds open-vocabulary text prompts on top of SAM 2's box/point/mask prompts and supports video tracking via the same model. The detector is DETR-based, conditioned on text, geometry, and image exemplars.

Once empirical RTX 4060 Ti 16GB numbers are seeded, they will appear at /check/sam-3/rtx-4060-ti-16gb.

Troubleshooting

Long video sessions and concurrent models

Video sessions hold per-frame state, so peak usage during a long video can exceed the ~4 GB single-image envelope. The 16 GB card is forgiving here — you have headroom for longer clips than the 8 GB tier — but the same hygiene applies: lower the resolution or drop frame count if you push into multi-minute sessions, and free old sessions explicitly with del inference_session; torch.cuda.empty_cache() before opening a fresh one. If you intend to run SAM 3 alongside another model on the same card, the ~12 GB free after SAM 3 loads is enough room for most small-to-mid models — verify by watching nvidia-smi after both are warm.

`Sam3Model` or `Sam3Processor` not found in `transformers`

SAM 3 classes were added to transformers after the model's November 2025 release. Upgrade with pip install -U transformers. If a pinned environment can't upgrade, use Option B (install from facebookresearch/sam3).

Wheel/CUDA mismatch with PyTorch 2.10

The official repo pins torch==2.10.0 with CUDA 12.8 wheels. The RTX 4060 Ti 16GB is Ada Lovelace (sm_89), which has full kernel coverage on the default PyTorch wheels — cu128 is the upstream pin, but cu126 / cu121 also work for the Transformers path. Stick to cu128 only if you're following the facebookresearch/sam3 install verbatim, since that's the only combination upstream actually tests.

For other issues, file a report via the submission form.