SAM 2 + GPT-4o — Cascading Foundation Models via Visual Prompting — Part 1 (2024)

In Part 1 of this article we introduce Segment Anything Model 2 (SAM 2). Then, we walk you through how you can set it up and run inference on your own video clips.

🔥 Learn more about visual prompting and RAG:

  • CVPR 2024: Foundation Models + Visual Prompting Are About to Disrupt Computer Vision
  • RAG for Vision: Building Multimodal Computer Vision Systems

Table of Contents

  1. What is Segment Anything Model 2 (SAM 2)?
  2. What is special about SAM 2?
  3. How can I run SAM 2?
  4. What’s next

1. What is Segment Anything Model 2 (SAM 2)?

TL;DR:

SAM 2 can segment objects in any image or video without retraining.

Segment Anything Model 2 (SAM 2) [1] by Meta is an advanced version of the original Segment Anything Model [2] designed for object segmentation in both images and videos (see Figure 1).

Figure 1. A pedestrian (blue mask) and a car (yellow mask) are segmented and tracked using SAM 2

Released under an open-source Apache 2.0 license, SAM 2 represents a significant leap forward in computer vision, allowing for real-time, promptable segmentation of objects.

SAM 2 is notable for its accuracy in image segmentation and superior performance in video segmentation, requiring significantly less interaction time compared to previous models: we show how SAM 2 required 3 points to segment objects across an entire video!

Meta has also introduced the SA-V dataset alongside SAM 2, which features over 51,000 videos and more than 600,000 masklets. This dataset facilitates its application in diverse fields such as medical imaging, satellite imagery, marine science, and content creation.

1.1 SAM 2 features summary

The main characteristics of SAM 2 are summarized in Figure 2.

SAM 2 + GPT-4o — Cascading Foundation Models via Visual Prompting — Part 1 (1)

2. What is special about SAM 2?

What’s novel about SAM 2 is that it addresses the complexities of video data, such as object motion, deformation, occlusion, and lighting changes, which are not present in static images.

This makes SAM 2 a crucial tool for applications in mixed reality, robotics, autonomous vehicles, and video editing.

Figure 3. SAM 2 in action: the ball is removed from the original video (top left), and a new video with no ball is created (bottom right) (Source)

SAM 2’s key innovations are:

  1. Unified Model for Images and Videos: SAM 2 treats images as single-frame videos, allowing it to handle both types of input seamlessly. This unification is achieved by leveraging memory to recall previously processed information in videos, enabling accurate segmentation across frames.
  2. Promptable Visual Segmentation Task: SAM 2 generalizes the image segmentation task to the video domain by taking input prompts (points, boxes, or masks) in any frame of a video to define a spatio-temporal mask (masklet). It can make immediate predictions and propagate them temporally, refining the segmentation iteratively with additional prompts.
  3. Advanced Dataset (SA-V): SAM 2 is trained on the SA-V dataset, which is significantly larger than existing video segmentation datasets. This extensive dataset enables SAM 2 to achieve state-of-the-art performance in video segmentation.

3. How can I run SAM 2?

You can either check SAM 2 repository or setup your model on your own machine using this Jupyter Notebook. In this section we describe the latter approach.

3.1 Pre-requisites

  • A machine with a GPU
  • A library to extract frames from a video (e.g., ffmpeg)

3.2 Setup

import osHOME = os.getcwd()# Clone the repository!git clone https://github.com/facebookresearch/segment-anything-2.git%cd {HOME}/segment-anything-2# install the python libraries for "segment-anything-2"!pip install -e . -q!pip install -e ".[demo]" -q

3.3. Download SAM-2 checkpoints

We’ll only download the largest model but there are smaller options available too.

!wget -q https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_large.pt -P {HOME}/checkpoints

3.4 Create a predictor

from sam2.build_sam import build_sam2_video_predictorsam2_checkpoint = f"{HOME}/checkpoints/sam2_hiera_large.pt"model_cfg = "sam2_hiera_l.yaml"predictor = build_sam2_video_predictor(model_cfg, sam2_checkpoint)

3.5 Extract the frames from your video and explore the data

# Extract the framesvideo_path = f"{HOME}/segment-anything-2/SAM2_gymnastics.mp4"output_path = f"{HOME}/segment-anything-2/outputs/gymnastics"!ffmpeg -i {video_path} -q:v 2 -start_number 0 {output_path}/'%05d.jpg'video_dir = f"{HOME}/segment-anything-2/outputs/gymnastics"# scan all the JPEG frame names in this directoryframe_names = [ p for p in os.listdir(video_dir) if os.path.splitext(p)[-1] in [".jpg", ".jpeg", ".JPG", ".JPEG"]]frame_names.sort(key=lambda p: int(os.path.splitext(p)[0]))# take a look the first video frameframe_idx = 0plt.figure(figsize=(12, 8))plt.title(f"frame {frame_idx}")plt.imshow(Image.open(os.path.join(video_dir, frame_names[frame_idx])))

SAM 2 + GPT-4o — Cascading Foundation Models via Visual Prompting — Part 1 (2)

3.6 Define the objects to segment using coordinates

We define a function to help us provide a list of x, y coordinates:

def refine_mask_with_coordinates(coordinates, ann_frame_idx, ann_obj_id, show_result=True): """ Refine a mask by adding new points using a SAM predictor. Args: coordinates (list): List of [x, y] coordinates, e.g., [[210, 350], [250, 220]] ann_frame_idx (int): The index of the frame being processed ann_obj_id (int): A unique identifier for the object being segmented show_result (bool): Whether to display the result (default: True) """ # Convert the list of coordinates to a numpy array points = np.array(coordinates, dtype=np.float32) # Create labels array (assuming all points are positive clicks) labels = np.ones(len(coordinates), dtype=np.int32) # Add new points to the predictor _, out_obj_ids, out_mask_logits = predictor.add_new_points( inference_state=inference_state, frame_idx=ann_frame_idx, obj_id=ann_obj_id, points=points, labels=labels, ) if show_result: # Display the results plt.figure(figsize=(12, 8)) plt.title(f"Frame {ann_frame_idx}") plt.imshow(Image.open(os.path.join(video_dir, frame_names[ann_frame_idx]))) show_points(points, labels, plt.gca()) show_mask((out_mask_logits[0] > 0.0).cpu().numpy(), plt.gca(), obj_id=out_obj_ids[0]) plt.show()

We establish the state and provide the coordinates of the objects we aim to segment:

inference_state = predictor.init_state(video_path=video_dir)refine_mask_with_coordinates([[950, 700], [950, 600], [950, 500]], 0, 1)

SAM 2 + GPT-4o — Cascading Foundation Models via Visual Prompting — Part 1 (3)

As shown in Figure 5, three points were enough for the model to assign a mask to the whole body of the individual.

Now we run the process on all the frames (Figure 6):

# run propagation throughout the video and collect the results in a dictvideo_segments = {} # video_segments contains the per-frame segmentation resultsfor out_frame_idx, out_obj_ids, out_mask_logits in predictor.propagate_in_video(inference_state): video_segments[out_frame_idx] = { out_obj_id: (out_mask_logits[i] > 0.0).cpu().numpy() for i, out_obj_id in enumerate(out_obj_ids) }# render the segmentation results every few framesvis_frame_stride = 30plt.close("all")for out_frame_idx in range(0, len(frame_names), vis_frame_stride): plt.figure(figsize=(6, 4)) plt.title(f"frame {out_frame_idx}") plt.imshow(Image.open(os.path.join(video_dir, frame_names[out_frame_idx]))) for out_obj_id, out_mask in video_segments[out_frame_idx].items(): show_mask(out_mask, plt.gca(), obj_id=out_obj_id)

SAM 2 + GPT-4o — Cascading Foundation Models via Visual Prompting — Part 1 (4)

Finally, we combine the frames to generate a video using ffmpeg. The end result is shown in Figure 7.

Figure 7. Top: original video, Bottom: video after running SAM 2 on it

4. What’s next

SAM 2’s ability to segment objects accurately and quickly in both images and videos can revolutionize how computer vision systems are created.

In Part 2 we’ll explore how we can use GPT-4o to provide visual prompts to SAM 2 in what we call a cascade of foundation models, meaning, chaining models together to create the vision systems of the future.

🔥 Learn more about the cutting edge of multimodality and foundation models in our CVPR 2024 series:

  • (RAG, Multimodal, Embeddings, and more).
  • Top Highlights You Must Know — Embodied AI, GenAI, Foundation Models, and Video Understanding.

References

[1] Segment Anything Model 2

[2] Segment Anything Model

Authors: Jose Gabriel Islas Montero, Dmitry Kazhdan

👉 If you would like to know more about Tenyks, try sandbox.

SAM 2 + GPT-4o — Cascading Foundation Models via Visual Prompting — Part 1 (5)

Stay In Touch

Subscribe to our Newsletter

Stay up-to-date on the latest blogs and news from Tenyks!

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

SAM 2 + GPT-4o — Cascading Foundation Models via Visual Prompting — Part 1 (2024)
Top Articles
Auto Parts Store in Miami, Florida | AutoZone
AutoZone Auto Parts in Jacksonville, FL (9805 Atlantic Blvd): Best Auto Parts Store Near Me
Chs.mywork
Triumph Speed Twin 2025 e Speed Twin RS, nelle concessionarie da gennaio 2025 - News - Moto.it
Aces Fmc Charting
Fototour verlassener Fliegerhorst Schönwald [Lost Place Brandenburg]
Paula Deen Italian Cream Cake
Rubfinder
Https //Advanceautoparts.4Myrebate.com
Keniakoop
Craigslist Apartments In Philly
2016 Ford Fusion Belt Diagram
Available Training - Acadis® Portal
Nutrislice Menus
Chastity Brainwash
Second Chance Maryland Lottery
Der Megatrend Urbanisierung
Spoilers: Impact 1000 Taping Results For 9/14/2023 - PWMania - Wrestling News
Kp Nurse Scholars
Golden Abyss - Chapter 5 - Lunar_Angel
Conan Exiles: Nahrung und Trinken finden und herstellen
Jeff Now Phone Number
Doublelist Paducah Ky
Craigslist Houses For Rent In Milan Tennessee
Disputes over ESPN, Disney and DirecTV go to the heart of TV's existential problems
Getmnapp
1145 Barnett Drive
What Sells at Flea Markets: 20 Profitable Items
Downtown Dispensary Promo Code
Tomb Of The Mask Unblocked Games World
Sinfuldeed Leaked
Free Tiktok Likes Compara Smm
Poe T4 Aisling
Indiana Jones 5 Showtimes Near Jamaica Multiplex Cinemas
Nicole Wallace Mother Of Pearl Necklace
Of An Age Showtimes Near Alamo Drafthouse Sloans Lake
The Pretty Kitty Tanglewood
Los Garroberros Menu
Marcus Roberts 1040 Answers
Fototour verlassener Fliegerhorst Schönwald [Lost Place Brandenburg]
Craiglist Hollywood
One Main Branch Locator
Levothyroxine Ati Template
Post A Bid Monticello Mn
فیلم گارد ساحلی زیرنویس فارسی بدون سانسور تاینی موویز
Craigslist Com St Cloud Mn
Joblink Maine
Motorcycles for Sale on Craigslist: The Ultimate Guide - First Republic Craigslist
New Starfield Deep-Dive Reveals How Shattered Space DLC Will Finally Fix The Game's Biggest Combat Flaw
Bismarck Mandan Mugshots
Dinargurus
Latest Posts
Article information

Author: Virgilio Hermann JD

Last Updated:

Views: 5855

Rating: 4 / 5 (61 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Virgilio Hermann JD

Birthday: 1997-12-21

Address: 6946 Schoen Cove, Sipesshire, MO 55944

Phone: +3763365785260

Job: Accounting Engineer

Hobby: Web surfing, Rafting, Dowsing, Stand-up comedy, Ghost hunting, Swimming, Amateur radio

Introduction: My name is Virgilio Hermann JD, I am a fine, gifted, beautiful, encouraging, kind, talented, zealous person who loves writing and wants to share my knowledge and understanding with you.