cpaua
·1 min0

Marlin-2B Open Source: 2B Vision-Language Model for Video Search

Marlin-2B Open Source: 2B Vision-Language Model for Video Search
photo_3176.jpg

The source code for Marlin-2B has been released

This is a compact vision-language model for extracting structured information from video.

Marlin was fine-tuned for two key queries that developers most often need when working with video: what is happening and exactly when.

For its size class, the model shows strong results, competing with Gemini-2.5-flash while having only 2B parameters.

Marlin was trained in two modes:

1. marlin.caption() returns structured JSON with the scene and events, with timecodes accurate to the second.

This can be used to generate subtitles for Reels videos, index a video library, or provide an agent with context about what happened and when in a video stream.

2. marlin.find() returns timecodes (start, end) for any natural-language query about the video.

Fast enough to run directly in an agent loop; can be used to search for video segments with sub-second precision.

model: Hugging FaceNemoStation/Marlin-2Bhuggingface.co/NemoStation/Marlin-2B
demo: https://vlm.nemostation.com/

Share:
Author
cpaua

VibeCode blog admin. Writing about vibe coding, AI and open source.

Comments

To leave a comment, log in or sign up
Loading...

Related articles