Marlin-2B Open Source: 2B Vision-Language Model for Video Search
The source code for Marlin-2B has been released
This is a compact vision-language model for extracting structured information from video.
Marlin was fine-tuned for two key queries that developers most often need when working with video: what is happening and exactly when.
For its size class, the model shows strong results, competing with Gemini-2.5-flash while having only 2B parameters.
Marlin was trained in two modes:
1. marlin.caption() returns structured JSON with the scene and events, with timecodes accurate to the second.
This can be used to generate subtitles for Reels videos, index a video library, or provide an agent with context about what happened and when in a video stream.
2. marlin.find() returns timecodes (start, end) for any natural-language query about the video.
Fast enough to run directly in an agent loop; can be used to search for video segments with sub-second precision.
model:
demo: https://vlm.nemostation.com/