NVIDIA Open-Sources LocateAnything-3B for Faster Visual Localization
NVIDIA has open-sourced the visual localization model LocateAnything-3B.
The model can find objects even in very dense scenes. For example, in an image with dozens of minions standing close together, it correctly highlights each one with a separate bounding box.
The main difference from most existing models is the way bounding boxes are generated. Usually, the coordinates (x1, y1, x2, y2) are predicted sequentially, digit by digit. This slows things down, and errors at early stages can affect subsequent coordinates, especially when there are many objects.
LocateAnything-3B uses parallel decoding: the model immediately predicts complete, ready-made boxes, rather than constructing them step by step. Thanks to this, detection becomes more stable, especially in scenes with a large number of objects.
For training, not only classic object recognition datasets were used, but also data for UI recognition, OCR, and document structure analysis. Therefore, the model can find both real-world objects and user interface elements and text regions.
The model has 3 billion parameters and is released as open source.