Abstract: Inspired by the success of vision–language methods (VLMs) in zero-shot classification, recent works attempt to extend this line of work into object detection by leveraging the localization ...