Abstract: Multimodal Large Language Models (MLLMs) have excelled in 2D image-text comprehension and image generation. Still, their understanding of the 3D world needs to be improved, limiting progress ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results