Submitted by akhaliq 124 MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training · 31 authors 12
Submitted by akhaliq 72 Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking · 6 authors 7
Submitted by akhaliq 54 Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset · 3 authors 4
Submitted by akhaliq 25 GiT: Towards Generalist Vision Transformer through Universal Language Interface · 8 authors 11
Submitted by akhaliq 24 StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based Semantic Control · 4 authors 3
Submitted by akhaliq 20 BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences · 9 authors 2
Submitted by akhaliq 16 Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering · 7 authors 1
Submitted by akhaliq 14 Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring · 6 authors 3
Submitted by akhaliq 13 Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding · 10 authors 1
Submitted by akhaliq 8 VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding · 10 authors 1
Submitted by akhaliq 7 LocalMamba: Visual State Space Model with Windowed Selective Scan · 6 authors 1