GaggiX

AI & ML interests

None yet

Organizations

None yet

GaggiX's activity

upvoted a paper 8 days ago

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Paper • 2410.06885 • Published 13 days ago • 33

upvoted 2 papers 12 days ago

Pixtral 12B

Paper • 2410.07073 • Published 13 days ago • 57

A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

Paper • 2410.01912 • Published 20 days ago • 13

upvoted a paper 14 days ago

FAN: Fourier Analysis Networks

Paper • 2410.02675 • Published 19 days ago • 24

upvoted a paper 16 days ago

Eliminating Oversaturation and Artifacts of High Guidance Scales in Diffusion Models

Paper • 2410.02416 • Published 19 days ago • 25

upvoted a paper 21 days ago

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

Paper • 2409.17066 • Published 27 days ago • 26

upvoted a paper 26 days ago

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Paper • 2409.17146 • Published 27 days ago • 96

upvoted 2 papers 27 days ago

MaskBit: Embedding-free Image Generation via Bit Tokens

Paper • 2409.16211 • Published 28 days ago • 16

SLIMER-IT: Zero-Shot NER on Italian Language

Paper • 2409.15933 • Published 28 days ago • 4

upvoted a paper 30 days ago

MuCodec: Ultra Low-Bitrate Music Codec

Paper • 2409.13216 • Published Sep 20 • 22

upvoted 3 papers about 1 month ago

Kolmogorov-Arnold Transformer

Paper • 2409.10594 • Published Sep 16 • 38

Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think

Paper • 2409.11355 • Published Sep 17 • 27

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Paper • 2409.12191 • Published Sep 18 • 72

upvoted a collection about 1 month ago

Moshi v0.1 Release

Collection

MLX, Candle & PyTorch model checkpoints released as part of the Moshi release from Kyutai. Run inference via: https://github.com/kyutai-labs/moshi • 13 items • Updated Sep 18 • 211

upvoted a paper about 1 month ago

Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation

Paper • 2409.04410 • Published Sep 6 • 23

upvoted a paper 2 months ago

Jamba-1.5: Hybrid Transformer-Mamba Models at Scale

Paper • 2408.12570 • Published Aug 22 • 29

upvoted 4 papers 3 months ago

upvoted 2 papers 4 months ago

E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

Paper • 2406.18009 • Published Jun 26 • 18

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Paper • 2406.17557 • Published Jun 25 • 85

upvoted an article 4 months ago

Article

Tokenization Is A Dead Weight (Tokun Part 1)

•

Jun 27

• 16

upvoted 4 papers 4 months ago

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

Paper • 2406.09406 • Published Jun 13 • 13

AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation

Paper • 2406.07686 • Published Jun 11 • 14

What If We Recaption Billions of Web Images with LLaMA-3?

Paper • 2406.08478 • Published Jun 12 • 39

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Paper • 2406.06525 • Published Jun 10 • 65

upvoted 4 papers 5 months ago

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Paper • 2406.02430 • Published Jun 4 • 29

BitsFusion: 1.99 bits Weight Quantization of Diffusion Model

Paper • 2406.04333 • Published Jun 6 • 36

Diffusion for World Modeling: Visual Details Matter in Atari

Paper • 2405.12399 • Published May 20 • 26

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Paper • 2405.09818 • Published May 16 • 125

upvoted 6 papers 6 months ago

A Careful Examination of Large Language Model Performance on Grade School Arithmetic

Paper • 2405.00332 • Published May 1 • 30

Dynamic Typography: Bringing Words to Life

Paper • 2404.11614 • Published Apr 17 • 43

Long-form music generation with latent diffusion

Paper • 2404.10301 • Published Apr 16 • 24

Probing the 3D Awareness of Visual Foundation Models

Paper • 2404.08636 • Published Apr 12 • 11

ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback

Paper • 2404.07987 • Published Apr 11 • 47

Adapting LLaMA Decoder to Vision Transformer

Paper • 2404.06773 • Published Apr 10 • 17

upvoted 12 papers 7 months ago

On the Scalability of Diffusion-based Text-to-Image Generation

Paper • 2404.02883 • Published Apr 3 • 17

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Paper • 2404.02258 • Published Apr 2 • 104

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

Paper • 2404.02905 • Published Apr 3 • 64

Getting it Right: Improving Spatial Consistency in Text-to-Image Models

Paper • 2404.01197 • Published Apr 1 • 30

InstantSplat: Unbounded Sparse-view Pose-free Gaussian Splatting in 40 Seconds

Paper • 2403.20309 • Published Mar 29 • 17

ViTAR: Vision Transformer with Any Resolution

Paper • 2403.18361 • Published Mar 27 • 51

FlashFace: Human Image Personalization with High-fidelity Identity Preservation

Paper • 2403.17008 • Published Mar 25 • 18

Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition

Paper • 2403.14148 • Published Mar 21 • 17

FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis

Paper • 2403.12963 • Published Mar 19 • 7

LightIt: Illumination Modeling and Control for Diffusion Models

Paper • 2403.10615 • Published Mar 15 • 16

SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion

Paper • 2403.12008 • Published Mar 18 • 19

Stealing Part of a Production Language Model

Paper • 2403.06634 • Published Mar 11 • 90

upvoted 11 papers 8 months ago

Pix2Gif: Motion-Guided Diffusion for GIF Generation

Paper • 2403.04634 • Published Mar 7 • 14

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Paper • 2403.04132 • Published Mar 7 • 38

StableDrag: Stable Dragging for Point-based Image Editing

Paper • 2403.04437 • Published Mar 7 • 25

Yi: Open Foundation Models by 01.AI

Paper • 2403.04652 • Published Mar 7 • 62

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

Paper • 2403.03100 • Published Mar 5 • 34

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Paper • 2403.03206 • Published Mar 5 • 56

StarCoder 2 and The Stack v2: The Next Generation

Paper • 2402.19173 • Published Feb 29 • 134

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Paper • 2402.17764 • Published Feb 27 • 601

FiT: Flexible Vision Transformer for Diffusion Model

Paper • 2402.12376 • Published Feb 19 • 48

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

Paper • 2402.08093 • Published Feb 12 • 54

World Model on Million-Length Video And Language With RingAttention

Paper • 2402.08268 • Published Feb 13 • 36