Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20 • 62
Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model Aug 22, 2023 • 26
view article Article Releasing Outlines-core 0.1.0: structured generation in Rust and Python about 19 hours ago • 7
view article Article ColFlor: Towards BERT-Size Vision-Language Document Retrieval Models By ahmed-masry • 5 days ago • 5
view article Article OCR Processing and Text in Image Analysis with DeepSeek Janus-1.3B By PandorAI1995 • about 2 hours ago • 1
view article Article OCR Processing and Text in Image Analysis with Florence-2-base and Qwen2-VL-2B By PandorAI1995 • 4 days ago • 9
view article Article 🇮🇹🇯🇵🇧🇷 Generating multilingual instruction datasets with Magpie 🐦⬛ By anakin87 • 1 day ago • 14
view article Article Aria: First Open Multimodal Native MoE Model By RhymesAI • about 14 hours ago • 4
view article Article How to build a custom text classifier without days of human labeling By sdiazlor • 5 days ago • 48
Aria: An Open Multimodal Native Mixture-of-Experts Model Paper • 2410.05993 • Published 14 days ago • 104
view article Article wHy DoNt YoU jUsT uSe ThE lLaMa ToKeNiZeR?? By catherinearnett • 25 days ago • 33
Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling Paper • 2409.14683 • Published 30 days ago • 8
view article Article 🌟 Easy Fine-Tuning with Hugging Face SQL Console, Notebook Creator, and SFT By asoria • 28 days ago • 12
Molmo Collection Artifacts for open multimodal language models. • 5 items • Updated 27 days ago • 257
Llama 3.2 Collection This collection hosts the transformers and original repos of the Llama 3.2 and Llama Guard 3 • 11 items • Updated 27 days ago • 386
NIM Serverless Inference API Collection Models in this collection are available for inference via a serverless API powered by NVIDIA NIM. • 8 items • Updated 8 days ago • 21
view article Article ColPali: Efficient Document Retrieval with Vision Language Models 👀 By manu • Jul 5 • 139
ColPali: Efficient Document Retrieval with Vision Language Models Paper • 2407.01449 • Published Jun 27 • 41
jina-embeddings-v3 Collection Multilingual multi-task general text embedding model • 6 items • Updated Sep 19 • 14
MagpieLM Collection Aligning LMs with Fully Open Recipe (data+training configs+logs) • 9 items • Updated 30 days ago • 14
Llama 3.1 Collection This collection hosts the transformers and original repos of the Llama 3.1, Llama Guard 3 and Prompt Guard models • 11 items • Updated 27 days ago • 597
Qwen2.5 Collection Qwen2.5 language models, including pretrained and instruction-tuned models of 7 sizes, including 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B. • 45 items • Updated Sep 18 • 268
Dataset Creation Tools and Utilities Collection Spaces and utilities for creating datasets and getting them on the Hub • 3 items • Updated Sep 20 • 7
Synthetic Dataset Creation Spaces Collection Spaces focused on generating synthetic datasets • 5 items • Updated Sep 20 • 5
jina-embeddings-v3: Multilingual Embeddings With Task LoRA Paper • 2409.10173 • Published Sep 16 • 23
CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation Paper • 2409.02098 • Published Sep 3 • 1
CRAFT: Corpus Retrieval and Augmentation for Fine-Tuning Collection CRAFTed datasets and LoRA adapter checkpoints. All datasets are synthetically generated. Paper: https://arxiv.org/abs/2409.02098 • 11 items • Updated Sep 4 • 2
Medieval NER Collection This is a collection of Medieval NER datasets and models. • 7 items • Updated Jul 4 • 2
TrOCR Medieval HTR Collection This is a collection of models trained to recognize medieval scripts. • 10 items • Updated Jul 8 • 4
Political DEBATE: Efficient Zero-shot and Few-shot Classifiers for Political Text Paper • 2409.02078 • Published Sep 3 • 8
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models Paper • 2408.08872 • Published Aug 16 • 97
Llama 2 Family Collection This collection hosts the transformers and original repos of the Llama 2 and Llama Guard releases • 13 items • Updated 27 days ago • 70
MathBridge: A Large-Scale Dataset for Translating Mathematical Expressions into Formula Images Paper • 2408.07081 • Published Aug 7 • 1
Seeing and Understanding: Bridging Vision with Chemical Knowledge Via ChemVLM Paper • 2408.07246 • Published Aug 14 • 19
InfinityMATH: A Scalable Instruction Tuning Dataset in Programmatic Mathematical Reasoning Paper • 2408.07089 • Published Aug 9 • 12
ARK Annif Models Collection Contains 5 Annif models for the languages German, Latin, English, French and multilingual. • 5 items • Updated Aug 14 • 2
MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine Paper • 2408.02900 • Published Aug 6 • 25
RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation Paper • 2408.02545 • Published Aug 5 • 33
BioMamba: A Pre-trained Biomedical Language Representation Model Leveraging Mamba Paper • 2408.02600 • Published Aug 5 • 8
LocalValueBench: A Collaboratively Built and Extensible Benchmark for Evaluating Localized Value Alignment and Ethical Safety in Large Language Models Paper • 2408.01460 • Published Jul 27 • 1
view article Article The case for specialized pre-training: ultra-fast foundation models for dedicated tasks By Pclanglais • Aug 4 • 26
ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation Paper • 2407.19835 • Published Jul 29 • 20
🍃 MINT-1T Collection Data for "MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens" • 13 items • Updated Jul 24 • 50
view article Article Announcing Finance Commons and the Bad Data Toolbox: Pioneering Open Data and Advanced Document Processing By Pclanglais • Jul 19 • 17
Alpaca Style Datasets Collection Datasets which follow the Alpaca Style format based on having 'instruction', 'input', and 'output' columns • 3395 items • Updated about 18 hours ago • 2
view article Article Experimenting with Automatic PII Detection on the Hub using Presidio Jul 10 • 24