Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model Aug 22, 2023 • 26
view article Article Democratization of AI, Open Source, and AI Auditing: Thoughts from the DisinfoCon Panel in Berlin By frimelle • 15 days ago • 5
Moshi v0.1 Release Collection MLX, Candle & PyTorch model checkpoints released as part of the Moshi release from Kyutai. Run inference via: https://github.com/kyutai-labs/moshi • 13 items • Updated Sep 18 • 211
view article Article Getty Images Brings High-Quality, Commercially Safe Dataset to Hugging Face By andreagagliano • Sep 6 • 16
Building and better understanding vision-language models: insights and future directions Paper • 2408.12637 • Published Aug 22 • 115
LlamaDuo: LLMOps Pipeline for Seamless Migration from Service LLMs to Small-Scale Local LLMs Paper • 2408.13467 • Published Aug 24 • 23
view article Article 🔥 Argilla 2.0: the data-centric tool for AI makers 🤗 By dvilasuero • Jul 30 • 34
🍃 MINT-1T Collection Data for "MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens" • 13 items • Updated Jul 24 • 50
view article Article Querying Datasets with the Datasets Explorer Chrome Extension By cfahlgren1 • Jul 19 • 6
view article Article Structured Harm Reporting in AI: New Research Paper at AIES and DEFCON event! By evijit • Jul 18 • 3
Corpus audio: Spanish & variants Collection Audio corpus that specify the origin of the speakers • 8 items • Updated Jul 17 • 1
🪐 SmolLM Collection A series of smol LLMs: 135M, 360M and 1.7B. We release base and Instruct models as well as the training corpus and some WebGPU demos • 12 items • Updated Aug 18 • 176
view article Article Experimenting with Automatic PII Detection on the Hub using Presidio Jul 10 • 24
view article Article EU Training Data Transparency: A Proposal for a Sufficiently Detailed Summary 📑📚🖼️🇪🇺 By yjernite • Jul 3 • 8
view article Article BM25 for Python: Achieving high performance while simplifying dependencies with *BM25S*⚡ By xhluca • Jul 9 • 35
view article Article 📚 Training Data Transparency in AI: Tools, Trends, and Policy Recommendations 🗳️ By yjernite • Dec 5, 2023 • 1
view article Article Ethics and Society Newsletter #6: Building Better AI: The Importance of Data Quality Jun 24 • 31
view article Article Introducing Synthetic Data Workshop: Your Gateway to Easy Synthetic Dataset Creation By davanstrien • Jun 20 • 12
view article Article Open-source embeddings and LLMs outperform Gemini and OpenAI for Web Navigation while being faster and cheaper By dhuynh95 • Jun 21 • 6
view article Article BigCodeBench: Benchmarking Large Language Models on Solving Practical and Challenging Programming Tasks Jun 18 • 36
view article Article Unveiling CIVICS: A New Dataset for Examining Cultural Values in Language Models By giadap • Jun 19 • 8
Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback Paper • 2406.09279 • Published Jun 13 • 1
Tulu V2.5 Suite Collection A suite of models trained using DPO and PPO across a wide variety (up to 14) of preference datasets. See https://arxiv.org/abs/2406.09279 for more! • 44 items • Updated 8 days ago • 14
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs Paper • 2402.14740 • Published Feb 22 • 8
view article Article Reports on the Hub: A First Look at Self-governance in Open Source AI Development By frimelle • Jun 12 • 7
view article Article How to build an interactive HF Space to visualize an Image Dataset By MarkusStoll • Dec 18, 2023 • 3
Hugging Face community’s Wikimedia datasets Collection Wikimedia datasets created by the Hugging Face community, not Wikimedia. Sorted by Wikimedia project. • 17 items • Updated Jun 7 • 9
StarChat2 15B Collection Model, datasets, and demo for StarChat2 15B. For code to train the models, see: https://github.com/huggingface/alignment-handbook • 10 items • Updated Apr 12 • 13
view article Article How to directly access 150k+ Hugging Face Datasets with DuckDB and query using GPT-4o By chilijung • May 31 • 10
view article Article Wikipedia's Treasure Trove: Advancing Machine Learning with Diverse Data By frimelle • Jun 3 • 13
📚 FineWeb-Edu Collection FineWeb-Edu datasets, classifier and ablation model • 5 items • Updated Jun 12 • 8
CommonCanvas Collection Collection of models trained on the CommonCatalogue datasets • 8 items • Updated May 16 • 9
CommonCatalog Collection Common Catalog, a dataset with Creative Commons licensed images and machine-generated caption pairs • 8 items • Updated May 16 • 14
Wikimedia Datasets Collection Wikimedia datasets, across languages and modalities, from different Wikimedia projects, on the hub. Not all tested. • 19 items • Updated May 16 • 9
view article Article ⚗️ 🧑🏼🌾 Let's grow some Domain Specific Datasets together By burtenshaw • Apr 29 • 29
view article Article Releasing Youtube-Commons: a massive open corpus for conversational and multimodal data By Pclanglais • Apr 18 • 21
Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies Paper • 2404.08197 • Published Apr 12 • 27
view article Article Policy Questions Blog 1: AI Data Transparency Remarks for NAIAC Panel 📚🔍⚖️ By yjernite • Mar 27 • 2
Creación de corpus en comunidad Collection Colección de esfuerzos colaborativos para crear corpus en español de calidad. Toda persona hispanohablante puede contribuir :) • 7 items • Updated Jul 17 • 6