Yacine Jernite

yjernite

https://yjernite.github.io/

AI & ML interests

Technical, community, and regulatory tools of AI governance @HuggingFace

Articles

EU Training Data Transparency: A Proposal for a Sufficiently Detailed Summary 📑📚🖼️🇪🇺

Jul 3

• 8

Ethics and Society Newsletter #6: Building Better AI: The Importance of Data Quality

Jun 24

• 31

📚 Training Data Transparency in AI: Tools, Trends, and Policy Recommendations 🗳️

Dec 5, 2023

• 1

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Aug 22, 2023

• 26

AI Policy @🤗: Open ML Considerations in the EU AI Act

Jul 24, 2023

• 2

AI Policy @🤗: Response to the U.S. NTIA's Request for Comment on AI Accountability

Jun 20, 2023

Hugging Face Selected for the French Data Protection Agency Enhanced Support Program

May 15, 2023

Ethics and Society Newsletter #3: Ethical Openness at Hugging Face

Mar 30, 2023

Ethics and Society Newsletter #2: Let's talk about bias!

Dec 15, 2022

Putting ethical principles at the core of research lifecycle

May 19, 2022

Introducing the Data Measurements Tool: an Interactive Tool for Looking at Datasets

Nov 29, 2021

Organizations

yjernite's activity

upvoted an article 13 days ago

Article

Democratization of AI, Open Source, and AI Auditing: Thoughts from the DisinfoCon Panel in Berlin

•

15 days ago

• 5

upvoted a collection about 1 month ago

Moshi v0.1 Release

Collection

MLX, Candle & PyTorch model checkpoints released as part of the Moshi release from Kyutai. Run inference via: https://github.com/kyutai-labs/moshi • 13 items • Updated Sep 18 • 211

upvoted an article about 2 months ago

Article

Getty Images Brings High-Quality, Commercially Safe Dataset to Hugging Face

•

Sep 6

• 16

upvoted a collection about 2 months ago

Qwen2-VL

Collection

Vision-language model series based on Qwen2 • 15 items • Updated Sep 18 • 141

upvoted a paper about 2 months ago

The Future of Open Human Feedback

Paper • 2408.16961 • Published Aug 15 • 19

upvoted an article about 2 months ago

Article

The Environmental Impacts of AI -- Primer

•

Sep 3

• 27

upvoted 2 papers about 2 months ago

Building and better understanding vision-language models: insights and future directions

Paper • 2408.12637 • Published Aug 22 • 115

LlamaDuo: LLMOps Pipeline for Seamless Migration from Service LLMs to Small-Scale Local LLMs

Paper • 2408.13467 • Published Aug 24 • 23

upvoted a collection about 2 months ago

Multi-Vector Retrievers

Collection

2 items • Updated Aug 20 • 3

upvoted an article about 2 months ago

Article

The 5 Most Under-Rated Tools on Hugging Face

Aug 22

• 84

upvoted an article 3 months ago

Article

🔥 Argilla 2.0: the data-centric tool for AI makers 🤗

•

Jul 30

• 34

upvoted a collection 3 months ago

🍃 MINT-1T

Collection

Data for "MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens" • 13 items • Updated Jul 24 • 50

upvoted 4 articles 3 months ago

Article

Querying Datasets with the Datasets Explorer Chrome Extension

•

Jul 19

• 6

Article

How NuminaMath Won the 1st AIMO Progress Prize

Jul 11

• 95

Article

Docmatix - a huge dataset for Document Visual Question Answering

Jul 18

• 66

Article

Structured Harm Reporting in AI: New Research Paper at AIES and DEFCON event!

•

Jul 18

• 3

upvoted 2 collections 3 months ago

DCLM

Collection

DCLM Models + Datasets • 7 items • Updated Jul 22 • 40

Corpus audio: Spanish & variants

Collection

Audio corpus that specify the origin of the speakers • 8 items • Updated Jul 17 • 1

upvoted an article 3 months ago

Article

Announcing BigCodeBench-Hard, and More

•

Jul 24

• 10

upvoted a collection 3 months ago

🪐 SmolLM

Collection

A series of smol LLMs: 135M, 360M and 1.7B. We release base and Instruct models as well as the training corpus and some WebGPU demos • 12 items • Updated Aug 18 • 176

upvoted an article 3 months ago

Article

SmolLM - blazingly fast and remarkably powerful

Jul 16

• 248

upvoted a collection 3 months ago

H2O Danube3

Collection

6 items • Updated 5 days ago • 52

upvoted an article 3 months ago

Article

Experimenting with Automatic PII Detection on the Hub using Presidio

Jul 10

• 24

upvoted 11 articles 4 months ago

Article

Announcing New Dataset Search Features

Jul 8

• 22

Article

EU Training Data Transparency: A Proposal for a Sufficiently Detailed Summary 📑📚🖼️🇪🇺

•

Jul 3

• 8

Article

BM25 for Python: Achieving high performance while simplifying dependencies with BM25S⚡

•

Jul 9

• 35

Article

AI Policy @🤗: Open ML Considerations in the EU AI Act

Jul 24, 2023

• 2

Article

📚 Training Data Transparency in AI: Tools, Trends, and Policy Recommendations 🗳️

•

Dec 5, 2023

• 1

Article

Ethics and Society Newsletter #6: Building Better AI: The Importance of Data Quality

Jun 24

• 31

Article

Introducing Synthetic Data Workshop: Your Gateway to Easy Synthetic Dataset Creation

•

Jun 20

• 12

Article

Data Is Better Together: A Look Back and Forward

Jun 20

• 18

Article

Open-source embeddings and LLMs outperform Gemini and OpenAI for Web Navigation while being faster and cheaper

•

Jun 21

• 6

Article

BigCodeBench: Benchmarking Large Language Models on Solving Practical and Challenging Programming Tasks

Jun 18

• 36

Article

Unveiling CIVICS: A New Dataset for Examining Cultural Values in Language Models

•

Jun 19

• 8

upvoted a paper 4 months ago

Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

Paper • 2406.09279 • Published Jun 13 • 1

upvoted a collection 4 months ago

Tulu V2.5 Suite

Collection

A suite of models trained using DPO and PPO across a wide variety (up to 14) of preference datasets. See https://arxiv.org/abs/2406.09279 for more! • 44 items • Updated 8 days ago • 14

upvoted an article 4 months ago

Article

Making sense of this mess

Jun 7

• 14

upvoted a paper 4 months ago

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Paper • 2402.14740 • Published Feb 22 • 8

upvoted 2 articles 4 months ago

Article

Reports on the Hub: A First Look at Self-governance in Open Source AI Development

•

Jun 12

• 7

Article

How to build an interactive HF Space to visualize an Image Dataset

•

Dec 18, 2023

• 3

upvoted 2 collections 4 months ago

Hugging Face community’s Wikimedia datasets

Collection

Wikimedia datasets created by the Hugging Face community, not Wikimedia. Sorted by Wikimedia project. • 17 items • Updated Jun 7 • 9

StarChat2 15B

Collection

Model, datasets, and demo for StarChat2 15B. For code to train the models, see: https://github.com/huggingface/alignment-handbook • 10 items • Updated Apr 12 • 13

upvoted an article 4 months ago

Article

How to directly access 150k+ Hugging Face Datasets with DuckDB and query using GPT-4o

•

May 31

• 10

upvoted an article 5 months ago

Article

Wikipedia's Treasure Trove: Advancing Machine Learning with Diverse Data

•

Jun 3

• 13

upvoted 2 collections 5 months ago

🍷 FineWeb datasets

Collection

5 items • Updated Jun 26 • 19

📚 FineWeb-Edu

Collection

FineWeb-Edu datasets, classifier and ablation model • 5 items • Updated Jun 12 • 8

upvoted 3 articles 5 months ago

Article

Space secrets security update

May 31

• 50

Article

AI has a problem with objectifying women

•

May 24

• 55

Article

Let's talk about LLM evaluation

•

May 23

• 123

upvoted 3 collections 5 months ago

CommonCanvas

Collection

Collection of models trained on the CommonCatalogue datasets • 8 items • Updated May 16 • 9

CommonCatalog

Collection

Common Catalog, a dataset with Creative Commons licensed images and machine-generated caption pairs • 8 items • Updated May 16 • 14

Wikimedia Datasets

Collection

Wikimedia datasets, across languages and modalities, from different Wikimedia projects, on the hub. Not all tested. • 19 items • Updated May 16 • 9

upvoted an article 5 months ago

Article

Energy Scores for AI Models

•

May 9

• 25

upvoted 2 articles 6 months ago

Article

⚗️ 🧑🏼‍🌾 Let's grow some Domain Specific Datasets together

•

Apr 29

• 29

Article

Releasing Youtube-Commons: a massive open corpus for conversational and multimodal data

•

Apr 18

• 21

upvoted a paper 6 months ago

Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies

Paper • 2404.08197 • Published Apr 12 • 27

upvoted an article 6 months ago

Article

Vision Language Models Explained

Apr 11

• 193

upvoted 2 articles 7 months ago

Article

Policy Questions Blog 1: AI Data Transparency Remarks for NAIAC Panel 📚🔍⚖️

•

Mar 27

• 2

Article

Public Policy at Hugging Face

Apr 8

• 19

upvoted a collection 7 months ago

Creación de corpus en comunidad

Collection

Colección de esfuerzos colaborativos para crear corpus en español de calidad. Toda persona hispanohablante puede contribuir :) • 7 items • Updated Jul 17 • 6

Yacine Jernite

AI & ML interests

Articles

EU Training Data Transparency: A Proposal for a Sufficiently Detailed Summary 📑📚🖼️🇪🇺

Ethics and Society Newsletter #6: Building Better AI: The Importance of Data Quality

Public Policy at Hugging Face

Policy Questions Blog 1: AI Data Transparency Remarks for NAIAC Panel 📚🔍⚖️

AI Watermarking 101: Tools and Techniques

📚 Training Data Transparency in AI: Tools, Trends, and Policy Recommendations 🗳️

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

AI Policy @🤗: Open ML Considerations in the EU AI Act

AI Policy @🤗: Response to the U.S. NTIA's Request for Comment on AI Accountability

Hugging Face Selected for the French Data Protection Agency Enhanced Support Program

Ethics and Society Newsletter #3: Ethical Openness at Hugging Face

Ethics and Society Newsletter #2: Let's talk about bias!

Putting ethical principles at the core of research lifecycle

Introducing the Data Measurements Tool: an Interactive Tool for Looking at Datasets

Organizations

yjernite's activity

Democratization of AI, Open Source, and AI Auditing: Thoughts from the DisinfoCon Panel in Berlin

Getty Images Brings High-Quality, Commercially Safe Dataset to Hugging Face

The Environmental Impacts of AI -- Primer

The 5 Most Under-Rated Tools on Hugging Face

🔥 Argilla 2.0: the data-centric tool for AI makers 🤗

Querying Datasets with the Datasets Explorer Chrome Extension

How NuminaMath Won the 1st AIMO Progress Prize

Docmatix - a huge dataset for Document Visual Question Answering

Structured Harm Reporting in AI: New Research Paper at AIES and DEFCON event!

Announcing BigCodeBench-Hard, and More

SmolLM - blazingly fast and remarkably powerful

Experimenting with Automatic PII Detection on the Hub using Presidio

Announcing New Dataset Search Features

EU Training Data Transparency: A Proposal for a Sufficiently Detailed Summary 📑📚🖼️🇪🇺

BM25 for Python: Achieving high performance while simplifying dependencies with *BM25S*⚡

AI Policy @🤗: Open ML Considerations in the EU AI Act

📚 Training Data Transparency in AI: Tools, Trends, and Policy Recommendations 🗳️

Ethics and Society Newsletter #6: Building Better AI: The Importance of Data Quality

Introducing Synthetic Data Workshop: Your Gateway to Easy Synthetic Dataset Creation

Data Is Better Together: A Look Back and Forward

Open-source embeddings and LLMs outperform Gemini and OpenAI for Web Navigation while being faster and cheaper

BigCodeBench: Benchmarking Large Language Models on Solving Practical and Challenging Programming Tasks

Unveiling CIVICS: A New Dataset for Examining Cultural Values in Language Models

Making sense of this mess

Reports on the Hub: A First Look at Self-governance in Open Source AI Development

How to build an interactive HF Space to visualize an Image Dataset

How to directly access 150k+ Hugging Face Datasets with DuckDB and query using GPT-4o

Wikipedia's Treasure Trove: Advancing Machine Learning with Diverse Data

Space secrets security update

AI has a problem with objectifying women

Let's talk about LLM evaluation

Energy Scores for AI Models

⚗️ 🧑🏼‍🌾 Let's grow some Domain Specific Datasets together

Releasing Youtube-Commons: a massive open corpus for conversational and multimodal data

Vision Language Models Explained

Policy Questions Blog 1: AI Data Transparency Remarks for NAIAC Panel 📚🔍⚖️

Public Policy at Hugging Face

BM25 for Python: Achieving high performance while simplifying dependencies with BM25S⚡