Daniel van Strien PRO

davanstrien

https://danielvanstrien.xyz/

AI & ML interests

Machine Learning Librarian

Articles

Scaling AI-based Data Processing with Hugging Face + Dask

14 days ago

• 22

Introducing Synthetic Data Workshop: Your Gateway to Easy Synthetic Dataset Creation

Jun 20

• 12

Data Is Better Together: A Look Back and Forward

Jun 20

• 18

Synthetic dataset generation techniques: generating custom sentence similarity data

May 23

• 15

Synthetic dataset generation techniques: Self-Instruct

May 15

• 11

Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia?

May 7

• 7

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Mar 20

• 62

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Aug 22, 2023

• 26

Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub

Aug 2, 2023

The Hugging Face Hub for Galleries, Libraries, Archives and Museums

Jun 12, 2023

• 1

Introducing BERTopic Integration with Hugging Face Hub

May 31, 2023

• 7

Jupyter X Hugging Face

Mar 23, 2023

• 2

Image search with 🤗 datasets

Mar 16, 2022

• 5

Organizations

davanstrien's activity

upvoted 2 articles about 1 hour ago

Article

Releasing Outlines-core 0.1.0: structured generation in Rust and Python

about 19 hours ago

• 7

Article

ColFlor: Towards BERT-Size Vision-Language Document Retrieval Models

•

5 days ago

• 5

upvoted an article about 2 hours ago

Article

OCR Processing and Text in Image Analysis with DeepSeek Janus-1.3B

•

about 2 hours ago

• 1

upvoted 3 articles about 6 hours ago

Article

OCR Processing and Text in Image Analysis with Florence-2-base and Qwen2-VL-2B

•

4 days ago

• 9

Article

🇮🇹🇯🇵🇧🇷 Generating multilingual instruction datasets with Magpie 🐦‍⬛

•

1 day ago

• 14

Article

Aria: First Open Multimodal Native MoE Model

•

about 14 hours ago

• 4

upvoted an article about 10 hours ago

Article

Allegro: Advanced Video Generation Model

•

about 17 hours ago

• 50

upvoted an article 5 days ago

Article

How to build a custom text classifier without days of human labeling

•

5 days ago

• 48

upvoted a paper 12 days ago

Aria: An Open Multimodal Native Mixture-of-Experts Model

Paper • 2410.05993 • Published 14 days ago • 104

upvoted 3 articles 13 days ago

Article

Improving Parquet Dedupe on Hugging Face Hub

18 days ago

• 27

Article

Faster Assisted Generation with Dynamic Speculation

15 days ago

• 26

Article

Scaling AI-based Data Processing with Hugging Face + Dask

14 days ago

• 22

upvoted an article 18 days ago

Article

VLM Art Analysis

•

18 days ago

• 8

upvoted a collection 20 days ago

HTRflow v.0.1.2 models

Collection

3 items • Updated 20 days ago • 3

upvoted an article 20 days ago

Article

HTRflow - A tool for HTR and OCR

•

21 days ago

• 12

upvoted a collection 22 days ago

From Around the Hub

Collection

2 items • Updated 27 days ago • 1

upvoted an article 25 days ago

Article

wHy DoNt YoU jUsT uSe ThE lLaMa ToKeNiZeR??

•

25 days ago

• 33

upvoted a collection 25 days ago

Gemma 2 Release

Collection

15 items • Updated Sep 9 • 187

upvoted a paper 26 days ago

Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling

Paper • 2409.14683 • Published 30 days ago • 8

upvoted an article 26 days ago

Article

🌟 Easy Fine-Tuning with Hugging Face SQL Console, Notebook Creator, and SFT

•

28 days ago

• 12

upvoted 2 collections 27 days ago

Molmo

Collection

Artifacts for open multimodal language models. • 5 items • Updated 27 days ago • 257

Llama 3.2

Collection

This collection hosts the transformers and original repos of the Llama 3.2 and Llama Guard 3 • 11 items • Updated 27 days ago • 386

upvoted an article 28 days ago

Article

Data Is Better Together: A Look Back and Forward

Jun 20

• 18

upvoted a collection 29 days ago

NIM Serverless Inference API

Collection

Models in this collection are available for inference via a serverless API powered by NVIDIA NIM. • 8 items • Updated 8 days ago • 21

upvoted an article about 1 month ago

Article

ColPali: Efficient Document Retrieval with Vision Language Models 👀

•

Jul 5

• 139

upvoted a paper about 1 month ago

ColPali: Efficient Document Retrieval with Vision Language Models

Paper • 2407.01449 • Published Jun 27 • 41

upvoted 6 collections about 1 month ago

jina-embeddings-v3

Collection

Multilingual multi-task general text embedding model • 6 items • Updated Sep 19 • 14

MagpieLM

Collection

Aligning LMs with Fully Open Recipe (data+training configs+logs) • 9 items • Updated 30 days ago • 14

Llama 3.1

Collection

This collection hosts the transformers and original repos of the Llama 3.1, Llama Guard 3 and Prompt Guard models • 11 items • Updated 27 days ago • 597

Qwen2.5

Collection

Qwen2.5 language models, including pretrained and instruction-tuned models of 7 sizes, including 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B. • 45 items • Updated Sep 18 • 268

Dataset Creation Tools and Utilities

Collection

Spaces and utilities for creating datasets and getting them on the Hub • 3 items • Updated Sep 20 • 7

Synthetic Dataset Creation Spaces

Collection

Spaces focused on generating synthetic datasets • 5 items • Updated Sep 20 • 5

upvoted 2 papers about 1 month ago

jina-embeddings-v3: Multilingual Embeddings With Task LoRA

Paper • 2409.10173 • Published Sep 16 • 23

CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation

Paper • 2409.02098 • Published Sep 3 • 1

upvoted 4 collections about 1 month ago

CRAFT: Corpus Retrieval and Augmentation for Fine-Tuning

Collection

CRAFTed datasets and LoRA adapter checkpoints. All datasets are synthetically generated. Paper: https://arxiv.org/abs/2409.02098 • 11 items • Updated Sep 4 • 2

Medieval NER

Collection

This is a collection of Medieval NER datasets and models. • 7 items • Updated Jul 4 • 2

TrOCR Medieval HTR

Collection

This is a collection of models trained to recognize medieval scripts. • 10 items • Updated Jul 8 • 4

Hub Card Data

Collection

2 items • Updated Sep 10 • 2

upvoted 2 papers about 2 months ago

Hermes 3 Technical Report

Paper • 2408.11857 • Published Aug 15 • 36

Political DEBATE: Efficient Zero-shot and Few-shot Classifiers for Political Text

Paper • 2409.02078 • Published Sep 3 • 8

upvoted a collection about 2 months ago

Qwen2-VL

Collection

Vision-language model series based on Qwen2 • 15 items • Updated Sep 18 • 141

upvoted a paper 2 months ago

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Paper • 2408.08872 • Published Aug 16 • 97

upvoted a collection 2 months ago

Llama 2 Family

Collection

This collection hosts the transformers and original repos of the Llama 2 and Llama Guard releases • 13 items • Updated 27 days ago • 70

upvoted 3 papers 2 months ago

MathBridge: A Large-Scale Dataset for Translating Mathematical Expressions into Formula Images

Paper • 2408.07081 • Published Aug 7 • 1

Seeing and Understanding: Bridging Vision with Chemical Knowledge Via ChemVLM

Paper • 2408.07246 • Published Aug 14 • 19

InfinityMATH: A Scalable Instruction Tuning Dataset in Programmatic Mathematical Reasoning

Paper • 2408.07089 • Published Aug 9 • 12

upvoted a collection 2 months ago

ARK Annif Models

Collection

Contains 5 Annif models for the languages German, Latin, English, French and multilingual. • 5 items • Updated Aug 14 • 2

upvoted an article 2 months ago

Article

⭐ PySpark and 🤗 Hugging Face Parquet Files

•

Aug 13

• 5

upvoted 4 papers 3 months ago

MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine

Paper • 2408.02900 • Published Aug 6 • 25

RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation

Paper • 2408.02545 • Published Aug 5 • 33

BioMamba: A Pre-trained Biomedical Language Representation Model Leveraging Mamba

Paper • 2408.02600 • Published Aug 5 • 8

LocalValueBench: A Collaboratively Built and Extensible Benchmark for Evaluating Localized Value Alignment and Ethical Safety in Large Language Models

Paper • 2408.01460 • Published Jul 27 • 1

upvoted an article 3 months ago

Article

The case for specialized pre-training: ultra-fast foundation models for dedicated tasks

•

Aug 4

• 26

upvoted a paper 3 months ago

ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation

Paper • 2407.19835 • Published Jul 29 • 20

upvoted a collection 3 months ago

🍃 MINT-1T

Collection

Data for "MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens" • 13 items • Updated Jul 24 • 50

upvoted 2 articles 3 months ago

Article

Bringing Open-Source Models to Spreadsheets 🚀

•

Jul 19

• 3

Article

Announcing Finance Commons and the Bad Data Toolbox: Pioneering Open Data and Advanced Document Processing

•

Jul 19

• 17

upvoted 2 collections 3 months ago

DCLM

Collection

DCLM Models + Datasets • 7 items • Updated Jul 22 • 40

Alpaca Style Datasets

Collection

Datasets which follow the Alpaca Style format based on having 'instruction', 'input', and 'output' columns • 3395 items • Updated about 18 hours ago • 2

upvoted an article 3 months ago

Article

Experimenting with Automatic PII Detection on the Hub using Presidio

Jul 10

• 24

Daniel van Strien PRO

AI & ML interests

Articles

Scaling AI-based Data Processing with Hugging Face + Dask

Introducing Synthetic Data Workshop: Your Gateway to Easy Synthetic Dataset Creation

Data Is Better Together: A Look Back and Forward

Synthetic dataset generation techniques: generating custom sentence similarity data

Synthetic dataset generation techniques: Self-Instruct

Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia?

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Data is better together

Extracting Insights from Model Cards Using Open Large Language Models

Creating open machine learning datasets? Share them on the Hugging Face Hub!

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub

The Hugging Face Hub for Galleries, Libraries, Archives and Museums

Introducing BERTopic Integration with Hugging Face Hub

Jupyter X Hugging Face

Image search with 🤗 datasets

Organizations

davanstrien's activity

Releasing Outlines-core 0.1.0: structured generation in Rust and Python

ColFlor: Towards BERT-Size Vision-Language Document Retrieval Models

OCR Processing and Text in Image Analysis with DeepSeek Janus-1.3B

OCR Processing and Text in Image Analysis with Florence-2-base and Qwen2-VL-2B

🇮🇹🇯🇵🇧🇷 Generating multilingual instruction datasets with Magpie 🐦‍⬛

Aria: First Open Multimodal Native MoE Model

Allegro: Advanced Video Generation Model

How to build a custom text classifier without days of human labeling

Improving Parquet Dedupe on Hugging Face Hub

Faster Assisted Generation with Dynamic Speculation

Scaling AI-based Data Processing with Hugging Face + Dask

VLM Art Analysis

HTRflow - A tool for HTR and OCR

wHy DoNt YoU jUsT uSe ThE lLaMa ToKeNiZeR??

🌟 Easy Fine-Tuning with Hugging Face SQL Console, Notebook Creator, and SFT

Data Is Better Together: A Look Back and Forward

ColPali: Efficient Document Retrieval with Vision Language Models 👀

⭐ PySpark and 🤗 Hugging Face Parquet Files

The case for specialized pre-training: ultra-fast foundation models for dedicated tasks

Bringing Open-Source Models to Spreadsheets 🚀

Announcing Finance Commons and the Bad Data Toolbox: Pioneering Open Data and Advanced Document Processing

Experimenting with Automatic PII Detection on the Hub using Presidio