Merve Noyan PRO

merve

AI & ML interests

VLMs, vision & co

Articles

Organizations

merve's activity

posted an update about 10 hours ago

Post

382

Tencent released a new depth model that generates temporally consistent depth maps over videos ⏯️

Model: tencent/DepthCrafter
Demo: tencent/DepthCrafter
Paper: DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos (2409.02095)

You don't need to input anything other than video itself, no need for optical flow or camera poses! 🤩

posted an update 1 day ago

Post

2388

I ported the hottest new shape-optimized SigLIP 🔥 merve/siglip-so400m-patch16-256-i18n

if you don't want to wait for the next transformers release install transformers from my PR https://github.com/huggingface/transformers/pull/32938 and initialize SigLIP from there

posted an update 4 days ago

Post

1795

It's raining depth estimation models ☔️
DepthPro is a zero-shot depth estimation model by Apple, it's fast, sharp and accurate 🔥
Demo: akhaliq/depth-pro
Model: apple/DepthPro
Paper page: Depth Pro: Sharp Monocular Metric Depth in Less Than a Second (2410.02073)

The model consists of two encoders: an encoder for patches and an image encoder 🖼️ The outputs of both are merged to decode to depth maps and get the focal length.
The model outperforms the previous state-of-the-art models in average of various benchmarks 📑

posted an update 6 days ago

Post

1700

🔥 Meta AI just blessed us with CoTracker v3, bleeding edge point tracking foundation model 🤩
model: facebook/cotracker3
demo: facebook/cotracker

posted an update 11 days ago

Post

2762

This is not a drill 💥
HuggingChat is now multimodal with meta-llama/Llama-3.2-11B-Vision-Instruct! 🤗
This also comes with multimodal assistants, I have migrated my Marcus Aurelius advice assistant to Llama-Vision and Marcus can see now! 😄

Chat with Marcus: https://huggingface.co/chat/assistant/65bfed22022ba290531112f8
Start chatting with Llama-Vision 3.2 11B Instruct https://huggingface.co/chat/models/meta-llama/Llama-3.2-11B-Vision-Instruct

1 reply

posted an update 14 days ago

Post

3644

Meta AI vision has been cooking @facebook
They shipped multiple models and demos for their papers at @ECCV 🤗

Here's a compilation of my top picks:
- Sapiens is family of foundation models for human-centric depth estimation, segmentation and more, all models have open weights and demos 👏

All models have their demos and even torchscript checkpoints!
A collection of models and demos: facebook/sapiens-66d22047daa6402d565cb2fc
- VFusion3D is state-of-the-art consistent 3D generation model from images

Model: facebook/vfusion3d
Demo: facebook/VFusion3D

- CoTracker is the state-of-the-art point (pixel) tracking model

Demo: facebook/cotracker
Model: facebook/cotracker

posted an update 21 days ago

Post

3913

If you feel like you missed out for ECCV 2024, there's an app to browse the papers, rank for popularity, filter for open models, datasets and demos 📝

Get started at ECCV/ECCV2024-papers ✨

posted an update 21 days ago

Post

2647

NVIDIA just dropped a gigantic multimodal model called NVLM 72B 🦖
nvidia/NVLM-D-72B
Paper page NVLM: Open Frontier-Class Multimodal LLMs (2409.11402)

The paper contains many ablation studies on various ways to use the LLM backbone 👇🏻

🦩 Flamingo-like cross-attention (NVLM-X)
🌋 Llava-like concatenation of image and text embeddings to a decoder-only model (NVLM-D)
✨ a hybrid architecture (NVLM-H)

Checking evaluations, NVLM-D and NVLM-H are best or second best compared to other models 👏

The released model is NVLM-D based on Qwen-2 Instruct, aligned with InternViT-6B using a huge mixture of different datasets

You can easily use this model by loading it through transformers' AutoModel 😍

posted an update 25 days ago

Post

2709

We've shipped new computer vision/multimodal tasks to Hugging Face Hub 🫡
Keypoint detection just landed with many docs, and goodies 🎁
https://huggingface.co/models?pipeline_tag=keypoint-detection

In Hugging Face transformers we have SuperPoint, foundation model for keypoint detection, check out the demo here merve/SuperPoint

Shipped transformers task guide on keypoint detection https://huggingface.co/docs/transformers/tasks/keypoint_detection 📖

Also shipped the task page https://huggingface.co/tasks/keypoint-detection (easiest way to get started!) 🔖

posted an update about 2 months ago

Post

5389

I have put together a notebook on Multimodal RAG, where we do not process the documents with hefty pipelines but natively use:
- vidore/colpali for retrieval 📖 it doesn't need indexing with image-text pairs but just images!
- Qwen/Qwen2-VL-2B-Instruct for generation 💬 directly feed images as is to a vision language model with no processing to text!
I used ColPali implementation of the new 🐭 Byaldi library by @bclavie 🤗
https://github.com/answerdotai/byaldi
Link to notebook: https://github.com/merveenoyan/smol-vision/blob/main/ColPali_%2B_Qwen2_VL.ipynb

posted an update about 2 months ago

Post

3723

If you have documents that do not only have text and you're doing retrieval or RAG (using OCR and LLMs), give it up and give ColPali and vision language models a try 🤗

Why? Documents consist of multiple modalities: layout, table, text, chart, images. Document processing pipelines often consist of multiple models and they're immensely brittle and slow. 🥲

How? ColPali is a ColBERT-like document retrieval model built on PaliGemma, it operates over image patches directly, and indexing takes far less time with more accuracy. You can use it for retrieval, and if you want to do retrieval augmented generation, find the closest document, and do not process it, give it directly to a VLM like Qwen2-VL (as image input) and give your text query. 🤝

This is much faster + you do not lose out on any information + much easier to maintain too! 🥳

Multimodal RAG merve/multimodal-rag-66d97602e781122aae0a5139 💬
Document AI (made it way before, for folks who want structured input/output and can fine-tune a model) merve/awesome-document-ai-65ef1cdc2e97ef9cc85c898e 📖

2 replies

posted an update about 2 months ago

Post

2351

NVIDIA just dropped NVEagle 🦅

Super impressive vision language model that comes in 7B, 13B and 13B fine-tuned on chat 💬
Model repositories: merve/nveagle-66d0705108582d73bb235c26
Try it: NVEagle/Eagle-X5-13B-Chat 💬 (works very well! 🤯)

This model essentially explores having different experts (MoE) for image encoder part of vision language model.
How? 🧐
The authors concatenate the vision encoder output tokens together, and they apply "pre-alignment" essentially fine-tune experts with frozen text encoder.

Then they freeze both experts and the decoder and just train the projection layer, and finally, they unfreeze everything for supervised fine-tuning ✨

In the paper, they explore different fusion strategies and vision encoders, extending basic CLIP encoder, and figure out simply concatenating visual tokens works well.
Rest of the architecture is quite similar to LLaVA. (see below the architecture)

posted an update about 2 months ago

Post

2244

amazing leaderboard by @rwightman , compare all the image backbones on various metrics against model performance

below is an example for top-k against inferred samples per second
timm/leaderboard

posted an update about 2 months ago

Post

3009

a new shape-optimized SigLIP just dropped 👀 google/siglip-so400m-patch14-224

replied to ucsahin's post 2 months ago

Great work, teşekkürler! (and also thanks for informative model card)

posted an update 3 months ago

Post

3905

New smol-vision tutorial dropped: QLoRA fine-tuning IDEFICS3-Llama 8B on VQAv2 🐶

Learn how to efficiently fine-tune the latest IDEFICS3-Llama on visual question answering in this notebook 📖
Fine-tuning notebook: https://github.com/merveenoyan/smol-vision/blob/main/Idefics_FT.ipynb
Resulting model: merve/idefics3llama-vqav2

3 replies

posted an update 3 months ago

Post

4933

Idefics3-Llama is out! 💥💥
Model: HuggingFaceM4/Idefics3-8B-Llama3
Demo: HuggingFaceM4/idefics3

It's a multimodal model based on Llama 3.1 that accepts an arbitrary number of interleaved images with text with a huge context window (10k tokens!) ✨

Supported by Hugging Face transformers 🤗

2 replies

posted an update 3 months ago

Post

2539

🥹 @lbourdois has made an app to browse all of my vision paper summaries for everyone's convenience merve/vision_papers

posted an update 3 months ago

Post

2638

OWLSAM2: text-promptable SAM2 🦉 merve/OWLSAM2

Marrying cutting-edge zero-shot object detector OWLv2 🤝 mask generator SAM2 (small checkpoint)
Zero-shot segmentation with insane precision ⛵️

I also uploaded all models with usage snippets and made a collection of SAM2 models and demos merve/sam2-66ac9deac6fca3bc5482fe30

2 replies

posted an update 3 months ago

Post

3648

At Hugging Face we have an open-source Cookbook with many applied AI recipes 📖
Here are some of the latest recipes contributed ⥥

- "Information Extraction with Haystack and NuExtract": Use Haystack and transformers to build structured data extraction pipelines using LLMs by @anakin87 https://huggingface.co/learn/cookbook/en/information_extraction_haystack_nuextract

- "Build RAG with Hugging Face and Milvus": Learn how to use Milvus with sentence transformers to build RAG pipelines https://huggingface.co/learn/cookbook/rag_with_hf_and_milvus

- "Code Search with Vector Embeddings and Qdrant": Search a codebase by building a retrieval pipeline using Qdrant and sentence transformers https://huggingface.co/learn/cookbook/code_search

- Data analyst agent: get your data’s insights in the blink of an eye ✨: great recipe by our own @m-ric showing how to build an agent that can do data analysis! 😱 https://huggingface.co/learn/cookbook/agent_data_analyst

replied to their post 3 months ago

I think it's not about the Space, it's model output, Space can't do anything for this. Maybe try another VLM that was fine-tuned for this type of tasks? Maybe google/paligemma-3b-mix-224

posted an update 3 months ago

Post

2265

We have recently merged Video-LLaVA to transformers! 🤗🎞️
What makes this model different?

Demo: llava-hf/video-llava
Model: LanguageBind/Video-LLaVA-7B-hf

Compared to other models that take image and video input and either project them separately or downsampling video and projecting selected frames, Video-LLaVA is converting images and videos to unified representation and project them using a shared projection layer.

It uses Vicuna 1.5 as the language model and LanguageBind's own encoders that's based on OpenCLIP, these encoders project the modalities to an unified representation before passing to projection layer.

I feel like one of the coolest features of this model is the joint understanding which is also introduced recently with many models

It's a relatively older model but ahead of it's time and works very well! Which means, e.g. you can pass model an image of a cat and a video of a cat and ask questions like whether the cat in the image exists in video or not 🤩

replied to their post 3 months ago

It is a vision language model, these models use text decoders (here it's built on Llama-2 since it's another model from Meta) as a smaller part. VLMs largely differ from LLMs, if you can read the post above you can understand the difference.

replied to their post 3 months ago

Can you send your inputs for reproducibility? @prasiyer

posted an update 3 months ago

Post

3280

Chameleon 🦎 by Meta is now available in Hugging Face transformers 😍
A vision language model that comes in 7B and 34B sizes 🤩
But what makes this model so special?

Demo: merve/chameleon-7b
Models: facebook/chameleon-668da9663f80d483b4c61f58

keep reading ⥥

Chameleon is a unique model: it attempts to scale early fusion 🤨
But what is early fusion?
Modern vision language models use a vision encoder with a projection layer to project image embeddings so it can be promptable to text decoder (LLM)

Early fusion on the other hand attempts to fuse all features together (image patches and text) by using an image tokenizer and all tokens are projected into a shared space, which enables seamless generation 😏

Authors have also introduced different architectural improvements (QK norm and revise placement of layer norms) for scalable and stable training and they were able to increase the token count (5x tokens compared to Llama 3 which is a must with early-fusion IMO)

This model is an any-to-any model thanks to early fusion: it can take image and text input and output image and text, but image generation are disabled to prevent malicious use.

One can also do text-only prompting, authors noted the model catches up with larger LLMs (like Mixtral 8x7B or larger Llama-2 70B) and also image-pair prompting with larger VLMs like IDEFICS2-80B (see paper for the benchmarks Chameleon: Mixed-Modal Early-Fusion Foundation Models (2405.09818))
Thanks for reading!

8 replies

posted an update 3 months ago

Post

3192

Forget any document retrievers, use ColPali 💥💥

Document retrieval is done through OCR + layout detection, but you are losing a lot of information in between, stop doing that! 🤓

ColPali uses a vision language model, which is better in doc understanding 📑
ColPali: vidore/colpali (mit license!)
Blog post: https://huggingface.co/blog/manu/colpali
The authors also released a new benchmark for document retrieval:
ViDoRe Benchmark: vidore/vidore-benchmark-667173f98e70a1c0fa4db00d
ViDoRe Leaderboard: vidore/vidore-leaderboard

ColPali marries the idea of modern vision language models with retrieval 🤝

The authors apply contrastive fine-tuning to SigLIP on documents, and pool the outputs (they call it BiSigLip). Then they feed the patch embedding outputs to PaliGemma and create BiPali 🖇️
BiPali natively supports image patch embeddings to an LLM, which enables leveraging the ColBERT-like late interaction computations between text tokens and image patches (hence the name ColPali!) 🤩

The authors created the ViDoRe benchmark by collecting PDF documents and generate queries from Claude-3 Sonnet.
ColPali seems to be the most performant model on ViDoRe. Not only this, but is way faster than traditional PDF parsers too!

posted an update 4 months ago

Post

4950

Real-time DEtection Transformer (RT-DETR) landed in transformers 🤩 with Apache 2.0 license 😍

🔖 models: https://huggingface.co/PekingU
🔖 demo: merve/RT-DETR-tracking-coco
📝 paper: DETRs Beat YOLOs on Real-time Object Detection (2304.08069)
📖 notebook: https://github.com/merveenoyan/example_notebooks/blob/main/RT_DETR_Notebook.ipynb

YOLO models are known to be super fast for real-time computer vision, but they have a downside with being volatile to NMS 🥲

Transformer-based models on the other hand are computationally not as efficient 🥲

Isn't there something in between? Enter RT-DETR!

The authors combined CNN backbone, multi-stage hybrid decoder (combining convs and attn) with a transformer decoder. In the paper, authors also claim one can adjust speed by changing decoder layers without retraining altogether.
The authors find out that the model performs better in terms of speed and accuracy compared to the previous state-of-the-art. 🤩

posted an update 4 months ago

Post

3271

Just shipped: introduction to vision language models (aka image-text-to-text) https://huggingface.co/tasks/image-text-to-text

Learn about more machine learning tasks at https://huggingface.co/tasks

posted an update 4 months ago

Post

5975

Fine-tune Florence-2 on any task 🔥

Today we release a notebook and a walkthrough blog on fine-tuning Florence-2 on DocVQA dataset @andito @SkalskiP

Blog: https://huggingface.co/blog 📕
Notebook: https://colab.research.google.com/drive/1hKDrJ5AH_o7I95PtZ9__VlCTNAo1Gjpf?usp=sharing 📖
Florence-2 is a great vision-language model thanks to it's massive dataset and small size!

This model requires conditioning through task prefixes and it's not as generalist, requiring fine-tuning on a new task, such as DocVQA 📝

We have fine-tuned the model on A100 (and one can also use a smaller GPU with smaller batch size) and saw that model picks up new tasks 🥹

See below how it looks like before and after FT 🤩
Play with the demo here andito/Florence-2-DocVQA 🏄‍♀️

posted an update 4 months ago

Post

3546

EPFL and Apple (at @EPFL-VILAB ) just released 4M-21: single any-to-any model that can do anything from text-to-image generation to generating depth masks! 🙀
4M is a multimodal training framework introduced by Apple and EPFL.
Resulting model takes image and text and output image and text 🤩

Models: EPFL-VILAB/4m-models-660193abe3faf4b4d98a2742
Demo: EPFL-VILAB/4M
Paper: 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities (2406.09406)

This model consists of transformer encoder and decoder, where the key to multimodality lies in input and output data:

input and output tokens are decoded to generate bounding boxes, generated image's pixels, captions and more!

This model also learnt to generate canny maps, SAM edges and other things for steerable text-to-image generation 🖼️

The authors only added image-to-all capabilities for the demo, but you can try to use this model for text-to-image generation as well ☺️

posted an update 4 months ago

Post

4311

Florence-2 is a new vision foundation model capable of a wide variety of tasks 🤯
Demo 👉🏻 gokaygokay/Florence-2
Collection 👉🏻 microsoft/florence-6669f44df0d87d9c3bfb76de

This model can handle tasks that vary from OCR to semantic segmentation.

The difference from previous models is that the authors have compiled a dataset consisting of 126M images with 5.4B annotations labelled with their own data engine pseudolabelled by smaller specialized models and APIs.

The model has a similar architecture to previous models: an image encoder and a multimodality encoder with a text decoder. The authors have compiled the multitask dataset with prompts for each task.

You can also fine-tune this model on any task of choice. The authors also released different results on downstream tasks and reported their results when un/freezing the vision encoder 🤓📉
They have released fine-tuned models too, you can find them in the collection above 🤗

3 replies

posted an update 4 months ago

Post

3274

Forget about all the captioning datasets you've tried before!

PixelProse is a captioning dataset of 16M image-caption pairs, with less toxicity and higher details ✨
tomg-group-umd/pixelprose

The existing suite of captioning datasets consists of web scrapes that have alt text that is either irrelevant or not descriptive. The authors of this paper have taken those datasets, filtered for CSAM, passed it with a prompt to Gemini Vision Pro. They also removed PII and detoxified the resulting dataset.

posted an update 4 months ago

Post

4204

I love Depth Anything V2 😍
It’s Depth Anything, but scaled with both larger teacher model and a gigantic dataset!

Here's a small TLDR of paper with a lot of findings, experiments and more.
I have also created a collection that has the models, the dataset, the demo and CoreML converted model 😚 merve/depth-anything-v2-release-6671902e798cd404513ffbf5

The authors have analyzed Marigold, a diffusion based model against Depth Anything and found out what’s up with using synthetic images vs real images for MDE:

🔖 Real data has a lot of label noise, inaccurate depth maps (caused by depth sensors missing transparent objects etc) and there are many details overlooked

🔖 Synthetic data have more precise and detailed depth labels and they are truly ground-truth, but there’s a distribution shift between real and synthetic images, and they have restricted scene coverage

The authors train different image encoders only on synthetic images and find out unless the encoder is very large the model can’t generalize well (but large models generalize inherently anyway) 🧐
But they still fail encountering real images that have wide distribution in labels (e.g. diverse instances of objects) 🥲

Depth Anything v2 framework is to..

🦖 Train a teacher model based on DINOv2-G based on 595K synthetic images
🏷️ Label 62M real images using teacher model
🦕 Train a student model using the real images labelled by teacher
Result: 10x faster and more accurate than Marigold!

The authors also construct a new benchmark called DA-2K that is less noisy, highly detailed and more diverse!

posted an update 4 months ago

Post

3010

Finally @CVPR2024 is here! 🩷
Have you claimed your papers and linked your models/datasets/demos?
This will increase visibility and impact of your paper 💫

To index your papers, go here
CVPR2024/CVPR2024-papers
Find your paper, click on paper page link, index the paper, then click on your name (workflow is below 👇🏻)
If you'd like to add links to your paper, go here CVPR2024/update-CVPR2024-papers
login, find your paper's id, retrieve the paper, fill in the info and submit!

posted an update 4 months ago

Post

2978

releasing: smol vision 🌼

A repository with notebooks on shrinking, optimizing, speeding-up, customizing large vision models! https://github.com/merveenoyan/smol-vision

1 reply

replied to Tonic's post 4 months ago

thank you for all you do for good open-source <3

posted an update 5 months ago

Post

2736

THUDM has released GLM-4V-9B and it's.. chatty! 😂
I asked it to describe my favorite Howl's Moving Castle scene and here's how it went 👇🏻

joke aside it seems to outperform the previous VLMs. however the license isn't open-source 📈
model repo: THUDM/glm-4v-9b
a community member has built a demo: vilarin/VL-Chatbox

1 reply

posted an update 5 months ago

Post

2686

A great vision language benchmark: MM-UPD evaluates how model responds to unsolvable problems 🤓
LLaVA 1.6 is outperforming proprietary VLMs, making it a very robust choice for production!

It is now hosted as a leaderboard MM-UPD/MM-UPD_Leaderboard 🏆💕

replied to hakunamatata1997's post 5 months ago

@hakunamatata1997 why not use a document LM though if you were to combine OCR and VLM, if you do the latter it will for sure perform worse because you're missing out a lot on the layout, chart etc anyway. maybe try this https://huggingface.co/spaces/mPLUG/DocOwl it's very good

replied to their post 5 months ago

Hello @anothercoder2 interesting, can you see the files through the CLI though? is this your local setup? I think you need to find the correct path inside /downloads and give load_from_disk that. because many datasets are cached in same folder it needs the exact path. (which often is a folder under ~/.cache/huggingface/datasets/downloads with a unique ID assigned)

posted an update 5 months ago

Post

2060

Do we fully leverage ViT encoders in vision language models?

A new paper (by @HuanjinYao et al) built a dense connector that does it better! HuanjinYao/DenseConnector-v1.5-8B
HuanjinYao/denseconnector-66500e173fc8c9f05dc98dea

VLMs consist of an image encoder block, a projection layer that projects image embeddings to text embedding space and then a text decoder sequentially connected 📖
This paper explores using intermediate states of image encoder and not a single output 🤩
The authors explore three different ways of instantiating dense connector: sparse token integration, sparse channel integration and dense channel integration. (see paper on how they do it Dense Connector for MLLMs (2405.13800))

They explore all three of them integrated to LLaVA 1.5 and found out each of the new models are superior to the original LLaVA 1.5 🥹 I tried the model and it seems to work very well. As part of the release, the authors have released various ckpts based on different decoders (Vicuna 7/13B and Llama 3-8B) that you can find in the collection 🤗

replied to their post 5 months ago

you can use Colab's instances to do QLoRA FT, and then for Space we will give ZeroGPU :)

replied to their post 5 months ago

you can use Colab's instances to do QLoRA FT, and then for Space we will give ZeroGPU :)

posted an update 5 months ago

Post

1269

We will be providing ZeroGPU grants (for Spaces inference) to those who want to fine-tune PaliGemma and build a Space 🔥

You can pick any dataset of your choice!

Example code: https://colab.research.google.com/drive/1x_OEphRK0H97DqqxEyiMewqsTiLD_Xmi?usp=sharing (you can use a lower GPU with QLoRA)

Datasets:
https://huggingface.co/datasets?task_categories=task_categories:text-to-image&sort=trending
https://huggingface.co/datasets?task_categories=task_categories:image-to-text&sort=trending

5 replies

replied to hakunamatata1997's post 5 months ago

@HakunaMatata1997 hello!
I think on top of my head I can't think of an OCR model specifically, I was mostly using easyocr. OCR is a problem that is pretty much solved, so most of the AI work around docs are focused on understanding documents (because it's more than image -> text, it involves text, charts, tables, whole layout and more)
if you really want OCR there are models like https://huggingface.co/facebook/nougat-base that is for PDF to markdown for instance.
I can also recommend some for document understanding in general (which works on text + chart + image + layout) zero shot or as a backbone to finetune.

posted an update 5 months ago

Post

1932

we recently shipped fine-grained access tokens on Hugging Face Hub, which lets you create tokens with super specific permissions

for instance, if you want to collaborate with an external organization you don't want to use your write token since they can access everything you can access. instead you can set token access to repositories under that org only like below

posted an update 5 months ago

Post

2861

I got asked about PaliGemma's document understanding capabilities, so I built a Space that has all the PaliGemma fine-tuned doc models 📄📊📖
merve/paligemma-doc

replied to their post 5 months ago

@Cuiunbo ah yes, right. these type of models are "OCR free" meaning it understands and responds the image and not uses an extra ocr on them per se. those datasets are also ocr free I think. good thing about ocr free approach is that features like layout, charts, tables etc are also understood. maybe try prompts to do purely ocr? high res works well also on handwritings etc

posted an update 5 months ago

Post

2096

So many of you have asked how to do segmentation and detection with PaliGemma, you've been served! 🫡
Here's the notebook to do so: https://colab.research.google.com/drive/16-Tq-iAMHNlSjDWgz43kYDMJERjU_KHW?usp=sharing 🤗

replied to their post 5 months ago

@Cuiunbo I think in model card you can see OCR (document understanding in general) fine-tuned model with associated benchmark on test dataset

posted an update 5 months ago

Post

1770

it's raining vision language models ☔️
CuMo is a new vision language model that has MoE in every step of the VLM (image encoder, MLP and text decoder) and uses Mistral-7B for the decoder part 🤓
You can try it yourself here: shi-labs/CuMo-7b-zero

the authors firstly did pre-training of MLP with the by freezing the image encoder and text decoder, then they warmup the whole network by unfreezing and finetuning which they state to stabilize the visual instruction tuning when bringing in the experts. 🤓

the mixture of experts MLP blocks above are simply the same MLP blocks initialized from the single MLP that was trained during pre-training and fine-tuned in pre-finetuning.
it works very well (also tested myself) that it outperforms the previous sota of it's size LLaVA NeXt and IDEFICS2-8B in several benchmarks! 😍

replied to their post 5 months ago

@Cuiunbo I think @giffmana et al will release a technical report in the upcoming days. for mix models and finetuned models the details should be in the model cards. for chatty model I think it's not the intention of this release.

replied to their post 5 months ago

@MoonRide if you check the model card you can see the scores. mix models are trained on a mix of academic benchmark datasets (coco captions, vqav2, ocrvqa etc), where you just say e.g. "caption" and it captions. these datasets often have shorter descriptions and not long prompts, however they're grounded so they are good in the test sets of those benchmarks and can be used in many industry use cases (document AI etc since it hardly hallucinates). for your prompt, I just input "caption" and it came up with very grounded caption for instance.

the main point of PaliGemma release is to release finetuneable models, not provide heavy models with wide zero shot capabilities (where you input super long instruction or chat like prompts) so if you want, you can finetune a "pt" model on any benchmark of your choice and it should perform well.

replied to their post 5 months ago

@MoonRide it's not about benchmarks, but the training dataset of the mix checkpoint is different than your use case. I responded on your issue with more details.

posted an update 5 months ago

Post

1744

New open Vision Language Model by @Google : PaliGemma 💙🤍

📝 Comes in 3B, pretrained, mix and fine-tuned models in 224, 448 and 896 resolution
🧩 Combination of Gemma 2B LLM and SigLIP image encoder
🤗 Supported in transformers

PaliGemma can do..
🧩 Image segmentation and detection! 🤯
📑 Detailed document understanding and reasoning
🙋 Visual question answering, captioning and any other VLM task!

Read our blog 🔖 huggingface.co/blog/paligemma
Try the demo 🪀 huggingface.co/spaces/google/paligemma
Check out the Spaces and the models all in the collection 📚 google/paligemma-release-6643a9ffbf57de2ae0448dda
Collection of fine-tuned PaliGemma models google/paligemma-ft-models-6643b03efb769dad650d2dda

13 replies

posted an update 6 months ago

Post

1887

two new VLM benchmarks! 🤩

BLINK: evaluates tasks that humans can solve within a blink 👀 BLINK-Benchmark/BLINK

SEED-2-Plus: multichoice questions on charts, maps, webs 😍 AILab-CVC/SEED-Bench-2-plus

posted an update 6 months ago

Post

3839

just landed at Hugging Face Hub: community-led computer vision course 📖🤍
learn from fundamentals to details of the bleeding edge vision transformers!

1 reply

posted an update 6 months ago

Post

2322

I have built a Space to compare different vision language model outputs, which model should I add next? 👀
Try them yourself here merve/compare_VLMs

1 reply

posted an update 6 months ago

Post

2914

New multimodal dataset by @xai-org @liuhaotian 🤩❤️ xai-org/RealworldQA

replied to xiaotianhan's post 6 months ago

Hiya, are you planning to open-source the models?

Merve Noyan PRO

AI & ML interests

Articles

Llama can now see and run on your device - welcome Llama 3.2

Preference Optimization for Vision Language Models

Fine-tuning Florence-2 - Microsoft's Cutting-edge Vision Language Models

PaliGemma – Google's Cutting-Edge Open Vision Language Model

Vision Language Models Explained

Introduction to Quantization cooked in 🤗 with 💗🧑‍🍳

Deploy MusicGen in no time with Inference Endpoints

Open-Source Text Generation & LLM Ecosystem at Hugging Face

Jupyter X Hugging Face

Using Machine Learning to Aid Survivors and Race through Time

Introducing Skops

Announcing the Hugging Face Fellowship Program

Showcase Your Projects in Spaces using Gradio

Hosting your Models and Datasets on Hugging Face Spaces using Streamlit

Organizations

merve's activity