Merve Noyan PRO
AI & ML interests
Articles
Organizations
merve's activity
Model: tencent/DepthCrafter
Demo: tencent/DepthCrafter
Paper: DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos (2409.02095)
You don't need to input anything other than video itself, no need for optical flow or camera poses! π€©
if you don't want to wait for the next transformers release install transformers from my PR https://github.com/huggingface/transformers/pull/32938 and initialize SigLIP from there
DepthPro is a zero-shot depth estimation model by Apple, it's fast, sharp and accurate π₯
Demo: akhaliq/depth-pro
Model: apple/DepthPro
Paper page: Depth Pro: Sharp Monocular Metric Depth in Less Than a Second (2410.02073)
The model consists of two encoders: an encoder for patches and an image encoder πΌοΈ The outputs of both are merged to decode to depth maps and get the focal length.
The model outperforms the previous state-of-the-art models in average of various benchmarks π
model: facebook/cotracker3
demo: facebook/cotracker
HuggingChat is now multimodal with meta-llama/Llama-3.2-11B-Vision-Instruct! π€
This also comes with multimodal assistants, I have migrated my Marcus Aurelius advice assistant to Llama-Vision and Marcus can see now! π
Chat with Marcus: https://huggingface.co/chat/assistant/65bfed22022ba290531112f8
Start chatting with Llama-Vision 3.2 11B Instruct https://huggingface.co/chat/models/meta-llama/Llama-3.2-11B-Vision-Instruct
They shipped multiple models and demos for their papers at @ECCV π€
Here's a compilation of my top picks:
- Sapiens is family of foundation models for human-centric depth estimation, segmentation and more, all models have open weights and demos π
All models have their demos and even torchscript checkpoints!
A collection of models and demos: facebook/sapiens-66d22047daa6402d565cb2fc
- VFusion3D is state-of-the-art consistent 3D generation model from images
Model: facebook/vfusion3d
Demo: facebook/VFusion3D
- CoTracker is the state-of-the-art point (pixel) tracking model
Demo: facebook/cotracker
Model: facebook/cotracker
Get started at ECCV/ECCV2024-papers β¨
nvidia/NVLM-D-72B
Paper page NVLM: Open Frontier-Class Multimodal LLMs (2409.11402)
The paper contains many ablation studies on various ways to use the LLM backbone ππ»
𦩠Flamingo-like cross-attention (NVLM-X)
π Llava-like concatenation of image and text embeddings to a decoder-only model (NVLM-D)
β¨ a hybrid architecture (NVLM-H)
Checking evaluations, NVLM-D and NVLM-H are best or second best compared to other models π
The released model is NVLM-D based on Qwen-2 Instruct, aligned with InternViT-6B using a huge mixture of different datasets
You can easily use this model by loading it through transformers' AutoModel π
Keypoint detection just landed with many docs, and goodies π
https://huggingface.co/models?pipeline_tag=keypoint-detection
In Hugging Face transformers we have SuperPoint, foundation model for keypoint detection, check out the demo here merve/SuperPoint
Shipped transformers task guide on keypoint detection https://huggingface.co/docs/transformers/tasks/keypoint_detection π
Also shipped the task page https://huggingface.co/tasks/keypoint-detection (easiest way to get started!) π
- vidore/colpali for retrieval π it doesn't need indexing with image-text pairs but just images!
- Qwen/Qwen2-VL-2B-Instruct for generation π¬ directly feed images as is to a vision language model with no processing to text!
I used ColPali implementation of the new π Byaldi library by @bclavie π€
https://github.com/answerdotai/byaldi
Link to notebook: https://github.com/merveenoyan/smol-vision/blob/main/ColPali_%2B_Qwen2_VL.ipynb
Why? Documents consist of multiple modalities: layout, table, text, chart, images. Document processing pipelines often consist of multiple models and they're immensely brittle and slow. π₯²
How? ColPali is a ColBERT-like document retrieval model built on PaliGemma, it operates over image patches directly, and indexing takes far less time with more accuracy. You can use it for retrieval, and if you want to do retrieval augmented generation, find the closest document, and do not process it, give it directly to a VLM like Qwen2-VL (as image input) and give your text query. π€
This is much faster + you do not lose out on any information + much easier to maintain too! π₯³
Multimodal RAG merve/multimodal-rag-66d97602e781122aae0a5139 π¬
Document AI (made it way before, for folks who want structured input/output and can fine-tune a model) merve/awesome-document-ai-65ef1cdc2e97ef9cc85c898e π
Super impressive vision language model that comes in 7B, 13B and 13B fine-tuned on chat π¬
Model repositories: merve/nveagle-66d0705108582d73bb235c26
Try it: NVEagle/Eagle-X5-13B-Chat π¬ (works very well! π€―)
This model essentially explores having different experts (MoE) for image encoder part of vision language model.
How? π§
The authors concatenate the vision encoder output tokens together, and they apply "pre-alignment" essentially fine-tune experts with frozen text encoder.
Then they freeze both experts and the decoder and just train the projection layer, and finally, they unfreeze everything for supervised fine-tuning β¨
In the paper, they explore different fusion strategies and vision encoders, extending basic CLIP encoder, and figure out simply concatenating visual tokens works well.
Rest of the architecture is quite similar to LLaVA. (see below the architecture)
below is an example for top-k against inferred samples per second
timm/leaderboard
Great work, teΕekkΓΌrler! (and also thanks for informative model card)
Learn how to efficiently fine-tune the latest IDEFICS3-Llama on visual question answering in this notebook π
Fine-tuning notebook: https://github.com/merveenoyan/smol-vision/blob/main/Idefics_FT.ipynb
Resulting model: merve/idefics3llama-vqav2
Model: HuggingFaceM4/Idefics3-8B-Llama3
Demo: HuggingFaceM4/idefics3
It's a multimodal model based on Llama 3.1 that accepts an arbitrary number of interleaved images with text with a huge context window (10k tokens!) β¨
Supported by Hugging Face transformers π€
Marrying cutting-edge zero-shot object detector OWLv2 π€ mask generator SAM2 (small checkpoint)
Zero-shot segmentation with insane precision β΅οΈ
I also uploaded all models with usage snippets and made a collection of SAM2 models and demos merve/sam2-66ac9deac6fca3bc5482fe30
Here are some of the latest recipes contributed β₯₯
- "Information Extraction with Haystack and NuExtract": Use Haystack and transformers to build structured data extraction pipelines using LLMs by @anakin87 https://huggingface.co/learn/cookbook/en/information_extraction_haystack_nuextract
- "Build RAG with Hugging Face and Milvus": Learn how to use Milvus with sentence transformers to build RAG pipelines https://huggingface.co/learn/cookbook/rag_with_hf_and_milvus
- "Code Search with Vector Embeddings and Qdrant": Search a codebase by building a retrieval pipeline using Qdrant and sentence transformers https://huggingface.co/learn/cookbook/code_search
- Data analyst agent: get your dataβs insights in the blink of an eye β¨: great recipe by our own @m-ric showing how to build an agent that can do data analysis! π± https://huggingface.co/learn/cookbook/agent_data_analyst
I think it's not about the Space, it's model output, Space can't do anything for this. Maybe try another VLM that was fine-tuned for this type of tasks? Maybe google/paligemma-3b-mix-224
What makes this model different?
Demo: llava-hf/video-llava
Model: LanguageBind/Video-LLaVA-7B-hf
Compared to other models that take image and video input and either project them separately or downsampling video and projecting selected frames, Video-LLaVA is converting images and videos to unified representation and project them using a shared projection layer.
It uses Vicuna 1.5 as the language model and LanguageBind's own encoders that's based on OpenCLIP, these encoders project the modalities to an unified representation before passing to projection layer.
I feel like one of the coolest features of this model is the joint understanding which is also introduced recently with many models
It's a relatively older model but ahead of it's time and works very well! Which means, e.g. you can pass model an image of a cat and a video of a cat and ask questions like whether the cat in the image exists in video or not π€©
It is a vision language model, these models use text decoders (here it's built on Llama-2 since it's another model from Meta) as a smaller part. VLMs largely differ from LLMs, if you can read the post above you can understand the difference.
Can you send your inputs for reproducibility? @prasiyer
A vision language model that comes in 7B and 34B sizes π€©
But what makes this model so special?
Demo: merve/chameleon-7b
Models: facebook/chameleon-668da9663f80d483b4c61f58
keep reading β₯₯
Chameleon is a unique model: it attempts to scale early fusion π€¨
But what is early fusion?
Modern vision language models use a vision encoder with a projection layer to project image embeddings so it can be promptable to text decoder (LLM)
Early fusion on the other hand attempts to fuse all features together (image patches and text) by using an image tokenizer and all tokens are projected into a shared space, which enables seamless generation π
Authors have also introduced different architectural improvements (QK norm and revise placement of layer norms) for scalable and stable training and they were able to increase the token count (5x tokens compared to Llama 3 which is a must with early-fusion IMO)
This model is an any-to-any model thanks to early fusion: it can take image and text input and output image and text, but image generation are disabled to prevent malicious use.
One can also do text-only prompting, authors noted the model catches up with larger LLMs (like Mixtral 8x7B or larger Llama-2 70B) and also image-pair prompting with larger VLMs like IDEFICS2-80B (see paper for the benchmarks Chameleon: Mixed-Modal Early-Fusion Foundation Models (2405.09818))
Thanks for reading!
Document retrieval is done through OCR + layout detection, but you are losing a lot of information in between, stop doing that! π€
ColPali uses a vision language model, which is better in doc understanding π
ColPali: vidore/colpali (mit license!)
Blog post: https://huggingface.co/blog/manu/colpali
The authors also released a new benchmark for document retrieval:
ViDoRe Benchmark: vidore/vidore-benchmark-667173f98e70a1c0fa4db00d
ViDoRe Leaderboard: vidore/vidore-leaderboard
ColPali marries the idea of modern vision language models with retrieval π€
The authors apply contrastive fine-tuning to SigLIP on documents, and pool the outputs (they call it BiSigLip). Then they feed the patch embedding outputs to PaliGemma and create BiPali ποΈ
BiPali natively supports image patch embeddings to an LLM, which enables leveraging the ColBERT-like late interaction computations between text tokens and image patches (hence the name ColPali!) π€©
The authors created the ViDoRe benchmark by collecting PDF documents and generate queries from Claude-3 Sonnet.
ColPali seems to be the most performant model on ViDoRe. Not only this, but is way faster than traditional PDF parsers too!
π models: https://huggingface.co/PekingU
π demo: merve/RT-DETR-tracking-coco
π paper: DETRs Beat YOLOs on Real-time Object Detection (2304.08069)
π notebook: https://github.com/merveenoyan/example_notebooks/blob/main/RT_DETR_Notebook.ipynb
YOLO models are known to be super fast for real-time computer vision, but they have a downside with being volatile to NMS π₯²
Transformer-based models on the other hand are computationally not as efficient π₯²
Isn't there something in between? Enter RT-DETR!
The authors combined CNN backbone, multi-stage hybrid decoder (combining convs and attn) with a transformer decoder. In the paper, authors also claim one can adjust speed by changing decoder layers without retraining altogether.
The authors find out that the model performs better in terms of speed and accuracy compared to the previous state-of-the-art. π€©
Learn about more machine learning tasks at https://huggingface.co/tasks
Today we release a notebook and a walkthrough blog on fine-tuning Florence-2 on DocVQA dataset @andito @SkalskiP
Blog: https://huggingface.co/blog π
Notebook: https://colab.research.google.com/drive/1hKDrJ5AH_o7I95PtZ9__VlCTNAo1Gjpf?usp=sharing π
Florence-2 is a great vision-language model thanks to it's massive dataset and small size!
This model requires conditioning through task prefixes and it's not as generalist, requiring fine-tuning on a new task, such as DocVQA π
We have fine-tuned the model on A100 (and one can also use a smaller GPU with smaller batch size) and saw that model picks up new tasks π₯Ή
See below how it looks like before and after FT π€©
Play with the demo here andito/Florence-2-DocVQA πββοΈ
4M is a multimodal training framework introduced by Apple and EPFL.
Resulting model takes image and text and output image and text π€©
Models: EPFL-VILAB/4m-models-660193abe3faf4b4d98a2742
Demo: EPFL-VILAB/4M
Paper: 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities (2406.09406)
This model consists of transformer encoder and decoder, where the key to multimodality lies in input and output data:
input and output tokens are decoded to generate bounding boxes, generated image's pixels, captions and more!
This model also learnt to generate canny maps, SAM edges and other things for steerable text-to-image generation πΌοΈ
The authors only added image-to-all capabilities for the demo, but you can try to use this model for text-to-image generation as well βΊοΈ
Demo ππ» gokaygokay/Florence-2
Collection ππ» microsoft/florence-6669f44df0d87d9c3bfb76de
This model can handle tasks that vary from OCR to semantic segmentation.
The difference from previous models is that the authors have compiled a dataset consisting of 126M images with 5.4B annotations labelled with their own data engine pseudolabelled by smaller specialized models and APIs.
The model has a similar architecture to previous models: an image encoder and a multimodality encoder with a text decoder. The authors have compiled the multitask dataset with prompts for each task.
You can also fine-tune this model on any task of choice. The authors also released different results on downstream tasks and reported their results when un/freezing the vision encoder π€π
They have released fine-tuned models too, you can find them in the collection above π€
PixelProse is a captioning dataset of 16M image-caption pairs, with less toxicity and higher details β¨
tomg-group-umd/pixelprose
The existing suite of captioning datasets consists of web scrapes that have alt text that is either irrelevant or not descriptive. The authors of this paper have taken those datasets, filtered for CSAM, passed it with a prompt to Gemini Vision Pro. They also removed PII and detoxified the resulting dataset.
Itβs Depth Anything, but scaled with both larger teacher model and a gigantic dataset!
Here's a small TLDR of paper with a lot of findings, experiments and more.
I have also created a collection that has the models, the dataset, the demo and CoreML converted model π merve/depth-anything-v2-release-6671902e798cd404513ffbf5
The authors have analyzed Marigold, a diffusion based model against Depth Anything and found out whatβs up with using synthetic images vs real images for MDE:
π Real data has a lot of label noise, inaccurate depth maps (caused by depth sensors missing transparent objects etc) and there are many details overlooked
π Synthetic data have more precise and detailed depth labels and they are truly ground-truth, but thereβs a distribution shift between real and synthetic images, and they have restricted scene coverage
The authors train different image encoders only on synthetic images and find out unless the encoder is very large the model canβt generalize well (but large models generalize inherently anyway) π§
But they still fail encountering real images that have wide distribution in labels (e.g. diverse instances of objects) π₯²
Depth Anything v2 framework is to..
π¦ Train a teacher model based on DINOv2-G based on 595K synthetic images
π·οΈ Label 62M real images using teacher model
π¦ Train a student model using the real images labelled by teacher
Result: 10x faster and more accurate than Marigold!
The authors also construct a new benchmark called DA-2K that is less noisy, highly detailed and more diverse!
Have you claimed your papers and linked your models/datasets/demos?
This will increase visibility and impact of your paper π«
To index your papers, go here
CVPR2024/CVPR2024-papers
Find your paper, click on paper page link, index the paper, then click on your name (workflow is below ππ»)
If you'd like to add links to your paper, go here CVPR2024/update-CVPR2024-papers
login, find your paper's id, retrieve the paper, fill in the info and submit!
A repository with notebooks on shrinking, optimizing, speeding-up, customizing large vision models! https://github.com/merveenoyan/smol-vision
thank you for all you do for good open-source <3
I asked it to describe my favorite Howl's Moving Castle scene and here's how it went ππ»
joke aside it seems to outperform the previous VLMs. however the license isn't open-source π
model repo: THUDM/glm-4v-9b
a community member has built a demo: vilarin/VL-Chatbox
LLaVA 1.6 is outperforming proprietary VLMs, making it a very robust choice for production!
It is now hosted as a leaderboard MM-UPD/MM-UPD_Leaderboard ππ
@hakunamatata1997 why not use a document LM though if you were to combine OCR and VLM, if you do the latter it will for sure perform worse because you're missing out a lot on the layout, chart etc anyway. maybe try this https://huggingface.co/spaces/mPLUG/DocOwl it's very good
Hello
@anothercoder2
interesting, can you see the files through the CLI though? is this your local setup? I think you need to find the correct path inside /downloads and give load_from_disk
that. because many datasets are cached in same folder it needs the exact path. (which often is a folder under ~/.cache/huggingface/datasets/downloads
with a unique ID assigned)
A new paper (by @HuanjinYao et al) built a dense connector that does it better! HuanjinYao/DenseConnector-v1.5-8B
HuanjinYao/denseconnector-66500e173fc8c9f05dc98dea
VLMs consist of an image encoder block, a projection layer that projects image embeddings to text embedding space and then a text decoder sequentially connected π
This paper explores using intermediate states of image encoder and not a single output π€©
The authors explore three different ways of instantiating dense connector: sparse token integration, sparse channel integration and dense channel integration. (see paper on how they do it Dense Connector for MLLMs (2405.13800))
They explore all three of them integrated to LLaVA 1.5 and found out each of the new models are superior to the original LLaVA 1.5 π₯Ή I tried the model and it seems to work very well. As part of the release, the authors have released various ckpts based on different decoders (Vicuna 7/13B and Llama 3-8B) that you can find in the collection π€
you can use Colab's instances to do QLoRA FT, and then for Space we will give ZeroGPU :)
you can use Colab's instances to do QLoRA FT, and then for Space we will give ZeroGPU :)
You can pick any dataset of your choice!
Example code: https://colab.research.google.com/drive/1x_OEphRK0H97DqqxEyiMewqsTiLD_Xmi?usp=sharing (you can use a lower GPU with QLoRA)
Datasets:
https://huggingface.co/datasets?task_categories=task_categories:text-to-image&sort=trending
https://huggingface.co/datasets?task_categories=task_categories:image-to-text&sort=trending
@HakunaMatata1997
hello!
I think on top of my head I can't think of an OCR model specifically, I was mostly using easyocr. OCR is a problem that is pretty much solved, so most of the AI work around docs are focused on understanding documents (because it's more than image -> text, it involves text, charts, tables, whole layout and more)
if you really want OCR there are models like https://huggingface.co/facebook/nougat-base that is for PDF to markdown for instance.
I can also recommend some for document understanding in general (which works on text + chart + image + layout) zero shot or as a backbone to finetune.
for instance, if you want to collaborate with an external organization you don't want to use your write token since they can access everything you can access. instead you can set token access to repositories under that org only like below
merve/paligemma-doc
@Cuiunbo ah yes, right. these type of models are "OCR free" meaning it understands and responds the image and not uses an extra ocr on them per se. those datasets are also ocr free I think. good thing about ocr free approach is that features like layout, charts, tables etc are also understood. maybe try prompts to do purely ocr? high res works well also on handwritings etc
Here's the notebook to do so: https://colab.research.google.com/drive/16-Tq-iAMHNlSjDWgz43kYDMJERjU_KHW?usp=sharing π€
@Cuiunbo I think in model card you can see OCR (document understanding in general) fine-tuned model with associated benchmark on test dataset
CuMo is a new vision language model that has MoE in every step of the VLM (image encoder, MLP and text decoder) and uses Mistral-7B for the decoder part π€
You can try it yourself here: shi-labs/CuMo-7b-zero
the authors firstly did pre-training of MLP with the by freezing the image encoder and text decoder, then they warmup the whole network by unfreezing and finetuning which they state to stabilize the visual instruction tuning when bringing in the experts. π€
the mixture of experts MLP blocks above are simply the same MLP blocks initialized from the single MLP that was trained during pre-training and fine-tuned in pre-finetuning.
it works very well (also tested myself) that it outperforms the previous sota of it's size LLaVA NeXt and IDEFICS2-8B in several benchmarks! π
@MoonRide if you check the model card you can see the scores. mix models are trained on a mix of academic benchmark datasets (coco captions, vqav2, ocrvqa etc), where you just say e.g. "caption" and it captions. these datasets often have shorter descriptions and not long prompts, however they're grounded so they are good in the test sets of those benchmarks and can be used in many industry use cases (document AI etc since it hardly hallucinates). for your prompt, I just input "caption" and it came up with very grounded caption for instance.
the main point of PaliGemma release is to release finetuneable models, not provide heavy models with wide zero shot capabilities (where you input super long instruction or chat like prompts) so if you want, you can finetune a "pt" model on any benchmark of your choice and it should perform well.
@MoonRide it's not about benchmarks, but the training dataset of the mix checkpoint is different than your use case. I responded on your issue with more details.
π Comes in 3B, pretrained, mix and fine-tuned models in 224, 448 and 896 resolution
𧩠Combination of Gemma 2B LLM and SigLIP image encoder
π€ Supported in transformers
PaliGemma can do..
𧩠Image segmentation and detection! π€―
π Detailed document understanding and reasoning
π Visual question answering, captioning and any other VLM task!
Read our blog π huggingface.co/blog/paligemma
Try the demo πͺ huggingface.co/spaces/google/paligemma
Check out the Spaces and the models all in the collection π google/paligemma-release-6643a9ffbf57de2ae0448dda
Collection of fine-tuned PaliGemma models google/paligemma-ft-models-6643b03efb769dad650d2dda
BLINK: evaluates tasks that humans can solve within a blink π BLINK-Benchmark/BLINK
SEED-2-Plus: multichoice questions on charts, maps, webs π AILab-CVC/SEED-Bench-2-plus
Try them yourself here merve/compare_VLMs
Hiya, are you planning to open-source the models?