I use it almost every day for zero-shot text-to-image matching.
For example, having a large folder of images and finding out which ones contain âfoodâ using just the search term âa photo of foodâ.
How?
SigLIP stands for âSigmoid Language Image Pretrainingâ.
In a nutshell, itâs a model that learns a joint representation of vision and language so the two become interchangeable.
The best part is that itâs open-source!
Thatâs why itâs also the vision backbone of many of the open-source VLMs coming out (Idefics3, PaliGemma, HPT-Edge).
For more on SigLIP and how it was trained, Iâd highly recommend watching this talk from Lucas Beyer, one of the co-creators of the model (where the slide is from).
(of course read the paper too but I find talks often contain little bits of information you may not pick up on in papers)
One of my favourite quotes from the talk was âthe more precise your text, the higher the scoreâ.
This pays big dividends when youâre sorting through a large dataset or doing precise image labelling.
As seen with the goat emojis demo, the more specific the text, the higher the score.
1. Start with large image dataset (e.g. COYO700M/DataComp-1B or larger if you have resources) 2. Embed images with SigLIP -> index with FAISS 3. Define text ontology for precise extraction 4. Filter samples with zero-shot image/text matching 5. Use samples for downstream specific vision task improvement (e.g. to fine-tune a smaller vision model) 6. ???? 7. Profit 8. Bonus: Use filtered samples with SigLIP + image caption model + image generation model to generate even more task-specific samples
*should work quite well with any vision task you could reasonably expect to exist on the internet (e.g. if your task requires super specific custom vision data, this may not work as well)
Daniel Bourke
Machine learning tip:
SigLIP is an absolute goated vision model đ
I use it almost every day for zero-shot text-to-image matching.
For example, having a large folder of images and finding out which ones contain âfoodâ using just the search term âa photo of foodâ.
How?
SigLIP stands for âSigmoid Language Image Pretrainingâ.
In a nutshell, itâs a model that learns a joint representation of vision and language so the two become interchangeable.
The best part is that itâs open-source!
Thatâs why itâs also the vision backbone of many of the open-source VLMs coming out (Idefics3, PaliGemma, HPT-Edge).
For more on SigLIP and how it was trained, Iâd highly recommend watching this talk from Lucas Beyer, one of the co-creators of the model (where the slide is from).
(of course read the paper too but I find talks often contain little bits of information you may not pick up on in papers)
One of my favourite quotes from the talk was âthe more precise your text, the higher the scoreâ.
This pays big dividends when youâre sorting through a large dataset or doing precise image labelling.
As seen with the goat emojis demo, the more specific the text, the higher the score.
Customer computer vision project workflow example*:
1. Start with large image dataset (e.g. COYO700M/DataComp-1B or larger if you have resources)
2. Embed images with SigLIP -> index with FAISS
3. Define text ontology for precise extraction
4. Filter samples with zero-shot image/text matching
5. Use samples for downstream specific vision task improvement (e.g. to fine-tune a smaller vision model)
6. ????
7. Profit
8. Bonus: Use filtered samples with SigLIP + image caption model + image generation model to generate even more task-specific samples
*should work quite well with any vision task you could reasonably expect to exist on the internet (e.g. if your task requires super specific custom vision data, this may not work as well)
--
Link to talk: https://youtu.be/Nk9YnMHB6hU?si=j-lQ2...
Link to SigLIP model on Hugging Face: huggingface.co/google/siglip-so400m-patch14-384
Note: "goated" = slang for "Greatest Of All Time"
7 months ago | [YT] | 172