Invidious

Machine learning tip:

SigLIP is an absolute goated vision model 🐐

I use it almost every day for zero-shot text-to-image matching.

For example, having a large folder of images and finding out which ones contain “food” using just the search term “a photo of food”.

How?

SigLIP stands for “Sigmoid Language Image Pretraining”.

In a nutshell, it’s a model that learns a joint representation of vision and language so the two become interchangeable.

The best part is that it’s open-source!

That’s why it’s also the vision backbone of many of the open-source VLMs coming out (Idefics3, PaliGemma, HPT-Edge).

For more on SigLIP and how it was trained, I’d highly recommend watching this talk from Lucas Beyer, one of the co-creators of the model (where the slide is from).

(of course read the paper too but I find talks often contain little bits of information you may not pick up on in papers)

One of my favourite quotes from the talk was “the more precise your text, the higher the score”.

This pays big dividends when you’re sorting through a large dataset or doing precise image labelling.

As seen with the goat emojis demo, the more specific the text, the higher the score.

Customer computer vision project workflow example*:

1. Start with large image dataset (e.g. COYO700M/DataComp-1B or larger if you have resources)
2. Embed images with SigLIP -> index with FAISS
3. Define text ontology for precise extraction
4. Filter samples with zero-shot image/text matching
5. Use samples for downstream specific vision task improvement (e.g. to fine-tune a smaller vision model)
6. ????
7. Profit
8. Bonus: Use filtered samples with SigLIP + image caption model + image generation model to generate even more task-specific samples

*should work quite well with any vision task you could reasonably expect to exist on the internet (e.g. if your task requires super specific custom vision data, this may not work as well)

--

Link to talk: https://youtu.be/Nk9YnMHB6hU?si=j-lQ2...
Link to SigLIP model on Hugging Face: huggingface.co/google/siglip-so400m-patch14-384

Note: "goated" = slang for "Greatest Of All Time"

7 months ago | [YT] | 172

Hi! Looks like you have JavaScript turned off. Click here to view comments, keep in mind they may take a bit longer to load.