TIPSv2 (CVPR'26) and TIPS (ICLR'25)
-
Updated
Jun 1, 2026 - Jupyter Notebook
TIPSv2 (CVPR'26) and TIPS (ICLR'25)
OpenVision (ICCV 2025), OpenVision 2 (CVPR 2026), and OpenVision 3
timm, evolved
A Gradio-based demonstration for the AllenAI Molmo2-8B multimodal model, enabling image QA, multi-image pointing, video QA, and temporal tracking. Users upload images or videos, provide natural language prompts.
Coral-Health is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for a single-label classification task. It is designed to classify coral reef images into two health conditions using the SiglipForImageClassification architecture.
This demonstrates the process of adapting a large scale pretrained model, MetaCLIP 2, for fine tuning a specific downstream task: image classification.
Flood-Image-Detection is a vision-language encoder model fine-tuned from google/siglip2-base-patch16-512 for binary image classification. It is trained to detect whether an image contains a flooded scene or non-flooded environment. The model uses the SiglipForImageClassification architecture.
Multilabel-GeoSceneNet is a vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for multi-label image classification. It is designed to recognize and label multiple geographic or environmental elements in a single image using the SiglipForImageClassification architecture.
Multilabel-Portrait-SigLIP2 is a vision-language model fine-tuned from google/siglip2-base-patch16-224 using the SiglipForImageClassification architecture. It classifies portrait-style images into one of the following visual portrait categories:
PussyCat-vs-Doggie-SigLIP2 is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for a single-label classification task. It is designed to classify images as either a cat or a dog using the SiglipForImageClassification architecture.
shoe-type-detection is a vision-language encoder model fine-tuned from google/siglip2-base-patch16-512 for multi-class image classification. It is trained to detect different types of shoes such as Ballet Flats, Boat Shoes, Brogues, Clogs, and Sneakers. The model uses the SiglipForImageClassification architecture.
Fashion-Product-Usage is a vision-language model fine-tuned from google/siglip2-base-patch16-224 using the SiglipForImageClassification architecture. It classifies fashion product images based on their intended usage context.
🌍 Fine-tune MetaCLIP-2 for multilingual image classification on various tasks, enhancing performance with innovative training and data curation methods.
Leverage SigLIP 2's capabilities using LitServe.
Research project for exploring Kazakh Vision Language Models.
Add a description, image, and links to the vision-encoder topic page so that developers can more easily learn about it.
To associate your repository with the vision-encoder topic, visit your repo's landing page and select "manage topics."