NanoCLIP : Vision Contrastive Learning Model

View on GitHub

MiniCLIP is a from scratch implementation of the CLIP architecture, designed as a lightweight and deployable vision language model.

The system learns a shared embedding space for images and text, enabling zero shot classification and semantic retrieval using vector similarity.

Model Architecture

MiniCLIP Architecture

The model follows a dual encoder transformer design consisting of a Vision Transformer and a Text Transformer.

Training Dataset

The model was trained on the Flickr30k dataset, containing 30,000 images with multiple captions per image.

Text Processing

A custom Byte Pair Encoding tokenizer was trained on caption data using the HuggingFace Tokenizers library.

Model Optimization

ONNX Benchmark Results

Post training optimization was performed using ONNX Runtime dynamic quantization to improve deployment efficiency.

Performance Results

Training Dynamics

Training History

The model was trained for 30 epochs and showed rapid convergence in early stages followed by stable validation.

Zero Shot Inference Examples

The trained model supports zero shot classification and semantic image text matching.

Query Example 1
Query Example 2
Query Example 3

Query Images

Input Image 1
Input Image 2
Input Image 3

Technology Stack

PyTorch, Transformers, HuggingFace Tokenizers, ONNX Runtime, CUDA, MPS