SINC-V1-SIGLIP2-KEE-SPEED-Small: Multimodal Product Classifier

Shopify Image Niche Classifier (V1-small) : A high-performance multimodal product classification model that combines text (product titles) and images to classify products into 14 different categories.

Model Performance

Test Accuracy: 92.46%
Test Macro F1-Score: 0.73
Test Loss: 0.2450
Test Set Size: 59,789 samples

Per-Class Performance

Category	Precision	Recall	F1-Score	Support
Fashion	0.97	0.99	0.98	35,880
Jewelry	0.95	0.97	0.96	2,474
Baby & Kids	0.91	0.88	0.90	2,511
Consumer Electronics	0.91	0.83	0.87	1,599
Lights	0.90	0.92	0.91	3,504
Others	0.90	0.68	0.77	1,930
Beauty & Personal Care	0.85	0.85	0.85	1,354
Home & Interior	0.81	0.89	0.85	6,006
Sports & Fitness Equipment	0.80	0.61	0.69	727
Outdoor, Garden & Adventure Gear	0.79	0.72	0.76	1,962
Health & Supplements	0.74	0.56	0.64	1,240
Hobbies & Collectibles	0.64	0.57	0.61	435
Office Supplies	0.51	0.45	0.48	153
Food & Beverages	0.00	0.00	0.00	14

Model Architecture

Text Encoder: microsoft/deberta-v3-small
Image Encoder: google/siglip2-base-patch16-256
Fusion: Concatenation + MLP (512 hidden units)
Output: 14-class classifier

Usage

from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer, AutoProcessor, SiglipModel
import torch
from PIL import Image
import json
from model_arch import MultimodalClassifier

# Download model and config from Hugging Face
repo_id = "manavbangotra/sinc-v1-siglip2-kee-speed-small"  # Replace with your repo
model_path = hf_hub_download(repo_id=repo_id, filename="best_multimodal.pt")
config_path = hf_hub_download(repo_id=repo_id, filename="config.json")

# Load configuration
with open(config_path) as f:
    config = json.load(f)

labels = config["labels"]
num_labels = config["num_labels"]

# Load model components
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-small")
processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-256")
clip_model = SiglipModel.from_pretrained("google/siglip2-base-patch16-256")

model = MultimodalClassifier("microsoft/deberta-v3-small", clip_model, num_labels, text_finetune=True, clip_finetune=False)
#Use cuda if available
model.load_state_dict(torch.load(model_path, map_location="cpu"))

# Example usage
text = "Elegant summer dress with floral pattern"
image = Image.open("product_image.jpg")

text_inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=64)
image_inputs = processor(images=image, return_tensors="pt")

# Predict
with torch.no_grad():
    outputs = model(
        input_ids=text_inputs["input_ids"],
        attention_mask=text_inputs["attention_mask"],
        pixel_values=image_inputs["pixel_values"]
    )
    predictions = torch.softmax(outputs["logits"], dim=-1)
    predicted_class = torch.argmax(predictions, dim=-1).item()
    confidence = predictions[0][predicted_class].item()

print(f"Predicted: {labels[predicted_class]} ({confidence:.2%})")

Training Details

Training Data: E-commerce product dataset with titles and images
Training Strategy: Fine-tuned text encoder, frozen image encoder
Optimizer: AdamW with linear warmup scheduler
Batch Size: 8
Learning Rate: 2e-5
Epochs: 3
Max Text Length: 64 tokens

Limitations

Performance varies across categories (especially small classes)
Requires mandatory image input.
Trained on specific Shopify product domain

Citation

@misc{sinc-v1-siglip2-kee-speed-small,
  title={SINC-V1-SIGLIP2-KEE-SPEED-small: High-Performance Multimodal Product Classifier},
  author={Manav Bangotra},
  year={2025},
  url={https://huggingface.co/manavbangotra/sinc-v1-siglip2-kee-speed-small}
}

License

MIT License

Downloads last month: 7

manavbangotra
/

SINC-V1-SIGLIP2-KEE-SPEED-small