QWen2.5VL Fine-tuned for FLARE 2025 Medical Image Analysis

This model is a fine-tuned version of Qwen/Qwen2.5-VL-7B-Instruct specifically optimized for medical image analysis tasks in the FLARE 2025 2D Medical Multimodal Dataset challenge.

Model Description

  • Base Model: Qwen2.5-VL-7B-Instruct
  • Fine-tuning Method: QLoRA (Low-Rank Adaptation)
  • Target Domain: Medical imaging across 8 modalities (Clinical, Dermatology, Endoscopy, Mammography, Microscopy, Retinography, Ultrasound, Xray)
  • Tasks: Medical image captioning, visual question answering, report generation
  • Training Data: 19 FLARE 2025 datasets with comprehensive medical annotations

Training Details

Training Data

The model was fine-tuned on 19 diverse medical imaging datasets from FLARE 2025, details can be found at: https://huggingface.co/datasets/FLARE-MedFM/FLARE-Task5-MLLM-2D

Training Configuration

# LoRA Configuration
lora_r: 16
lora_alpha: 32
lora_dropout: 0.1
target_modules: ['k_proj', 'v_proj', 'o_proj', 'gate_proj', 'down_proj', 'up_proj', 'q_proj']
task_type: CAUSAL_LM

# Training Statistics
total_steps: 1000
learning_rate: N/A
final_eval_loss: 5.4849

Training Procedure

  • Optimization: 4-bit quantization with BitsAndBytesConfig
  • LoRA Configuration:
    • r=64, alpha=16, dropout=0.1
    • Target modules: All linear layers
  • Memory Optimization: Gradient checkpointing, flash attention
  • Batch Size: Dynamic based on image resolution
  • Learning Rate: 1e-4 with cosine scheduling
  • Training Steps: 1000 steps with evaluation every 500 steps

Model Performance

This model has been evaluated across multiple medical imaging tasks with the following capabilities:

  • Image Captioning: Generates detailed medical reports from imaging studies
  • Visual Question Answering: Answers clinical questions about medical images
  • Diagnosis Support: Identifies pathological findings and abnormalities
  • Multi-modal Understanding: Integrates visual and textual medical information

Evaluation Metrics

The model is evaluated using task-specific metrics following FLARE 2025 specifications:

Classification Tasks:

  • Balanced Accuracy (PRIMARY): Handles class imbalance in medical diagnosis
  • Accuracy: Standard classification accuracy
  • F1 Score: Weighted F1 for multi-class scenarios

Multi-label Classification:

  • F1 Score (PRIMARY): Sample-wise F1 across multiple labels
  • Precision: Label prediction precision
  • Recall: Label coverage recall

Detection Tasks:

  • F1 Score @ IoU > 0.5 (PRIMARY): Standard computer vision detection metric
  • Precision: Detection precision at IoU threshold
  • Recall: Detection recall at IoU threshold

Instance Detection (Identity-Aware):

  • F1 Score @ IoU > 0.3 (PRIMARY): Medical imaging standard for chromosome detection
  • F1 Score @ IoU > 0.5: Computer vision standard
  • Average F1: COCO-style average across IoU thresholds (0.3-0.7)
  • Per-chromosome metrics: Detailed breakdown by chromosome identity

Counting Tasks:

  • Mean Absolute Error (PRIMARY): Cell counting accuracy
  • Root Mean Squared Error: Additional counting precision metric

Regression Tasks:

  • Mean Absolute Error (PRIMARY): Continuous value prediction accuracy
  • Root Mean Squared Error: Regression precision metric

Report Generation:

  • GREEN Score (PRIMARY): Comprehensive medical report evaluation with 7 components:
    • Entity matching with severity assessment (30%)
    • Location accuracy with laterality (20%)
    • Negation and uncertainty handling (15%)
    • Temporal accuracy (10%)
    • Size/measurement accuracy (10%)
    • Clinical significance weighting (10%)
    • Report structure completeness (5%)
  • BLEU Score: Text generation quality
  • Clinical Efficacy: Medical relevance scoring

Usage

Installation

pip install transformers torch peft accelerate bitsandbytes

Basic Usage

import torch
from transformers import AutoTokenizer, AutoProcessor
from peft import PeftModel, PeftConfig
from PIL import Image

# Load the fine-tuned model
base_model_name = "Qwen/Qwen2.5-VL-7B-Instruct"
adapter_model_name = "leoyinn/qwen2.5vl-flare2025"

# Load tokenizer and processor
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
processor = AutoProcessor.from_pretrained(base_model_name)

# Load base model and adapter
from transformers import AutoModelForVision2Seq
base_model = AutoModelForVision2Seq.from_pretrained(
    base_model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    load_in_4bit=True
)

# Load the fine-tuned adapter
model = PeftModel.from_pretrained(base_model, adapter_model_name)

# Prepare input
image = Image.open("medical_image.jpg")
prompt = "Describe the medical findings in this image."

# Process and generate
inputs = processor(
    images=image,
    text=prompt,
    return_tensors="pt"
).to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.7,
        top_p=0.9
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Limitations and Ethical Considerations

Limitations

  • Model outputs may contain inaccuracies and should be verified by medical professionals
  • Performance may vary across different medical imaging modalities and populations
  • Training data may contain biases present in medical literature and datasets
  • Model has not been validated in clinical settings

Intended Use

  • Medical education and training
  • Research in medical AI and computer vision
  • Development of clinical decision support tools (with proper validation)
  • Academic research in multimodal medical AI

Out-of-Scope Use

  • Direct clinical diagnosis without physician oversight
  • Treatment recommendations without medical professional validation
  • Use in emergency medical situations
  • Deployment in production clinical systems without extensive validation

Citation

If you use this model in your research, please cite:

@misc{qwen25vl-flare2025,
  title={QWen2.5VL Fine-tuned for FLARE 2025 Medical Image Analysis},
  author={Shuolin Yin},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/leoyinn/qwen2.5vl-flare2025}
}

@misc{qwen25vl-base,
  title={Qwen2.5-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Qwen Team},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct}
}

Model Details

  • Model Type: Vision-Language Model (VLM)
  • Architecture: QWen2.5VL with LoRA adapters
  • Parameters: ~7B base parameters + LoRA adapters
  • Precision: 4-bit quantized base model + full precision adapters
  • Framework: PyTorch, Transformers, PEFT

Contact

For questions or issues, please open an issue in the model repository or contact the authors.

Downloads last month
25
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for leoyinn/flare25-qwen2.5vl

Adapter
(152)
this model

Dataset used to train leoyinn/flare25-qwen2.5vl