Model Card

Model Description

MedSwin-7B-SFT is a specialized 7B parameter language model for medical question-answering, instruction-following, and clinical reasoning. It was created by applying a rigorous, multi-stage Supervised Fine-Tuning (SFT) pipeline to the medalpaca/medalpaca-7b base model. The core innovation lies in its data augmentation methodology, which is designed to maximize data diversity and robustness while strictly preserving clinical fidelity and accuracy.

Developed by: Medical Swinburne University of Technology AI Team
Funded by: Swinburne University of Technology
Base Model: medalpaca/medalpaca-7b
Language(s): English
License: Apache 2.0

Intended Use

This model is intended for research purposes in the following domains:

AI-assisted medicine and clinical decision support research.
Biomedical natural language processing (NLP).
Exploration of robust instruction-tuning and knowledge distillation in specialized domains.
Generating high-quality, clinically-grounded synthetic data for further model training.

Training Data

The model was fine-tuned on a curated and augmented collection of medical QA datasets:

PubMedQA: Original and processed (map, u, l) variants for factoid and research-oriented questions.
HealthCareMagic & iCliniq: Real-world patient-doctor interactions from online portals.

Data Curation & Augmentation Pipeline

The training data underwent a sophisticated, reliability-first augmentation pipeline to enhance diversity and robustness while ensuring semantic and clinical integrity.

Stage	Purpose	Methodology & Quality Control
A. LLM-Based Paraphrasing	Introduce syntactic diversity while locking medical terminology.	Multi-model approach (LLaMA-8B, Gemini) for redundancy. Difficulty levels: Easy (lexical) vs. Hard (structural). Terminology Lock: Constrained decoding to preserve drug names, dosages, ICD/LOINC codes. QC Metric: Semantic similarity, and Jaccard(term-locks) = 1.0.
B. Back-Translation	Induce discourse-level variation.	Pivot Languages: Stochastic selection from {es, de, vi, fr}. Quality Control: Length ratio bounds in [0.8,1.2] and semantic similarity. Auto-retry on failure.
C. Style Standardization	Enforce a neutral, professional clinical tone.	Register Enforcement: Discourages colloquialisms. Probabilistic Rewrites: e.g., "will" → "is likely to". Artifact Removal: Strips forum-specific sign-offs, greetings, and disclaimers.
D. Multi-Variant Generation	Cover diverse reasoning and presentation styles.	Answer Variants: Concise, detailed, clinician-focused, patient-friendly. Question Variants: Clarifying, follow-up, symptom-centric, treatment-centric. Cross-combinations: Up to 9 variants per seed, capped by budget and uniqueness.
E. Clinical Scenario Creation	Build contextual robustness for multi-hop reasoning.	Simulated Contexts: ED triage, routine visit, chronic care, caregiver perspective. Effect: Trains model to consider contraindications, monitoring, and context-specific factors.
F. Quality Assurance	Ensure data purity and consistency.	F1. Data Cleaning: PHI removal (email/phone/URL/IP); MD5-based deduplication; filler removal. F2. Validation: LLM consistency checks, max-length caps, English-only filter.

Output Format

All training data was formatted into a standardized SFT structure to facilitate clear instruction-following:

### Instruction:
{Task descriptor and/or user question with context}

### Input:
{Additional user question or context, if any}

### Output:
{The model's target response}

Each data point includes metadata tags for its augmentation source (para, bt, style, scenario), original source IDs, and quality control scores.

Usage

You can load and use the model with the Hugging Face transformers library.

import transformers

model_id = "MedSwin/MedSwin-7B-SFT"
pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    device_map="auto", # Use GPU if available
)

# Format your input according to the training template
instruction = "Based on the provided context, what is the most likely diagnosis?"
context = "A 45-year-old male presents with acute, crushing substernal chest pain radiating to the left arm, associated with diaphoresis and nausea for the past hour."
formatted_prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{context}\n\n### Output:\n"

# Generate a response
sequences = pipeline(
    formatted_prompt,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.3,
    top_p=0.9,
    eos_token_id=pipeline.tokenizer.eos_token_id,
)
print(sequences[0]['generated_text'])

Bias, Risks, and Limitations

The model inherits and may amplify biases present in its base model and training data. These can include:

Demographic Biases: Biases related to race, gender, age, or socioeconomic status based on patterns in the source data.
Clinical Biases: Potential over-representation of certain conditions, treatments, or clinical perspectives.
Factual Accuracy: While stringent controls were applied, the model is not a knowledge base and can generate incorrect or outdated medical information.
Safe Deployment: Use a Human-in-the-Loop (HITL) system for any real-world application. Outputs must be verified by a qualified professional.

Technical Specifications

Model Architecture: Based on LLaMA, fine-tuned via Supervised Fine-Tuning (SFT).
Model Size: 7 Billion parameters.
Input Format: Instruction-Input-Output structure.
Benchmark Dataset
Benchmark Logs