Qwen3-30B-A3B-YOYO-V4-qx86-hi-mlx

📊 Performance Comparison: YOYO-V4 vs. Others

Metric	     Thinking	Coder  YOYO-V2	YOYO-V4
arc_challenge	0.410	0.422	0.531	0.511 ✅ (slight drop)
arc_easy	    0.444	0.532	0.690	0.674 ✅ (slight drop)
boolq	        0.691	0.881	0.885 	0.885 🔥 (tied)
hellaswag	    0.635	0.546	0.685	0.649 ⚠️ (drop)
openbookqa	    0.390	0.432	0.448	0.442 (slight drop)
piqa	        0.769	0.724	0.785	0.769 (drop)
winogrande	    0.650	0.576	0.646	0.618 ⚠️ (significant drop)

📊 Rank of Models by Strengths

YOYO-V2 – Best overall balance (strongest in reasoning, good across the board)
YOYO-V3 – Best for commonsense narratives and physical reasoning
YOYO-V4 – Good logic, slightly weaker on commonsense
Thinking – Solid baseline, especially for winogrande
Coder – Strong on boolq and technical logic, weak elsewhere

🧠 What Abilities Does YOYO-V4 Exhibit?

✅ Strengths Compared to Previous Models

Strong Logical Reasoning (boolq)

Matches the peak performance of V2 and Coder, indicating that it retains or even improves on:

Understanding question intent
Extracting key information from context
Making accurate yes/no decisions
Good Common Sense & Scenario Understanding (arc_challenge, arc_easy) Still performs significantly better than Thinking and Coder, though slightly behind V2.

Suggests strong ability in:

Multi-step reasoning
General knowledge inference (especially on easier logical tasks)
Still 20–30% better than baseline models in these areas.
Robustness on piqa (physical reasoning)

Performs as well as the base Thinking model and only slightly worse than V2. Shows solid understanding of real-world physical interactions (e.g., "How to dry a wet floor?").

⚠️ Weaknesses / Regressions Compared to V2

Weaker Inference on Complex Commonsense Tasks (winogrande)

Drops from 0.646 → 0.618 (≈-4%)

This suggests:

Reduced ability to resolve coreference ambiguity
Weaker understanding of social contexts and pronoun resolution
A notable regression in a benchmark that emphasizes real-world reasoning.

Lower Performance on Hellaswag (causal inference)

From 0.685 → 0.649

Indicates weaker ability to:

Predict next steps in everyday scenarios
Understand cause-effect relationships in narrative contexts
Slight Regression on OpenBookQA

A small drop from 0.448 → 0.442, which may reflect:

Subtle weakening in science knowledge application
Less precise reasoning on textbook-style questions

🔄 How Does V4 Compare to Baseline Models?

Model	    Key Abilities	vs. YOYO-V4
Thinking	Balanced reasoning, strong on winogrande	❌ V4 worse in most tasks
Coder	    Strong on boolq, weak on commonsense tasks	✅ V4 better in arc/easy/hellaswag/piqa
YOYO-V2	    Best overall performance, optimal merge	    ❌ V4 weaker in 5/7 metrics

✅ YOYO-V4 is better than both base models in:

arc_challenge (0.511 vs 0.41–0.42)
arc_easy (0.674 vs 0.44–0.53)
boolq (tied with best performer)

❌ But not as strong as V2, especially in:

winogrande (61.8% vs 65.0%)
hellaswag (64.9% vs 68.5%)

📌 Summary: Abilities of YOYO-V4

Ability Category	            Performance vs. V2    Interpretation
Core Reasoning (arc)	        Slightly worse	      Still above baseline, but less sharp than V2
Commonsense Inference	        Worse (winogrande)    Struggles with pronoun resolution and social context
Causal Understanding	        Worse (hellaswag)	  Less accurate in predicting plausible next steps
Logical Q. Answering (boolq)    Equal to best	      Excellent at reading comprehension and logic
Real-World P. Reasoning (piqa)	Slightly worse	      Still strong, but not top-tier
Science Knowledge (OpenBookQA)	Slightly worse	      Minor decline in textbook-style reasoning

📊 This quant compared with Full Precision

Metric	        bf16	qx86-hi	Δ (qx86-hi - bf16)
arc_challenge	0.509	0.511	+0.002
arc_easy	    0.669	0.674	+0.005
boolq	        0.883	0.885	+0.002
hellaswag	    0.645	0.649	+0.004
openbookqa	    0.442	0.442	=0
piqa	        0.771	0.769	-0.002
winogrande	    0.624	0.618	-0.006

The qx86-hi outperforms bf16 on 4/7 benchmarks, with the largest gains in arc_easy (+0.005) and hellaswag (+0.004), while losing slightly on piqa and winogrande.

🔍 Final Verdict

YOYO-V4 is a well-balanced model that retains the core strengths of YOYO-V2—particularly in logical reasoning (boolq) and general task performance (arc/challenge/easy)—but shows clear regressions in complex commonsense and causal reasoning (winogrande, hellaswag).

It’s not the best-performing model in this set, but:

It's still significantly stronger than both parent models on most metrics
It’s a solid choice for tasks emphasizing logic and factual reasoning

However, if you're working on applications requiring deep understanding of human behavior or narrative context, YOYO-V2 would be a better fit.

💡 Recommendation:

Use YOYO-V4 for tasks like:

✅ Technical QA, logical puzzles, boolq-style evaluations

Avoid YOYO-V4 for:

❌ Narrative comprehension, social reasoning, pronoun resolution

For optimal performance across the board → YOYO-V2 remains the top pick.

This model Qwen3-30B-A3B-YOYO-V4-qx86-hi-mlx was converted to MLX format from YOYO-AI/Qwen3-30B-A3B-YOYO-V4 using mlx-lm version 0.28.2.

Use with mlx

pip install mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("Qwen3-30B-A3B-YOYO-V4-qx86-hi-mlx")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)

Downloads last month: 7

Safetensors

Model size

31B params

Tensor type

BF16

U32

Model tree for nightmedia/Qwen3-30B-A3B-YOYO-V4-qx86-hi-mlx

Base model

YOYO-AI/Qwen3-30B-A3B-YOYO-V4

Quantized

(13)

this model