Qwen3-30B-A3B-YOYO-V4-qx86-hi-mlx
π Performance Comparison: YOYO-V4 vs. Others
Metric Thinking Coder YOYO-V2 YOYO-V4
arc_challenge 0.410 0.422 0.531 0.511 β
(slight drop)
arc_easy 0.444 0.532 0.690 0.674 β
(slight drop)
boolq 0.691 0.881 0.885 0.885 π₯ (tied)
hellaswag 0.635 0.546 0.685 0.649 β οΈ (drop)
openbookqa 0.390 0.432 0.448 0.442 (slight drop)
piqa 0.769 0.724 0.785 0.769 (drop)
winogrande 0.650 0.576 0.646 0.618 β οΈ (significant drop)
π Rank of Models by Strengths
- YOYO-V2 β Best overall balance (strongest in reasoning, good across the board)
- YOYO-V3 β Best for commonsense narratives and physical reasoning
- YOYO-V4 β Good logic, slightly weaker on commonsense
- Thinking β Solid baseline, especially for winogrande
- Coder β Strong on boolq and technical logic, weak elsewhere
π§ What Abilities Does YOYO-V4 Exhibit?
β Strengths Compared to Previous Models
Strong Logical Reasoning (boolq)
Matches the peak performance of V2 and Coder, indicating that it retains or even improves on:
- Understanding question intent
- Extracting key information from context
- Making accurate yes/no decisions
- Good Common Sense & Scenario Understanding (arc_challenge, arc_easy) Still performs significantly better than Thinking and Coder, though slightly behind V2.
Suggests strong ability in:
- Multi-step reasoning
- General knowledge inference (especially on easier logical tasks)
- Still 20β30% better than baseline models in these areas.
- Robustness on piqa (physical reasoning)
Performs as well as the base Thinking model and only slightly worse than V2. Shows solid understanding of real-world physical interactions (e.g., "How to dry a wet floor?").
β οΈ Weaknesses / Regressions Compared to V2
Weaker Inference on Complex Commonsense Tasks (winogrande)
- Drops from 0.646 β 0.618 (β-4%)
This suggests:
- Reduced ability to resolve coreference ambiguity
- Weaker understanding of social contexts and pronoun resolution
- A notable regression in a benchmark that emphasizes real-world reasoning.
Lower Performance on Hellaswag (causal inference)
- From 0.685 β 0.649
Indicates weaker ability to:
- Predict next steps in everyday scenarios
- Understand cause-effect relationships in narrative contexts
- Slight Regression on OpenBookQA
A small drop from 0.448 β 0.442, which may reflect:
- Subtle weakening in science knowledge application
- Less precise reasoning on textbook-style questions
π How Does V4 Compare to Baseline Models?
Model Key Abilities vs. YOYO-V4
Thinking Balanced reasoning, strong on winogrande β V4 worse in most tasks
Coder Strong on boolq, weak on commonsense tasks β
V4 better in arc/easy/hellaswag/piqa
YOYO-V2 Best overall performance, optimal merge β V4 weaker in 5/7 metrics
β YOYO-V4 is better than both base models in:
- arc_challenge (0.511 vs 0.41β0.42)
- arc_easy (0.674 vs 0.44β0.53)
- boolq (tied with best performer)
β But not as strong as V2, especially in:
- winogrande (61.8% vs 65.0%)
- hellaswag (64.9% vs 68.5%)
π Summary: Abilities of YOYO-V4
Ability Category Performance vs. V2 Interpretation
Core Reasoning (arc) Slightly worse Still above baseline, but less sharp than V2
Commonsense Inference Worse (winogrande) Struggles with pronoun resolution and social context
Causal Understanding Worse (hellaswag) Less accurate in predicting plausible next steps
Logical Q. Answering (boolq) Equal to best Excellent at reading comprehension and logic
Real-World P. Reasoning (piqa) Slightly worse Still strong, but not top-tier
Science Knowledge (OpenBookQA) Slightly worse Minor decline in textbook-style reasoning
π This quant compared with Full Precision
Metric bf16 qx86-hi Ξ (qx86-hi - bf16)
arc_challenge 0.509 0.511 +0.002
arc_easy 0.669 0.674 +0.005
boolq 0.883 0.885 +0.002
hellaswag 0.645 0.649 +0.004
openbookqa 0.442 0.442 =0
piqa 0.771 0.769 -0.002
winogrande 0.624 0.618 -0.006
The qx86-hi outperforms bf16 on 4/7 benchmarks, with the largest gains in arc_easy (+0.005) and hellaswag (+0.004), while losing slightly on piqa and winogrande.
π Final Verdict
YOYO-V4 is a well-balanced model that retains the core strengths of YOYO-V2βparticularly in logical reasoning (boolq) and general task performance (arc/challenge/easy)βbut shows clear regressions in complex commonsense and causal reasoning (winogrande, hellaswag).
Itβs not the best-performing model in this set, but:
- It's still significantly stronger than both parent models on most metrics
- Itβs a solid choice for tasks emphasizing logic and factual reasoning
However, if you're working on applications requiring deep understanding of human behavior or narrative context, YOYO-V2 would be a better fit.
π‘ Recommendation:
Use YOYO-V4 for tasks like:
- β Technical QA, logical puzzles, boolq-style evaluations
Avoid YOYO-V4 for:
- β Narrative comprehension, social reasoning, pronoun resolution
For optimal performance across the board β YOYO-V2 remains the top pick.
This model Qwen3-30B-A3B-YOYO-V4-qx86-hi-mlx was converted to MLX format from YOYO-AI/Qwen3-30B-A3B-YOYO-V4 using mlx-lm version 0.28.2.
Use with mlx
pip install mlx-lm
from mlx_lm import load, generate
model, tokenizer = load("Qwen3-30B-A3B-YOYO-V4-qx86-hi-mlx")
prompt = "hello"
if tokenizer.chat_template is not None:
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
messages, add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, verbose=True)
- Downloads last month
- 7
Model tree for nightmedia/Qwen3-30B-A3B-YOYO-V4-qx86-hi-mlx
Base model
YOYO-AI/Qwen3-30B-A3B-YOYO-V4