Qwen3-30B-A3B-YOYO-V4-qx86-hi-mlx

πŸ“Š Performance Comparison: YOYO-V4 vs. Others

Metric	     Thinking	Coder  YOYO-V2	YOYO-V4
arc_challenge	0.410	0.422	0.531	0.511 βœ… (slight drop)
arc_easy	    0.444	0.532	0.690	0.674 βœ… (slight drop)
boolq	        0.691	0.881	0.885 	0.885 πŸ”₯ (tied)
hellaswag	    0.635	0.546	0.685	0.649 ⚠️ (drop)
openbookqa	    0.390	0.432	0.448	0.442 (slight drop)
piqa	        0.769	0.724	0.785	0.769 (drop)
winogrande	    0.650	0.576	0.646	0.618 ⚠️ (significant drop)

πŸ“Š Rank of Models by Strengths

  • YOYO-V2 – Best overall balance (strongest in reasoning, good across the board)
  • YOYO-V3 – Best for commonsense narratives and physical reasoning
  • YOYO-V4 – Good logic, slightly weaker on commonsense
  • Thinking – Solid baseline, especially for winogrande
  • Coder – Strong on boolq and technical logic, weak elsewhere

🧠 What Abilities Does YOYO-V4 Exhibit?

βœ… Strengths Compared to Previous Models

Strong Logical Reasoning (boolq)

Matches the peak performance of V2 and Coder, indicating that it retains or even improves on:

  • Understanding question intent
  • Extracting key information from context
  • Making accurate yes/no decisions
  • Good Common Sense & Scenario Understanding (arc_challenge, arc_easy) Still performs significantly better than Thinking and Coder, though slightly behind V2.

Suggests strong ability in:

  • Multi-step reasoning
  • General knowledge inference (especially on easier logical tasks)
  • Still 20–30% better than baseline models in these areas.
  • Robustness on piqa (physical reasoning)

Performs as well as the base Thinking model and only slightly worse than V2. Shows solid understanding of real-world physical interactions (e.g., "How to dry a wet floor?").

⚠️ Weaknesses / Regressions Compared to V2

Weaker Inference on Complex Commonsense Tasks (winogrande)

  • Drops from 0.646 β†’ 0.618 (β‰ˆ-4%)

This suggests:

  • Reduced ability to resolve coreference ambiguity
  • Weaker understanding of social contexts and pronoun resolution
  • A notable regression in a benchmark that emphasizes real-world reasoning.

Lower Performance on Hellaswag (causal inference)

  • From 0.685 β†’ 0.649

Indicates weaker ability to:

  • Predict next steps in everyday scenarios
  • Understand cause-effect relationships in narrative contexts
  • Slight Regression on OpenBookQA

A small drop from 0.448 β†’ 0.442, which may reflect:

  • Subtle weakening in science knowledge application
  • Less precise reasoning on textbook-style questions

πŸ”„ How Does V4 Compare to Baseline Models?

Model	    Key Abilities	vs. YOYO-V4
Thinking	Balanced reasoning, strong on winogrande	❌ V4 worse in most tasks
Coder	    Strong on boolq, weak on commonsense tasks	βœ… V4 better in arc/easy/hellaswag/piqa
YOYO-V2	    Best overall performance, optimal merge	    ❌ V4 weaker in 5/7 metrics

βœ… YOYO-V4 is better than both base models in:

  • arc_challenge (0.511 vs 0.41–0.42)
  • arc_easy (0.674 vs 0.44–0.53)
  • boolq (tied with best performer)

❌ But not as strong as V2, especially in:

  • winogrande (61.8% vs 65.0%)
  • hellaswag (64.9% vs 68.5%)

πŸ“Œ Summary: Abilities of YOYO-V4

Ability Category	            Performance vs. V2    Interpretation
Core Reasoning (arc)	        Slightly worse	      Still above baseline, but less sharp than V2
Commonsense Inference	        Worse (winogrande)    Struggles with pronoun resolution and social context
Causal Understanding	        Worse (hellaswag)	  Less accurate in predicting plausible next steps
Logical Q. Answering (boolq)    Equal to best	      Excellent at reading comprehension and logic
Real-World P. Reasoning (piqa)	Slightly worse	      Still strong, but not top-tier
Science Knowledge (OpenBookQA)	Slightly worse	      Minor decline in textbook-style reasoning

πŸ“Š This quant compared with Full Precision

Metric	        bf16	qx86-hi	Ξ” (qx86-hi - bf16)
arc_challenge	0.509	0.511	+0.002
arc_easy	    0.669	0.674	+0.005
boolq	        0.883	0.885	+0.002
hellaswag	    0.645	0.649	+0.004
openbookqa	    0.442	0.442	=0
piqa	        0.771	0.769	-0.002
winogrande	    0.624	0.618	-0.006

The qx86-hi outperforms bf16 on 4/7 benchmarks, with the largest gains in arc_easy (+0.005) and hellaswag (+0.004), while losing slightly on piqa and winogrande.

πŸ” Final Verdict

YOYO-V4 is a well-balanced model that retains the core strengths of YOYO-V2β€”particularly in logical reasoning (boolq) and general task performance (arc/challenge/easy)β€”but shows clear regressions in complex commonsense and causal reasoning (winogrande, hellaswag).

It’s not the best-performing model in this set, but:

  • It's still significantly stronger than both parent models on most metrics
  • It’s a solid choice for tasks emphasizing logic and factual reasoning

However, if you're working on applications requiring deep understanding of human behavior or narrative context, YOYO-V2 would be a better fit.

πŸ’‘ Recommendation:

Use YOYO-V4 for tasks like:

  • βœ… Technical QA, logical puzzles, boolq-style evaluations

Avoid YOYO-V4 for:

  • ❌ Narrative comprehension, social reasoning, pronoun resolution

For optimal performance across the board β†’ YOYO-V2 remains the top pick.

This model Qwen3-30B-A3B-YOYO-V4-qx86-hi-mlx was converted to MLX format from YOYO-AI/Qwen3-30B-A3B-YOYO-V4 using mlx-lm version 0.28.2.

Use with mlx

pip install mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("Qwen3-30B-A3B-YOYO-V4-qx86-hi-mlx")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)
Downloads last month
7
Safetensors
Model size
31B params
Tensor type
BF16
Β·
U32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for nightmedia/Qwen3-30B-A3B-YOYO-V4-qx86-hi-mlx

Quantized
(13)
this model