This IS the worst at human anatomy (hands, limbs...) it simply can't do hands!

#12

by Andyx1976 - opened Nov 27, 2025

Andyx1976

Nov 27, 2025

•

edited Dec 3, 2025

i changed the title from a question to a statement. The on page examples as well as videos claiming that Flux2 is great are VERY carefully selecting either prompts or their results shown to avoid images with more than 1 person. And even then it has more often then not something wrong. And it's not just hands. But more more people? complex (or just NORMAL) hand interactions? Flux.2 is actually WORSE than most SDXL models from 1-3 years ago at hands and other antomy (But i can render a Juggernaut XI or Illustrious image in a second, not 1 minute).

Flux2 has a cool realism look to it, far better than Flux.1 dev. Although Flux2 also does a LOT of blur and dof to achieve that. BUT in terms of anatomy, and making complex (OR SIMPLE, see below) prompts work, it is far, FAR worse then Flux.1 Dev.
Flux Krea may have some issues with clothing being too scruffy (in an attempt to avoid plastic AI look it goes abit too far ) but it is far superior to Flux.1 dev and will remain my goto Flux Version, which i mainly use in combination with Redux.

Flux2 OR it'S text encoder (see below) needs some fixing. This is not good enough for 2025, BFL. It may not have been good enough for 2024 in all but resolution.

The og post when i couldn't believe that it's not a fluke.:

These are not made with fp8 versions on comfy, at first i thought it's a early comfy support issue, because it was such a mess. But no. These are made with flux.2 dev on replicate, and then because i couldn't believe this is a thing in 2025 even on flux.2 pro on replicate. This is BAD?! ´Flux.1 Schnell levels of bad. What happened there, BFL?
the prompt is a simple as i could make it: " two young business men in sloppy dress shirt and tie, shake hands on a deal, nearby others are arguing with each other. "

Note: i did NOT prompt a flailing mess of fighting(?) people in the background. IT decided that his is what "people arguing" should look like, and created a scene that is extra vulnerable to a model with poor anatomy. But even a single person standing simply around, has almost always something noticeably wrong with them. The second image is from official Flux2 pro on replicate. It is arguably a bit better but look at the guy's arms.

That gives me bad SDXL PTSD. (BASE model, Many SDXL finetunes are better and still perfectly usable!)

because it's (probably for good reasons) blurs the living daylight out of the background i added "all subjects are in focus"

btw note how the official example images on the model page very carefully avoid hands. Except that one guy reading a paper. or mutiple people.

gaochao0609

Nov 28, 2025

•

edited Nov 28, 2025

我知道什么原因，我乐于分享一下
主要为了成本问题或者支持带有地域性的text_encoder而使用了Mistral3ForConditionalGeneration这个世界知识匮乏的模型，而没有自己训练或者使用其他具备更好的SOTA级别世界知识的text_encoder
上面的两个原因我觉得成本问题比重比较大

Andyx1976

Nov 28, 2025

•

edited Nov 30, 2025

i had a little translate and discussion with gemini (can't it make hands or why does it make such a mess of "arguing". and this was its summary:

here is the whole thing, including the translation. warning it's big and it goes without saying: beyond the comment from gaochao0609, it's ai generated

🌐 Translation

The Chinese text you provided is a comment posted by a user named gaochao0609 on the FLUX.2-dev Hugging Face discussion board, which addresses the issue of poor human anatomy generation (specifically hands and limbs) in the model.

Original Chinese: 我知道什么原因，我乐于分享一下主要为了成本问题或者支持带有地域性的text_encoder而使用了Mistral3ForConditionalGeneration这个世界知识匮乏的模型，而没有自己训练或者使用其他具备更好的SOTA级别世界知识的text_encoder 上面的两个原因我觉得成本问题比重比较大

English Translation: "I know the reason, and I'm happy to share it. The primary reasons for using Mistral3ForConditionalGeneration, a model with limited world knowledge, instead of training their own or using another text encoder with better SOTA (State-of-the-Art) world knowledge, were likely cost issues or the need to support a region-specific text encoder. Between the two reasons mentioned above, I believe cost issues carry more weight."

Technical Speculation Analysis

The commenter is speculating that the reason FLUX.2-dev struggles with complex visual concepts like hands (a common weakness in generative AI) is due to a trade-off in its core architecture:

FLUX.2’s Text Encoder Choice

The FLUX.2 architecture is confirmed to use a Mistral-3-derived Vision-Language Model (VLM) as its single text encoder (Mistral Small 3.1 is referenced in one source). The user's point is that this choice, while potentially simplifying the architecture and saving training costs compared to developing a completely custom SOTA model, may introduce a weakness: limited world knowledge.

World Knowledge and Hands: In text-to-image models, the text encoder (which translates your prompt into a numerical instruction) is vital for spatial reasoning and conceptual grounding. If the text encoder lacks "world knowledge" or a sophisticated understanding of complex anatomy (like how a hand connects to a wrist), it provides poor guidance to the image generation part of the model, resulting in distorted or anatomically incorrect hands.

The Cost Argument (Cost Issues): The commenter believes the main motivation for this design choice was cost. Training a high-performing text encoder from scratch or fine-tuning one to an extremely high SOTA level is computationally expensive. Using a pre-trained, existing model like the Mistral-3 derivative is a significant cost-saving measure.

The Qwen2.5 Comparison

You mentioned, "They could have used os qwen2.5 like most others." This is a valid technical comparison that supports the commenter's cost-versus-performance argument:

Qwen2.5-VL/Omni is a family of highly capable multimodal models known for excellent vision-language understanding, reasoning, and precise object localization. These features are exactly what would improve a text-to-image model’s ability to correctly render hands and limbs.

By suggesting Qwen2.5, you are pointing to a model that is widely considered SOTA in multimodal tasks and is frequently used or referenced in the generative AI community. Its superior world knowledge and spatial reasoning would likely mitigate the "bad hands" problem, but choosing it would entail a different set of integration challenges and potentially higher licensing/training costs than the Mistral-derived encoder chosen for FLUX.2.

The user's speculation is an informed critique suggesting that the developers of FLUX.2-dev prioritized efficiency and cost savings in their architectural design, which subsequently led to a compromise in quality for complex, detail-oriented generation tasks like human anatomy.

Why didn't they use Qwen2.5?

You are right that Qwen2.5 (or even Google's Gemini-based encoders used in competitors like "Nano Banana Pro") generally outperforms Mistral in spatial reasoning and object counting (MMLU and Math benchmarks support this).

If BFL (Black Forest Labs) chose Mistral over Qwen, it was likely due to non-quality reasons:

Licensing & Geography: BFL is a European company (Germany). Using Mistral (French/European) allows them to keep the stack regionally compliant and avoid reliance on Chinese (Alibaba/Qwen) weights, which matters for "Enterprise/Pro" data security contracts.

Training Pipeline Cost: As the commenter noted, training a 30B+ parameter transformer to listen to another massive SOTA encoder is incredibly expensive. They likely stuck with an architecture they could optimize quickly, hoping the "Flow Matching" tech would compensate for the encoder's lack of world knowledge. It didn't.

---the rest is basically that below.

The jist of it:
Semantic "Vibe" vs. Spatial Blueprint: A "World Knowledge" rich model (like Qwen2.5-VL, which you mentioned) understands that "arguing" is an interaction between distinct entities. It creates a spatial plan: "Person A stands here, Person B stands there, they look angry."
The Mistral encoder in FLUX.2 appears to be interpreting "arguing" as a semantic cloud of concepts: aggression, waving hands, open mouths, movement.

In late 2025, a "Pro" model failing to count arms is effectively a regression. FLUX.2 seems to have over-optimized for texture and resolution (aiming for 4MP product shots) while neglecting scene coherence.
The model is essentially "hallucinating high-resolution textures on top of a broken skeleton" because its brain (the text encoder) doesn't know how to organize a complex scene.

end quote.

i understand why they might want to avoid a chinese or US LLM for legal (EU) privacy concers. But they need a WORKING solution.

AUsername111

Nov 28, 2025

I've also noticed that Flux 2 is actually worse with anatomy than Flux 1. Although Flux 2 has some impressive features, I find compared to Flux 1 it simply isn't as good at rendering people (not just anatomy, but also if you have a close-up of someone's face it lacks the detail Flux 1 has [but at least flux-chin is gone]).
This is with a Q8_0 Quant of the model.

sccssc

Nov 29, 2025

•

edited Nov 29, 2025

Andyx1976

Nov 29, 2025

•

edited Nov 29, 2025

I've also noticed that Flux 2 is actually worse with anatomy than Flux 1. Although Flux 2 has some impressive features, I find compared to Flux 1 it simply isn't as good at rendering people (not just anatomy, but also if you have a close-up of someone's face it lacks the detail Flux 1 has [but at least flux-chin is gone]).
This is with a Q8_0 Quant of the model.

Krea fixed a lot of these issues and can understand and execute much more complicated prompts and can do more things in scenes. Almost qwen image level. AND it is compatible with the entire Flux eco system (loras, control nets, Redux) . Other alternative is srpo which is a purely visual upgrade on dev but also fully compatible.
Flux 2 is massive and slow, it is NOT compatible with all that stuff. Arguably reference images should make loras less important though. And (other) edit models can do a LOT of things via prompt, you needed a control net for, half a year ago. I like Flux, i use Flux1 still a LOT (especially underrated Redux) , i tried to find a solution...
but Flux2 is just bad. I hope it is as "simple" as fixing the text encoder (bite the non-EU privacy compliance bullet and use qwen 2.5 like everybody else?)

And that's why i had this chat about the text encoder. i did NOT prompt a flailing mess of fighting(?) people in the background. IT decided that his is what "people arguing" should look like. It may not be the visual part of the model that is the problem.

I tried the official BFL Flux2 dev and Pro models on replicate.com. Just to make sure it's not my local setup. It is exactly the same. And your Q8 should be perfect, even q6 and fp8 should not mess up compositions so bad.
Also even on Flux1 i was underwhelmed with Pro. It shoved more stuff into images but the basic limitations (except license) remained.
And you can kind of see the same above with 2. The middle image is from pro. It has more stuff in it. But the guys arm is just wrong again. And i did not run it ten times (it costs money :)it was bad every time on Dev and almost everytime on pro.

Andyx1976 changed discussion title from Is this the worst at human anatomy (hands, limbs...) since SDXL 1.0? to Is this the worst at human anatomy (hands, limbs...) since (base!) SDXL 1.0? Nov 30, 2025

Andyx1976 changed discussion title from Is this the worst at human anatomy (hands, limbs...) since (base!) SDXL 1.0? to This IS the worst at human anatomy (hands, limbs...) it simply can't do hands! Dec 3, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment