Testing Twelve AI Image Models with Marcel Proust's Madeleine (Or: How We Chose Hunyuan for Hexagram 24)

November 18, 202518 min read

Upgrading 8-Bit Oracle's Hexagram 24: Return with Proust's famous involuntary memory scene, then systematically testing 12 diverse AI image models through fal.ai to find which one actually captures that precise moment when past and present collapse together.

Here's the thing about upgrading a hexagram commentary: you want an image that captures the essence of the concept, not just a literal representation. Hexagram 24 is Return (復, )—the idea of cyclical patterns, things coming back around, the winter solstice when yang energy begins its return after maximum darkness. It's about renewal, natural rhythms, the inevitable return of what was temporarily absent.

And there's no better literary embodiment of "return" than Marcel Proust's madeleine moment from In Search of Lost Time—that famous scene where tasting a tea-soaked madeleine cake triggers an involuntary flood of childhood memories. The past doesn't just come back; it crashes into the present with sensory immediacy, collapsing time itself1.

So we needed an image: Proust, eating the madeleine, experiencing that moment of involuntary memory. But rendered in our signature tech-noir aesthetic (phosphor green CRT glow, 1970s film grain, deep blacks). And since we're shipping a production feature for 8-Bit Oracle, we couldn't just pick a model arbitrarily. We had to test.

This is the story of how we systematically evaluated 12 diverse AI image generation models with the same prompt, analyzed their architectural differences and output quality, and ultimately chose Tencent's Hunyuan v3—not because it was the most popular or the fastest, but because it nailed the exact expression we needed: that instant where memory returns unbidden.

The Prompt

We used two prompts: a detailed user prompt describing the scene, plus a system prompt enforcing the tech-noir aesthetic.

User Prompt:

Portrait of Marcel Proust, 1900s French writer with dark mustache and formal suit, eating a madeleine cake. Above his head floats a glowing phosphor green thought bubble containing childhood memory fragments: sunlit garden, church steeple, boy silhouette. Deep black background with film grain. Tech-noir aesthetic with 1970s Kodak film quality, high contrast, dramatic lighting. His expression captures the moment of involuntary memory returning.

System Prompt (Tech-Noir Transformation):

1970s FILM SCREENGRAB - TECH-NOIR AESTHETIC: Transform into a tech-noir film still with phosphor green (#2EBD2E) for main subjects/CRT glow, amber/orange (#FFA500) for warm lighting, deep black (#000000) for shadows. Authentic 1970s-1980s film aesthetic with heavy grain, CRT scanlines, phosphor bloom. Organic scanned film negative quality.

The test was simple: send this exact prompt to 12 different models via fal.ai, log everything to our ai_provider_requests database, save outputs with ISO timestamps to prevent file clobbering, and compare results.

Important Note: fal.ai Ignores System Prompts

After testing, we discovered that fal.ai does not pass system prompts to the underlying models—they're completely ignored. We verified this by examining the raw request/response:

Request sent to fal.ai:

{
  "prompt": "Close-up portrait photograph of Marcel Proust...",
  "aspectRatio": "3:4",
  "systemPrompt": "1970s FILM SCREENGRAB - TECH-NOIR AESTHETIC:\n\nTransform into a tech-noir film still:\n\nCOLOR PALETTE...",
  "hasReferenceImage": false
}

Response received from fal.ai:

{
  "id": "596ff36d-f1f3-4941-959f-143264262e0c",
  "text": "Close-up portrait photograph of Marcel Proust...",
  "model": "fal-ai/hunyuan-image/v3/text-to-image",
  "usage": { "input_tokens": 272, "total_tokens": 272, "output_tokens": 0 },
  "imageUrl": "...",
  "provider": "fal.ai",
  "finishReason": "stop",
  "system_fingerprint": "160398587"
}

Notice the systemPrompt field from the request is nowhere in the response—only the main prompt appears as text. The system prompt was completely dropped. This means all models received identical input (just the user prompt), making this a fair apples-to-apples comparison. The tech-noir aesthetic elements that appeared in outputs came entirely from the detailed user prompt, not from a separate system-level transformation instruction.

The Contestants

We tested across diverse architectures to see how different approaches handle complex narrative scenes:

  1. FLUX.1 [dev] - 12B rectified flow transformer (open weights, non-commercial)
  2. Recraft V3 - Proprietary SOTA-level (photorealism + text)
  3. Stable Diffusion 3.5 Large - MMDiT text-to-image (≈8.1B params)
  4. Imagen 4 - Google proprietary diffusion (Vertex AI)
  5. Ideogram V2 - Frontier text-to-image (strong typography/design)
  6. HiDream I1 Full - 17B open-source Chinese image foundation model
  7. Hunyuan Image 3.0 - Tencent open-source native multimodal AR MoE ⭐
  8. DeepSeek Janus-Pro - Autoregressive multimodal (vision + generation)
  9. CogView4 - Zhipu AI/Tsinghua open-source (6B, Chinese text capable)
  10. ByteDance Seedream V3 - Bilingual high-res text-to-image
  11. Bria FIBO - JSON-native, licensed-data, commercial-safe
  12. Wan 2.5 Preview - Alibaba text/image-to-video (plus T2I variant)

Testing duration: ~5 minutes total (333 seconds). Success rate: 12/12 (100%). Average generation time: 27.8s. Now let's see what they actually produced.

The Results (A Gallery of Interpretations)

FLUX.1 [dev] - Black Forest Labs

Architecture: 12B rectified flow transformer (open weights, non-commercial) Generation Time: 6.0s (fastest) | File Size: 142 KB | Cost: $0.025

Rectified flow transformer architecture delivering the fastest generation time while maintaining clean, artistic composition. The phosphor green thought bubble contains castle/church architecture with excellent contrast against the dark background. Very stylized—almost illustration-like rather than photorealistic. The memory bubble is clear and well-defined, but Proust's expression is more contemplative than surprised, missing that specific "involuntary memory cascade" moment. The speed-to-quality ratio here is impressive—12B parameters generating coherent narrative scenes in 6 seconds.

FLUX.1 dev


Recraft V3

Architecture: Proprietary SOTA-level (photorealism + text rendering specialist) Generation Time: 12.9s | File Size: 318 KB | Tokens: 112 (simplified from 272) | Cost: $0.04

Had to use a simplified 447-character prompt because Recraft enforces a 1000-character limit and our detailed prompt exceeded that—reducing from 272 tokens down to 112 tokens (59% reduction). Despite this constraint, Recraft's SOTA-level architecture (which tops the Artificial Analysis benchmark for photorealism and text rendering) produced highly photorealistic results with exceptional detail. The memory bubble rendered as a vintage illustration/engraving style—which is actually brilliant, suggesting old photographs or children's book illustrations. Unique dual madeleine composition (one whole, one being eaten). The fact that Recraft achieved this quality with half the narrative detail speaks to architectural sophistication, but we still lost specificity in the simplification.

Recraft V3


Stable Diffusion 3.5 Large - Stability AI

Architecture: MMDiT text-to-image (≈8.1B params) Generation Time: 10.9s | File Size: 267 KB | Tokens: 272 (CLIP saw only 77) | Cost: $0.065

Warning received: CLIP token truncation—our 272-token prompt exceeded CLIP's 77-token limit (only 28% of instructions encoded), then got truncated again at T5's 256-token limit. This is the Multimodal Diffusion Transformer (MMDiT) architecture hitting hard limits. Despite seeing less than a third of the detailed prompt through CLIP, SD 3.5 produced a good balance of detail and coherence in just 10.9s—competitive speed with Recraft but working with even less information. The thought bubble is there, the tech-noir aesthetic is mostly preserved, but there's a sense of "making do" with truncated instructions. This is architectural constraint showing through in output quality.

SD 3.5 Large


Imagen 4 - Google

Architecture: Proprietary diffusion (Vertex AI) Generation Time: 24.7s | File Size: 1470 KB | Cost: $0.04

Google DeepMind's proprietary diffusion architecture taking its time to render highly realistic detail. The thought bubble shows garden scenes with a boy silhouette and church steeple bathed in amber/golden light. More dimensional and naturalistic than most—feels like a film still from an actual movie. The warmth of the memory fragments contrasts beautifully with the cooler tech-noir palette. The 24.7s generation time reflects the architectural focus on photorealism and compositional complexity. But (and this matters) the expression is serene rather than startled, missing that precise instant of recognition.

Imagen 4


Ideogram V2

Architecture: Frontier text-to-image (strong typography/graphic design) Generation Time: 32.6s | File Size: 1689 KB | Cost: $0.08

Ideogram 2.0's frontier text-to-image architecture with typography expertise doesn't directly apply to portraiture, but the clean execution and strong contrast are evident—that same precision that renders text cleanly translates to well-defined edges and compositional clarity. Solid technical rendering at 32.6s. The phosphor green pops nicely. But again, we're missing that emotional specificity—the face reads as "person eating cake and thinking about things" rather than "person experiencing involuntary memory cascade."

Ideogram V2


HiDream I1 Full - HiDream

Architecture: 17B open-source Chinese image foundation model Generation Time: 20.1s | File Size: 189 KB | Cost: $0.05

Open-source Chinese image foundation model with 17B parameters delivering impressive compression efficiency—189 KB for a complex narrative scene suggests sophisticated encoding. The 20.1s generation time is mid-range despite the parameter count. The tech-noir aesthetic is present but muted. Proust looks appropriately contemplative, but the memory bubble is less defined than we'd hoped. Good technical execution, but not quite capturing the narrative moment.

HiDream I1


Hunyuan Image 3.0 - Tencent ⭐ [WINNER]

Architecture: Open-source native multimodal autoregressive MoE (80B total params, ~13B active) Generation Time: 22.8s | File Size: 1.6 MB | Cost: $0.10

Selected for Production Use

This is the one. Tencent's open-source native multimodal autoregressive Mixture-of-Experts architecture taking 22.8s—not the fastest, but the most precise. Very photorealistic with natural window lighting from the left creating depth and dimension. The thought bubble contains flowers (hawthorns—a direct Proust reference if you know the book), boy silhouette, and church steeple. But what sealed the decision was the expression: eyes widening slightly, contemplative gaze upward, mouth slightly parted—that precise instant where past and present collapse together. Not "remembering" (active, deliberate), but "being remembered to" (passive, involuntary). That's the Return hexagram in facial form. The extra seconds Hunyuan took compared to faster models were spent on this exact kind of emotional nuance.

Hunyuan v3


DeepSeek Janus-Pro

Architecture: Autoregressive multimodal (vision + generation, 1B/7B variants) Generation Time: 60.4s (slowest) | File Size: 191 KB | Cost: $0.0155

Completely different interpretation—and revealing architectural differences through generation time. At 60.4 seconds, Janus-Pro is the slowest model we tested, yet remarkably cheap at just $0.0155 total cost (14 compute seconds billed, not the full 60.4s generation time). This reflects the fundamental difference of DeepSeek's novel autoregressive framework (which unifies multimodal understanding and generation) vs diffusion models. Instead of a detailed thought bubble, Janus-Pro rendered a minimalist green halo behind Proust's head—almost religious iconography rather than tech-noir memory visualization. More painterly/stylized portrait than photorealistic. Interesting aesthetic choice, but not what we were going for. This is autoregressive architecture thinking differently—both in how it processes the prompt (sequentially vs in parallel) and in what it produces. The minute-long generation time is the architecture showing itself.

Janus-Pro


CogView4 - Zhipu AI / Tsinghua

Architecture: Open-source text-to-image (6B, Chinese text capable, Apache-2.0) Generation Time: 64.8s (slowest) | File Size: 196 KB | Cost: $0.10

Zhipu AI's open-source text-to-image model—the first open-source model capable of generating Chinese characters in images. The absolute slowest at 64.8 seconds—even longer than Janus-Pro. The extended generation time likely reflects heavy computation for the stylized aesthetic CogView4 specializes in. Compact file size (196 KB) suggests aggressive compression post-generation. The tech-noir elements are there—phosphor green, grain, contrast—but the overall composition feels more illustrative than cinematic. Good technical output, but missing narrative weight. The architectural trade-off here is clear: stylization depth costs generation time.

CogView4


ByteDance Seedream V3 (Best Thought Bubble Award)

Architecture: Bilingual (EN/ZH) high-res text-to-image Generation Time: 9.4s (second fastest) | File Size: 216 KB | Cost: $0.03

ByteDance's Seedream V3—a bilingual (English/Chinese) high-res text-to-image model—demonstrating excellent speed/quality ratio. At 9.4 seconds, this is the second-fastest generation behind only FLUX, yet with competitive photorealistic output. The architectural efficiency is impressive: compact 216 KB file size, sub-10-second generation, solid detail. The execution is technically proficient across the board, but (and this is the theme across many models) the expression doesn't quite hit that involuntary memory moment. Emotionally neutral. When you're trying to visualize a specific psychological state, speed and technical proficiency aren't enough—but Seedream's architecture shows you don't have to sacrifice much speed to get quality.

SeeDream V3


Bria FIBO

Architecture: JSON-native, licensed-data, commercial-safe controllable text-to-image Generation Time: 30.1s | File Size: 1.7 MB | Cost: $0.04

Commercial-safe generation with licensing compliance at 30.1 seconds. Bria FIBO is open-source but trained exclusively on licensed data, avoiding the copyright gray areas other models swim in—pitched specifically for safe commercial use with JSON-native fine-grained controllability. The architectural focus on legal compliance doesn't seem to impact generation time significantly (mid-range at 30s). The output is clean, professional, appropriate. But "appropriate" is rarely the goal for tech-noir aesthetic work. We need specific, not safe.

Bria Fibo


Wan 2.5 Preview

Architecture: Alibaba text/image-to-video (with companion text-to-image variant) Generation Time: 34.3s | File Size: 1.7 MB | Cost: $0.05

Alibaba's Wan 2.5—primarily positioned as a text/image-to-video model with synchronized audio and cinematic output, but also available as a text-to-image variant. Running at 34.3 seconds—mid-range despite being labeled "next-gen." Large file size (1.7 MB) suggests the architecture prioritizes detail over compression. The output is technically impressive—detailed, coherent, aesthetically consistent. But again, we're optimizing for emotional precision, not technical perfection. The expression is good. It's just not the expression. "Next-gen" in architecture doesn't necessarily mean faster or emotionally smarter.

Wan 2.5


Speed Patterns: Architecture Reveals Itself Through Time

Generation time isn't just a performance metric—it's architectural fingerprinting. The spread from 6.0s to 64.8s reveals fundamental differences in how these models process complex narrative prompts:

Fastest (< 10s): Optimized Architectures

  • FLUX.1 dev (6.0s) - 12B rectified flow transformer
  • Seedream V3 (9.4s) - ByteDance bilingual high-res model

Fast (10-15s): Constrained But Quick

  • SD 3.5 Large (10.9s) - MMDiT with token truncation
  • Recraft V3 (12.9s) - SOTA with simplified prompt (112 tokens vs 272)

Medium (20-35s): Quality Over Speed

  • HiDream I1 (20.1s - 17B open-source), Hunyuan 3.0 (22.8s - 80B MoE), Imagen 4 (24.7s - Google proprietary)
  • Bria FIBO (30.1s - licensed-data), Ideogram V2 (32.6s), Wan 2.5 (34.3s)

Slow (> 60s): Architectural Difference, Not Optimization

  • Janus-Pro (60.4s) - Autoregressive (sequential token generation)
  • CogView4 (64.8s) - Heavy stylization computation

The pattern is clear: diffusion models cluster in the 6-35s range, while autoregressive architecture (Janus-Pro) and specialized stylization (CogView4) take 60s+. This isn't about optimization—it's about fundamentally different computational approaches.

Key insight: When Janus-Pro takes 60.4s vs FLUX's 6.0s (10x longer), you're not seeing "slower," you're seeing "different." Autoregressive models generate images token-by-token, sequentially. Diffusion models denoise in parallel. The architecture is the speed.

Cost Patterns: Pricing Models Reveal Business Strategy

Generation cost varies dramatically across models. Here's what each image actually cost us:

Actual Cost Per Image (Ranked Cheapest to Most Expensive):

  1. Janus-Pro - $0.0155 (cheapest overall)
  2. FLUX.1 dev - $0.025
  3. Seedream V3 - $0.03
  4. Recraft V3 - $0.04
  5. Imagen 4 - $0.04
  6. Bria FIBO - $0.04
  7. HiDream I1 - $0.05
  8. Wan 2.5 - $0.05
  9. SD 3.5 Large - $0.065
  10. Ideogram V2 - $0.08
  11. Hunyuan 3.0 - $0.10 (tied for most expensive)
  12. CogView4 - $0.10 (tied for most expensive)

Pricing Models:

  • Per-image pricing (Seedream, Recraft, Imagen, Bria, Wan, Ideogram): Fixed cost regardless of resolution
  • Per-megapixel pricing (FLUX, HiDream, SD 3.5, Hunyuan, CogView): Cost scales with image size
  • Per-compute-second (Janus): Cost based on computation time ($0.0011/second, 14 seconds billed)

Cost vs Quality:

Janus-Pro at $0.0155 is remarkably cheap—the absolute lowest cost in the entire test, despite 60.4s generation time. The per-compute-second pricing model ($0.0011/s, 14 seconds billed) means you're only paying for actual computation, not wall-clock time. The minimalist aesthetic it produced wasn't right for our use case, but that's an architectural interpretation, not a quality failure. For experimental workflows, research, or stylized portrait work, Janus-Pro's economics are unbeatable.

The second-cheapest (FLUX at $0.025) was fast and stylized but missed the emotional moment.

Seedream V3 stands out for sheer economics. At $0.03 (third cheapest), 9.4s generation (second fastest), and technically excellent output, it offers the best speed/cost/quality ratio in the entire test. For most use cases—marketing, rapid iteration, high-volume generation—Seedream would be the obvious choice. It only lost here because we needed a very specific emotional expression that it didn't quite capture. If you're optimizing for cost-efficiency rather than narrative precision, Seedream is stellar.

The most expensive models (Hunyuan and CogView4, both $0.10) delivered vastly different results despite identical pricing. Hunyuan cost 6.5× more than Janus and 3.3× more than Seedream—but delivered the exact expression we needed.

Key insight: Cost reveals nothing about emotional precision. The 6.5× price difference between Janus ($0.0155) and Hunyuan ($0.10) bought us narrative correctness—the difference between "person with green halo" and "person experiencing involuntary memory." For a production asset generated once and cached forever, paying $0.09 more was trivial. But for most production workflows, Seedream's $0.03 + 9.4s + excellent quality combination is hard to beat.

Practical Recommendations:

  • For batch workflows producing long-lasting assets (landing pages, marketing materials, product launches) where cost isn't a constraint and precision matters: Choose Hunyuan. Pay the premium for exactly the right result.
  • For workflows where latency and cost are critical (A/B testing, rapid iteration, high-volume generation, prototyping): Choose Seedream V3. Best speed/cost/quality ratio. Fast enough for real-time workflows, cheap enough for volume, good enough for most use cases.
  • For experimental/research work with unique aesthetic requirements: Choose Janus-Pro. Cheapest overall, interesting autoregressive interpretation style.

Architectural Patterns Observed

Testing 12 models revealed clear patterns in how different architectures handle complex narrative prompts:

Flow Transformers (FLUX): Clean, artistic, excellent composition. Fastest at 6.0s. More stylized than photorealistic. The speed/quality ratio is exceptional.

MMDiT (SD 3.5): Fast at 10.9s despite prompt truncation challenges. CLIP's 77-token limit is a real constraint for detailed prompts—only saw 28% of instructions but still produced coherent output.

Proprietary Diffusion (Imagen): Highly photorealistic with natural lighting. Mid-range speed (24.7s). Strong compositional complexity.

Open-Source Autoregressive MoE (Hunyuan 3.0): Native multimodal AR with 80B total params (~13B active). Mid-range speed (22.8s). Best at capturing subtle facial expressions—the extra seconds translate to emotional nuance.

Autoregressive (Janus-Pro): Completely different minimalist interpretation—shows how architectural choices fundamentally shape output style. 10x slower (60.4s) due to sequential token generation vs parallel diffusion.

Typography-focused (Ideogram, Recraft): Excellent detail and clarity. Recraft fast despite 1000-character limit requiring 59% prompt reduction (272→112 tokens). Ideogram slower (32.6s), likely precision-focused.

Chinese Models (HiDream, Hunyuan, Seedream, CogView4): Dramatic diversity in both architecture and performance. Seedream (ByteDance, bilingual) is 2nd fastest overall (9.4s). CogView4 (Zhipu AI, first to handle Chinese text generation) is slowest (64.8s). HiDream (17B open-source) and Hunyuan (80B MoE, open-source) in mid-range. Not a monolithic category—architectural approaches vary wildly. Generally efficient file sizes (189-216 KB for fast models).

File Size Clusters:

  • Compact (142-268 KB): FLUX dev, SD 3.5, HiDream, Janus-Pro, CogView4, Seedream
  • Large (1.5-1.7 MB): Recraft V3, Imagen 4, Ideogram V2, Hunyuan 3.0, Bria FIBO, Wan 2.5

Prompt Handling:

  • Recraft V3: 1000-character hard limit
  • SD 3.5: CLIP 77-token limit causes truncation (T5 handles 256 tokens)
  • Most others: Handle full detailed prompts without issue

Why Hunyuan Won

The decision came down to three factors that mattered more than speed or file size:

1. Facial Expression Precision

Every model produced technically competent images. Most captured "Proust eating cake." Only Hunyuan captured "Proust experiencing involuntary memory at this exact instant." The hollow eyes widening, the upward gaze, the slight parting of lips—that's not "thinking about the past," that's "the past returning unbidden." For Hexagram 24: Return, that distinction is everything.

2. Memory Bubble Detail

The thought bubble contains hawthorns (flowers), boy silhouette, church steeple—all specific references from Proust's actual text. Other models gave us generic gardens or abstract architectural elements. Hunyuan gave us Combray—the specific childhood place Proust returns to in the novel. That level of narrative accuracy suggests the model is actually parsing semantic meaning, not just matching visual tokens.

3. Tech-Noir Aesthetic Preservation

Some models nailed the phosphor green but lost the film grain. Others got the grain but made it too modern. Hunyuan balanced all elements: CRT glow, 1970s Kodak film quality, high contrast, dramatic lighting, deep blacks. The window light from the left adds depth without breaking the aesthetic. It feels like a still from a 1977 sci-fi film that doesn't exist but should.

Speed and file size didn't matter. This image will load once per user session and get cached. 22.8 seconds generation time is irrelevant for a production asset we generate once. What mattered was correctness—not technical correctness, but narrative correctness.

Implementation Details

All images generated via fal.ai Model APIs with consistent logging:

  • Every request logged to ai_provider_requests Supabase table (model, prompt, generation time, file size, errors)
  • Timestamp naming (ISO format) prevents file clobbering when testing multiple models
  • Test script: scripts/testMultipleModels.ts
  • Success rate: 100% (12/12 models completed successfully)
  • Total generation time: 333 seconds (~5.6 minutes)
  • Average generation time: 27.8 seconds
  • Median generation time: 23.8 seconds
  • Fastest: FLUX.1 dev (6.0s)
  • Slowest: CogView4 (64.8s)
  • Range: 10.8x speed difference between fastest and slowest
  • Cost Range: $0.0155-$0.10 per image
    • Cheapest: Janus-Pro ($0.0155)
    • Second cheapest: FLUX.1 dev ($0.025)
    • Third cheapest: Seedream V3 ($0.03)
    • Most expensive: Hunyuan 3.0 & CogView4 ($0.10 each)

This systematic approach meant we could compare apples-to-apples: same prompt, same system prompt, same infrastructure, different models. The only variable was the model architecture itself.

What We Learned

Prompt engineering hits architectural limits. SD 3.5's CLIP truncation and Recraft's 1000-character limit aren't bugs—they're architectural constraints. You can't prompt-engineer around a hard token limit. Model choice matters.

File size ≠ quality. HiDream produced 189 KB outputs. Hunyuan produced 1.6 MB outputs. For our use case, Hunyuan's larger file was worth it. Optimize for what matters (emotional accuracy), not what's easy to measure (compression ratio).

Cost ≠ quality either. The absolute cheapest model (Janus-Pro, $0.0155 total) produced minimalist iconography that wasn't right. Seedream V3 ($0.03) was technically excellent. The most expensive models (Hunyuan, CogView4 at $0.10 each) delivered vastly different results despite identical pricing. Hunyuan cost 6.5× more than Janus and 3.3× more than Seedream—but delivered the exact expression we needed. Pricing structure (per-image vs per-megapixel vs per-compute-second) reveals target audience more than capability. Pay for precision when it matters; optimize for cost when it doesn't.

Autoregressive models are different. Janus-Pro's minimalist halo interpretation wasn't wrong—it was a fundamentally different architectural approach to the same problem. Diffusion models and autoregressive models "think" differently about image generation. This shows up in output style (minimalist vs detailed) and generation time (60.4s vs 6-25s for diffusion). Architecture is destiny.

Speed reveals architecture. The 10.8x difference between FLUX (6.0s) and CogView4 (64.8s) isn't about optimization—it's about fundamentally different computation approaches. Autoregressive (sequential) vs diffusion (parallel) vs stylization-heavy models each have characteristic speed signatures. Generation time is architectural fingerprinting.

Chinese models are underrated and diverse. HiDream, Hunyuan, CogView4, Seedream—these aren't a monolithic category. Seedream is 2nd fastest overall (9.4s). CogView4 is slowest (64.8s). Hunyuan (open-source 80B MoE) won outright for emotional precision. The AI image generation landscape is more diverse than "DALL-E vs Midjourney vs Stable Diffusion."

Testing takes time but prevents regret. ~5.6 minutes of generation time, maybe an hour of analysis. But now we know Hunyuan is the right choice, not just a guess. When you're shipping to production, "probably good enough" isn't good enough.

Conclusion

We upgraded Hexagram 24: Return in 8-Bit Oracle with Marcel Proust's madeleine moment—that famous scene of involuntary memory from In Search of Lost Time. The concept fit perfectly: Return is about cyclical patterns, things coming back around, the inevitable return of what was temporarily absent. Proust's past crashing into his present via a tea-soaked cake is Return in literary form.

But we needed the image to match the concept. So we tested 12 diverse AI image generation models with the same detailed prompt: Proust eating the madeleine, tech-noir aesthetic, phosphor green thought bubble with childhood memories, that precise instant where past and present collapse together.

After systematic testing via fal.ai (100% success rate, all outputs logged to database), we chose Tencent's open-source Hunyuan Image 3.0—not because it was fastest (it wasn't), not because it produced the smallest file (it didn't), not because it was cheapest (at $0.10 it cost 6.5× more than Janus-Pro's $0.0155 and 3.3× more than Seedream V3's $0.03)—but because it captured the exact expression we needed: eyes widening, contemplative upward gaze, that moment of involuntary memory returning.

Speed doesn't matter when you're generating a production asset once. File size doesn't matter when the image loads once and gets cached. Cost doesn't matter when the difference is pennies and the asset ships to production. What matters is correctness—not technical correctness, but narrative correctness. Getting the face right. Getting the emotion right. Getting the Return right.

The image is now live in 8-Bit Oracle's Hexagram 24 commentary. When users consult the oracle about cycles, returns, and things coming back around, they'll see Proust—experiencing his own return, his own involuntary memory, his own collapse of time.

Rendered in phosphor green on deep black, with 1970s film grain, because tech-noir is how we do divination in 2025.


Technical Specifications:

  • Service: fal.ai Model APIs
  • Models Tested: 12 diverse architectures
    • Open-source: FLUX.1 dev, SD 3.5 Large, HiDream I1, Hunyuan 3.0, Janus-Pro, CogView4, Bria FIBO
    • Proprietary: Recraft V3, Imagen 4, Ideogram V2, Seedream V3, Wan 2.5
  • Success Rate: 100% (12/12)
  • Total Generation Time: 333 seconds (~5.6 minutes)
  • Average Generation Time: 27.8s
  • Median Generation Time: 23.8s
  • Fastest Model: FLUX.1 dev - 12B rectified flow transformer (6.0s)
  • Slowest Model: CogView4 - 6B open-source (64.8s)
  • Speed Range: 10.8x difference
  • Selected Model: Hunyuan Image 3.0 (Tencent open-source native multimodal AR MoE, 80B total / ~13B active)
  • Winner Generation Time: 22.8s
  • Winner File Size: 1.6 MB
  • Database Logging: All requests logged to ai_provider_requests Supabase table
  • Test Script: scripts/testMultipleModels.ts
  • Production Deployment: 8-Bit Oracle Hexagram 24: Return
  • Multilingual Support: en, zh (Cantonese), zh-CN (Mandarin), th

Footnotes

  1. The full scene: Proust dips a madeleine into lime-blossom tea, tastes it, and suddenly—without conscious effort, without warning—his entire childhood in Combray floods back with vivid sensory detail. Not just memories of the past, but the past itself, fully present, as real as the moment he's currently in. This is what he calls "involuntary memory"—memory that returns on its own terms, triggered by sensation rather than conscious recall. The entire seven-volume In Search of Lost Time unfolds from this single moment of return. Which makes it probably the most consequential pastry in literary history.