HappyHorse 1.0 — Native Audio-Video Generation Model Has Arrived
HappyHorse 1.0
Native Audio-Video Generation
Alibaba's ATH-AI team has officially released HappyHorse 1.0, a unified multimodal video generation model. In a single forward pass, it simultaneously generates 1080p video and native audio, supporting lip-sync in 7 languages. On the Artificial Analysis leaderboard, HappyHorse 1.0 ranks #1 in both text-to-video (1333 Elo) and image-to-video (1392 Elo).
What Makes HappyHorse 1.0 Different
The first video model to natively co-generate visuals and audio.
Unified Multimodal Transformer
HappyHorse 1.0 uses a single-stream Transformer architecture that places text, image, video, and audio tokens in the same representation space, bypassing the traditional 'video → audio → lip-sync' pipeline.
Native Audio Generation
Dialogue, ambient sound, and sound effects are generated alongside the visuals — no extra TTS or post-production needed. Audio and video are perfectly synced out of the box.
7-Language Lip Sync
Supports English, Mandarin, Cantonese, Japanese, Korean, German, and French with sub-pixel lip alignment, purpose-built for global dialogue content.
#1 on Artificial Analysis
As of April 2026, HappyHorse 1.0 tops the Artificial Analysis global blind evaluation arena in both text-to-video (1333 Elo) and image-to-video (1392 Elo).
8-Step DMD-2 Sampling
Distilled to ~8 denoising steps with no classifier-free guidance needed — a single H100 GPU generates a 1080p video in approximately 38 seconds.
Native 1080p Output
Outputs broadcast-grade 1080p video without upscaling, with excellent physical consistency and temporal coherence across multiple shots.
Optimized for Vertical & Dialogue
Specifically optimized for vertical video and dialogue-heavy content — perfect for TikTok, Reels, Shorts, and social media advertising.
Sandwich Architecture
Uses a shared self-attention core 'sandwich' structure for efficient multimodal fusion, with approximately 15 billion parameters optimized for H100 clusters.
Typical Use Cases
Production scenarios that benefit from native audio-visual co-generation.
Multilingual Ad Creative
Generate ad creatives in one pass with lip-synced versions in 7 languages — no separate dubbing pipeline needed, dramatically shortening global marketing launch cycles.
Vertical Short-Form Video at Scale
Native support for vertical video and dialogue-first generation makes it ideal for TikTok, Douyin, Reels, and Shorts — authentic, conversational content at scale.
Spokesperson & Presenter Videos
Generate spokesperson, product explanation, and training videos where speech, lip movements, and gestures stay perfectly synchronized — no more awkward lip-sync mismatches.
Storyboarding & Pre-visualization
Generate multi-shot previews with temporary dialogue and ambient audio for film, animation, and game CG early-stage storyboarding and concept validation.
Localized Online Education
Output multilingual lip-synced versions of the same course at a fraction of the cost of re-shooting or manual dubbing, accelerating course internationalization.
Global Brand Content
Brands expanding overseas can rapidly produce localized video creative assets in each region without building separate production teams.
Pricing
HappyHorse 1.0 supports 720P and 1080P resolutions in standard and edit modes. Pay-as-you-go per-second pricing with no hidden fees. See our pricing page for full details.
Try HappyHorse 1.0 on corevideo
Experience native audio-video generation with a single click. Sign up and start creating with free credits.