New AI Model#1 on Artificial Analysis

HappyHorse 1.0 — Native Audio-Video Generation Model Has Arrived

April 30, 2026

6 min read

New AI Model

HappyHorse 1.0

Native Audio-Video Generation

1080pNative Audio7 LanguagesDual #1

Alibaba's ATH-AI team has officially released HappyHorse 1.0, a unified multimodal video generation model. In a single forward pass, it simultaneously generates 1080p video and native audio, supporting lip-sync in 7 languages. On the Artificial Analysis leaderboard, HappyHorse 1.0 ranks #1 in both text-to-video (1333 Elo) and image-to-video (1392 Elo).

What Makes HappyHorse 1.0 Different

The first video model to natively co-generate visuals and audio.

Unified Multimodal Transformer

HappyHorse 1.0 uses a single-stream Transformer architecture that places text, image, video, and audio tokens in the same representation space, bypassing the traditional 'video → audio → lip-sync' pipeline.

Native Audio Generation

Dialogue, ambient sound, and sound effects are generated alongside the visuals — no extra TTS or post-production needed. Audio and video are perfectly synced out of the box.

7-Language Lip Sync

Supports English, Mandarin, Cantonese, Japanese, Korean, German, and French with sub-pixel lip alignment, purpose-built for global dialogue content.

#1 on Artificial Analysis

As of April 2026, HappyHorse 1.0 tops the Artificial Analysis global blind evaluation arena in both text-to-video (1333 Elo) and image-to-video (1392 Elo).

8-Step DMD-2 Sampling

Distilled to ~8 denoising steps with no classifier-free guidance needed — a single H100 GPU generates a 1080p video in approximately 38 seconds.

Native 1080p Output

Outputs broadcast-grade 1080p video without upscaling, with excellent physical consistency and temporal coherence across multiple shots.

Optimized for Vertical & Dialogue

Specifically optimized for vertical video and dialogue-heavy content — perfect for TikTok, Reels, Shorts, and social media advertising.

Sandwich Architecture

Uses a shared self-attention core 'sandwich' structure for efficient multimodal fusion, with approximately 15 billion parameters optimized for H100 clusters.

Typical Use Cases

Production scenarios that benefit from native audio-visual co-generation.

Multilingual Ad Creative

Generate ad creatives in one pass with lip-synced versions in 7 languages — no separate dubbing pipeline needed, dramatically shortening global marketing launch cycles.

Vertical Short-Form Video at Scale

Native support for vertical video and dialogue-first generation makes it ideal for TikTok, Douyin, Reels, and Shorts — authentic, conversational content at scale.

Spokesperson & Presenter Videos

Generate spokesperson, product explanation, and training videos where speech, lip movements, and gestures stay perfectly synchronized — no more awkward lip-sync mismatches.

Storyboarding & Pre-visualization

Generate multi-shot previews with temporary dialogue and ambient audio for film, animation, and game CG early-stage storyboarding and concept validation.

Localized Online Education

Output multilingual lip-synced versions of the same course at a fraction of the cost of re-shooting or manual dubbing, accelerating course internationalization.

Global Brand Content

Brands expanding overseas can rapidly produce localized video creative assets in each region without building separate production teams.

Pricing

HappyHorse 1.0 supports 720P and 1080P resolutions in standard and edit modes. Pay-as-you-go per-second pricing with no hidden fees. See our pricing page for full details.

Artificial Analysis Ranking

Languages Supported

~38s

Generation Time (H100)

1080p

Native Resolution

Try HappyHorse 1.0 on corevideo

Experience native audio-video generation with a single click. Sign up and start creating with free credits.

Start Creating View Pricing

HappyHorse is a trademark of Alibaba Group. This article is for informational purposes. corevideo is an independent third-party service and is not affiliated with, endorsed by, or connected to Alibaba.

HappyHorse 1.0 — Native Audio-Video Generation Model | corevideo | CoreVideo | 爱酷影