Alibaba unleashes open source WAN 2.1 suite of AI video generation models, claiming to surpass Openai's Sora

Alibaba released a set of artificial intelligence (AI) video generation models on Wednesday. Known as WAN 2.1, these are open source models that can be used for academic and commercial purposes. The Chinese e-commerce giant has released these models in several parameter-based variants. The models were developed by the company’s WAN team and were originally launched in January, which claims WAN 2.1 can generate highly realistic videos. Currently, these models are being hosted on AI and Machine Learning (ML) Hub Embrace.

Alibaba introduces WAN 2.1 video generation model

The new Alibaba video AI model is hosted on Alibaba’s Wan team’s embrace page. The Model page also details the WAN 2.1 Big Language Model (LLMS) suite. There are four models in total: T2V-1.3B, T2V-14B, I2V-14B-720P and I2V-14B-480P. T2V is the abbreviation of text to video, while I2V stands for image to video.

The researchers claim that the smallest variant, WAN 2.1 T2V-1.3B, can run on consumer-grade GPUs, which require only 8.19GB of VRAM. According to the post, the AI model can generate five-second long videos using the NVIDIA RTX 4090 in about four minutes, with a resolution of 480p.

Although the WAN 2.1 suites are targeted at video generation, they can perform other functions such as image generation, video to audit generation, and video editing. However, currently open source models cannot perform these advanced tasks. For video generation, it accepts text prompts in Chinese and English languages as well as image input.

The WAN 2.1 model was designed using a diffusion transformer architecture, the researchers revealed. However, the company has innovated the basic architecture with new variant autoencoders (VAEs), training strategies, and more.

Most notably, the AI model uses a new 3D causal building called Wan-Vae. It improves space-time compression and reduces memory usage. The autoencoder can encode and decode unlimited lengths of 1080p resolution video without losing historical time information. This can make video generation consistent.

According to internal testing, the company claims that the WAN 2.1 model outperforms OpenAI’s SORA AI model in terms of consistency, scene generation quality, accuracy of individual objects, and spatial positioning.

These models are available under the Apache 2.0 license. While it does allow unrestricted usage for academic and research purposes, commercial usage has multiple limitations.