THUDM/CogVideoX-5b-I2V is producing jibberish output #517

nitinmukesh · 2024-11-18T19:05:42Z

System Info / 系統信息

diffusers 0.32.0.dev0
torch 2.5.1+cu121
torchvision 0.20.1+cu121
python 3.11

Information / 问题信息

The official example scripts / 官方的示例脚本
My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

python inference/cli_demo.py --prompt "A young girl with sun-kissed hair and sparkling blue eyes stands in a lush, sunlit garden, her face radiant with a genuine smile that lights up her entire being. She wears a soft, floral dress that complements the vibrant blooms around her. As she tilts her head slightly, the sunlight catches the gentle curve of her smile, highlighting her joyful expression. Her hands are gently clasped in front of her, adding to the serene and happy atmosphere. The background is a tapestry of colorful flowers and greenery, enhancing the warmth and beauty of her smile." --model_path THUDM/CogVideoX-5b-I2V --generate_type "i2v" --num_frames 48 --image_or_video_path image.png --width 720 --height 480

output-i2v.mp4

Expected behavior / 期待表现

I tried the same image here and output was fine

https://huggingface.co/spaces/THUDM/CogVideoX-5B-Space

The text was updated successfully, but these errors were encountered:

a-r-r-o-w · 2024-11-18T19:34:40Z

num_frames must be 81 or 161. I see it mentioned in the docs here: https://huggingface.co/docs/diffusers/main/en/api/pipelines/cogvideox.

Looks like the table rendering is broken, so will fix that in a follow-up PR

nitinmukesh · 2024-11-19T07:47:47Z

@a-r-r-o-w

I am using --model_path THUDM/CogVideoX-5b-I2V, I think it supports less than 49 frames.
THUDM/CogVideoX1.5-5b-I2V supports 81 frames.

Is my understanding wrong?

a-r-r-o-w · 2024-11-19T09:10:25Z

Oh really sorry, I misread the model id. 49 frames is indeed correct here, but I see that you have 48 frames specified in the command. The I2V model is very sensitive to this and that's probably why you're seeing this tile patterns

nitinmukesh · 2024-11-19T09:21:16Z

So what do you suggest. Please if you can guide on how to do inference using v1 i2v.

a-r-r-o-w · 2024-11-19T10:03:19Z

Try running inference with height=480, width=720 and num_frames=49. I don't get any artifacts running the pure Diffusers example. I will take a look at the file in question soon as well to see if there are any bugs

nitinmukesh mentioned this issue Nov 19, 2024

Work plan and enhancement / 工作计划和用户诉求 #194

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

THUDM/CogVideoX-5b-I2V is producing jibberish output #517

THUDM/CogVideoX-5b-I2V is producing jibberish output #517

nitinmukesh commented Nov 18, 2024

a-r-r-o-w commented Nov 18, 2024 •

edited

Loading

nitinmukesh commented Nov 19, 2024

a-r-r-o-w commented Nov 19, 2024

nitinmukesh commented Nov 19, 2024

a-r-r-o-w commented Nov 19, 2024

THUDM/CogVideoX-5b-I2V is producing jibberish output #517

THUDM/CogVideoX-5b-I2V is producing jibberish output #517

Comments

nitinmukesh commented Nov 18, 2024

System Info / 系統信息

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现

a-r-r-o-w commented Nov 18, 2024 • edited Loading

nitinmukesh commented Nov 19, 2024

a-r-r-o-w commented Nov 19, 2024

nitinmukesh commented Nov 19, 2024

a-r-r-o-w commented Nov 19, 2024

a-r-r-o-w commented Nov 18, 2024 •

edited

Loading