OmniTalker: Video to Video (alibaba)
Voice Cloner and Text-to-Speech:
"Spark-TTS"
An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
---------------------------------------------------------------------------------------
Image + Audio = Video with Hands and Face movement
"EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation"
- Tested System Environment: Centos 7.2/Ubuntu 22.04, Cuda >= 11.7
- Tested GPUs: A100(80G) / RTX4090D (24G) / V100(16G)
- Tested Python Version: 3.8 / 3.10 / 3.11
---------------------------------------------------------------------------------------
Image + Audio = Video with only Face movement
"Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation"
-----------------------------------------------------------------------------------------------
Best Model : "Hunyuan Video"
-------------------------------------------------------------------------------------------------
GPU Poor version by DeepBeepMeep
- Reduce greatly the RAM requirements and VRAM requirements
- 5 profiles in order to able to run the model at a decent speed on a low end consumer config (32 GB of RAM and 12 VRAM) and to run it at a very good speed on a high end consumer config (48 GB of RAM and 24 GB of VRAM)
- Support multiple pretrained Loras with 32 GB of RAM or less
- Switch easily between Hunyuan and Fast Hunyuan models and quantized / non quantized models
--------------------------------------------------------------------------------------------------------------------
Intel OpenVino Avatar:
Requires Proprietary Intel Hardware (Gaudi Accelerator)
--------------------------------------------------------------------------------------------------------------------
Reference Video:
Comments