[WIP]Add Wan2.2 Animate Pipeline (Continuation of #12442 by tolgacangoz)#12526
[WIP]Add Wan2.2 Animate Pipeline (Continuation of #12442 by tolgacangoz)#12526yiyixuxu merged 98 commits intohuggingface:mainfrom
Conversation
- Introduced WanAnimateTransformer3DModel and WanAnimatePipeline. - Updated get_transformer_config to handle the new model type. - Modified convert_transformer to instantiate the correct transformer based on model type. - Adjusted main execution logic to accommodate the new Animate model type.
…prove error handling for undefined parameters
…work for character animation and replacement - Added Wan 2.2 Animate 14B model to the documentation. - Introduced the Wan-Animate framework, detailing its capabilities for character animation and replacement. - Included example usage for the WanAnimatePipeline with preprocessing steps and guidance on input requirements.
- Introduced `WanAnimateGGUFSingleFileTests` to validate functionality. - Added dummy input generation for testing model behavior.
- Introduced `EncoderApp`, `Encoder`, `Direction`, `Synthesis`, and `Generator` classes for enhanced motion and appearance encoding. - Added `FaceEncoder`, `FaceBlock`, and `FaceAdapter` classes to integrate facial motion processing. - Updated `WanTimeTextImageMotionEmbedding` to utilize the new `Generator` for motion embedding. - Enhanced `WanAnimateTransformer3DModel` with additional face adapter and pose patch embedding for improved model functionality.
- Introduced `pad_video` method to handle padding of video frames to a target length. - Updated video processing logic to utilize the new padding method for `pose_video`, `face_video`, and conditionally for `background_video` and `mask_video`. - Ensured compatibility with existing preprocessing steps for video inputs.
…roved video processing - Added optional parameters: `conditioning_pixel_values`, `refer_pixel_values`, `refer_t_pixel_values`, `bg_pixel_values`, and `mask_pixel_values` to the `prepare_latents` method. - Updated the logic in the denoising loop to accommodate the new parameters, enhancing the flexibility and functionality of the pipeline.
…eneration - Updated the calculation of `num_latent_frames` and adjusted the shape of latent tensors to accommodate changes in frame processing. - Enhanced the `get_i2v_mask` method for better mask generation, ensuring compatibility with new tensor shapes. - Improved handling of pixel values and device management for better performance and clarity in the video processing pipeline.
…and mask generation - Consolidated the handling of `pose_latents_no_ref` to improve clarity and efficiency in latent tensor calculations. - Updated the `get_i2v_mask` method to accept batch size and adjusted tensor shapes accordingly for better compatibility. - Enhanced the logic for mask pixel values in the replacement mode, ensuring consistent processing across different scenarios.
…nced processing - Introduced custom QR decomposition and fused leaky ReLU functions for improved tensor operations. - Implemented upsampling and downsampling functions with native support for better performance. - Added new classes: `FusedLeakyReLU`, `Blur`, `ScaledLeakyReLU`, `EqualConv2d`, `EqualLinear`, and `RMSNorm` for advanced neural network layers. - Refactored `EncoderApp`, `Generator`, and `FaceBlock` classes to integrate new functionalities and improve modularity. - Updated attention mechanism to utilize `dispatch_attention_fn` for enhanced flexibility in processing.
…annotations - Removed extra-abstractioned-functions such as `custom_qr`, `fused_leaky_relu`, and `make_kernel` to streamline the codebase. - Updated class constructors and method signatures to include type hints for better clarity and type checking. - Refactored the `FusedLeakyReLU`, `Blur`, `EqualConv2d`, and `EqualLinear` classes to enhance readability and maintainability. - Simplified the `Generator` and `Encoder` classes by removing redundant parameters and improving initialization logic.
|
Here are some Animation: wan_animate_video_20_step.mp4Replacement: wan_animate_video_replace_20_step.mp4 |
src/diffusers/image_processor.py
Outdated
| VAE scale factor. If `do_resize` is `True`, the image is automatically resized to multiples of this factor. | ||
| resample (`str`, *optional*, defaults to `lanczos`): | ||
| Resampling filter to use when resizing the image. | ||
| resample (`str`, *optional*, defaults to `"lanczos"`): |
There was a problem hiding this comment.
can we add a new WanVaeImageProcessor(VaeImageProcessor) and put into wan folder, under utils.py file I think?
(we start to see more and more custom preprocess methods, almost everyone does and they don't really get reused across models, I think moving forward let's just do this for all new models)
cc @DN6 here too, let me know what you think
There was a problem hiding this comment.
I think the changes which make _resize_and_fill and _resize_and_crop respect self.config.resample should be added to the base VaeImageProcessor class; this could also be spun off into its own PR. I agree with moving the other (Wan Animate-specific logic) into its own class.
| def __repr__(self): | ||
| return ( | ||
| f"{self.__class__.__name__}(in_features={self.weight.shape[1]}, out_features={self.weight.shape[0]}," | ||
| f" bias={self.bias is not None})" | ||
| ) |
There was a problem hiding this comment.
| def __repr__(self): | |
| return ( | |
| f"{self.__class__.__name__}(in_features={self.weight.shape[1]}, out_features={self.weight.shape[0]}," | |
| f" bias={self.bias is not None})" | |
| ) |
| hidden_states = hidden_states.flatten(2).transpose(1, 2) | ||
|
|
||
| # 3. Condition embeddings (time, text, image) | ||
| # timestep shape: batch_size, or batch_size, seq_len (wan 2.2 ti2v) |
There was a problem hiding this comment.
I think we can move one of these conditions for animate, no?
There was a problem hiding this comment.
Yeah, Wan Animate is based on Wan 2.1, so the Wan2.2 TI2V logic isn't necessary here, and I have removed it.
|
|
||
| self.gradient_checkpointing = False | ||
|
|
||
| def motion_batch_encode( |
There was a problem hiding this comment.
can we move this to forward? all the layers (motion_encoder here) should be visible in forward
|
|
||
| hidden_states_original_dtype = hidden_states.dtype | ||
| hidden_states = self.norm_out(hidden_states.float()) | ||
| # Move the shift and scale tensors to the same device as hidden_states. |
There was a problem hiding this comment.
ohh let's try to fix it here
I think all we need to do is to pack shift and scale into same layer and add that layere into _no_split_modules attribute
| >>> face_video = load_video("path/to/face_video.mp4") | ||
|
|
||
| >>> # Calculate optimal dimensions based on VAE constraints | ||
| >>> max_area = 480 * 832 |
There was a problem hiding this comment.
if we make a vaeimageprocessor for wan, this can be added there too
…otion encoder's upcast_to_fp32 arg
…well as vae_scale_factor when calculating default height and width
What does this PR do?
This PR is a continuation of #12442 by @tolgacangoz. It adds a pipeline for the Wan2.2-Animate-14B model (project page, paper, code, weights), a SOTA character animation and replacement video model.
Fixes #12441 (the original requesting issue).
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@yiyixuxu
@sayakpaul
@tolgacangoz