Building a Netflix VOID Video Object Removal Pipeline with CogVideoX

How to Build a Netflix VOID Video Object Removal and Inpainting Pipeline with CogVideoX, Custom Prompting, and End-to-End Sample Inference

Netflix’s VOID model enables high-fidelity video object removal and inpainting by leveraging the CogVideoX-Fun-V1.5-5b-InP base model. This advanced pipeline specifically targets the removal of complex objects while maintaining temporal consistency across frames.

Why This Matters

While standard image inpainting is mature, video object removal faces significant challenges in temporal coherence and memory management. The VOID model addresses these by requiring upwards of 40GB VRAM, ideally on A100 hardware, to process sequences without flickering or artifacts. Moving beyond simple masks, this pipeline integrates custom background prompting via LLMs to guide the diffusion process toward more physically plausible reconstructions, addressing the limitations of baseline inpainting models in complex dynamic scenes.

Key Insights

VOID Pass 1 requires the CogVideoX-Fun-V1.5-5b-InP base model and specific safetensors checkpoints from the Netflix repository (2026).
Hardware constraints are significant; official documentation recommends 40GB+ VRAM, noting that T4 or L4 GPUs may fail during execution (Netflix, 2026).
The pipeline utilizes a temporal window size of 85 frames and a spatial resolution of 384x672 for high-quality inference (Netflix/VOID, 2026).
Integration of OpenAI’s GPT-4o-mini allows for the generation of cleaner background prompts, improving the semantic quality of the inpainted scene (MarkTechPost, 2026).
Checkpoint adaptation is necessary when state_dict channels for VAE masks do not match the transformer’s expected latent dimensions (16 latent channels).

Working Examples

Adapting the VOID checkpoint channels to align with the VAE mask dimensions in the transformer model.

print('Loading VOID checkpoint from {TRANSFORMER_CKPT} ...')\nstate_dict = load_file(TRANSFORMER_CKPT)\nparam_name = 'patch_embed.proj.weight'\nif state_dict[param_name].size(1) != transformer.state_dict()[param_name].size(1):\n    latent_ch, feat_scale = 16, 8\n    feat_dim = latent_ch * feat_scale\n    new_weight = transformer.state_dict()[param_name].clone()\n    new_weight[:, :feat_dim] = state_dict[param_name][:, :feat_dim]\n    new_weight[:, -feat_dim:] = state_dict[param_name][:, -feat_dim:]\n    state_dict[param_name] = new_weight

Practical Applications

Automated Video Editing: Removing distracting objects like kettlebells or glassware from commercial footage using VOID Pass 1. Pitfall: Using insufficient VRAM (under 40GB) leading to OOM errors or frame distortion.
Scene Reconstruction: Regenerating background environments for film production where physical props need to be removed post-capture. Pitfall: Neglecting negative prompts like ‘distortion’ or ‘strange trajectory,’ which can lead to visual artifacts in the output.

References:

https://www.marktechpost.com/2026/04/05/how-to-build-a-netflix-void-video-object-removal-and-inpainting-pipeline-with-cogvideox-custom-prompting-and-end-to-end-sample-inference/

On This Page

How to Build a Netflix VOID Video Object Removal and Inpainting Pipeline with CogVideoX, Custom Prompting, and End-to-End Sample Inference

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

Netflix AI Open-Sources VOID: Physics-Aware Video Object Removal

Spatial Supersensing as the Core Capability for Multimodal AI Systems

Vision Banana: Google DeepMind’s Instruction-Tuned Model Outperforms SAM 3 and Depth Anything V3