This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/terrariyum on 2025-07-15 01:33:01+00:00.


Intro

This post covers how to use Wan 2.1 Vace to composite any combination of images into one scene, optionally using masked inpainting. The works for t2v, i2v, v2v, flf2v, or even tivflf2v. Vace is very flexible! I can’t find another post that explains all this. Hopefully I can save you from the need to watch 40m of youtube videos.

Comfyui workflows

This guide is only about using masking with Vace, and assumes you already have a basic Vace workflow. I’ve included diagrams here instead of workflow. That makes it easier for you to add masking to your existing workflows.

There are many example Vace workflows on Comfy, Kijai’s github, Civitai, and this subreddit. Important: this guide assumes a workflow using Kijai’s WanVideoWrapper nodes, not the native nodes.

How to mask

Masking first frame, last frame, and reference image inputs

  • These all use “pseudo-masked images”, not actual masks.
  • A pseudo-masked image is one where the masked areas of the image are replaced with white pixels instead of having a separate image + mask channel.
  • In short: the model output will replace the white pixels in the first/last frame images and ignore the white pixels in the reference image.
  • All masking is optional!

Masking the first and/or last frame images

  • Make a mask in the mask editor.
  • Pipe the load image node’s mask output to a mask to image node.
  • Pipe the mask to image node’s image output and the load image image output to an image blend node. Set the blend mode set to “screen”, and factor to 1.0 (opaque).
  • This draws white pixels over top of the original image, matching the mask.
  • Pipe the image blend node’s image output to the WanVideo Vace Start to End Frame node’s start (frame) or end (frame) inputs.
  • This is telling the model to replace the white pixels but keep the rest of the image.

https://preview.redd.it/0cwzp9rvvxcf1.png?width=1988&format=png&auto=webp&s=857c0b0c1db4714029ff21dda8f3ca6aac979373

Masking the reference image

  • Make a mask in the mask editor.
  • Pipe the mask to an invert mask node (or invert it in the mask editor), pipe that to mask to image, and that plus the reference image to image blend. Pipe the result to the WanVideo Vace Endcode node’s ref images input.
  • The reason for the inverting is purely for ease of use. E.g. you draw a mask over a face, then invert so that everything but the face becomes white pixels.
  • This is telling the model to ignore the white pixels in the reference image.

https://preview.redd.it/xby8vyyxvxcf1.png?width=1988&format=png&auto=webp&s=84dc97fc63e49a2c618967e2f26da3bf58a87c52

Masking the video input

  • The video input can have an optional actual mask (not pseudo-mask). If you use a mask, the model will replace only pixels in the masked parts of the video. If you don’t, then all of the video’s pixels will be replaced.
  • But the original (un-preprocessed) video pixels won’t drive motion. To drive motion, the video needs to be preprocessed, e.g. converting it to a depth map video.
  • So if you want to keep parts of the original video, you’ll need to composite the preprocessed video over top of the masked area of the original video.

https://preview.redd.it/h4cmguszvxcf1.png?width=2128&format=png&auto=webp&s=3b44ab3744634feb02bb94ebe09664743041f4f7

The effect of masks

  • For the video, masking works just like still-image inpainting with masks: the unmasked parts of the video will be unaltered.
  • For the first and last frames, the pseudo-mask (white pixels) helps the model understand what part of these frames to replace with the reference image. But even without it, the model can introduce elements of the reference images in the middle frames.
  • For the reference image, the pseudo-mask (white pixels) helps the model understand the separate objects from the reference that you want to use. But even without it, the model can often figure things out.

Example 1: Add object from reference to first frame

  • Inputs
    • Prompt: “He puts on sunglasses.”
    • First frame: a man who’s not wearing sunglasses (no masking)
    • Reference: a pair of sunglasses on a white background (pseudo-masked)
    • Video: either none, or something appropriate for the prompt. E.g. a depth map of someone putting on sunglasses or simply a moving red box on white background where the box moves from off-screen to the location of the face.
  • Output
    • The man from the first frame image will put on the sunglasses from the reference image.

https://preview.redd.it/04tmzcx2wxcf1.png?width=900&format=png&auto=webp&s=ebaf88d96454a45ef0da9fe5ef981c44090c0c6d

Example 2: Use reference to maintain consistency

  • Inputs
    • Prompt: “He walks right until he reaches the other side of the column, walking behind the column.”
    • Last frame: a man standing to the right of a large column (no masking)
    • Reference: the same man, facing the camera (no masking)
    • Video: either none, or something appropriate for the prompt
  • Output
    • The man starts on the left and moves right, and his face temporarily obscured by the column. The face is consistent before and after being obscured, and matches the reference image. Without the reference, his face might change before and after the column.

https://preview.redd.it/nb8p77hhwxcf1.png?width=1264&format=png&auto=webp&s=ac9aeb83e8257167cfc898c7f7d26b8f568f5a74

Example 3: Use reference to composite multiple characters to a background

  • Inputs
    • Prompt: “The man pets the dog in the field.”
    • First frame: an empty field (no masking)
    • Reference: a man and a dog on a white background (pseudo-masked)
    • Video: either none, or something appropriate for the prompt
  • Output
    • The man from the reference pets the dog from the reference, except the first frame, which will always exactly match the input first frame.
    • The man and dog need to have the correct relative size in the reference image. If they’re the same size, you’ll get a giant dog.
    • You don’t need to mask the reference image. It just works better if you do.

https://preview.redd.it/ggbogejjwxcf1.png?width=1264&format=png&auto=webp&s=108bebc806df35e49abd1147096431f2df0b3800

Example 4: Combine reference and prompt to restyle video

  • Inputs
    • Prompt: “The robot dances on a city street.”
    • First frame: none
    • Reference: a robot on a white background (pseudo-masked)
    • Video: depth map of a person dancing
  • Output
    • The robot from the reference dancing in the city street, following the motion of the video, giving Wan the freedom to create the street.
    • The result will be nearly the same if you use robot as the first frame instead of the reference. But this gives the model more freedom. Remember, the output first frame will always exactly match the input first frame unless the first frame is missing or solid gray.

https://preview.redd.it/2v6mha2mwxcf1.png?width=1264&format=png&auto=webp&s=8ee0d0c9df4e257e99cfd2b356fa29fb44666dd6

Example 5: Use reference to face swap

  • Inputs
    • Prompt: “The man smiles.”
    • First frame: none
    • Reference: desired face on a white background (pseudo-masked)
    • Video: Man in a cafe smiles, and on all frames:
      • There’s an actual mask channel masking the unwanted face
      • Face-pose preprocessing pixels have been composited over (replacing) the unwanted face pixels
  • Output
    • The face has been swapped, while retaining all of the other video pixels, and the face matches the reference
    • More effective face-swapping tools exist than Vace!
    • But with Vace you can swap anything. You could swap everything except the faces.

https://preview.redd.it/rm95fd9owxcf1.png?width=1508&format=png&auto=webp&s=cbbf13734986525125a6ab5c68449916a20238fa

How to use the encoder strength setting

  • The WanVideo Vace Encode node has a strength setting.
  • If you set it 0, then all of the inputs (first, last, reference, and video) will be ignored, and you’ll get pure text to video based on the prompts.
  • Especially when using a driving video, you typically want a value lower than 1 (e.g. 0.9) to give the model a little freedom, just like any controlnet. Experiment!
  • Though you might wish to be able to give low strength to the driving video but high strength to the reference, that’s not possible. But what you can do instead is use a less detailed preprocessor with high strength. E.g. use pose instead of depth map. Or simply use a video of a moving red box.