See What the Machine Sees: Building a Video Matting & ControlNet Pipeline in ComfyUI
Video Matting and other vegetables in comfyUI
Guessing is expensive. Not in dollars... in time, in frustration, in that slow erosion of creative momentum. What if you could literally watch what the AI sees when it looks at your video? Five outputs. One workflow. Zero diffusion steps. And suddenly the black box cracks open.
The Problem Nobody Talks About
Most folks building video-to-video workflows in ComfyUI skip straight to generation. Plug in a ControlNet, cross fingers, hit render. When the output looks wrong... they guess. Tweak a weight. Swap a preprocessor. Guess again.
That cycle burns hours.
Giling Around built something better. A single ComfyUI workflow that extracts a human subject from video using Robust Video Matting, then generates multiple ControlNet visualization outputs... depth maps, OpenPose skeletons, and LineArt renders... all as standalone videos you can actually watch.
No diffusion required. Just preprocessing. Just seeing.
Why This Matters Beyond the Technical
Here's where it gets interesting for anyone waging their own creative war.
When you render a depth map as a video, you're watching the AI's understanding of spatial relationships play out frame by frame. When you see the OpenPose skeleton dance across your screen, you're watching the machine parse human movement into geometric relationships. That LineArt output? That's the AI stripping your footage down to essential edges and contours.
Each one is a lens. A different way of perceiving the same truth.
And once you've seen through those lenses... you stop guessing. You start knowing. The next time you wire up a ControlNet for a video-to-video project, you carry that understanding with you. Less trial and error. More intentional creation.
Time × Focus = Attention. This workflow is a focus multiplier.
The Build: Five Outputs, One Pipeline
The architecture is cleaner than you'd expect.
Step 1: Load and Downscale. Pull your source video into ComfyUI and immediately reduce the resolution. This standardizes frame sizes across every downstream node and speeds up the entire pipeline. Giling uses the Klinter size selector node for aspect ratio management... a small tool that solves a real problem. No shame in admitting you can't remember aspect ratios. Honest tools beat pretending.
Step 2: Robust Video Matting. This is the core. The Robust Video Matting node separates your human subject from the background... no manual rotoscoping, no green screen shoot required. Install it from ComfyUI Manager. The image output gives you the isolated subject against whatever background color you choose. The mask output gives you a clean alpha.
Two outputs already. Green screen video. Mask video.
Step 3: ControlNet Preprocessors. Now the pipeline branches. Feed that same processed video into three preprocessor nodes:
- Depth Anything for depth estimation - OpenPose Preprocessor for skeletal pose mapping - LineArt Preprocessor for edge and contour extraction
Each node generates frames that get combined into their own video output.
Step 4: Combine and Render. Multiple Video Combine nodes run simultaneously, each synced to the original video's frame rate. The pose estimation node is the slowest step in the chain... worth knowing when you're planning your render time around a coffee break or a meeting.
Final count: five synchronized video outputs from a single workflow.
What You Actually Get
| Output | What It Shows | Creative Use | |--------|--------------|-------------| | Green Screen | Isolated subject, solid background | Compositing, layering, VFX | | Mask | Clean black and white alpha | Precise subject isolation | | Depth Map | Spatial depth estimation | Parallax effects, 3D workflows | | OpenPose Skeleton | Joint and limb positioning | Character animation, motion transfer | | LineArt | Edge and contour rendering | Stylized generation, artistic overlays |
Each one is usable as a standalone creative asset. Each one is usable as input for further AI video generation. And each one teaches you something about how ControlNet models perceive and process visual information.
The Quiet Power Here
This pipeline runs without a single diffusion step. No model loading. No sampling. No denoising. That means it's fast... even on modest hardware. It means the barrier to entry drops significantly.
But the real gift isn't speed.
It's understanding.
When you watch your footage translated into a depth map... you see which spatial relationships the AI prioritizes and which it flattens. When you watch the OpenPose skeleton... you see where pose estimation gets confident and where it wobbles. That knowledge transfers to every future project.
You stop fighting tools you don't understand. You start collaborating with tools you've seen from the inside.
Practical Notes
- Installation: Robust Video Matting, Depth Anything, and the preprocessor nodes all install through ComfyUI Manager. Straightforward. - Organization: Alt+drag duplicates nodes. Use reroutes to keep your graph readable. Giling calls it a "nodal allergy"... I call it wisdom. Clean workspaces produce clearer thinking. - Render Planning: Budget extra time for the OpenPose estimation pass. Everything else moves quickly. - Frame Count: Test with low frame counts first. Validate the pipeline before committing to full renders.
There's a principle I keep coming back to... What is it about? Answer that before everything else. Before you render. Before you tweak. Before you guess.
This workflow answers that question for ControlNet preprocessing. It shows you what each model sees, how it interprets your footage, and what it prioritizes. Five lenses on the same truth.
So build it. Run your footage through. Watch the outputs.
Because the next time you wire up a video-to-video pipeline, you won't be guessing. You'll be working with understanding. And that... that changes everything. 🛠️
--- Source: https://www.youtube.com/watch?v=xKeH-ZjDlj8
From TIG's Notebook
Thoughts that surfaced while watching this.
**What is it about?** Answer this before everything else. At the beginning of every day, every project, every meeting, clarify what it is about? Defining this before action will save you time, energy, and enhance your focus.— TIG's Notebook — Core Principles
Don't be afraid of take two.— TIG's Notebook — On Failure & Perseverance
If you are able to emotionally heal and not allow it to turn into a bitterness, then it becomes a superpower. — *Chaplain TIG*— TIG's Notebook — On Self & Identity
Echoes
Wisdom from across the constellation that resonates with this article.
3D printed fractal vise - The coolest tool you didn't know you needed - I saw an awesome video by Hand Tool Rescue of a 100+ year old fractal vise being restored and just had to have one. CAD and 3D printing makes this possible and if you own a 3D printer you can have one
A broken signal is still a signal.
If you don't own the layer below or the relationship above, you're just borrowing time.