See What the Machine Sees: Building a Video Matting & ControlNet Pipeline in ComfyUI
Video Matting and other vegetables in comfyUI
Guessing is expensive. Not in dollars... in time, in frustration, in that slow erosion of creative momentum. What if you could literally watch what the AI sees when it looks at your video? Five outputs. One workflow. Zero diffusion steps. And suddenly the black box cracks open.
The Problem Nobody Talks About
Most folks building video-to-video workflows in ComfyUI skip straight to generation. Plug in a ControlNet, cross fingers, hit render. When the output looks wrong... they guess. Tweak a weight. Swap a preprocessor. Guess again.
That cycle burns hours.
Giling Around built something better. A single ComfyUI workflow that extracts a human subject from video using Robust Video Matting, then generates multiple ControlNet visualization outputs... depth maps, OpenPose skeletons, and LineArt renders... all as standalone videos you can actually watch.
No diffusion required. Just preprocessing. Just seeing.
Why This Matters Beyond the Technical
Here's where it gets interesting for anyone waging their own creative war.
When you render a depth map as a video, you're watching the AI's understanding of spatial relationships play out frame by frame. When you see the OpenPose skeleton dance across your screen, you're watching the machine parse human movement into geometric relationships. That LineArt output? That's the AI stripping your footage down to essential edges and contours.
Each one is a lens. A different way of perceiving the same truth.
And once you've seen through those lenses... you stop guessing. You start knowing. The next time you wire up a ControlNet for a video-to-video project, you carry that understanding with you. Less trial and error. More intentional creation.
Time × Focus = Attention. This workflow is a focus multiplier.
The Build: Five Outputs, One Pipeline
The architecture is cleaner than you'd expect.
Step 1: Load and Downscale. Pull your source video into ComfyUI and immediately reduce the resolution. This standardizes frame sizes across every downstream node and speeds up the entire pipeline. Giling uses the Klinter size selector node for aspect ratio management... a small tool that solves a real problem. No shame in admitting you can't remember aspect ratios. Honest tools beat pretending.
Step 2: Robust Video Matting. This is the core. The Robust Video Matting node separates your human subject from the background... no manual rotoscoping, no green screen shoot required. Install it from ComfyUI Manager. The image output gives you the isolated subject against whatever background color you choose. The mask output gives you a clean alpha.
Two outputs already. Green screen video. Mask video.
Step 3: ControlNet Preprocessors. Now the pipeline branches. Feed that same processed video into three preprocessor nodes:
- Depth Anything for depth estimation - OpenPose Preprocessor for skeletal pose mapping - LineArt Preprocessor for edge and contour extraction
Each node generates frames that get combined into their own video output.
Step 4: Combine and Render. Multiple Video Combine nodes run simultaneously, each synced to the original video's frame rate. The pose estimation node is the slowest step in the chain... worth knowing when you're planning your render time around a coffee break or a meeting.
Final count: five synchronized video outputs from a single workflow.
What You Actually Get
| Output | What It Shows | Creative Use | |--------|--------------|-------------| | Green Screen | Isolated subject, solid background | Compositing, layering, VFX | | Mask | Clean black and white alpha | Precise subject isolation | | Depth Map | Spatial depth estimation | Parallax effects, 3D workflows | | OpenPose Skeleton | Joint and limb positioning | Character animation, motion transfer | | LineArt | Edge and contour rendering | Stylized generation, artistic overlays |
Each one is usable as a standalone creative asset. Each one is usable as input for further AI video generation. And each one teaches you something about how ControlNet models perceive and process visual information.
The Quiet Power Here
This pipeline runs without a single diffusion step. No model loading. No sampling. No denoising. That means it's fast... even on modest hardware. It means the barrier to entry drops significantly.
But the real gift isn't speed.
It's understanding.
When you watch your footage translated into a depth map... you see which spatial relationships the AI prioritizes and which it flattens. When you watch the OpenPose skeleton... you see where pose estimation gets confident and where it wobbles. That knowledge transfers to every future project.
You stop fighting tools you don't understand. You start collaborating with tools you've seen from the inside.
Practical Notes
- Installation: Robust Video Matting, Depth Anything, and the preprocessor nodes all install through ComfyUI Manager. Straightforward. - Organization: Alt+drag duplicates nodes. Use reroutes to keep your graph readable. Giling calls it a "nodal allergy"... I call it wisdom. Clean workspaces produce clearer thinking. - Render Planning: Budget extra time for the OpenPose estimation pass. Everything else moves quickly. - Frame Count: Test with low frame counts first. Validate the pipeline before committing to full renders.
There's a principle I keep coming back to... What is it about? Answer that before everything else. Before you render. Before you tweak. Before you guess.
This workflow answers that question for ControlNet preprocessing. It shows you what each model sees, how it interprets your footage, and what it prioritizes. Five lenses on the same truth.
So build it. Run your footage through. Watch the outputs.
Because the next time you wire up a video-to-video pipeline, you won't be guessing. You'll be working with understanding. And that... that changes everything. 🛠️
--- Source: https://www.youtube.com/watch?v=xKeH-ZjDlj8
From TIG's Notebook
Thoughts that surfaced while watching this.
The two most important days in your life are the day you are born and the day you find out why. — *Mark Twain*— TIG's Notebook — On Purpose & Legacy
We don't build trust by offering help. We build trust by asking for help. — *Simon Sinek*— TIG's Notebook — On Connection & Understanding
Ever tried, ever failed, no matter. Try again, fail again, fail better! — *Samuel Beckett*— TIG's Notebook — On Failure & Perseverance
Echoes
Wisdom from across the constellation that resonates with this article.
The asymmetry between the quality of Perplexity's execution and the fragility of its structural position, that is the thing I can't stop thinking about.
Three simultaneous shifts in consumer hardware, agent attention spans, and AI memory systems are converging in 2026 to make AI agents viable for mainstream adoption.
It's more amazing to me that they did this whole thing in less than 5 megs.