I rewrote this article because the previous version was still too soft for what I actually built. It was not wrong, but it described the clip more as an outcome than as a technical production system. That was the part I wanted to capture properly for AutomatedWeb. The interesting result was not just that a video came out at the end. The interesting result was that several narrow MCPs started to behave like a reusable path for images, video, audio, editing, and publishing.
Current cut: This is the latest version. The end card now holds long enough for the brand to register, the soundtrack starts at frame one, and the original street ambience remains audible under the Suno layer.
What the technical task actually was
The real task was never “make something cool with a model.” I wanted a durable production path for AutomatedWeb. If I build the next clip, I do not want to restart from browser tabs, prompt fragments, manual downloads, and improvised shell commands. I want the same chain every time:
- generate image anchors
- render shots
- trim and stitch them
- generate and fetch music
- build a final audio mix
- publish the asset and embed it
That is why I treated the project as a set of narrow MCP layers instead of a single oversized prompt.
The MCP split
The final stack ended up looking like this:
- Nano Banana MCP for keyframes and image anchors
- Video Pipeline MCP for Seedance, storyboard runs, stitching, and local ffmpeg steps
- Suno MCP for music generation, polling, downloads, and later WAV conversion
- YouTube MCP for the eventual publish step
That split mattered. One large MCP with everything in it would have been faster to sketch, but much worse to maintain. With the stack separated, I could iterate on one layer without destabilising the rest. If the visuals broke, I stayed in the video pipeline. If the soundtrack did not carry the piece, I worked in the Suno layer. If publishing changes later, that belongs in the YouTube step.
The concrete video pipeline
The video side ran through a local MCP that combined several capabilities:
- Seedance "text-to-video"
- Seedance "image-to-video"
- Seedance "reference-to-video"
- queue status polling and result fetching
- local trimming through ffmpeg
- seamless stitching with video and audio fades
- storyboard manifests, run files, and refresh logic
The key idea was not to force a whole brand film out of one giant generation. That usually looks acceptable in a playground and collapses in a real edit. Instead I wanted short, controllable units. Long open-ended generations drift too easily: architecture softens, camera energy gets messy, and anything recognisable starts to look synthetic very quickly.
Why I kept the first shot
The opening shot already worked very early. It had the right balance for AutomatedWeb: recognisable reality, enough movement, and enough restraint not to feel like a generic model demo. That mattered because I did not want the film to advertise the model. I wanted it to support the brand.
So the right move was not to replace everything. It was to keep the strong opening and rebuild the weak second half around it. That sounds simple, but in generative workflows that decision is often the actual turning point. Once I stopped trying to regenerate the entire clip, the quality curve improved quickly.
What failed in the early versions
There were five recurring failure modes:
1. Static image phases
Some early versions looked strong as stills but dead as moving shots. That usually happened when the process leaned too heavily on anchor frames and not enough on convincing in-shot motion. Once movement stopped feeling spatial, the result flipped from “cinematic” to “animated still.”
2. Landmark drift
The most obvious failure was that in one version the Frauenkirche appeared to move. That was unacceptable. The moment a real landmark bends or drifts, the whole illusion weakens. I had to re-render the later Munich shots with tighter visual anchoring and much stronger constraints on architectural stability.
3. Unnatural people
I also tried a version with a distant human figure. The deeper problem was not only Seedance. The still-image references were not truly consistent either. That meant the person never felt fully real from shot to shot. The correct response for this film was to remove the visible protagonist completely.
4. Weak logo endings
Several versions technically included the brand, but they did not really end on it. The logo was too brief, too soft, or placed on top of moving footage in a way that never really landed. The branding existed, but it did not hold.
5. Fragile export finishing
The most annoying technical problem was that some ffmpeg passes produced files that looked valid at first but were not actually finalised correctly. The most obvious symptom was an invalid MP4 with no proper "moov atom" at the end. That forced me to simplify the final export path rather than keep pushing larger one-shot renders.
What changed in the Seedance strategy
The biggest shift was moving away from “make the whole thing in one generation” and toward controlled shot logic:
- keep shots that already work
- only re-render the broken sections
- anchor skyline and city identity much harder
- solve the ending in editing rather than forcing the model to do the entire brand move
That is also where the local MCP mattered more than the model UI. I could poll, fetch, compare, cut, and recombine outputs directly. Once the work is structured around run files and shot identities, iteration becomes much less destructive.
Storyboards and run files
One of the most useful parts of the pipeline was that I was not managing loose assets anymore. The workflow used storyboard manifests and run files. That means a shot is not just “that MP4 in Downloads.” It has explicit state:
- references
- prompt
- endpoint
- request id
- status
- result URL
- local output path
That is a much better operating model for generative production. It lets me rerun one shot without losing the rest of the chain. It also makes the later editing decisions traceable.
Why the Suno MCP mattered
I wanted the sound layer to follow the same rule as the visuals: no ad-hoc manual step. So I built a dedicated Suno MCP. It covers the practical steps I actually need:
- checking credits
- starting a generation
- polling task state
- downloading the output
- optionally starting WAV conversion
But the important part was not just getting a track back from Suno. The important part was how that track entered the edit. The first music bed felt like a replacement for the native clip audio. That was the wrong direction. The street ambience belongs to the world of the film. So the current version keeps the original audio and layers the Suno track above it as a separate music bed.
That is the real improvement. The soundtrack now starts at frame one and shapes the pacing immediately, but the city is still audible under it. That is why the final piece feels like a world and not just a silent video with a stock track on top.
The final mix
The current finishing logic is technically simple and therefore reliable:
- keep the original clip audio
- start the Suno track at frame one
- lift the music enough to carry the edit without erasing the ambience
- give the end card its own hold time
- handle the ending through audio and image timing together
That is a small set of decisions, but it is exactly the kind of thing that separates a “model demo” from a usable brand film. If picture, ambience, and score do not sit together properly, the whole piece feels accidental.
Why I moved the brand into a true end card
I tried several versions where the logo simply overlaid the moving last shot. That was not enough. The brand was present, but it did not stand. The correct solution was to separate the end state and give it its own card.
That solved three problems at once:
- "AutomatedWeb" stays visible long enough to register.
- The video can keep full motion before the end without competing with the brand.
- The audio ending becomes easier to tune independently.
It also made the export path more robust because I no longer had to force a single large finishing pass to do everything at once.
Why this matters for AutomatedWeb
For me the interesting part is not just the finished film. The interesting part is that AutomatedWeb now has a reusable media pipeline:
- image input through Nano Banana
- shot production through Seedance
- post-production through the video MCP and ffmpeg
- soundtrack generation through Suno
- publishing through EmDash and later YouTube
That fits the brand much better than a one-off clip ever could. AutomatedWeb should not just talk about automation. It should demonstrate how real work gets turned into repeatable systems. This project is a useful example because that is exactly what happened here.
What I would change next time
No pipeline is truly finished. The clearest improvements for the next round are:
- lock the shot list earlier instead of replacing weak second-half shots later
- bring the Suno layer into the process earlier rather than only near the final export
- decide much earlier whether the brand lands as an in-scene element or a dedicated end card
Even so, this is the first version where picture, city identity, rhythm, sound, and brand actually work together. That is why the article now opens with a real frame from the final film instead of an abstract concept image. This is no longer just a “look what I generated” post. It is a documented production path built from Nano Banana, Seedance, Suno, ffmpeg, and several MCPs that I can reuse the next time I need to make something real.