Writing prompts the model reads as cinematic.
Schema, five-part anatomy, time-coded shot blocks, the ten patterns I kept reaching for, why prompts get flagged, and what changes when audio is treated as a control surface.
Every field of the API, mapped to what it actually does.
Eleven inputs. Three are required at most for a great clip. The rest are levers.
{}
image / last_frame_image cannot be combined with reference_images. Pick one mode.reference_audios requires at least one reference image or video. Audio alone is invalid.{4, 5, 6, 8, 10, 12, 15} or -1."double quotes" with generate_audio=true. The model voices it natively — no Veo-style says, needed.[Image1]…[Image9], [Video1]…[Video3], [Audio1]…[Audio3]. Brackets, not @.Five parts, in order. The model treats earlier tokens as higher priority.
Every Seedance guide we surveyed — Atlabs, Imagine.art, fal.ai, Lovart, Higgsfield, Replicate's house blog — converged independently on the same five-slot structure. It is not a style; it is the grammar the model rewards. Hover any colored span below to see which slot it fills.
The hero clip above ("Rain Phone Booth") is the same five slots expanded with time-coded shot blocks. Read it back from the top: every line maps to one of the five.
The single highest-leverage instruction in Seedance prompting.
The community converged on this within twelve weeks of release: [00:00-00:05] behaves like a hard editorial cut, not a hint. It re-anchors the model against drift across the 15-second window. A 15-second clip should usually be three shots, not seven, unless it is a beat-sync montage.
Compare a monolithic prose prompt to a three-block edit. Same scene, completely different output discipline.
Monolithic (the model invents coverage)
A race car driver at night on a wet track. He grips the wheel,
breathes steady, then accelerates through a green light into
heavy spray as engines roar. Cinematic, dramatic, fast.
Time-coded (the model edits to your cuts)
[00:00-00:05] Interior cockpit medium close-up of a veteran race
driver at night, rain lashing the windshield, dashboard LEDs in
the visor. Both gloved hands tighten on the wheel at ten and two.
[00:05-00:10] Cut to rival cockpit, younger driver, jaw tense,
engine vibration shaking the frame. He whispers, "Hold the line."
[00:10-00:15] Low exterior tracking shot as the green light hits
and both cars accelerate in real time. Massive water spray hits
the lens, stadium lights stretch into motion blur, no limb
distortion, no text.
Cross-referenced from 14 guides, ~70 verified prompts, and the failure-mode literature.
Five-part order
Subject → Action → Camera → Style → Constraints. Earlier tokens carry more weight.
Time-coded blocks
[00:00-00:05] = hard cut. The single highest-leverage instruction.
Wide → MS → CU escalation
Three-shot grammar fits 15s cleanly: geography, then pressure, then emotion.
Camera vocabulary is load-bearing
Name lens (35mm, anamorphic, macro) and movement (dolly, push-in, orbit, handheld). Generic "cinematic" produces static motion.
Audio direction is mandatory
With generate_audio=true, omit audio direction and the model invents a generic score that flattens the realism. Layer ambient + foreground + dialogue.
Emotion before line
"She whispers, almost smiling," before "I should have called sooner." Naked dialogue produces flat reads.
Negative as positive
"Hands resting in lap, fingers relaxed" beats "no bad hands." But end-of-prompt direct negatives (no subtitles, no watermark) still help.
Cite refs with purpose
Not bare [Image1]. Say "[Image1] defines face and hair only; [Image5] defines wardrobe; [Image8] defines lighting."
Word budget: 120–280
Below 120, generic. Above 280, internal contradictions. Single-shot 50–90; multi-shot 200–280.
Aspect ratio is composition
9:16 favors close-ups, 16:9 favors cross-cuts and geography, 21:9 favors scale. State the framing rule inside the prompt body too.
The filter is a language model reading your whole prompt as a scene — not a keyword scanner.
Most flagged prompts aren't flagged because of a specific word. They're flagged because the prompt didn't give the filter enough context to read the scene as cinematic. Once you understand how it reads, the fixes become obvious.
Two layers run independently. The image-evaluation layer runs first and rejects on the upload itself: real identifiable faces of public figures + named copyrighted characters are image-stage hard blocks — no prompt rewrite gets past them. The text filter runs second and reads the entire prompt as one scene, judging intent and context. Words that look sensitive in isolation pass when wrapped in clear cinematic framing; harmless words get flagged when the prompt is too sparse to interpret.
Frame the whole scene, not just the action
The most common shape of a flagged prompt is one action with no surrounding context. Build outward from the action and answer four questions in the prompt:
- Where is this happening?
- What does it look like visually?
- What is the camera doing?
- What is the overall atmosphere?
Flagged · sparse, contextless
a soldier shoots someone in the street
Passes · same action, full scene context
Wide shot, war-torn Eastern European street in the 1940s. A
soldier in a grey uniform fires toward an off-screen position
during an active firefight. Smoke rises from collapsed buildings
in the background, overcast flat light, 35mm grain, documentary-
style handheld framing, debris scattered across the foreground.
Visual facts, not narrative
Strip emotional motivation, backstory, and relationship context from the prompt. The filter cares only about what the camera would see if this scene existed. A screenplay has scene description and subtext; Seedance only needs the description. Before keeping any sentence, ask: "if this were a real shoot, would this line appear on the shot list?" If not, cut it.
Production language is a context signal
Two or three production-language terms in a prompt establish the register and meaningfully shift how the filter weighs the rest. Pull from each category:
Framing
wide · medium · close-up · ECU · OTS · POV · two-shot · low / high / Dutch angle · bird's-eye
Camera
dolly in / out · tracking · pan · tilt · crane · locked off · push · circling · handheld
Format
35mm grain · anamorphic · 2.39:1 · 1.85:1 · vintage glass · soft halation · shallow DOF · rack focus
Setup
overcast diffused · volumetric rays · practical · side backlight · motivated shadow · golden hour · rim
Tone
muted desaturated · high contrast · bleach bypass · cold blue · warm amber · crushed blacks · flat grade
Multi-shot: native JSON
For multi-shot sequences, Seedance accepts a structured JSON prompt. A visual_world block sets the cinematic register once; each shot then only describes what the camera sees at that moment. This forces visual-fact discipline automatically.
{
"visual_world": {
"light": "soft overcast, diffused shadows, no hard edges",
"color": "muted naturals, cold whites, desaturated tones",
"film": "35mm grain, anamorphic lens, soft halation on highlights",
"atmosphere": "quiet, isolated, expansive"
},
"sequence": {
"duration": "10 seconds",
"pacing": "slow build to rapid cuts, ends in stillness",
"shots": {
"shot_1": {
"duration": "3 seconds",
"camera": "locked off wide shot, low angle",
"action": "Lone rider on horseback crests a snowfield ridge",
"transition": "SMASH CUT"
},
"shot_2": {
"duration": "4 seconds",
"camera": "tracking shot from behind, handheld feel",
"action": "Horse and rider gallop through deep snow, cloak whipping in wind",
"transition": "SMASH CUT"
},
"shot_3": {
"duration": "3 seconds",
"camera": "static wide, fully locked off",
"action": "Empty snowfield, a wolf stands motionless on a distant ridge"
}
}
}
}
Reference uploads: cite with purpose, every time
An uploaded image is not a self-explanatory artifact. The model assumes nothing about its role unless the prompt assigns one. Two equivalent syntaxes work depending on host:
- Replicate / fal canonical:
[Image1],[Video1],[Audio1] - Morphic / Dreamina UI:
@Image 1,@Video 1,@Audio 1(the picker injects the tag for you)
Either way, list every role assignment at the top of the prompt, before any scene description:
[Image1] as the first frame.
Reference all camera movements from [Video1].
Character appearance based on [Image2].
Use [Audio1] as the background score.
[scene description follows…]
Character-image best practice: let the image do the work
Don't re-describe a referenced character in text — that creates a second, competing identity layer the model has to reconcile. The image handles appearance; the prompt handles what happens and what the camera sees. Two important consequences:
- Refer to characters by role, not by age. Words like child, kid, young, boy, girl raise the sensitivity threshold across the entire prompt regardless of what the uploaded image actually shows. Use "a small figure in a dark coat" instead.
- If image-stage block is firing, work the image: face away from camera; go wide enough that the subject reads as silhouette; swap photo for illustration (illustrated images pass more reliably); use the image for wardrobe/setting/palette rather than face/identity.
Negative prompts that actually work
Negative prompts don't unstick filter flags — but they reliably reduce visual artifacts. Keep them short and tied to actual failures you're seeing:
negative: no jitter, no warping, no flickering, no identity drift
negative: no text morphing, no garbled logos, no color shift
negative: no motion blur on face, no floating limbs, no background collapse
Long lists backfire. Two or three targeted terms outperform exhaustive enumerations.
The Chinese-language pass-rate experiment
Community pattern reported by some Morphic users: write the scene description in Chinese, keep dialogue and on-screen text in English. The reasoning is that Seedance was developed with strong Chinese-language training, so Chinese prompts may hit slightly different filter thresholds. Not guaranteed; low effort to try if a well-constructed prompt keeps getting flagged.
Most of this section synthesizes Morphic's "Why your Seedance 2.0 prompts keep getting flagged" guide, validated against our empirical findings on Replicate and fal. Their full piece is the cleanest single reference on the filter's behavior.
Treat sound as a control surface, not decoration.
Seedance generates audio jointly with picture. That means you can describe what to hear at every layer — ambient bed, foreground SFX, dialogue, score policy — and the model treats each layer as a directive. The strongest audio prompts name the sound source, place every instrument in space, and explicitly reject music when you don't want it.
Dialogue
Wrap lines in "double quotes". Best results are 4–10 words for a 5–6 second clip; up to two short sentences in 8–10s. Always frame the line with delivery: "she whispers with restrained embarrassment, "I waited after class again.""
SFX & ambient
Visible sources should produce the foreground sounds. Cast iron skillet ⇒ "fat renders, bubbles aggressively, butter foams, occasional pan clink." Storm chaser dashcam ⇒ "loud windshield rain, rhythmic wipers, low wind buffeting, distant radio chatter saying "rotation is tightening"." Place every instrument in jazz scenes: piano forward, brushed snare behind, walking bass off-camera.
Mix policy
End every audio prompt with a hierarchy line. "Dialogue forward and clear; room tone soft underneath; no music" beats visual detail alone.