Prompting · AI Video Field Guide

02 The schema, played

Every field of the API, mapped to what it actually does.

Eleven inputs. Three are required at most for a great clip. The rest are levers.

prompt resolution aspect_ratio duration generate_audio seed image (URI) last_frame_image reference_images reference_videos reference_audios

{}

Mutually exclusive

image / last_frame_image cannot be combined with reference_images. Pick one mode.

Audio gate

reference_audios requires at least one reference image or video. Audio alone is invalid.

Duration safe-set

Replicate accepts 4–15 in the schema, but the underlying API silently rejects 7, 9, 11, 13, 14 on some endpoints. Stick to {4, 5, 6, 8, 10, 12, 15} or -1.

Dialogue trigger

Wrap dialogue in "double quotes" with generate_audio=true. The model voices it natively — no Veo-style says, needed.

Citation syntax

Cite uploaded refs in the prompt body as [Image1]…[Image9], [Video1]…[Video3], [Audio1]…[Audio3]. Brackets, not @.

Reference budgets

Up to 9 images, 3 videos (≤15s total), 3 audios (≤15s total). Reference jobs should be assigned in the prompt — face, wardrobe, lighting, motion, voice — not left ambiguous.

03 Anatomy of a prompt

Five parts, in order. The model treats earlier tokens as higher priority.

Every Seedance guide we surveyed — Atlabs, Imagine.art, fal.ai, Lovart, Higgsfield, Replicate's house blog — converged independently on the same five-slot structure. It is not a style; it is the grammar the model rewards. Hover any colored span below to see which slot it fills.

SubjectA tired man in a khaki trench coat inside a red phone booth Actionholds the receiver without speaking, then whispers, "I kept the old number" Cameramedium shot through rain-streaked glass, Christopher Doyle handheld feel, slow push-in StyleCinestill 800T halation, neon smears red and green, step-printed motion blur Constraintsno subtitles, no on-screen text, dialogue forward, rain ticking soft underneath, no music

01 · Subject

A noun phrase with two or three concrete traits. "a tired man in a khaki trench coat inside a red phone booth", not "a man in a city".

02 · Action

One visible verb. "holds the receiver without speaking". Plot belongs in time-coded blocks, not in this slot.

03 · Camera

Shot size, angle, lens, movement. "medium shot through rain-streaked glass, Christopher Doyle handheld feel, slow push-in".

04 · Style

Lighting + palette + film stock or director. One dominant anchor beats three competing ones. "Cinestill 800T halation, neon smears red and green".

05 · Constraints

Negatives as positive instructions. "single continuous take · no subtitles · no on-screen text · dialogue forward, rain soft underneath".

The hero clip above ("Rain Phone Booth") is the same five slots expanded with time-coded shot blocks. Read it back from the top: every line maps to one of the five.

04 Time-coded shot blocks

The single highest-leverage instruction in Seedance prompting.

The community converged on this within twelve weeks of release: [00:00-00:05] behaves like a hard editorial cut, not a hint. It re-anchors the model against drift across the 15-second window. A 15-second clip should usually be three shots, not seven, unless it is a beat-sync montage.

Compare a monolithic prose prompt to a three-block edit. Same scene, completely different output discipline.

Monolithic (the model invents coverage)

A race car driver at night on a wet track. He grips the wheel,
breathes steady, then accelerates through a green light into
heavy spray as engines roar. Cinematic, dramatic, fast.

Time-coded (the model edits to your cuts)

[00:00-00:05] Interior cockpit medium close-up of a veteran race
driver at night, rain lashing the windshield, dashboard LEDs in
the visor. Both gloved hands tighten on the wheel at ten and two.

[00:05-00:10] Cut to rival cockpit, younger driver, jaw tense,
engine vibration shaking the frame. He whispers, "Hold the line."

[00:10-00:15] Low exterior tracking shot as the green light hits
and both cars accelerate in real time. Massive water spray hits
the lens, stadium lights stretch into motion blur, no limb
distortion, no text.

ACTION · 16:9 · 15s · GENERATING…

05 Ten patterns the community converged on

Cross-referenced from 14 guides, ~70 verified prompts, and the failure-mode literature.

01

Five-part order

Subject → Action → Camera → Style → Constraints. Earlier tokens carry more weight.

02

Time-coded blocks

[00:00-00:05] = hard cut. The single highest-leverage instruction.

03

Wide → MS → CU escalation

Three-shot grammar fits 15s cleanly: geography, then pressure, then emotion.

04

Camera vocabulary is load-bearing

Name lens (35mm, anamorphic, macro) and movement (dolly, push-in, orbit, handheld). Generic "cinematic" produces static motion.

05

Audio direction is mandatory

With generate_audio=true, omit audio direction and the model invents a generic score that flattens the realism. Layer ambient + foreground + dialogue.

06

Emotion before line

"She whispers, almost smiling," before "I should have called sooner." Naked dialogue produces flat reads.

07

Negative as positive

"Hands resting in lap, fingers relaxed" beats "no bad hands." But end-of-prompt direct negatives (no subtitles, no watermark) still help.

08

Cite refs with purpose

Not bare [Image1]. Say "[Image1] defines face and hair only; [Image5] defines wardrobe; [Image8] defines lighting."

09

Word budget: 120–280

Below 120, generic. Above 280, internal contradictions. Single-shot 50–90; multi-shot 200–280.

10

Aspect ratio is composition

9:16 favors close-ups, 16:9 favors cross-cuts and geography, 21:9 favors scale. State the framing rule inside the prompt body too.

5b Why prompts get flagged (and how to write ones that don't)

The filter is a language model reading your whole prompt as a scene — not a keyword scanner.

Most flagged prompts aren't flagged because of a specific word. They're flagged because the prompt didn't give the filter enough context to read the scene as cinematic. Once you understand how it reads, the fixes become obvious.

Two layers run independently. The image-evaluation layer runs first and rejects on the upload itself: real identifiable faces of public figures + named copyrighted characters are image-stage hard blocks — no prompt rewrite gets past them. The text filter runs second and reads the entire prompt as one scene, judging intent and context. Words that look sensitive in isolation pass when wrapped in clear cinematic framing; harmless words get flagged when the prompt is too sparse to interpret.

Frame the whole scene, not just the action

The most common shape of a flagged prompt is one action with no surrounding context. Build outward from the action and answer four questions in the prompt:

Where is this happening?
What does it look like visually?
What is the camera doing?
What is the overall atmosphere?

Flagged · sparse, contextless

a soldier shoots someone in the street

Passes · same action, full scene context

Wide shot, war-torn Eastern European street in the 1940s. A
soldier in a grey uniform fires toward an off-screen position
during an active firefight. Smoke rises from collapsed buildings
in the background, overcast flat light, 35mm grain, documentary-
style handheld framing, debris scattered across the foreground.

Visual facts, not narrative

Strip emotional motivation, backstory, and relationship context from the prompt. The filter cares only about what the camera would see if this scene existed. A screenplay has scene description and subtext; Seedance only needs the description. Before keeping any sentence, ask: "if this were a real shoot, would this line appear on the shot list?" If not, cut it.

Production language is a context signal

Two or three production-language terms in a prompt establish the register and meaningfully shift how the filter weighs the rest. Pull from each category:

Shot

Framing

wide · medium · close-up · ECU · OTS · POV · two-shot · low / high / Dutch angle · bird's-eye

Move

Camera

dolly in / out · tracking · pan · tilt · crane · locked off · push · circling · handheld

Lens

Format

35mm grain · anamorphic · 2.39:1 · 1.85:1 · vintage glass · soft halation · shallow DOF · rack focus

Light

Setup

overcast diffused · volumetric rays · practical · side backlight · motivated shadow · golden hour · rim

Color

Tone

muted desaturated · high contrast · bleach bypass · cold blue · warm amber · crushed blacks · flat grade

Multi-shot: native JSON

For multi-shot sequences, Seedance accepts a structured JSON prompt. A visual_world block sets the cinematic register once; each shot then only describes what the camera sees at that moment. This forces visual-fact discipline automatically.

{
  "visual_world": {
    "light": "soft overcast, diffused shadows, no hard edges",
    "color": "muted naturals, cold whites, desaturated tones",
    "film": "35mm grain, anamorphic lens, soft halation on highlights",
    "atmosphere": "quiet, isolated, expansive"
  },
  "sequence": {
    "duration": "10 seconds",
    "pacing": "slow build to rapid cuts, ends in stillness",
    "shots": {
      "shot_1": {
        "duration": "3 seconds",
        "camera": "locked off wide shot, low angle",
        "action": "Lone rider on horseback crests a snowfield ridge",
        "transition": "SMASH CUT"
      },
      "shot_2": {
        "duration": "4 seconds",
        "camera": "tracking shot from behind, handheld feel",
        "action": "Horse and rider gallop through deep snow, cloak whipping in wind",
        "transition": "SMASH CUT"
      },
      "shot_3": {
        "duration": "3 seconds",
        "camera": "static wide, fully locked off",
        "action": "Empty snowfield, a wolf stands motionless on a distant ridge"
      }
    }
  }
}

Reference uploads: cite with purpose, every time

An uploaded image is not a self-explanatory artifact. The model assumes nothing about its role unless the prompt assigns one. Two equivalent syntaxes work depending on host:

Replicate / fal canonical: [Image1], [Video1], [Audio1]
Morphic / Dreamina UI: @Image 1, @Video 1, @Audio 1 (the picker injects the tag for you)

Either way, list every role assignment at the top of the prompt, before any scene description:

[Image1] as the first frame.
Reference all camera movements from [Video1].
Character appearance based on [Image2].
Use [Audio1] as the background score.

[scene description follows…]

Character-image best practice: let the image do the work

Don't re-describe a referenced character in text — that creates a second, competing identity layer the model has to reconcile. The image handles appearance; the prompt handles what happens and what the camera sees. Two important consequences:

Refer to characters by role, not by age. Words like child, kid, young, boy, girl raise the sensitivity threshold across the entire prompt regardless of what the uploaded image actually shows. Use "a small figure in a dark coat" instead.
If image-stage block is firing, work the image: face away from camera; go wide enough that the subject reads as silhouette; swap photo for illustration (illustrated images pass more reliably); use the image for wardrobe/setting/palette rather than face/identity.

Negative prompts that actually work

Negative prompts don't unstick filter flags — but they reliably reduce visual artifacts. Keep them short and tied to actual failures you're seeing:

negative: no jitter, no warping, no flickering, no identity drift
negative: no text morphing, no garbled logos, no color shift
negative: no motion blur on face, no floating limbs, no background collapse

Long lists backfire. Two or three targeted terms outperform exhaustive enumerations.

The Chinese-language pass-rate experiment

Community pattern reported by some Morphic users: write the scene description in Chinese, keep dialogue and on-screen text in English. The reasoning is that Seedance was developed with strong Chinese-language training, so Chinese prompts may hit slightly different filter thresholds. Not guaranteed; low effort to try if a well-constructed prompt keeps getting flagged.

Most of this section synthesizes Morphic's "Why your Seedance 2.0 prompts keep getting flagged" guide, validated against our empirical findings on Replicate and fal. Their full piece is the cleanest single reference on the filter's behavior.

06 Native audio

Treat sound as a control surface, not decoration.

Seedance generates audio jointly with picture. That means you can describe what to hear at every layer — ambient bed, foreground SFX, dialogue, score policy — and the model treats each layer as a directive. The strongest audio prompts name the sound source, place every instrument in space, and explicitly reject music when you don't want it.

Dialogue

Wrap lines in "double quotes". Best results are 4–10 words for a 5–6 second clip; up to two short sentences in 8–10s. Always frame the line with delivery: "she whispers with restrained embarrassment, "I waited after class again.""

SFX & ambient

Visible sources should produce the foreground sounds. Cast iron skillet ⇒ "fat renders, bubbles aggressively, butter foams, occasional pan clink." Storm chaser dashcam ⇒ "loud windshield rain, rhythmic wipers, low wind buffeting, distant radio chatter saying "rotation is tightening"." Place every instrument in jazz scenes: piano forward, brushed snare behind, walking bass off-camera.

Mix policy

End every audio prompt with a hierarchy line. "Dialogue forward and clear; room tone soft underneath; no music" beats visual detail alone.

AUDIO · 9:16 · 8s · GENERATING…