Analysis and Corpus

OpenMontage builds and queries a local, CLIP-indexed corpus of video and image assets for reference-driven productions. The corpus draws from free and open archives such as Archive.org, NASA, and Wikimedia Commons, plus optional stock sources when API keys are present. This enables documentary-style workflows that rely on real footage instead of generated video.

Corpus Layout

Each production that uses analysis maintains its own corpus under the project workspace:

projects/<kebab-name>/corpus/
├── clips/                  # downloaded video and image files
├── thumbnails/
│   └── <clip_id>/
│       └── frame_00.jpg    # evenly spaced frames for preview
├── embeddings.npy          # (N, 512) L2-normalised visual vectors
├── tag_embeddings.npy      # (N, 512) L2-normalised text/tag vectors
└── index.jsonl             # one metadata record per clip

The index.jsonl and NumPy files stay aligned row-by-row. The entire projects/ tree is gitignored.

Core Analysis Tools

Four tools perform the heavy lifting. All run locally and require no paid API keys for basic operation:

  • video_analyzer accepts a URL or local file and returns a video_analysis_brief. Depths are transcript_only, standard, and deep. It chains metadata fetch, transcript extraction, scene detection, keyframe sampling, motion classification, and style profiling.
  • scene_detect wraps PySceneDetect (content and threshold methods) plus FFmpeg to emit start/end timestamps for each scene.
  • frame_sampler extracts JPEG frames by count, explicit timestamps, or scene boundaries.
  • transcriber produces word-level segments using WhisperX. Speaker diarization is available when an HF_TOKEN is configured.

Supporting tools (video_downloader, transcript_fetcher, audio_probe) are invoked automatically by the analyzer.

Reference-Driven Workflows

Pipelines that declare reference_input.supported: true (such as cinematic) activate analysis when a reference video is supplied at the start of a session. The agent:

  1. Runs the analysis tools listed in the pipeline manifest.
  2. Produces a video_analysis_brief containing source metadata, content analysis, structure (scenes and pacing_profile), style profile, narration transcript, keyframes, and replication guidance.
  3. Embeds the reference and user query through CLIP.
  4. Retrieves matching clips from the corpus using fused visual + tag similarity.
  5. Diversifies the selected set and builds edit_decisions that respect the reference's motion and timing characteristics.

Sub-stages (for example, a short sample preview) become active once the video_analysis_brief exists.

Request real-footage behavior explicitly: "use real footage only." The system then avoids paid video generation and falls back to stock or archive clips.

Retrieval and Diversity

The corpus supports:

  • Ranking by text query (visual and tag channels blended).
  • Nearest-neighbour search from a seed clip.
  • Maximal Marginal Relevance diversification to avoid repetitive cuts.

Motion score, duration, and shot-type filters can be applied during retrieval. Results feed directly into the edit stage.

Verification Before Use

Run preflight to confirm the local analysis stack and any configured stock providers are ready:

make preflight

Or inspect the live registry:

python -c "
from tools.tool_registry import registry
import json
registry.discover()
print(json.dumps(registry.provider_menu_summary(), indent=2))

See guides/project-workspace for the full directory contract and guides/running-pipelines for how to start a reference-aware session. For the list of pipelines that support reference input, see reference/available-pipelines.