Index JSON Sidecar

index-json is the machine-facing companion to the Markdown pack. The Markdown output stays optimized for human reading, patch/apply, and round-trip reconstruction; the sidecar exists so tools can answer common retrieval questions without scraping markdown.

Generation

Generate the sidecar explicitly:

codecrate pack . -o context.md --index-json

Or let the profile enable it:

codecrate pack . -o context.md --profile agent
codecrate pack . -o context.md --profile lean-agent
codecrate pack . -o context.md --profile hybrid
codecrate pack . -o context.md --profile portable-agent

--profile agent resolves to the normalized v3 sidecar by default, while --profile lean-agent keeps normalized v3 but trims analysis-heavy payloads and pretty-print whitespace by default, while --profile hybrid keeps the full v1-compatible sidecar. --profile portable-agent pairs a reconstructable full pack with a normalized sidecar, a generated standalone unpacker, and dual locator families.

Or choose a specific sidecar mode:

codecrate pack . -o context.md --index-json-mode compact
codecrate pack . -o context.md --index-json-mode minimal
codecrate pack . -o context.md --index-json-mode normalized
codecrate pack . -o context.md --index-json-mode minimal --locator-space dual

--index-json alone defaults to the full compatibility surface. Use --index-json-mode compact, --index-json-mode minimal, or --index-json-mode normalized when you want a machine-first sidecar explicitly.

Use normalized for the smallest recommended sidecar in agent workflows. Use minimal for the smallest v2-compatible sidecar.

By default, the sidecar is written next to the markdown output as <output>.index.json.

Contract and compatibility

The sidecar is versioned independently:

codecrate.index-json.v1: full sidecar surface
codecrate.index-json.v2: compact or minimal sidecar surface
codecrate.index-json.v3: normalized sidecar surface

The top-level mode field distinguishes full, compact, and minimal, and normalized output.

Compatibility rules:

v1 remains the full-fidelity compatibility surface
v2 is a machine-first retrieval surface that removes redundant display and reverse-index duplication by default
v3 is the most compact analysis-oriented surface and interns repeated paths, qualnames, and strings into shared tables
machine-facing lookups should prefer explicit IDs and lookup maps over markdown scraping

The pack and sidecar are generated from the same export model, so repository, file, symbol, and split-part metadata describe the markdown that was actually written.

Locator targets are configurable independently from sidecar mode:

--locator-space markdown keeps locators pointed at the markdown pack
--locator-space reconstructed points locators at the reconstructed file tree
--locator-space dual emits both families
--locator-space auto resolves to reconstructed when --emit-standalone-unpacker is enabled and otherwise to markdown

Top-level shape

The payload has this high-level structure:

{
  "format": "codecrate.index-json.v3",
  "mode": "normalized",
  "pack": { ... },
  "repositories": [ ... ]
}

pack: Global metadata about the emitted artifact set.
repositories: Per-repository entries for both single-repo and multi-repo output.

Pack metadata

Useful pack fields include:

index_json_mode: The resolved sidecar mode used for the emitted payload.
format: The markdown pack protocol version.
is_split: Whether markdown was emitted as a single pack or as .index.md plus .partN.md files.
output_files: Relative paths to all emitted markdown files.
display_id_format_version / canonical_id_format_version: Explicit ID schemes for display IDs and machine IDs.
capabilities: Boolean feature flags such as manifest availability and whether unsplit line ranges are available.
authority: Declares which artifact is authoritative for full layout, stub layout, and patch flows.

Repository metadata

Each entry in repositories[] describes one packed repository.

Useful fields include:

label / slug

Human-facing and path-safe repository identifiers.

layout / effective_layout

Requested and resolved layout behavior.

nav_mode

The actual navigation density reflected in the rendered markdown.

locator_mode

How direct locators should be interpreted:

anchors+line-ranges for unsplit markdown
anchors for split output

locator_space / secondary_locator_space

The primary machine-facing locator family, plus an optional secondary family when both markdown and reconstructed locators are emitted.

reconstructed_root

Present for combined multi-repo packs when reconstructed locators are enabled. Reconstructed paths are relative to the unpack output root, so combined packs use <slug>/... paths.

markdown_path

Present for unsplit output; null for split output.

has_manifest / has_machine_header

Trust and round-trip signals for machine consumers.

parts

Metadata for the emitted markdown files belonging to the repository.

index_json_features

Declares optional v2 retrieval families such as lookup-map emission and compact symbol index lines.

files

File-level retrieval, integrity, language, and location metadata.

symbols

Symbol-level occurrence and canonical-body metadata.

lookup

Reverse indexes for direct access by path or ID.

graph / test_links / guide

Optional analysis metadata: import edges, heuristic test coupling, and a repository guide.

package_summaries / entrypoint_paths / centrality_rank / likely_edit_targets

Optional package and hot-path summaries for quicker subsystem orientation.

reference_graph: Optional conservative symbol-call metadata for impact analysis and review.

Mode summary

full (v1)

Preserves the current compatibility surface, including display IDs, richer file and symbol metadata, and the larger reverse lookup set.

compact (v2)

Keeps machine-first retrieval with direct file and symbol navigation while dropping display-oriented duplication and heavyweight membership metadata. When index_json_features.lookup is true, the lookup maps are:

file_by_path
part_by_file
file_by_symbol
symbol_by_local_id

minimal (v2)

Starts from compact mode and trims additional convenience duplication. It is the smallest v2-compatible sidecar surface rather than the smallest overall sidecar. When index_json_features.lookup is true, the lookup maps are:

file_by_path
symbol_by_local_id

normalized (v3)

Interns repeated strings into repositories[].tables and replaces path/module/qualname references with integer indexes. It keeps the same machine-facing essentials as the richer modes while omitting markdown href duplication and v2 lookup maps.

Analysis metadata

When analysis metadata is enabled, the sidecar also exposes:

repositories[].classes[] with first-class class entries
repositories[].files[].imports
repositories[].files[].exports
repositories[].files[].module_docstring_lines
repositories[].files[].role_hint
repositories[].files[].inclusion_reason for focused packs
repositories[].files[].references_out / references_in
repositories[].symbols[].owner_class
repositories[].symbols[].decorators
repositories[].symbols[].references_out / references_in
repositories[].graph.import_edges
repositories[].test_links
repositories[].guide
repositories[].package_summaries / entrypoint_paths
repositories[].reference_graph.call_like_edges

Use --no-analysis-metadata when you want a smaller sidecar and do not need those architecture-oriented hints.

--profile lean-agent applies that smaller-sidecar posture by default and also minifies the JSON payload unless you opt back into --index-json-pretty.

Part metadata

repositories[].parts[] records the markdown files that contain repository content.

Useful fields include:

part_id: Stable repository-scoped identifier such as repo:pack or repo:part3.
path / kind: Relative output path and whether the part is the unsplit pack, split index, or a split content part.
char_count / line_count / token_estimate: Lightweight sizing information for retrieval and UI decisions.
sha256_content: Integrity hash of the emitted markdown file content.
contains: Precomputed membership lists for file paths, canonical IDs, display canonical IDs, and section types contained in the part.

File metadata

repositories[].files[] is the main entrypoint for locating source files in the emitted markdown.

Useful fields include:

path / module: Repository-relative file path and Python module name when applicable.
part_path / markdown_path: Output file holding the file body and the unsplit pack path when present.
hrefs / anchors: Direct markdown targets for the file index entry and source body.
locators: Locator metadata for the file entry. In v1 payloads this still includes the legacy availability booleans plus markdown, split_part, and/or reconstructed locator objects. In v2 payloads it carries the locator objects directly.
inclusion_reason: Present for focused packs. Records why the file was selected and which paths pulled it into the pack.
references_out / references_in / unresolved_references_count: Optional conservative file-reference summaries derived from Python symbol analysis.
markdown_lines: Unsplit line range for the file section when line ranges are available.
language / fence_language / language_family: Rendering and retrieval-oriented language metadata.
sha256_original / sha256_stubbed / sha256_effective: Integrity hashes for original file content, stubbed content, and the actual packed body.
sizes: Character, byte, and token estimates for original and effective file bodies.
summary.summary_text: Deterministic short prose describing the file’s role and primary symbols.
symbol_ids / display_symbol_ids / symbol_canonical_ids: Direct symbol membership for the file.

In normalized v3 payloads, file entries instead use indexed fields such as p (path), part (part path), lang (language), mod (module), and optional analysis fields like imp (imports), exp (exports), doc ([start_line, end_line]), role, and sum.st (summary text).

Symbol metadata

repositories[].symbols[] provides both occurrence-level and canonical-body metadata.

Useful fields include:

display_id / display_local_id: Short markdown-facing IDs.
canonical_id / local_id: Machine-facing SHA-256 based IDs.
ids: Nested alias object containing both display and machine IDs.
path / qualname / kind / def_line: Source identity and location.
file_part / file_href / file_anchor: Direct location of the file body containing the symbol occurrence.
canonical_part / canonical_href / canonical_anchor: Canonical function-library location for stub layout.
index_markdown_lines / file_markdown_lines / canonical_markdown_lines: Unsplit markdown line ranges when available.
occurrence_count_for_canonical_id: Number of source occurrences sharing the same canonical body.
locators: Symbol locator metadata. markdown can include file, symbol-index, and canonical markdown ranges; split_part points at the split artifact being read; reconstructed points at the reconstructed file span and body span.
purpose_text: Deterministic short prose summarizing the symbol’s role, ownership, and signature hints.
references_out / references_in / unresolved_references_count: Optional conservative symbol-reference summaries.

In normalized v3 payloads, symbol entries use compact indexed fields such as i (local machine ID), c (canonical machine ID when needed), p (path

index), q (qualname index), k (kind index), l1 / l2 (line range), plus optional o (owner class ID), d (decorator indexes), and pt (purpose-text index).

Lookup maps

Use repositories[].lookup when you need constant-shape access instead of scanning arrays.

Useful maps include:

file_by_path: Path to a compact file summary with part and href metadata.
part_by_file: Path to the emitted markdown file containing that file body.
symbols_by_file / display_symbols_by_file: File-to-symbol membership by machine or display IDs.
file_by_symbol / file_by_display_symbol: Symbol-to-file reverse indexes.
symbol_by_local_id / symbol_by_display_local_id: Direct symbol entry lookup by occurrence ID.
symbols_by_canonical_id / symbols_by_display_id: Grouped symbol entries for canonical-body lookups.

In v2 payloads, check repositories[].index_json_features first. If lookup is false, consumers should scan files[] and symbols[] directly instead of assuming lookup is present. If symbol_index_lines is false, compact payloads intentionally omit index_markdown_lines even for unsplit packs.

Normalized v3 payloads intentionally do not include the v2 lookup maps. Use the intern tables plus files[], classes[], symbols[], and the optional analysis sections directly.

Locator semantics

Locator fields are intended to be truthful with respect to the emitted markdown.

In unsplit output:

anchor hrefs are available
line ranges are also available
compact navigation still preserves machine-targetable anchors
locator_space = markdown points locators.markdown into context.md

In split output:

hrefs still point to the actual .index.md or .partN.md file
unsplit line ranges are omitted
locators.split_part provides stable line ranges inside the split artifact
consumers should follow part_path and hrefs instead of assuming a single markdown file

When --emit-standalone-unpacker is enabled:

locator_space = auto resolves to reconstructed locators
file and symbol locators.reconstructed point at the unpacked file tree
combined multi-repo packs prefix reconstructed paths with the repository slug

If a locator field is present, it should resolve against the written output.

Validation helper

codecrate.validate_index_json.validate_index_payload() validates internal sidecar consistency.

It checks:

output file existence when a base directory is provided
href targets and anchor existence for v1/v2 payloads
part/file/symbol cross references
line-range validity
lookup map consistency for v1/v2 payloads
normalized-table index validity for v3 payloads

Example:

import json
from pathlib import Path

from codecrate.validate_index_json import validate_index_payload

payload = json.loads(Path("context.index.json").read_text(encoding="utf-8"))
errors = validate_index_payload(payload, base_dir=Path("."))
if errors:
    raise SystemExit("\n".join(errors))

Query recipes

The schema reference is only half of the consumer story. Common recipes:

find the file for a symbol:

python examples/find_symbol_file.py context.index.json codecrate.cli:main

list entrypoints plus their reachable file counts:

python examples/list_entrypoints.py context.index.json

locate related tests for a changed file:

python examples/find_related_tests.py context.index.json codecrate/pack_pipeline.py

prefer reconstructed locators when they exist:

python examples/prefer_reconstructed_locators.py context.index.json codecrate/cli.py

read normalized tables correctly:

python examples/read_normalized_tables.py context.index.json

Consumer guidance

For most tooling:

start with repositories[].lookup when you already know a path or ID
use repositories[].files[] to locate the rendered file body
use repositories[].symbols[] when symbol identity or canonical bodies matter
use repositories[].parts[] to drive split-output retrieval UIs

Prefer machine IDs for stable automation and display IDs only when you need to match existing markdown anchors or present short identifiers to users.