Index JSON Sidecar
==================

``index-json`` is the machine-facing companion to the Markdown pack. The
Markdown output stays optimized for human reading, patch/apply, and round-trip
reconstruction; the sidecar exists so tools can answer common retrieval
questions without scraping markdown.


Generation
----------

Generate the sidecar explicitly:

.. code-block:: console

   codecrate pack . -o context.md --index-json

Or let the profile enable it:

.. code-block:: console

   codecrate pack . -o context.md --profile agent
   codecrate pack . -o context.md --profile lean-agent
   codecrate pack . -o context.md --profile hybrid
   codecrate pack . -o context.md --profile portable-agent

``--profile agent`` resolves to the normalized v3 sidecar by default, while
``--profile lean-agent`` keeps normalized v3 but trims analysis-heavy payloads
and pretty-print whitespace by default, while ``--profile hybrid`` keeps the
full v1-compatible sidecar. ``--profile portable-agent`` pairs a reconstructable
``full`` pack with a normalized sidecar, a generated standalone unpacker, and
dual locator families.

Or choose a specific sidecar mode:

.. code-block:: console

   codecrate pack . -o context.md --index-json-mode compact
   codecrate pack . -o context.md --index-json-mode minimal
   codecrate pack . -o context.md --index-json-mode normalized
   codecrate pack . -o context.md --index-json-mode minimal --locator-space dual

``--index-json`` alone defaults to the full compatibility surface. Use
``--index-json-mode compact``, ``--index-json-mode minimal``, or
``--index-json-mode normalized`` when you want a machine-first sidecar
explicitly.

Use ``normalized`` for the smallest recommended sidecar in agent workflows.
Use ``minimal`` for the smallest v2-compatible sidecar.

By default, the sidecar is written next to the markdown output as
``<output>.index.json``.


Contract and compatibility
--------------------------

The sidecar is versioned independently:

* ``codecrate.index-json.v1``: full sidecar surface
* ``codecrate.index-json.v2``: compact or minimal sidecar surface
* ``codecrate.index-json.v3``: normalized sidecar surface

The top-level ``mode`` field distinguishes ``full``, ``compact``, and
``minimal``, and ``normalized`` output.

Compatibility rules:

* v1 remains the full-fidelity compatibility surface
* v2 is a machine-first retrieval surface that removes redundant display and
  reverse-index duplication by default
* v3 is the most compact analysis-oriented surface and interns repeated paths,
  qualnames, and strings into shared tables
* machine-facing lookups should prefer explicit IDs and lookup maps over
  markdown scraping

The pack and sidecar are generated from the same export model, so repository,
file, symbol, and split-part metadata describe the markdown that was actually
written.

Locator targets are configurable independently from sidecar mode:

* ``--locator-space markdown`` keeps locators pointed at the markdown pack
* ``--locator-space reconstructed`` points locators at the reconstructed file tree
* ``--locator-space dual`` emits both families
* ``--locator-space auto`` resolves to ``reconstructed`` when
  ``--emit-standalone-unpacker`` is enabled and otherwise to ``markdown``


Top-level shape
---------------

The payload has this high-level structure:

.. code-block:: json

    {
      "format": "codecrate.index-json.v3",
      "mode": "normalized",
      "pack": { ... },
      "repositories": [ ... ]
    }

``pack``
   Global metadata about the emitted artifact set.

``repositories``
   Per-repository entries for both single-repo and multi-repo output.


Pack metadata
-------------

Useful ``pack`` fields include:

``index_json_mode``
   The resolved sidecar mode used for the emitted payload.

``format``
   The markdown pack protocol version.

``is_split``
   Whether markdown was emitted as a single pack or as ``.index.md`` plus
   ``.partN.md`` files.

``output_files``
   Relative paths to all emitted markdown files.

``display_id_format_version`` / ``canonical_id_format_version``
   Explicit ID schemes for display IDs and machine IDs.

``capabilities``
   Boolean feature flags such as manifest availability and whether unsplit line
   ranges are available.

``authority``
   Declares which artifact is authoritative for full layout, stub layout, and
   patch flows.


Repository metadata
-------------------

Each entry in ``repositories[]`` describes one packed repository.

Useful fields include:

``label`` / ``slug``
   Human-facing and path-safe repository identifiers.

``layout`` / ``effective_layout``
   Requested and resolved layout behavior.

``nav_mode``
   The actual navigation density reflected in the rendered markdown.

``locator_mode``
   How direct locators should be interpreted:

   * ``anchors+line-ranges`` for unsplit markdown
   * ``anchors`` for split output

``locator_space`` / ``secondary_locator_space``
   The primary machine-facing locator family, plus an optional secondary family
   when both markdown and reconstructed locators are emitted.

``reconstructed_root``
   Present for combined multi-repo packs when reconstructed locators are
   enabled. Reconstructed paths are relative to the unpack output root, so
   combined packs use ``<slug>/...`` paths.

``markdown_path``
   Present for unsplit output; ``null`` for split output.

``has_manifest`` / ``has_machine_header``
   Trust and round-trip signals for machine consumers.

``parts``
   Metadata for the emitted markdown files belonging to the repository.

``index_json_features``
   Declares optional v2 retrieval families such as lookup-map emission and
   compact symbol index lines.

``files``
    File-level retrieval, integrity, language, and location metadata.

``symbols``
   Symbol-level occurrence and canonical-body metadata.

``lookup``
     Reverse indexes for direct access by path or ID.

``graph`` / ``test_links`` / ``guide``
     Optional analysis metadata: import edges, heuristic test coupling, and a
     repository guide.

``package_summaries`` / ``entrypoint_paths`` / ``centrality_rank`` /
``likely_edit_targets``
    Optional package and hot-path summaries for quicker subsystem orientation.

``reference_graph``
    Optional conservative symbol-call metadata for impact analysis and review.


Mode summary
------------

``full`` (v1)
   Preserves the current compatibility surface, including display IDs, richer
   file and symbol metadata, and the larger reverse lookup set.

``compact`` (v2)
   Keeps machine-first retrieval with direct file and symbol navigation while
   dropping display-oriented duplication and heavyweight membership metadata.
   When ``index_json_features.lookup`` is true, the lookup maps are:

   * ``file_by_path``
   * ``part_by_file``
   * ``file_by_symbol``
   * ``symbol_by_local_id``

``minimal`` (v2)
   Starts from compact mode and trims additional convenience duplication.
   It is the smallest v2-compatible sidecar surface rather than the smallest
   overall sidecar. When ``index_json_features.lookup`` is true, the lookup
   maps are:

   * ``file_by_path``
   * ``symbol_by_local_id``

``normalized`` (v3)
   Interns repeated strings into ``repositories[].tables`` and replaces
   path/module/qualname references with integer indexes. It keeps the same
   machine-facing essentials as the richer modes while omitting markdown href
   duplication and v2 lookup maps.


Analysis metadata
-----------------

When analysis metadata is enabled, the sidecar also exposes:

* ``repositories[].classes[]`` with first-class class entries
* ``repositories[].files[].imports``
* ``repositories[].files[].exports``
* ``repositories[].files[].module_docstring_lines``
* ``repositories[].files[].role_hint``
* ``repositories[].files[].inclusion_reason`` for focused packs
* ``repositories[].files[].references_out`` / ``references_in``
* ``repositories[].symbols[].owner_class``
* ``repositories[].symbols[].decorators``
* ``repositories[].symbols[].references_out`` / ``references_in``
* ``repositories[].graph.import_edges``
* ``repositories[].test_links``
* ``repositories[].guide``
* ``repositories[].package_summaries`` / ``entrypoint_paths``
* ``repositories[].reference_graph.call_like_edges``

Use ``--no-analysis-metadata`` when you want a smaller sidecar and do not need
those architecture-oriented hints.

``--profile lean-agent`` applies that smaller-sidecar posture by default and
also minifies the JSON payload unless you opt back into ``--index-json-pretty``.


Part metadata
-------------

``repositories[].parts[]`` records the markdown files that contain repository
content.

Useful fields include:

``part_id``
   Stable repository-scoped identifier such as ``repo:pack`` or ``repo:part3``.

``path`` / ``kind``
   Relative output path and whether the part is the unsplit pack, split index,
   or a split content part.

``char_count`` / ``line_count`` / ``token_estimate``
   Lightweight sizing information for retrieval and UI decisions.

``sha256_content``
   Integrity hash of the emitted markdown file content.

``contains``
   Precomputed membership lists for file paths, canonical IDs, display
   canonical IDs, and section types contained in the part.


File metadata
-------------

``repositories[].files[]`` is the main entrypoint for locating source files in
the emitted markdown.

Useful fields include:

``path`` / ``module``
   Repository-relative file path and Python module name when applicable.

``part_path`` / ``markdown_path``
   Output file holding the file body and the unsplit pack path when present.

``hrefs`` / ``anchors``
   Direct markdown targets for the file index entry and source body.

``locators``
    Locator metadata for the file entry. In v1 payloads this still includes the
    legacy availability booleans plus ``markdown``, ``split_part``, and/or ``reconstructed``
    locator objects. In v2 payloads it carries the locator objects directly.

``inclusion_reason``
    Present for focused packs. Records why the file was selected and which paths
    pulled it into the pack.

``references_out`` / ``references_in`` / ``unresolved_references_count``
    Optional conservative file-reference summaries derived from Python symbol
    analysis.

``markdown_lines``
   Unsplit line range for the file section when line ranges are available.

``language`` / ``fence_language`` / ``language_family``
   Rendering and retrieval-oriented language metadata.

``sha256_original`` / ``sha256_stubbed`` / ``sha256_effective``
   Integrity hashes for original file content, stubbed content, and the actual
   packed body.

``sizes``
   Character, byte, and token estimates for original and effective file bodies.

``summary.summary_text``
   Deterministic short prose describing the file's role and primary symbols.

``symbol_ids`` / ``display_symbol_ids`` / ``symbol_canonical_ids``
   Direct symbol membership for the file.

In normalized v3 payloads, file entries instead use indexed fields such as
``p`` (path), ``part`` (part path), ``lang`` (language), ``mod`` (module), and
optional analysis fields like ``imp`` (imports), ``exp`` (exports), ``doc``
(``[start_line, end_line]``), ``role``, and ``sum.st`` (summary text).


Symbol metadata
---------------

``repositories[].symbols[]`` provides both occurrence-level and canonical-body
metadata.

Useful fields include:

``display_id`` / ``display_local_id``
   Short markdown-facing IDs.

``canonical_id`` / ``local_id``
   Machine-facing SHA-256 based IDs.

``ids``
   Nested alias object containing both display and machine IDs.

``path`` / ``qualname`` / ``kind`` / ``def_line``
   Source identity and location.

``file_part`` / ``file_href`` / ``file_anchor``
   Direct location of the file body containing the symbol occurrence.

``canonical_part`` / ``canonical_href`` / ``canonical_anchor``
   Canonical function-library location for stub layout.

``index_markdown_lines`` / ``file_markdown_lines`` / ``canonical_markdown_lines``
   Unsplit markdown line ranges when available.

``occurrence_count_for_canonical_id``
   Number of source occurrences sharing the same canonical body.

``locators``
    Symbol locator metadata. ``markdown`` can include file, symbol-index, and
    canonical markdown ranges; ``split_part`` points at the split artifact being
    read; ``reconstructed`` points at the reconstructed file span and body span.

``purpose_text``
    Deterministic short prose summarizing the symbol's role, ownership, and
    signature hints.

``references_out`` / ``references_in`` / ``unresolved_references_count``
    Optional conservative symbol-reference summaries.

In normalized v3 payloads, symbol entries use compact indexed fields such as
``i`` (local machine ID), ``c`` (canonical machine ID when needed), ``p`` (path
 index), ``q`` (qualname index), ``k`` (kind index), ``l1`` / ``l2`` (line
 range), plus optional ``o`` (owner class ID), ``d`` (decorator indexes), and
 ``pt`` (purpose-text index).


Lookup maps
-----------

Use ``repositories[].lookup`` when you need constant-shape access instead of
scanning arrays.

Useful maps include:

``file_by_path``
   Path to a compact file summary with part and href metadata.

``part_by_file``
   Path to the emitted markdown file containing that file body.

``symbols_by_file`` / ``display_symbols_by_file``
   File-to-symbol membership by machine or display IDs.

``file_by_symbol`` / ``file_by_display_symbol``
   Symbol-to-file reverse indexes.

``symbol_by_local_id`` / ``symbol_by_display_local_id``
   Direct symbol entry lookup by occurrence ID.

``symbols_by_canonical_id`` / ``symbols_by_display_id``
   Grouped symbol entries for canonical-body lookups.

In v2 payloads, check ``repositories[].index_json_features`` first. If
``lookup`` is false, consumers should scan ``files[]`` and ``symbols[]``
directly instead of assuming ``lookup`` is present. If
``symbol_index_lines`` is false, compact payloads intentionally omit
``index_markdown_lines`` even for unsplit packs.

Normalized v3 payloads intentionally do not include the v2 lookup maps. Use the
intern tables plus ``files[]``, ``classes[]``, ``symbols[]``, and the optional
analysis sections directly.


Locator semantics
-----------------

Locator fields are intended to be truthful with respect to the emitted markdown.

In unsplit output:

* anchor hrefs are available
* line ranges are also available
* compact navigation still preserves machine-targetable anchors
* ``locator_space = markdown`` points ``locators.markdown`` into ``context.md``

In split output:

* hrefs still point to the actual ``.index.md`` or ``.partN.md`` file
* unsplit line ranges are omitted
* ``locators.split_part`` provides stable line ranges inside the split artifact
* consumers should follow ``part_path`` and hrefs instead of assuming a single
  markdown file

When ``--emit-standalone-unpacker`` is enabled:

* ``locator_space = auto`` resolves to reconstructed locators
* file and symbol ``locators.reconstructed`` point at the unpacked file tree
* combined multi-repo packs prefix reconstructed paths with the repository slug

If a locator field is present, it should resolve against the written output.


Validation helper
-----------------

``codecrate.validate_index_json.validate_index_payload()`` validates internal
sidecar consistency.

It checks:

* output file existence when a base directory is provided
* href targets and anchor existence for v1/v2 payloads
* part/file/symbol cross references
* line-range validity
* lookup map consistency for v1/v2 payloads
* normalized-table index validity for v3 payloads

Example:

.. code-block:: python

   import json
   from pathlib import Path

   from codecrate.validate_index_json import validate_index_payload

   payload = json.loads(Path("context.index.json").read_text(encoding="utf-8"))
   errors = validate_index_payload(payload, base_dir=Path("."))
   if errors:
       raise SystemExit("\n".join(errors))


Query recipes
-------------

The schema reference is only half of the consumer story. Common recipes:

* find the file for a symbol:

  .. code-block:: console

     python examples/find_symbol_file.py context.index.json codecrate.cli:main

* list entrypoints plus their reachable file counts:

  .. code-block:: console

     python examples/list_entrypoints.py context.index.json

* locate related tests for a changed file:

  .. code-block:: console

     python examples/find_related_tests.py context.index.json codecrate/pack_pipeline.py

* prefer reconstructed locators when they exist:

  .. code-block:: console

     python examples/prefer_reconstructed_locators.py context.index.json codecrate/cli.py

* read normalized tables correctly:

  .. code-block:: console

     python examples/read_normalized_tables.py context.index.json


Consumer guidance
-----------------

For most tooling:

1. start with ``repositories[].lookup`` when you already know a path or ID
2. use ``repositories[].files[]`` to locate the rendered file body
3. use ``repositories[].symbols[]`` when symbol identity or canonical bodies
   matter
4. use ``repositories[].parts[]`` to drive split-output retrieval UIs

Prefer machine IDs for stable automation and display IDs only when you need to
match existing markdown anchors or present short identifiers to users.