Pack Format
===========

Codecrate outputs a single Markdown file. When ``--split-max-chars`` is used,
it can also emit ``.index.md`` and ``.partN.md`` files intended for LLM
consumption containing enough information to:

* browse code quickly (directory tree + symbol index)
* reconstruct original files (full layout) or via stubs + canonical sources (stub layout)


High-level structure
--------------------

A typical pack includes:

* **How to Use This Pack**: reading guidance for LLMs
* **Directory Tree**: a simple text tree of files
* **Symbol Index**: per-file symbol list with line ranges
* **Function Library** (stub layout only): canonical function bodies keyed by ID
* **Files**: full file content (full layout) or stubbed files (stub layout)

The Manifest is required for machine operations (unpack/patch/validate-pack). For token
efficiency, split ``.partN.md`` files omit it, and you can disable it entirely with
``--no-manifest`` (LLM-only packs).

Manifest metadata also records explicit ID/marker schemes for forward compatibility:

* ``id_format_version`` (currently ``sha1-8-upper:v1``)
* ``marker_format_version`` (currently ``v1``)
* per-definition ``has_marker`` hints in stub layouts (for validation accuracy)

Codecrate can also emit JSON sidecars:

* ``codecrate.manifest-json.v1``: manifest-focused tooling export
* ``codecrate.index-json.v1``: retrieval-oriented file/symbol/part index for agents and tools

See :doc:`index_json` for the detailed sidecar contract.

Profiles can change output defaults without changing the underlying pack format:

* ``human`` keeps current markdown-first behavior
* ``agent`` implies compact navigation and normalized v3 index JSON output
* ``lean-agent`` implies the leanest normalized v3 agent sidecar defaults
* ``hybrid`` keeps current markdown behavior and also emits index JSON output
* ``portable`` implies manifest-enabled ``full`` layout for standalone unpack
* ``portable-agent`` keeps reconstructable ``full`` layout plus normalized retrieval defaults

The index sidecar includes deterministic per-repository metadata for:

* emitted markdown part files
* file-to-part lookup
* symbol-to-file and symbol-to-canonical-body lookup
* direct href-style links for file and symbol navigation
* unsplit markdown line ranges for file sections, symbol index entries, and canonical bodies
* explicit reverse lookup indexes for files and symbols
* part character and token estimates
* part oversize status and effective split policy
* safety findings
* per-file language detection and symbol extraction backend/status reporting

Split part membership is captured directly during split generation rather than
recovered later by reparsing emitted markdown.

For non-Python files, the index sidecar reports:

* ``language_detected``
* ``symbol_backend_requested``
* ``symbol_backend_used``
* ``symbol_extraction_status``

This makes it explicit whether symbol extraction was unavailable, disabled,
unsupported for the file type, or completed successfully.

The index sidecar also separates human-facing and machine-facing identifiers:

* ``display_id`` / ``display_local_id`` keep the current short pack IDs used by markdown anchors
* ``canonical_id`` / ``local_id`` use stronger SHA-256 based machine IDs for tooling
* ``display_id_format_version`` and ``canonical_id_format_version`` record both schemes explicitly

Per-file entries also include lightweight review metadata such as byte, character,
and token estimates for both original and effective packed content.

Machine Header includes:

* ``format``
* ``repo_label`` / ``repo_slug``
* ``manifest_sha256``


Protocol constants
------------------

* pack format: ``codecrate.v4``
* patch metadata format: ``codecrate.patch.v1``
* manifest-json format: ``codecrate.manifest-json.v1``
* index-json format: ``codecrate.index-json.v1``
* machine header fence: ``codecrate-machine-header``
* manifest fence: ``codecrate-manifest``
* patch metadata fence: ``codecrate-patch-meta``

Layouts
-------

``full``
   The pack includes full file contents under **Files**. The manifest is minimal and
   does not include function metadata.

``stubs``
   The pack includes stubbed file contents under **Files** and a **Function Library**
   with canonical function bodies.

``auto``
   Chooses ``stubs`` only when deduplication actually collapses something; otherwise
   chooses ``full`` for best token efficiency.


Portable unpack contract
------------------------

The initial standalone unpack flow targets a conservative subset of the pack
format:

* unsplit markdown is the authoritative machine-readable reconstruction source
* ``full`` portable unpack requires the Manifest plus file bodies under
  ``## Files``
* ``stubs`` portable unpack additionally requires the Function Library plus
  manifest ``defs`` metadata to resolve markers back into canonical bodies
* split ``.index.md`` / ``.partN.md`` outputs are not the standalone machine
  source

``codecrate pack --profile portable --emit-standalone-unpacker`` writes a
standard-library-only ``<output>.unpack.py`` beside the main markdown output.
Generated portable-agent markdown also includes a non-authoritative
``codecrate-agent-workflow`` JSON fence. It gives coding agents a deterministic
first-action hint, including the recommended ``python3 -S`` reconstruction
command, sidecar filenames, fallback interpreters, and a reminder to avoid
manual markdown scraping unless unpacking fails with a Codecrate error.


IDs and deduplication
---------------------

In stub layout, Codecrate distinguishes:

``local_id``
   Unique per definition occurrence (stable by file path + qualname + def line).

``id``
   Canonical body ID. When dedupe is enabled and identical bodies are detected,
   multiple ``local_id`` values may share the same canonical ``id``.


Stub markers
------------

Stubbed file bodies contain markers like:

.. code-block:: text

   ...  # ↪ FUNC:v1:XXXXXXXX

The marker references the function definition occurrence. During unpack, Codecrate
locates the marker, finds the ``def`` line above it (including decorators), and
replaces that region with the canonical function body from the Function Library.


Patch metadata
--------------

Generated patch markdown includes a ``codecrate-patch-meta`` fence with:

* patch format id (``codecrate.patch.v1``)
* baseline manifest checksum
* baseline per-file original checksums

``apply`` uses this metadata to verify that baseline files still match before
applying hunks.


Determinism
-----------

Pack ordering is deterministic by normalized relative path and stable id order.
Split outputs preserve deterministic section/file/function ordering and avoid
splitting inside fenced code blocks.

When a single logical block exceeds ``--split-max-chars``, Codecrate keeps it
intact in an oversize part by default. Use ``--split-strict`` to fail instead,
or ``--split-allow-cut-files`` to explicitly cut oversized file blocks across
multiple parts.

When binary files are detected during packing, they are skipped and reported as
``Skipped as binary: N file(s)`` in the pack header and Safety Report (when enabled).


Line ranges
-----------

The Symbol Index can include markdown line ranges ``(Lx-y)`` that refer to line numbers
inside the packed Markdown file itself.

When a pack is split into ``.partN.md`` files, these markdown line ranges are omitted in
the split parts because they are not stable across files. Use the per-part links
instead (for example ``context.part3.md#src-...`` / ``#func-...``).