Pack Format

Codecrate outputs a single Markdown file. When --split-max-chars is used, it can also emit .index.md and .partN.md files intended for LLM consumption containing enough information to:

  • browse code quickly (directory tree + symbol index)

  • reconstruct original files (full layout) or via stubs + canonical sources (stub layout)

High-level structure

A typical pack includes:

  • How to Use This Pack: reading guidance for LLMs

  • Directory Tree: a simple text tree of files

  • Symbol Index: per-file symbol list with line ranges

  • Function Library (stub layout only): canonical function bodies keyed by ID

  • Files: full file content (full layout) or stubbed files (stub layout)

The Manifest is required for machine operations (unpack/patch/validate-pack). For token efficiency, split .partN.md files omit it, and you can disable it entirely with --no-manifest (LLM-only packs).

Manifest metadata also records explicit ID/marker schemes for forward compatibility:

  • id_format_version (currently sha1-8-upper:v1)

  • marker_format_version (currently v1)

  • per-definition has_marker hints in stub layouts (for validation accuracy)

Codecrate can also emit JSON sidecars:

  • codecrate.manifest-json.v1: manifest-focused tooling export

  • codecrate.index-json.v1: retrieval-oriented file/symbol/part index for agents and tools

See Index JSON Sidecar for the detailed sidecar contract.

Profiles can change output defaults without changing the underlying pack format:

  • human keeps current markdown-first behavior

  • agent implies compact navigation and normalized v3 index JSON output

  • lean-agent implies the leanest normalized v3 agent sidecar defaults

  • hybrid keeps current markdown behavior and also emits index JSON output

  • portable implies manifest-enabled full layout for standalone unpack

  • portable-agent keeps reconstructable full layout plus normalized retrieval defaults

The index sidecar includes deterministic per-repository metadata for:

  • emitted markdown part files

  • file-to-part lookup

  • symbol-to-file and symbol-to-canonical-body lookup

  • direct href-style links for file and symbol navigation

  • unsplit markdown line ranges for file sections, symbol index entries, and canonical bodies

  • explicit reverse lookup indexes for files and symbols

  • part character and token estimates

  • part oversize status and effective split policy

  • safety findings

  • per-file language detection and symbol extraction backend/status reporting

Split part membership is captured directly during split generation rather than recovered later by reparsing emitted markdown.

For non-Python files, the index sidecar reports:

  • language_detected

  • symbol_backend_requested

  • symbol_backend_used

  • symbol_extraction_status

This makes it explicit whether symbol extraction was unavailable, disabled, unsupported for the file type, or completed successfully.

The index sidecar also separates human-facing and machine-facing identifiers:

  • display_id / display_local_id keep the current short pack IDs used by markdown anchors

  • canonical_id / local_id use stronger SHA-256 based machine IDs for tooling

  • display_id_format_version and canonical_id_format_version record both schemes explicitly

Per-file entries also include lightweight review metadata such as byte, character, and token estimates for both original and effective packed content.

Machine Header includes:

  • format

  • repo_label / repo_slug

  • manifest_sha256

Protocol constants

  • pack format: codecrate.v4

  • patch metadata format: codecrate.patch.v1

  • manifest-json format: codecrate.manifest-json.v1

  • index-json format: codecrate.index-json.v1

  • machine header fence: codecrate-machine-header

  • manifest fence: codecrate-manifest

  • patch metadata fence: codecrate-patch-meta

Layouts

full

The pack includes full file contents under Files. The manifest is minimal and does not include function metadata.

stubs

The pack includes stubbed file contents under Files and a Function Library with canonical function bodies.

auto

Chooses stubs only when deduplication actually collapses something; otherwise chooses full for best token efficiency.

Portable unpack contract

The initial standalone unpack flow targets a conservative subset of the pack format:

  • unsplit markdown is the authoritative machine-readable reconstruction source

  • full portable unpack requires the Manifest plus file bodies under ## Files

  • stubs portable unpack additionally requires the Function Library plus manifest defs metadata to resolve markers back into canonical bodies

  • split .index.md / .partN.md outputs are not the standalone machine source

codecrate pack --profile portable --emit-standalone-unpacker writes a standard-library-only <output>.unpack.py beside the main markdown output. Generated portable-agent markdown also includes a non-authoritative codecrate-agent-workflow JSON fence. It gives coding agents a deterministic first-action hint, including the recommended python3 -S reconstruction command, sidecar filenames, fallback interpreters, and a reminder to avoid manual markdown scraping unless unpacking fails with a Codecrate error.

IDs and deduplication

In stub layout, Codecrate distinguishes:

local_id

Unique per definition occurrence (stable by file path + qualname + def line).

id

Canonical body ID. When dedupe is enabled and identical bodies are detected, multiple local_id values may share the same canonical id.

Stub markers

Stubbed file bodies contain markers like:

...  # ↪ FUNC:v1:XXXXXXXX

The marker references the function definition occurrence. During unpack, Codecrate locates the marker, finds the def line above it (including decorators), and replaces that region with the canonical function body from the Function Library.

Patch metadata

Generated patch markdown includes a codecrate-patch-meta fence with:

  • patch format id (codecrate.patch.v1)

  • baseline manifest checksum

  • baseline per-file original checksums

apply uses this metadata to verify that baseline files still match before applying hunks.

Determinism

Pack ordering is deterministic by normalized relative path and stable id order. Split outputs preserve deterministic section/file/function ordering and avoid splitting inside fenced code blocks.

When a single logical block exceeds --split-max-chars, Codecrate keeps it intact in an oversize part by default. Use --split-strict to fail instead, or --split-allow-cut-files to explicitly cut oversized file blocks across multiple parts.

When binary files are detected during packing, they are skipped and reported as Skipped as binary: N file(s) in the pack header and Safety Report (when enabled).

Line ranges

The Symbol Index can include markdown line ranges (Lx-y) that refer to line numbers inside the packed Markdown file itself.

When a pack is split into .partN.md files, these markdown line ranges are omitted in the split parts because they are not stable across files. Use the per-part links instead (for example context.part3.md#src-... / #func-...).