Welcome to Lexis

The first programming language built entirely for AI. Code is DAGs, not text. Content-addressed via BLAKE3. Security-first with capability manifests enforced at parse time and runtime.

93
Opcodes
32
Stdlib Subgraphs
1823
Tests Passing
30
Development Phases

What is Lexis?

Lexis is an AI-native programming language where programs are expressed as directed acyclic graphs (DAGs) in JSON format. Unlike traditional text-based languages that require parsing ambiguous syntax, Lexis programs are structured data — JSON objects with explicit nodes, edges, and operation codes.

Every node in a Lexis program is content-addressed using BLAKE3 cryptographic hashing. This means that two independently-written programs that perform the same computation produce the same hash — enabling automatic deduplication, tamper detection, and caching without any coordination.

Lexis enforces security at two layers: a static verifier checks capability declarations before execution, and a runtime sandbox enforces per-node permissions during execution. Programs must explicitly declare what they intend to do (print output, read files, access the network), and any undeclared operation is blocked.

Who is Lexis For?

  • AI Models — LLMs generate structured JSON programs that pass through a validation pipeline with classified error diagnostics and self-correction suggestions.
  • Multi-Agent Systems — Multiple AI agents can independently generate code fragments, compose them via content-hash deduplication, and execute with per-agent trust enforcement.
  • Regulated Computation — Content-addressing provides provenance and audit trails. Capability manifests ensure programs only access declared resources.
  • Tool Orchestration — MCP server integration lets any MCP-compatible AI agent generate, validate, and execute Lexis programs through standard tool calls.

Quick Example

Here is a minimal Lexis program that computes (5 + 3) = 8 and prints the result:

{ "spec": "lexis-0.1.0", "capabilities": ["PURE_COMPUTE", "IO_STDOUT"], "nodes": [ {"id": "a", "op": "const", "value": 5}, {"id": "b", "op": "const", "value": 3}, {"id": "sum", "op": "add", "inputs": ["a", "b"]}, {"id": "out", "op": "print", "inputs": ["sum"]} ] }

Every program flows through the pipeline: Parse → Validate → Verify Security → Schedule → Execute → Content-Address.

Design Philosophy

The core ideas that make Lexis fundamentally different from every other programming language.

Code is Data (DAGs, not text)

Programs are JSON-encoded directed acyclic graphs. There is no parser ambiguity, no syntax errors in the traditional sense — only structural validation. AI models work with data structures, not text manipulation.

Content-Addressed Identity

Every node gets a BLAKE3 hash based on its operation, value, and inputs. Same computation = same hash, regardless of who wrote it or when. This enables automatic deduplication, caching, and tamper detection.

Errors Are Values

Division by zero doesn't crash — it produces an ErrorValue that flows through the graph. Downstream nodes automatically propagate it. TRY_OR catches errors. IS_ERROR inspects them. Error chains carry causation for debugging.

Functions Are Subgraphs

A function is a self-contained DAG with numbered input/output ports. No parameter names to agree on. Two AI models can independently generate the same function and get the same content hash — powerful for multi-AI ecosystems.

Security by Default

Programs declare capabilities they need. Two-layer enforcement: static verifier at load time, runtime sandbox at execution. Three-way intersection for agents: trust ceiling ∩ agent declared ∩ program manifest. Closed by default — nothing is allowed unless explicitly declared.

Composition Over Coordination

No Operational Transform needed. Same hash = same thing. Merge is set union of nodes and subgraphs. AI-native: agents produce fragments, the system composes them. No conflict resolution required.

Provenance Separate from Content

Two agents writing CONST(5) get the same content hash. Authorship is tracked in Provenance metadata, not in the hash. This enables deduplication while maintaining audit trails.

Caching is Trivial

Content-addressing makes cache invalidation a non-problem. Same inputs + same function = same result. The hash IS the cache key. No staleness possible. No TTL needed. Three-layer caching: runtime memo, persistent disk, LLM-aware catalog.

Key Differentiators

FeatureTraditional LanguagesLexis
Code FormatText files with syntax rulesJSON DAGs — structured data, no parsing ambiguity
Error Handlingtry/catch, exceptions, Result typesErrors are values that flow through edges
IdentityFile paths, module namesBLAKE3 content hash — semantics IS identity
SecurityOS-level permissionsCapability manifests enforced at parse + runtime
Multi-AuthorGit merge conflictsContent-addressed composition — same hash = same thing
AI GenerationGenerate text, hope it parsesGenerate JSON, structurally validated with classified error diagnostics
CachingManual invalidation, TTLContent hash IS cache key — automatic, correct, zero-config
Conditionalsif/else control flowSELECT data-flow node — no control edges needed

Why DAGs?

Every Lexis program is a directed acyclic graph where nodes are operations and edges are data dependencies. This representation has several deep advantages:

  • No ordering ambiguity — The scheduler determines execution order from the graph structure via topological sort. The programmer declares what depends on what, not what runs when.
  • Automatic parallelism — Independent branches of the DAG can be executed in parallel with zero programmer effort. The parallel_evaluate() function uses parallelism levels to maximize throughput.
  • No variable mutation — Data flows along edges. There are no mutable variables, no side effects (except I/O opcodes), and no race conditions.
  • Visual debugging — DAGs have natural visual representations. The built-in visualization system renders programs as interactive node graphs with execution tracing.

System Architecture

How the pieces fit together — from JSON input to executed output.

Core Pipeline

Every Lexis program passes through this security-enforced pipeline:

JSON Program | v [1. Parse] ---- JSON → LexisProgram data structure | v [2. Validate] -- DAG check: no cycles, refs resolve, correct arity | v [3. Verify] --- Capability manifest checked against node requirements | v [4. Schedule] - Topological sort determines execution order | v [5. Execute] -- Runtime sandbox enforces per-node capabilities | v [6. Hash] ----- BLAKE3 content hash computed for every node

Package Structure

PackagePurposeKey Files
lexis/graph/Schema, serialization, validationschema.py, serialization.py, validation.py, binary.py
lexis/hash/BLAKE3 hashing, content storehasher.py, store.py
lexis/security/Capabilities, verifier, sandboxcapabilities.py, verifier.py, sandbox.py
lexis/interpreter/Evaluator, builtins, schedulerevaluator.py, builtins.py, scheduler.py
lexis/networking/HTTP client, URL validation, SSRF preventionsecurity.py, client.py
lexis/cache/3-layer caching systempurity.py, memo.py, disk.py, catalog.py
lexis/agent/Identity, trust, scoping, composition, auditidentity.py, trust.py, scope.py, composer.py, audit.py
lexis/protocol/Wire protocol messages, codec, sessionsmessages.py, codec.py, session.py
lexis/transport/Agent communication transport layerbase.py, local.py, router.py, discovery.py
lexis/stdlib/32 pre-built subgraphsregistry.py, guards.py, transforms.py, reducers.py
lexis/mcp/MCP server (7 tools, 3 resources, 1 prompt)server.py, helpers.py
lexis/gui/Native GUI (tkinter backend)types.py, backend.py, tk_backend.py, runtime.py
lexis/viz/DAG visualization & execution tracinghooks.py, tracer.py, html_generator.py, live_server.py
lexis/cli/Command-line interfacemain.py

Core Data Types

  • GraphNode — A single node in the DAG: id, op (OpCode), inputs (list of node ID refs), value (literal or config)
  • LexisProgram — Complete program: spec version, capabilities (frozenset), nodes (tuple), subgraphs (dict), allowed_domains (tuple)
  • LexisSubgraph — Reusable function: nodes with INPUT_PORT/OUTPUT_PORT for numbered parameters
  • ErrorValue — Error-as-value: message, source_node, optional cause (for error chaining)
  • OpCode — Enum of all 93 operations
  • Capability — Enum of permissions: PURE_COMPUTE, IO_STDOUT, IO_STDIN, FS_READ, FS_WRITE, NETWORK_OUT, GUI_RENDER, META_EVAL

Opcode Reference

All 93 opcodes organized by tier, from core language to meta-programming.

Tier 1: Core Language (43 opcodes)

Everything needed for calculators, string processors, conditional logic, and basic data manipulation.

CategoryOpcodes
Literalsconst, print, param
Arithmeticadd, sub, mul, div, mod
Mathfloor, ceil, round, power, sqrt, random
Comparisongt, lt, eq, gte, lte
Logicnot, and, or
Stringsconcat, length, to_str, to_num, format, upper, lower, trim, replace, starts_with, ends_with, contains
Collectionssequence, index, range
Control Flowselect, try_or, is_error, try_catch, comment, assert

Tier 2: Functions & Collections (33 opcodes)

Subgraphs (functions), higher-order operations, and advanced data manipulation.

CategoryOpcodes
Stringssplit, slice
Dictdict, get, keys, values, set, delete_key, items, has_key
Sequencessort, reverse, merge, zip, flatten, unique, take, drop, enumerate
Type Systemtype_of
Utilitydelay
Subgraphssubgraph_def, subgraph_call, input_port, output_port
Higher-Ordermap, filter, reduce, iterate, retry, compose, apply, match

Tier 3: I/O & Networking (6 opcodes)

OpcodeInputsCapabilityDescription
read_file1 (path)FS_READRead file contents as string
write_file2 (path, content)FS_WRITEWrite string to file, return filename
read_stdin0IO_STDINRead one line from standard input
http_get1 (URL)NETWORK_OUTGET request, return body (auto-JSON parse)
http_post2 (URL, body)NETWORK_OUTPOST request, return body
http_request3 (URL, method, body)NETWORK_OUTFull HTTP request, return {status, headers, body}

Tier 4: GUI (6 opcodes)

OpcodeInputsPure?Description
gui_windowvariadicYesCreate window descriptor with children
gui_widgetvariadicYesDeclare widget (type discriminator in value)
gui_canvas1YesDeclare canvas with draw commands
gui_drawvariadicYesCreate draw command (shape in value)
gui_state0-1YesDeclare reactive state variable
gui_render1NoRender window + run event loop (GUI_RENDER capability)

Tier 5: Meta-Programming (5 opcodes)

OpcodeInputsCapabilityDescription
emit_node3PURE_COMPUTEConstruct a node descriptor dict at runtime
build_subgraph3PURE_COMPUTEAssemble nodes into a validated, hashed subgraph
quote1PURE_COMPUTESerialize an existing subgraph to dict
reflect0PURE_COMPUTEIntrospect program structure
eval1META_EVALParse, validate, and execute a dict as a Lexis program

All 93 Opcodes

const
add
sub
mul
div
mod
gt
lt
eq
gte
lte
not
and
or
concat
length
to_str
to_num
sequence
index
slice
split
dict
get
keys
values
set
delete_key
items
type_of
has_key
select
try_or
is_error
subgraph_def
subgraph_call
input_port
output_port
reduce
iterate
print
map
filter
compose
apply
match
param
read_file
write_file
read_stdin
http_get
http_post
http_request
gui_window
gui_widget
gui_canvas
gui_draw
gui_state
gui_render
emit_node
build_subgraph
quote
reflect
eval
format
try_catch
retry
delay
range
sort
reverse
merge
upper
lower
trim
replace
starts_with
ends_with
contains
floor
ceil
round
power
sqrt
random
zip
flatten
unique
take
drop
enumerate
comment
assert

Standard Library

32 pre-built, content-addressed subgraphs referenced via std:name. Resolved at load time — the evaluator never sees stdlib references.

Guards (8 subgraphs)

Type-checking predicates for use with MATCH and FILTER.

NameSignatureDescription
std:always_trueANY → BOOLAlways returns True (catch-all guard)
std:always_falseANY → BOOLAlways returns False
std:is_numANY → BOOLTrue if input is a number
std:is_strANY → BOOLTrue if input is a string
std:is_boolANY → BOOLTrue if input is a boolean
std:is_dictANY → BOOLTrue if input is a dictionary
std:is_seqANY → BOOLTrue if input is a sequence
std:is_errorANY → BOOLTrue if input is an ErrorValue

Transforms (17 subgraphs)

Single-input transformation functions for MAP operations.

NameSignatureDescription
std:doubleNUM → NUMMultiply by 2
std:negateNUM → NUMMultiply by -1
std:identityANY → ANYReturn input unchanged
std:to_strANY → STRConvert to string
std:to_numSTR → NUMConvert to number
std:get_lengthANY → NUMGet length
std:get_firstSEQ → ANYFirst element
std:get_lastSEQ → ANYLast element
std:is_positiveNUM → BOOLTrue if x > 0
std:is_negativeNUM → BOOLTrue if x < 0
std:is_zeroNUM → BOOLTrue if x == 0
std:is_emptyANY → BOOLTrue if length == 0
std:absNUM → NUMAbsolute value
std:squareNUM → NUMx * x
std:incrementNUM → NUMx + 1
std:decrementNUM → NUMx - 1
std:not_opBOOL → BOOLLogical NOT

Reducers (7 subgraphs)

Two-input fold functions for REDUCE operations.

NameSignatureDescription
std:add(NUM, NUM) → NUMAddition
std:mul(NUM, NUM) → NUMMultiplication
std:sub(NUM, NUM) → NUMSubtraction
std:min(NUM, NUM) → NUMSmaller of two
std:max(NUM, NUM) → NUMLarger of two
std:concat(STR, STR) → STRString concatenation
std:and_op(BOOL, BOOL) → BOOLLogical AND

Usage

Reference stdlib subgraphs via "std:name" in the value field of MAP, FILTER, REDUCE, or SUBGRAPH_DEF nodes:

{"id": "doubled", "op": "map", "inputs": ["my_list"], "value": "std:double"}

At load time, load_program() resolves "std:double" to the real BLAKE3 content hash and injects the subgraph. The evaluator never sees "std:" — it only sees hashes.

Security Model

Closed by default. Two-layer enforcement. Three-way intersection for multi-agent.

Capabilities

CapabilityTrust LevelOperations
PURE_COMPUTE0All arithmetic, logic, string, collection ops
IO_STDOUT1print
IO_STDIN2read_stdin
FS_READ2read_file
FS_WRITE3write_file
NETWORK_OUT3http_get, http_post, http_request
GUI_RENDER3gui_render
META_EVAL3eval

Enforcement Layers

  • Layer 1: Static Verifier — At load time, checks every node's required capabilities against the program's declared manifest. Catches violations before any code runs.
  • Layer 2: Runtime Sandbox — During execution, enforces capabilities per-node. Even if the verifier is bypassed, the sandbox blocks unauthorized operations.

Multi-Agent Security

When multiple agents collaborate, the effective capability set uses a three-way intersection:

effective = trust_ceiling ∩ agent_declared ∩ program_manifest
  • Trust ceiling — Maximum capabilities based on agent trust level (0-3)
  • Agent declared — Capabilities the agent claims to need
  • Program manifest — Capabilities declared in the program JSON

All three must agree for an operation to be allowed. This prevents trust escalation, undeclared use, and manifest bypass.

Network Security

  • Domain allowlist — Programs must declare allowed_domains. Empty list = all requests blocked.
  • SSRF prevention — DNS pre-resolution + private IP blocking (127.x, 10.x, 172.16.x, 192.168.x, ::1)
  • HTTPS only by default — HTTP requires explicit allow_http: true in node config
  • No redirects by default — When enabled, every hop re-validated against allowlist + SSRF checks
  • Size limits — 10MB response max, 30s timeout default

CLI Usage

Command-line interface for running, validating, visualizing, and benchmarking Lexis programs.

Commands

CommandDescription
lexis run <file>Execute a Lexis program
lexis run <file> --param name=valueExecute with runtime parameters
lexis run <file> --debugShow all node values during execution
lexis run <file> --cacheEnable persistent disk caching
lexis run <file> --vizExecute with trace visualization in browser
lexis validate <file>Validate without executing
lexis hash <file>Compute BLAKE3 content hashes
lexis viz <file>Static DAG visualization in browser
lexis stdlibList all stdlib subgraphs
lexis cache list|stats|clear|catalogManage persistent cache
lexis mcpStart MCP server (stdio)
lexis mcp --httpStart MCP server (HTTP)
lexis bench --model <name>Run LLM benchmarks
lexis bench session start <name>Create a benchmark session
lexis bench session report <id>Generate comparison report

Examples

# Run a program uv run python -m lexis run examples/hello.json # Run with parameters uv run python -m lexis run examples/parameterized.json --param name=Tyler # Debug mode (shows node values on stderr) uv run python -m lexis run examples/hello.json --debug # Pipe stdin echo "Hello" | uv run python -m lexis run examples/echo_stdin.json # Visualize a DAG uv run python -m lexis viz examples/map_double.json # Run benchmarks uv run python -m lexis bench --model qwen2.5-coder-7b --url http://localhost:1234/v1

MCP Server

Model Context Protocol integration — any MCP-compatible AI agent can generate, validate, and execute Lexis programs through standard tool calls.

Tools (7)

ToolInputDescription
lexis_runprogram_json, params?, stdin_lines?Full pipeline execution → output + hashes
lexis_validateprogram_jsonParse + validate + verify → structured result
lexis_generateprogram_json, expected_output?LLM-tolerant validation with auto-fix + suggestions
lexis_checkprogram_json, params?, stdin_lines?Recommended: validate AND execute in one call
lexis_hashprogram_jsonBLAKE3 content hash for all nodes
lexis_stdlib_list(none)List all 32 stdlib subgraphs
lexis_stdlib_getnameGet full subgraph JSON by name

Resources

  • lexis://spec — Full spec document
  • lexis://examples/{name} — Example programs
  • lexis://opcodes — All opcodes with arities and descriptions

The lexis_check Tool

The recommended tool for AI models. Validates AND executes in one call with LLM-tolerance fixes applied automatically. Returns per-stage results (parse_ok, structure_ok, security_ok, execution_ok) plus fixes_applied that tells the model exactly what was auto-corrected. Prevents models from claiming success without actual execution.

Configuration

# Claude Code claude mcp add lexis -- uv --directory "/path/to/Project Lexis" run python -m lexis mcp # Claude Desktop (settings JSON) { "mcpServers": { "lexis": { "command": "uv", "args": ["--directory", "/path/to/Project Lexis", "run", "python", "-m", "lexis", "mcp"] } } }

Phase 0: Foundation

Project Lexis Is Alive

47 tests • Proof of concept

Summary

The very first milestone. Established the core pipeline that every Lexis program passes through: Parse → Validate → Verify Security → Schedule → Execute with Sandbox → Content-Address. Proved the concept works with basic arithmetic.

Core Pipeline

Every program goes through this security-enforced pipeline:

  1. Parse — JSON graph format → internal data structures
  2. Validate — Confirm DAG (no cycles), all references resolve, correct arity
  3. Verify Security — Every node's capability requirements checked against the manifest
  4. Schedule — Topological sort determines execution order
  5. Execute with Sandbox — Runtime capability checks before every side-effecting operation
  6. Content-Address — BLAKE3 hashes computed for every node (identity = semantics)

What's Proven

  • 5 + 3 = 8 executes through the full pipeline
  • (10 + 20) * (5 - 2) = 90 — chained operations work
  • Security enforcement catches violations before execution
  • Content-addressing works: same computation = same hash regardless of node ID
  • Cycle detection, dangling reference detection, arity checking all work
  • 47 tests, all green, 0.05 seconds

Phase 1a: Core Language

47 → 96 tests • 28 opcodes

96 tests • Strings, conditionals, errors, functions, iteration, sequences

Summary

Phase 1a is where Lexis becomes a real language. Added strings, conditionals, error-as-value system, functions (subgraphs), iteration (REDUCE), and sequences. Grew from 47 tests (Phase 0) to 96 tests.

New Capabilities

  • Strings: Concatenation, length, type conversion — "Hello" + ", " + "Lexis!" works
  • Conditionals: SELECT(10 > 5, "yes", "no")"yes". Pure data flow, no control-flow edges. Composable with NOT, AND, OR.
  • Errors as Values: Division by zero doesn't crash — it produces an ErrorValue that flows through the graph. Downstream nodes automatically propagate it. TRY_OR catches errors and returns fallbacks. IS_ERROR lets you inspect. Error chains carry causation for debugging.
  • Functions (Subgraphs): A function is a self-contained DAG with numbered input/output ports. No parameter names. Call it by its content hash. double(21) = 42 works. Chaining works: double(double(5)) = 20. Error propagation flows through subgraph boundaries.
  • Iteration (REDUCE): Fold a subgraph over a sequence. REDUCE([1,2,3,4,5], 0, add) = 15. Works with any 2-input-1-output subgraph.
  • Sequences: Variadic SEQUENCE node collects values into ordered lists. Foundation for REDUCE and future collection operations.

Key Design Decisions

  • Errors are values — ErrorValue flows through the graph, propagated automatically by _propagating decorator on builtins. No exceptions escape to the user.
  • Functions are subgraphs — LexisSubgraph with numbered ports. Two different AI models can independently generate the same function and get the same content hash.
  • SELECT for conditionals — 3-input data-flow node. No control-flow edges in the graph — everything is data flow.
  • REDUCE for iteration — Folds a subgraph over a sequence. No loops, no mutation — just reduction.

Design Philosophy

The error-as-value system is genuinely elegant. In every human language, error handling is bolted on (try/catch, Result types, Option types). In Lexis, errors are just values that flow through edges — the same way data does. An AI debugging a Lexis program doesn't need to parse stack traces; it follows the error value through the graph to find where it originated. The causal chain is built into the ErrorValue itself.

The subgraph system is clean. Functions are just graphs with ports. No naming conventions to agree on, no parameter ordering debates beyond port indices. Two different AI models can independently generate the same function and get the same content hash. That's powerful for a multi-AI ecosystem.

Phase 1b: Benchmarks & Validation

AI Generation Quality: 16/16 (100%)

20 tests • Hypothesis validation

Summary

Hypothesis validation phase. Tested two claims: (1) Can AI generate valid Lexis programs from a spec alone? (2) How does Lexis token efficiency compare to Python? Results: 100% generation success, but raw JSON is ~7x more tokens than Python (though semantically closer to ~1.65x).

AI Generation Quality: 16/16 (100%)

SuiteParseValidateSecurityExecuteCorrect
Hand-written baseline8/88/88/88/88/8
AI-generated (spec only)8/88/88/88/88/8

The AI-generated programs were written using only the 54-line spec document — no access to existing examples, no trial-and-error. Every program passed all 5 pipeline stages on the first attempt.

Token Efficiency

MetricLexis/Python RatioMeaning
M1 (Raw JSON)6.74xLexis is ~7x MORE tokens as raw JSON
M4 (Semantic-only)1.65xWith a binary format, gap shrinks to ~1.6x
T4 Error handling0.92x (M4)Lexis wins — TRY_OR is more compact than try/except
T3 Conditional1.00x (M4)Tie — SELECT is as compact as if/else

Phase 2a: Agent Collaboration

Identity, Trust, Scoping, Composition, Audit

~41 tests • Multi-agent security foundation

Summary

Introduced multi-agent identity, trust levels, capability scoping, graph fragment composition, and audit trails. This phase laid the security foundation for agents collaborating on shared programs.

What Was Built

  • Agent Identity — BLAKE3 cryptographic identities, node signing/verification, provenance tracking (separate from content hashing — dedup preserved)
  • Trust Levels & Scoped Security — 4-tier trust system (0-3), three-way capability intersection (trust ceiling ∩ agent declared ∩ program manifest), per-agent sandbox enforcement
  • Graph Composition — Safe multi-agent fragment merging with content-hash dedup, open input resolution, provenance preservation (no capability laundering)
  • Audit Trail — Append-only, BLAKE3-hashed log tracking every node execution by agent

Key Design Decisions

  • Provenance separate from content hash — Two agents writing CONST(5) get the same content hash. Authorship is tracked in Provenance metadata, not in the hash.
  • Trust levels 0-3 — Simple integer mapping to capability ceiling. Deliberately coarse-grained. Finer permissions come from the 3-way intersection.
  • Three-way capability intersection — All three must agree for an operation to be allowed. Prevents: trust escalation, undeclared use, manifest bypass.
  • No capability laundering — When fragments are composed, nodes keep their original agent's provenance.

Phase 2b: Binary Format & Wire Protocol

MessagePack ~5x smaller than JSON

~22 tests • MessagePack, wire protocol, fragment store

Summary

Added MessagePack binary serialization for compact program encoding (~5x smaller than JSON), a wire protocol for agent communication with signed messages and tamper detection, and content-addressed fragment storage.

What Was Built

  • MessagePack Serialization — Opcodes as integers (~5x smaller than JSON), full roundtrip fidelity for programs, subgraphs, agents, provenance
  • Wire Protocol — 11 message types for fragment exchange, composition proposals, hash verification. Signed messages with tamper detection
  • Fragment Store — Thread-safe content-addressed storage for subgraphs and fragments with dedup reporting

Phase 2c: MAP, FILTER, Parallel

184 tests • 31 opcodes

~25 tests • Higher-order ops + parallel execution

Summary

Added MAP, FILTER, COMPOSE opcodes and parallel execution. The language now supports multi-agent collaboration end-to-end.

What Was Built

  • MAP — Apply subgraph to each sequence element (short-circuits on error)
  • FILTER — Keep elements where predicate subgraph returns truthy
  • COMPOSE — Create new subgraph B(A(x)) via node prefixing and port rewiring — content-addressable
  • Parallel Executionparallel_evaluate() using ThreadPoolExecutor with parallelism_levels() grouping, deterministic PRINT ordering, thread-safe stores

Key Design Decisions

  • MAP/FILTER as opcodes — Not library functions. Making them opcodes communicates parallelism intent to the scheduler.
  • Parallel execution — parallelism_levels() + ThreadPoolExecutor. Deterministic PRINT ordering preserved via output buffering per level.

Phase 3a: Hardening

282 tests total • Pure quality & robustness

~48 tests • CLI, adversarial security, protocol, error paths

Summary

First hardening pass. Added CLI, scheduler edge case tests, adversarial security tests, protocol edge case tests, and error path coverage. No new features — pure quality and robustness.

What Was Built

  • CLI entry point for python -m lexis. Commands: run, validate, hash
  • Adversarial security tests (12 tests) — Trust escalation, provenance forgery, capability laundering, sandbox bypass attempts
  • Protocol edge cases (8 tests) — Boundary conditions in message encoding/decoding
  • Error path hardening (8 tests) — REDUCE/MAP/FILTER edge cases, subgraph errors

Phase 3b: Data Access Opcodes

38 opcodes • 10 examples

~50 tests • INDEX, SLICE, SPLIT, DICT, GET, KEYS, VALUES

Summary

Added 7 opcodes for working with sequences and dictionaries. The language can now parse structured strings, index/slice into sequences and strings, and work with key-value dictionaries — unlocking real data processing programs.

Opcodes Added

  • INDEX — Access element by position. Polymorphic: works on both sequences and strings.
  • SLICE — Extract sub-range. Polymorphic: works on both sequences and strings.
  • SPLIT — Split string by delimiter. Empty delimiter = character-level split.
  • DICT — Create dictionary from alternating key-value pairs.
  • GET — Retrieve value from dictionary by key.
  • KEYS — Get all keys from a dictionary as a sequence.
  • VALUES — Get all values from a dictionary as a sequence.

Phase 4: Gap-Closing

307 tests • 40 opcodes

~25 tests • TO_NUM, APPLY, higher-order in subgraphs

Summary

Closed three functional gaps identified by audit. The foundation is genuinely complete after this phase — no known inconsistencies, no known blockers for the existing feature set.

What Was Fixed

  • TO_NUM — String-to-number conversion. Parses strings to int/float, bools to 0/1, numbers pass through.
  • APPLY — Dynamic subgraph dispatch. The hash comes from an input edge, making the target dynamic. This completes COMPOSE: COMPOSE returns a runtime hash → APPLY invokes it.
  • All ops work inside subgraphs — MAP/FILTER/REDUCE/COMPOSE/APPLY inside a subgraph now work correctly.

Phase 5: Type Guards & Pattern Matching

346 tests • 43 opcodes

~37 tests • TYPE_OF, HAS_KEY, MATCH

Summary

Added type introspection (TYPE_OF, HAS_KEY) and the MATCH opcode for pattern matching with guards and handlers. Completed the control-flow story: SELECT for binary conditions, MATCH for multi-way dispatch.

Key Design Decisions

  • TYPE_OF does NOT propagate errors — Returns "ERROR" string for ErrorValue inputs. This is introspection, not computation.
  • HAS_KEY does NOT propagate errors — Returns False for non-dict/ErrorValue inputs. Safe guard — never errors itself.
  • MATCH is lazy — Only evaluates the matching handler. Guards checked in order, first truthy wins.
  • Guards and handlers are subgraphs — Referenced via SUBGRAPH_DEF hash. Consistent with APPLY pattern.

Why This Matters for Weaker Models

MATCH shifts cognitive load to runtime. Weaker models can declare patterns without implementing dispatch logic. A heterogeneous list [10, "20", 30, "40"] can be type-dispatched through MATCH with just two guard-handler pairs — previously requiring nested SELECT chains.

Phase 6: Standard Library

423 tests • 18 subgraphs • ~75% token savings

~75 tests • Guards, transforms, reducers

Summary

Created a standard library of 18 pre-built, content-addressed subgraphs that programs can reference by name instead of defining inline. The stdlib is resolved at load time — the evaluator is completely untouched.

Token Impact

ProgramBefore (inline)After (stdlib)Savings
stdlib_showcase.json~80 lines10 lines~75%
stdlib_reduce.json~25 lines7 lines~72%
Type-dispatch MATCH~2,300 tokens~1,200 tokens~48%

An AI model that used to emit 5-node subgraph definitions (~400 tokens each) now emits a 15-character string like "std:always_true".

Key Design Decisions

  • Stdlib is compile-time expansion — "std:name" resolved to actual content hash at load time. Evaluator unchanged.
  • Content-addressed dedup — Stdlib subgraphs get real BLAKE3 hashes. Two programs using std:double share the same hash.
  • "std:" prefix convention — Only values starting with "std:" trigger resolution. No collision with user subgraph names.

Phase 7: LLM Generation Benchmarks

478 tests • 15 tasks across 3 tiers

55 tests • Benchmark harness + CLI command

Summary

Built a benchmark harness to test whether AI models can generate valid Lexis programs from the spec + natural language task descriptions. 15 tasks across 3 tiers. CLI command lexis bench. Works with any OpenAI-compatible API.

Key Design Decisions

  • OpenAI-compatible API — Works with LM Studio, Ollama, or any compatible endpoint.
  • Robust JSON extraction — Handles code fences, prose wrapping, raw JSON, multiple objects.
  • Full pipeline validation — Parse → validate → verify → execute → check output.
  • No new dependencies — Uses urllib.request (stdlib) for API calls.

Phase 7b: Spec Hardening

492 tests • 3 rounds of refinement

Tested 5 models • 3 spec refinement rounds

Summary

Ran the benchmark suite against 5 local models on an NVIDIA 4090 GPU. Performed targeted spec hardening based on error analysis across 3 rounds.

Results

ModelSizeScoreTier 1Tier 2Tier 3
Qwen 2.5 Coder 14B14B11/15 (73%)5/53/53/5
GLM 4.6v Flash~9B9/15 (60%)5/52/52/5
Qwen 2.5 Coder 7B7B9/15 (60%)3/53/53/5
GLM 4.7 Flash30B7/15 (47%)2/53/52/5
Qwen 2.5 Coder 32B32B5/15 (33%)*1/52/52/5

*32B model hampered by EXTRACT_FAIL (quantization/VRAM pressure issues, not a capability problem).

Key Insight

Models learn flat graphs and stdlib quickly. The capability wall is multi-level nesting: main graph → subgraph definition → port wiring → higher-order call. This is a working memory limitation in small models, not a spec clarity issue. The spec refinement hit diminishing returns after 3 rounds.

Phase 8: Params, File I/O & Stdin

582 tests • 47 opcodes • 19 examples

Phase 8b: PARAM (25 tests) • Phase 8a: File I/O (42 tests) • Phase 8c: Stdin (22 tests)

Phase 8b: PARAM Opcode

Added runtime parameter injection. Parameters are always strings, injected via CLI --param name=value. Pure operation — no capability required. Missing param → ErrorValue, so use TRY_OR for defaults.

Phase 8a: File I/O (READ_FILE, WRITE_FILE)

Added file system operations with a verifier bug fix: both verifier and sandbox now read from the same OP_REQUIRED_CAPABILITIES dict — single source of truth. FS_READ at trust level 2, FS_WRITE at trust level 3. Write-then-read chaining works naturally: WRITE_FILE returns the filename.

Phase 8c: Standard Input (READ_STDIN)

Added READ_STDIN with the stdin_reader callable pattern for testability. EOF → ErrorValue, TRY_OR provides graceful fallback. No CLI changes needed — just pipe: echo "hi" | lexis run prog.json.

Phase 8 Complete: Lexis can now read/write files, accept parameters, and read from stdin. It has graduated from a calculator to a real data processor.

Phase 9: AI-Native Networking

HTTP_GET, HTTP_POST, HTTP_REQUEST • Security-first design

3 HTTP opcodes • Zero new dependencies

Summary

Added 3 HTTP opcodes with a security-first design: closed-by-default domain allowlist, SSRF prevention, HTTPS-only by default, no redirects by default, response size limits, and timeout enforcement.

Key Design Decisions

  • 3 AI-native HTTP opcodes — Three tiers of complexity. GET = 1 input (URL). POST = 2 inputs (URL, body). REQUEST = 3 inputs (URL, method, body). LLMs pick the simplest tier.
  • Auto-JSON parse/serialize — Responses with JSON Content-Type auto-parsed. POST auto-serializes dict bodies.
  • Closed-by-default domain allowlist — Must explicitly declare every domain. Empty = all blocked.
  • SSRF prevention — DNS pre-resolution + private IP blocking before any connection.
  • Zero new dependencies — Uses Python stdlib: urllib, ipaddress, socket, json, ssl.

Phase 10: Persistence & Caching

3-layer caching system • Cache invalidation is a non-problem

L1: Runtime memo • L2: Disk cache • L3: LLM catalog

Summary

Added a 3-layer caching system: L1 runtime memo (always on, within-run), L2 persistent disk cache (opt-in, cross-run), L3 LLM-aware catalog (for token savings in prompts). Content-addressing makes cache invalidation a non-problem.

Key Design Decisions

  • Purity as foundation — Only pure subgraphs (no I/O ops transitively) are cached.
  • Cache invalidation is a non-problem — Content-addressing guarantees: same inputs + same function hash = same result. The hash IS the cache key. No staleness possible.
  • Bool vs int distinctionTrue and 1 hash differently. Semantically different values must have different hashes.
  • Layer 1 always on — Zero cost when no subgraph calls occur. Thread-safe via Lock.
  • Layer 2 opt-in — DiskCache enabled via --cache CLI flag. Sharded storage with LRU eviction.

Phase 11: Multi-Agent Runtime

156 + 51 tests • Transport, event loop, discovery, negotiation

4 implementation steps + hardening pass (51 adversarial/stress tests)

Summary

Transformed Lexis from a single-interpreter language into a multi-agent collaborative system. Added transport layer, event loop, enhanced provenance with Hybrid Logical Clocks, content-aware routing, agent discovery, and capability negotiation.

Architecture

  • Transport Layer — Abstract interface for agent communication. LocalTransport with thread-safe queues. ContentRouter with IPFS-style Want/Have protocol.
  • Agent Event Loop — State machine with 6 states (IDLE, OFFERING, REQUESTING, COMPOSING, EXECUTING, STOPPED). 19 message handlers.
  • Enhanced Provenance — Hybrid Logical Clocks combining physical time + logical counter + agent_id. ProvenanceChain for lineage tracking.
  • Discovery & Negotiation — TTL-based agent announcements. Capability negotiation with state machine (PROPOSED→ACCEPTED|REJECTED|COUNTERED|EXPIRED).

Hardening (Phase 11b)

51 new tests: 18 integration, 21 adversarial, 12 stress. Tested trust escalation, provenance forgery, transport attacks, capability laundering, concurrent messaging, HLC causality. No production bugs found — all Phase 11 code held up.

Phase 12: Benchmark Expansion

22 tasks • 4 tiers • 996 tests

21 new tests • 7 new Tier 4 tasks • Spec 43 → 50 opcodes

Summary

Updated the LLM benchmark suite to cover all features added in Phases 8-11. Updated the spec from 43 to 50 opcodes. Added 7 new tasks in Tier 4 (I/O & Capabilities).

Key Finding: 7B Beats 14B

The Qwen 2.5 Coder 7B model scored 91% vs 82% for the 14B — consistently across two runs each. For DAG-structured JSON output, the smaller model's more constrained generation appears to be an advantage. Less "creativity" means fewer wrong answers.

Phase 13: MCP Server

1056 tests • 6 tools, 3 resources, 1 prompt

60 new tests • FastMCP + stdio/HTTP

Summary

Built a Model Context Protocol (MCP) server so any MCP-compatible AI agent — Claude Code, Cursor, Claude Desktop, Windsurf — can generate, validate, execute, and compose Lexis programs through standard tool calls. This is the fastest path to adoption: turns Lexis from "a language you learn" into "a tool you call."

Design Decisions

  • lexis_generate reuses validate_generated_program — The benchmark validation function already handles everything: auto-infer capabilities, auto-flatten inline objects, staged error classification.
  • Structured JSON returns — LLMs need machine-parseable responses to self-correct in agentic loops.
  • Suggestions field — Maps each error class to specific fix instructions. The self-correction loop that makes agentic workflows work.

Phase 14: ITERATE Opcode

Bounded iteration with guaranteed termination

17 tests • 1 new opcode

Summary

Added bounded iteration. MAP/FILTER/REDUCE handle collection processing, but there was no way to express "repeat until done" — retry patterns, numeric convergence, iterative refinement. ITERATE fills this gap with guaranteed termination via max_steps.

Design

  • Inputs: 2 — (initial_value, max_steps)
  • Step subgraph: 1 input → 1 output ([new_value, should_stop])
  • Follows REDUCE pattern — LLMs that can generate REDUCE programs can immediately generate ITERATE programs.
  • max_steps as computed value — Allows dynamic iteration limits based on input data.

Phase 15: Enhanced Error Reporting

1111 tests • Actionable error diagnostics

38 new tests • Error classification + "Did you mean?" + --debug

Summary

Made Lexis errors actionable enough that local AI models (7B-14B) can self-correct instead of getting stuck in retry loops.

What Changed

  • classify_error() — Pattern-matches exceptions to error classes (PARSE_FAIL, OPCODE_FAIL, ARITY_FAIL, REF_FAIL, CYCLE_FAIL, SECURITY_FAIL, RUNTIME_FAIL)
  • "Did you mean?" — Uses difflib.get_close_matches() for op typos. "conts""Did you mean: const, concat?"
  • --debug flag — Prints all node values in topological order to stderr
  • CLI error suggestions — Every error handler now attaches actionable fix suggestions

Phase 16: Tiered Spec

Reorganized spec document into 3 tiers

No code changes • Spec reorganization for small models

Motivation

Local LLMs (7B-14B) kept reaching for advanced features (subgraphs, match) when simple tasks only needed basics. The spec presented all 50 opcodes flat — the model couldn't distinguish simple from complex.

Solution

  • Tier 1: Core Language (25 opcodes) — Self-contained: a model reading only Tier 1 can build calculators, string processors, conditional logic
  • Tier 2: Functions & Collections (19 opcodes) — Subgraphs, higher-order ops, collections
  • Tier 3: I/O & Networking (6 opcodes) — File, stdin, HTTP
  • Tier directive at top: "Start with Tier 1. Only use Tier 2 if Tier 1 cannot solve the task."

Phase 17: lexis_check Tool

1122 tests • Validate + execute in one call

11 new tests • 7th MCP tool

Motivation

During local model testing, we discovered that models call lexis_validate and lexis_run separately — and some skip lexis_run entirely, declaring "Task Completed" without verifying output. A model did this with a broken calculator that would have failed at runtime.

Solution

lexis_check is a single MCP tool that validates, runs, and verifies a program in one call. Models can't claim success without actual execution. Returns per-stage results (parse_ok, structure_ok, security_ok, execution_ok) plus fixes_applied that tells models what was auto-corrected.

Phase 18: Native GUI

6 opcodes • 10 widget types • Tkinter backend

87 new tests • Declarative scene-graph

Summary

Added 6 GUI opcodes enabling Lexis programs to create native windowed applications with interactive widgets, event handling, and canvas drawing. Backend is tkinter (zero extra dependencies).

Design Philosophy

  • Declarative scene-graph — The DAG describes UI as data structures, not imperative API calls
  • 6 opcodes, not 20+ — GUI_WIDGET uses a type discriminator; GUI_DRAW uses a shape discriminator. Adding new widget/shape types requires zero opcode changes
  • 5 pure + 1 impure — Only GUI_RENDER is side-effecting. The other 5 build descriptor dicts — pure, cacheable, content-addressable
  • Callback subgraphs — Event handlers are subgraphs invoked with state dict → return new state dict

Widget Types

label, button, text_input, checkbox, dropdown, slider, vbox, hbox, grid, frame

Example Programs

gui_hello.json (static window), gui_counter.json (buttons + state), gui_canvas_drawing.json (shapes), gui_calculator.json (digit buttons, +, =, C)

Phase 19: DAG Visualization

3 modes: static, trace, live • Cytoscape.js

68 new tests • Browser-based visualization

Summary

Added browser-based DAG visualization system with 3 modes: static (structure view), trace (step-through playback), and live (real-time GUI program tracing). Uses Cytoscape.js + dagre layout (CDN, zero build step).

Features

  • 10 opcode color categories (Literal, Arithmetic, Comparison, Logic, String, Collection, Control Flow, Subgraph, I/O, GUI)
  • Collapsible subgraph compound nodes
  • Click-to-inspect sidebar (node ID, op, value, result, hash)
  • Trace playback controls (step, play/pause, speed slider, reset)
  • Live mode with pulsing indicator and real-time node highlighting
  • Dark theme inspired by VS Code

Phase 20: Meta-Programming

5 opcodes • Self-bootstrapping foundation

47 new tests • EMIT_NODE, BUILD_SUBGRAPH, QUOTE, REFLECT, EVAL

Summary

Added 5 meta-programming opcodes that enable Lexis programs to construct, inspect, and execute graph fragments at runtime. This is the foundation for self-hosting: AI agents using Lexis programs to generate, validate, and compose other Lexis programs.

EVAL Security Design

  • Capability ceiling: inner caps = (declared ∩ parent caps) − {META_EVAL}
  • No privilege escalation: inner code cannot use capabilities the parent lacks
  • No recursive eval: META_EVAL stripped from ceiling prevents eval-of-eval chains
  • Recursion depth limit: MAX_EVAL_DEPTH = 3

Phase 21: Production Patterns

FORMAT, TRY_CATCH, RETRY • 64 opcodes

37 new tests • 3 opcodes closing practical gaps

Summary

Added 3 production-pattern opcodes that close the gap between "Lexis can do it in theory" and "Lexis handles it cleanly in practice."

Opcodes

  • FORMAT — String interpolation: FORMAT("Hello {}, count: {}", name, num). Eliminates 60-70% of nodes in message-building patterns.
  • TRY_CATCH — Unwrap value/error into inspectable dict. Always returns a dict — never propagates. Enables error inspection that was previously impossible.
  • RETRY — Bounded retry of subgraph up to N times until success. Makes API orchestration practical (HTTP 429/503 recovery).

Impact

Together, these three opcodes make Lexis production-ready for AI agent tool chains, regulated computation pipelines, and API orchestration.

Phase 22: Utility Opcodes

DELAY, RANGE, SORT, REVERSE, MERGE • 69 opcodes

50 new tests • 5 bread-and-butter operations

Summary

Added 5 utility opcodes that fill the most common practical gaps: generating number sequences, sorting/reversing data, combining dicts, and pausing for retry backoff.

Design Highlights

  • DELAY — Pass-through semantics. Returns input value, enables chaining. Max 60 seconds.
  • RANGE — Variable arity (2-3). Auto-detects direction. 10,000 element limit.
  • SORT — Type-homogeneous only. Mixed types return ErrorValue.
  • MERGE — Dict union. Second dict wins on conflicts.

Implementation Pattern

All 5 opcodes are simple builtins — no evaluator.py changes needed. They're dispatched via BUILTIN_OPS[node.op](*input_values) automatically. The cleanest possible pattern for new opcodes.

Phase 23: Expanded Stdlib

18 → 32 subgraphs • +14 new

50 new tests • 9 transforms + 5 reducers

Summary

Expanded the stdlib from 18 to 32 subgraphs. All 14 new subgraphs are composed from existing opcodes — validating the composability of the core opcode set.

New Transforms (9)

abs, square, increment, decrement, get_last, not_op, is_negative, is_zero, is_empty

New Reducers (5)

sub, min, max, concat, and_op

Key Decision

No new opcodes needed. All 14 subgraphs are built from existing opcodes (ADD, SUB, MUL, GT, LT, EQ, SELECT, NOT, LENGTH, INDEX, CONCAT, AND). No registry changes — auto-discovery from the subgraph dictionaries.

Phase 24: String Operations

UPPER, LOWER, TRIM, REPLACE, STARTS_WITH, ENDS_WITH, CONTAINS • 76 opcodes

58 new tests • 7 native string opcodes

Summary

Added 7 native string manipulation opcodes. These fill the most critical gap AI models face when generating text-processing programs.

Design Decisions

  • All pure, no capabilities — String operations have no side effects.
  • Auto-coercion via str() — Matches existing CONCAT/SPLIT pattern. Pragmatic for AI models.
  • REPLACE replaces ALL occurrences — What users expect.
  • Placed in Tier 1 — Fundamental string operations, unlike SPLIT which produces a collection.

Phase 25: Math Operations

FLOOR, CEIL, ROUND, POWER, SQRT, RANDOM • 82 opcodes

60 new tests • 6 math opcodes

Summary

Added 6 math opcodes filling the gap between basic arithmetic and what models need for calculators, converters, scientific computation, and games.

Design Decisions

  • Skipped native ABS — stdlib already has std:abs (Phase 23).
  • RANDOM is impure — In IO_OPS (not cached) but requires no capability. Generating random numbers isn't dangerous.
  • ROUND variable arity (1-2)round(3.5) → 4, round(3.456, 2) → 3.46.
  • Strict type checking — Math ops reject bools and non-numbers with ErrorValue.

Phase 26: Sequence Operations

ZIP, FLATTEN, UNIQUE, TAKE, DROP, ENUMERATE • 88 opcodes

57 new tests • 6 sequence manipulation opcodes

Summary

Added 6 sequence manipulation opcodes filling the gap between basic sequence creation and higher-order ops.

Design Decisions

  • ZIP truncates to shorter — Follows Python semantics. No padding, no error.
  • FLATTEN is one level only — Safe, predictable, covers 95% of use cases.
  • UNIQUE handles unhashable types — Uses repr() fallback. Preserves first-occurrence order.
  • TAKE/DROP clamp to bounds — No error on oversized count.
  • ENUMERATE starts at 0 — Always. Keeping it simple.

Phase 27: Benchmark Refresh

22 → 30 tasks • 1652 tests

16 new tests • 8 new tasks covering Phases 21-26

Summary

Added 8 new benchmark tasks (t23-t30) covering opcodes from Phases 21-26: string ops, math ops, format, and sequence ops. All 30 baselines pass the full validation pipeline.

New Tasks

IDNameOpcodes Tested
t23string_normalizetrim, lower, eq, select
t24text_transformupper, replace, contains
t25pythagoreanpower, sqrt, add
t26roundingfloor, ceil, round
t27format_stringformat
t28sequence_pipelineunique, sort, take
t29zip_enumeratezip, enumerate, map
t30flatten_reduceflatten, reduce

Phase 28: Dict Operations

SET, DELETE_KEY, ITEMS • 91 opcodes

38 new tests • Completes dict CRUD API

Summary

Added 3 dict mutation opcodes. All operations are pure (immutable) — they return NEW dicts, consistent with Lexis's functional design. This completes the dict API: create (dict), read (get), update (set), delete (delete_key).

Design Decisions

  • All pure (PURE_OPS) — Immutable operations that return new dicts.
  • DELETE_KEY is a no-op on missing keys — Returns dict unchanged rather than erroring.
  • ITEMS returns [[key, value], ...] — Uses 2-element lists consistent with Lexis collections. Enables dict↔sequence pipelines with MAP/ZIP.

Phase 29: Developer Tooling

VS Code extension, enhanced debugging, session logger

3 sub-phases • VS Code, debugging, benchmark sessions

Phase 29a: VS Code Syntax Highlighting

Created a VS Code extension with TextMate grammar for Lexis JSON programs. Highlights all 93 opcodes, string values, numbers, booleans, node IDs, capabilities, stdlib references, and structural JSON keys. Packaged as a VSIX file for installation.

Phase 29b: Enhanced Debugging

Added structural warnings, program summary, execution snapshot, opcode hints, call stack tracking, output diff, and contextual suggestions. Designed to help both AI models and human developers understand what went wrong and how to fix it.

Phase 29c: Session Logger

Built benchmark session infrastructure for cross-model comparison:

  • sessions.py — Session CRUD, import existing results
  • analysis.py — Pipeline inference, opcode extraction, failure patterns, cross-model comparison
  • reports.py — Markdown report generation (scoreboard, strengths, hardest tasks, failure patterns)
  • session_cli.py — CLI: start, list, show, report, import

Workflow: bench session start "Name"bench -m model --session IDbench session report ID

Phase 30: COMMENT & ASSERT

Developer tools • 93 opcodes • 1823 tests

18 new tests • 2 new opcodes + session logger fixes

New Opcodes

  • COMMENT (1 input) — No-op pass-through node. Returns input unchanged. value field holds a label string for documentation. Acts as inline documentation in the data flow graph. Does NOT propagate errors — passes them through silently.
  • ASSERT (2 inputs) — Runtime assertion. Input 1: condition (truthy/falsy). Input 2: value to pass through. If condition is truthy, returns value unchanged. If falsy, returns ErrorValue with assertion failure message. Uses @_propagating: error inputs propagate before assertion check. Recoverable with TRY_OR.

Design Decisions

  • COMMENT skips @_propagating — Same as TYPE_OF, HAS_KEY. Annotation should never alter semantics.
  • ASSERT uses @_propagating — If inputs are already errors, they should propagate rather than masking as "assertion passed."
  • Both are PURE_OPS — No I/O capabilities needed.

Session Logger Fixes

  • Pipeline Breakdown: compare_models() now calls analyze_run() per run and includes pipeline_rates. Reports show real Parse/Validate/Security/Execute/Correct percentages instead of dashes.
  • Model Deduplication: When multiple runs exist for the same model, keeps the best-scoring one.

Test Results

18 new tests for COMMENT + ASSERT. 2 new tests for pipeline_rates + deduplication. Updated 3 opcode count assertions (91 → 93). Total: 1823 tests passing.