Welcome to Lexis
The first programming language built entirely for AI. Code is DAGs, not text. Content-addressed via BLAKE3. Security-first with capability manifests enforced at parse time and runtime.
What is Lexis?
Lexis is an AI-native programming language where programs are expressed as directed acyclic graphs (DAGs) in JSON format. Unlike traditional text-based languages that require parsing ambiguous syntax, Lexis programs are structured data — JSON objects with explicit nodes, edges, and operation codes.
Every node in a Lexis program is content-addressed using BLAKE3 cryptographic hashing. This means that two independently-written programs that perform the same computation produce the same hash — enabling automatic deduplication, tamper detection, and caching without any coordination.
Lexis enforces security at two layers: a static verifier checks capability declarations before execution, and a runtime sandbox enforces per-node permissions during execution. Programs must explicitly declare what they intend to do (print output, read files, access the network), and any undeclared operation is blocked.
Who is Lexis For?
- AI Models — LLMs generate structured JSON programs that pass through a validation pipeline with classified error diagnostics and self-correction suggestions.
- Multi-Agent Systems — Multiple AI agents can independently generate code fragments, compose them via content-hash deduplication, and execute with per-agent trust enforcement.
- Regulated Computation — Content-addressing provides provenance and audit trails. Capability manifests ensure programs only access declared resources.
- Tool Orchestration — MCP server integration lets any MCP-compatible AI agent generate, validate, and execute Lexis programs through standard tool calls.
Quick Example
Here is a minimal Lexis program that computes (5 + 3) = 8 and prints the result:
Every program flows through the pipeline: Parse → Validate → Verify Security → Schedule → Execute → Content-Address.
Design Philosophy
The core ideas that make Lexis fundamentally different from every other programming language.
Code is Data (DAGs, not text)
Programs are JSON-encoded directed acyclic graphs. There is no parser ambiguity, no syntax errors in the traditional sense — only structural validation. AI models work with data structures, not text manipulation.
Content-Addressed Identity
Every node gets a BLAKE3 hash based on its operation, value, and inputs. Same computation = same hash, regardless of who wrote it or when. This enables automatic deduplication, caching, and tamper detection.
Errors Are Values
Division by zero doesn't crash — it produces an ErrorValue that flows through the graph. Downstream nodes automatically propagate it. TRY_OR catches errors. IS_ERROR inspects them. Error chains carry causation for debugging.
Functions Are Subgraphs
A function is a self-contained DAG with numbered input/output ports. No parameter names to agree on. Two AI models can independently generate the same function and get the same content hash — powerful for multi-AI ecosystems.
Security by Default
Programs declare capabilities they need. Two-layer enforcement: static verifier at load time, runtime sandbox at execution. Three-way intersection for agents: trust ceiling ∩ agent declared ∩ program manifest. Closed by default — nothing is allowed unless explicitly declared.
Composition Over Coordination
No Operational Transform needed. Same hash = same thing. Merge is set union of nodes and subgraphs. AI-native: agents produce fragments, the system composes them. No conflict resolution required.
Provenance Separate from Content
Two agents writing CONST(5) get the same content hash. Authorship is tracked in Provenance metadata, not in the hash. This enables deduplication while maintaining audit trails.
Caching is Trivial
Content-addressing makes cache invalidation a non-problem. Same inputs + same function = same result. The hash IS the cache key. No staleness possible. No TTL needed. Three-layer caching: runtime memo, persistent disk, LLM-aware catalog.
Key Differentiators
| Feature | Traditional Languages | Lexis |
|---|---|---|
| Code Format | Text files with syntax rules | JSON DAGs — structured data, no parsing ambiguity |
| Error Handling | try/catch, exceptions, Result types | Errors are values that flow through edges |
| Identity | File paths, module names | BLAKE3 content hash — semantics IS identity |
| Security | OS-level permissions | Capability manifests enforced at parse + runtime |
| Multi-Author | Git merge conflicts | Content-addressed composition — same hash = same thing |
| AI Generation | Generate text, hope it parses | Generate JSON, structurally validated with classified error diagnostics |
| Caching | Manual invalidation, TTL | Content hash IS cache key — automatic, correct, zero-config |
| Conditionals | if/else control flow | SELECT data-flow node — no control edges needed |
Why DAGs?
Every Lexis program is a directed acyclic graph where nodes are operations and edges are data dependencies. This representation has several deep advantages:
- No ordering ambiguity — The scheduler determines execution order from the graph structure via topological sort. The programmer declares what depends on what, not what runs when.
- Automatic parallelism — Independent branches of the DAG can be executed in parallel with zero programmer effort. The
parallel_evaluate()function uses parallelism levels to maximize throughput. - No variable mutation — Data flows along edges. There are no mutable variables, no side effects (except I/O opcodes), and no race conditions.
- Visual debugging — DAGs have natural visual representations. The built-in visualization system renders programs as interactive node graphs with execution tracing.
System Architecture
How the pieces fit together — from JSON input to executed output.
Core Pipeline
Every Lexis program passes through this security-enforced pipeline:
Package Structure
| Package | Purpose | Key Files |
|---|---|---|
| lexis/graph/ | Schema, serialization, validation | schema.py, serialization.py, validation.py, binary.py |
| lexis/hash/ | BLAKE3 hashing, content store | hasher.py, store.py |
| lexis/security/ | Capabilities, verifier, sandbox | capabilities.py, verifier.py, sandbox.py |
| lexis/interpreter/ | Evaluator, builtins, scheduler | evaluator.py, builtins.py, scheduler.py |
| lexis/networking/ | HTTP client, URL validation, SSRF prevention | security.py, client.py |
| lexis/cache/ | 3-layer caching system | purity.py, memo.py, disk.py, catalog.py |
| lexis/agent/ | Identity, trust, scoping, composition, audit | identity.py, trust.py, scope.py, composer.py, audit.py |
| lexis/protocol/ | Wire protocol messages, codec, sessions | messages.py, codec.py, session.py |
| lexis/transport/ | Agent communication transport layer | base.py, local.py, router.py, discovery.py |
| lexis/stdlib/ | 32 pre-built subgraphs | registry.py, guards.py, transforms.py, reducers.py |
| lexis/mcp/ | MCP server (7 tools, 3 resources, 1 prompt) | server.py, helpers.py |
| lexis/gui/ | Native GUI (tkinter backend) | types.py, backend.py, tk_backend.py, runtime.py |
| lexis/viz/ | DAG visualization & execution tracing | hooks.py, tracer.py, html_generator.py, live_server.py |
| lexis/cli/ | Command-line interface | main.py |
Core Data Types
- GraphNode — A single node in the DAG: id, op (OpCode), inputs (list of node ID refs), value (literal or config)
- LexisProgram — Complete program: spec version, capabilities (frozenset), nodes (tuple), subgraphs (dict), allowed_domains (tuple)
- LexisSubgraph — Reusable function: nodes with INPUT_PORT/OUTPUT_PORT for numbered parameters
- ErrorValue — Error-as-value: message, source_node, optional cause (for error chaining)
- OpCode — Enum of all 93 operations
- Capability — Enum of permissions: PURE_COMPUTE, IO_STDOUT, IO_STDIN, FS_READ, FS_WRITE, NETWORK_OUT, GUI_RENDER, META_EVAL
Opcode Reference
All 93 opcodes organized by tier, from core language to meta-programming.
Tier 1: Core Language (43 opcodes)
Everything needed for calculators, string processors, conditional logic, and basic data manipulation.
| Category | Opcodes |
|---|---|
| Literals | const, print, param |
| Arithmetic | add, sub, mul, div, mod |
| Math | floor, ceil, round, power, sqrt, random |
| Comparison | gt, lt, eq, gte, lte |
| Logic | not, and, or |
| Strings | concat, length, to_str, to_num, format, upper, lower, trim, replace, starts_with, ends_with, contains |
| Collections | sequence, index, range |
| Control Flow | select, try_or, is_error, try_catch, comment, assert |
Tier 2: Functions & Collections (33 opcodes)
Subgraphs (functions), higher-order operations, and advanced data manipulation.
| Category | Opcodes |
|---|---|
| Strings | split, slice |
| Dict | dict, get, keys, values, set, delete_key, items, has_key |
| Sequences | sort, reverse, merge, zip, flatten, unique, take, drop, enumerate |
| Type System | type_of |
| Utility | delay |
| Subgraphs | subgraph_def, subgraph_call, input_port, output_port |
| Higher-Order | map, filter, reduce, iterate, retry, compose, apply, match |
Tier 3: I/O & Networking (6 opcodes)
| Opcode | Inputs | Capability | Description |
|---|---|---|---|
| read_file | 1 (path) | FS_READ | Read file contents as string |
| write_file | 2 (path, content) | FS_WRITE | Write string to file, return filename |
| read_stdin | 0 | IO_STDIN | Read one line from standard input |
| http_get | 1 (URL) | NETWORK_OUT | GET request, return body (auto-JSON parse) |
| http_post | 2 (URL, body) | NETWORK_OUT | POST request, return body |
| http_request | 3 (URL, method, body) | NETWORK_OUT | Full HTTP request, return {status, headers, body} |
Tier 4: GUI (6 opcodes)
| Opcode | Inputs | Pure? | Description |
|---|---|---|---|
| gui_window | variadic | Yes | Create window descriptor with children |
| gui_widget | variadic | Yes | Declare widget (type discriminator in value) |
| gui_canvas | 1 | Yes | Declare canvas with draw commands |
| gui_draw | variadic | Yes | Create draw command (shape in value) |
| gui_state | 0-1 | Yes | Declare reactive state variable |
| gui_render | 1 | No | Render window + run event loop (GUI_RENDER capability) |
Tier 5: Meta-Programming (5 opcodes)
| Opcode | Inputs | Capability | Description |
|---|---|---|---|
| emit_node | 3 | PURE_COMPUTE | Construct a node descriptor dict at runtime |
| build_subgraph | 3 | PURE_COMPUTE | Assemble nodes into a validated, hashed subgraph |
| quote | 1 | PURE_COMPUTE | Serialize an existing subgraph to dict |
| reflect | 0 | PURE_COMPUTE | Introspect program structure |
| eval | 1 | META_EVAL | Parse, validate, and execute a dict as a Lexis program |
All 93 Opcodes
Standard Library
32 pre-built, content-addressed subgraphs referenced via std:name. Resolved at load time — the evaluator never sees stdlib references.
Guards (8 subgraphs)
Type-checking predicates for use with MATCH and FILTER.
| Name | Signature | Description |
|---|---|---|
| std:always_true | ANY → BOOL | Always returns True (catch-all guard) |
| std:always_false | ANY → BOOL | Always returns False |
| std:is_num | ANY → BOOL | True if input is a number |
| std:is_str | ANY → BOOL | True if input is a string |
| std:is_bool | ANY → BOOL | True if input is a boolean |
| std:is_dict | ANY → BOOL | True if input is a dictionary |
| std:is_seq | ANY → BOOL | True if input is a sequence |
| std:is_error | ANY → BOOL | True if input is an ErrorValue |
Transforms (17 subgraphs)
Single-input transformation functions for MAP operations.
| Name | Signature | Description |
|---|---|---|
| std:double | NUM → NUM | Multiply by 2 |
| std:negate | NUM → NUM | Multiply by -1 |
| std:identity | ANY → ANY | Return input unchanged |
| std:to_str | ANY → STR | Convert to string |
| std:to_num | STR → NUM | Convert to number |
| std:get_length | ANY → NUM | Get length |
| std:get_first | SEQ → ANY | First element |
| std:get_last | SEQ → ANY | Last element |
| std:is_positive | NUM → BOOL | True if x > 0 |
| std:is_negative | NUM → BOOL | True if x < 0 |
| std:is_zero | NUM → BOOL | True if x == 0 |
| std:is_empty | ANY → BOOL | True if length == 0 |
| std:abs | NUM → NUM | Absolute value |
| std:square | NUM → NUM | x * x |
| std:increment | NUM → NUM | x + 1 |
| std:decrement | NUM → NUM | x - 1 |
| std:not_op | BOOL → BOOL | Logical NOT |
Reducers (7 subgraphs)
Two-input fold functions for REDUCE operations.
| Name | Signature | Description |
|---|---|---|
| std:add | (NUM, NUM) → NUM | Addition |
| std:mul | (NUM, NUM) → NUM | Multiplication |
| std:sub | (NUM, NUM) → NUM | Subtraction |
| std:min | (NUM, NUM) → NUM | Smaller of two |
| std:max | (NUM, NUM) → NUM | Larger of two |
| std:concat | (STR, STR) → STR | String concatenation |
| std:and_op | (BOOL, BOOL) → BOOL | Logical AND |
Usage
Reference stdlib subgraphs via "std:name" in the value field of MAP, FILTER, REDUCE, or SUBGRAPH_DEF nodes:
At load time, load_program() resolves "std:double" to the real BLAKE3 content hash and injects the subgraph. The evaluator never sees "std:" — it only sees hashes.
Security Model
Closed by default. Two-layer enforcement. Three-way intersection for multi-agent.
Capabilities
| Capability | Trust Level | Operations |
|---|---|---|
| PURE_COMPUTE | 0 | All arithmetic, logic, string, collection ops |
| IO_STDOUT | 1 | |
| IO_STDIN | 2 | read_stdin |
| FS_READ | 2 | read_file |
| FS_WRITE | 3 | write_file |
| NETWORK_OUT | 3 | http_get, http_post, http_request |
| GUI_RENDER | 3 | gui_render |
| META_EVAL | 3 | eval |
Enforcement Layers
- Layer 1: Static Verifier — At load time, checks every node's required capabilities against the program's declared manifest. Catches violations before any code runs.
- Layer 2: Runtime Sandbox — During execution, enforces capabilities per-node. Even if the verifier is bypassed, the sandbox blocks unauthorized operations.
Multi-Agent Security
When multiple agents collaborate, the effective capability set uses a three-way intersection:
- Trust ceiling — Maximum capabilities based on agent trust level (0-3)
- Agent declared — Capabilities the agent claims to need
- Program manifest — Capabilities declared in the program JSON
All three must agree for an operation to be allowed. This prevents trust escalation, undeclared use, and manifest bypass.
Network Security
- Domain allowlist — Programs must declare
allowed_domains. Empty list = all requests blocked. - SSRF prevention — DNS pre-resolution + private IP blocking (127.x, 10.x, 172.16.x, 192.168.x, ::1)
- HTTPS only by default — HTTP requires explicit
allow_http: truein node config - No redirects by default — When enabled, every hop re-validated against allowlist + SSRF checks
- Size limits — 10MB response max, 30s timeout default
CLI Usage
Command-line interface for running, validating, visualizing, and benchmarking Lexis programs.
Commands
| Command | Description |
|---|---|
| lexis run <file> | Execute a Lexis program |
| lexis run <file> --param name=value | Execute with runtime parameters |
| lexis run <file> --debug | Show all node values during execution |
| lexis run <file> --cache | Enable persistent disk caching |
| lexis run <file> --viz | Execute with trace visualization in browser |
| lexis validate <file> | Validate without executing |
| lexis hash <file> | Compute BLAKE3 content hashes |
| lexis viz <file> | Static DAG visualization in browser |
| lexis stdlib | List all stdlib subgraphs |
| lexis cache list|stats|clear|catalog | Manage persistent cache |
| lexis mcp | Start MCP server (stdio) |
| lexis mcp --http | Start MCP server (HTTP) |
| lexis bench --model <name> | Run LLM benchmarks |
| lexis bench session start <name> | Create a benchmark session |
| lexis bench session report <id> | Generate comparison report |
Examples
MCP Server
Model Context Protocol integration — any MCP-compatible AI agent can generate, validate, and execute Lexis programs through standard tool calls.
Tools (7)
| Tool | Input | Description |
|---|---|---|
| lexis_run | program_json, params?, stdin_lines? | Full pipeline execution → output + hashes |
| lexis_validate | program_json | Parse + validate + verify → structured result |
| lexis_generate | program_json, expected_output? | LLM-tolerant validation with auto-fix + suggestions |
| lexis_check | program_json, params?, stdin_lines? | Recommended: validate AND execute in one call |
| lexis_hash | program_json | BLAKE3 content hash for all nodes |
| lexis_stdlib_list | (none) | List all 32 stdlib subgraphs |
| lexis_stdlib_get | name | Get full subgraph JSON by name |
Resources
lexis://spec— Full spec documentlexis://examples/{name}— Example programslexis://opcodes— All opcodes with arities and descriptions
The lexis_check Tool
The recommended tool for AI models. Validates AND executes in one call with LLM-tolerance fixes applied automatically. Returns per-stage results (parse_ok, structure_ok, security_ok, execution_ok) plus fixes_applied that tells the model exactly what was auto-corrected. Prevents models from claiming success without actual execution.
Configuration
Phase 0: Foundation
Project Lexis Is Alive
Summary
The very first milestone. Established the core pipeline that every Lexis program passes through: Parse → Validate → Verify Security → Schedule → Execute with Sandbox → Content-Address. Proved the concept works with basic arithmetic.
Core Pipeline
Every program goes through this security-enforced pipeline:
- Parse — JSON graph format → internal data structures
- Validate — Confirm DAG (no cycles), all references resolve, correct arity
- Verify Security — Every node's capability requirements checked against the manifest
- Schedule — Topological sort determines execution order
- Execute with Sandbox — Runtime capability checks before every side-effecting operation
- Content-Address — BLAKE3 hashes computed for every node (identity = semantics)
What's Proven
5 + 3 = 8executes through the full pipeline(10 + 20) * (5 - 2) = 90— chained operations work- Security enforcement catches violations before execution
- Content-addressing works: same computation = same hash regardless of node ID
- Cycle detection, dangling reference detection, arity checking all work
- 47 tests, all green, 0.05 seconds
Phase 1a: Core Language
47 → 96 tests • 28 opcodes
Summary
Phase 1a is where Lexis becomes a real language. Added strings, conditionals, error-as-value system, functions (subgraphs), iteration (REDUCE), and sequences. Grew from 47 tests (Phase 0) to 96 tests.
New Capabilities
- Strings: Concatenation, length, type conversion —
"Hello" + ", " + "Lexis!"works - Conditionals:
SELECT(10 > 5, "yes", "no")→"yes". Pure data flow, no control-flow edges. Composable with NOT, AND, OR. - Errors as Values: Division by zero doesn't crash — it produces an ErrorValue that flows through the graph. Downstream nodes automatically propagate it. TRY_OR catches errors and returns fallbacks. IS_ERROR lets you inspect. Error chains carry causation for debugging.
- Functions (Subgraphs): A function is a self-contained DAG with numbered input/output ports. No parameter names. Call it by its content hash.
double(21) = 42works. Chaining works:double(double(5)) = 20. Error propagation flows through subgraph boundaries. - Iteration (REDUCE): Fold a subgraph over a sequence.
REDUCE([1,2,3,4,5], 0, add) = 15. Works with any 2-input-1-output subgraph. - Sequences: Variadic SEQUENCE node collects values into ordered lists. Foundation for REDUCE and future collection operations.
Key Design Decisions
- Errors are values — ErrorValue flows through the graph, propagated automatically by
_propagatingdecorator on builtins. No exceptions escape to the user. - Functions are subgraphs — LexisSubgraph with numbered ports. Two different AI models can independently generate the same function and get the same content hash.
- SELECT for conditionals — 3-input data-flow node. No control-flow edges in the graph — everything is data flow.
- REDUCE for iteration — Folds a subgraph over a sequence. No loops, no mutation — just reduction.
Design Philosophy
The error-as-value system is genuinely elegant. In every human language, error handling is bolted on (try/catch, Result types, Option types). In Lexis, errors are just values that flow through edges — the same way data does. An AI debugging a Lexis program doesn't need to parse stack traces; it follows the error value through the graph to find where it originated. The causal chain is built into the ErrorValue itself.
The subgraph system is clean. Functions are just graphs with ports. No naming conventions to agree on, no parameter ordering debates beyond port indices. Two different AI models can independently generate the same function and get the same content hash. That's powerful for a multi-AI ecosystem.
Phase 1b: Benchmarks & Validation
AI Generation Quality: 16/16 (100%)
Summary
Hypothesis validation phase. Tested two claims: (1) Can AI generate valid Lexis programs from a spec alone? (2) How does Lexis token efficiency compare to Python? Results: 100% generation success, but raw JSON is ~7x more tokens than Python (though semantically closer to ~1.65x).
AI Generation Quality: 16/16 (100%)
| Suite | Parse | Validate | Security | Execute | Correct |
|---|---|---|---|---|---|
| Hand-written baseline | 8/8 | 8/8 | 8/8 | 8/8 | 8/8 |
| AI-generated (spec only) | 8/8 | 8/8 | 8/8 | 8/8 | 8/8 |
The AI-generated programs were written using only the 54-line spec document — no access to existing examples, no trial-and-error. Every program passed all 5 pipeline stages on the first attempt.
Token Efficiency
| Metric | Lexis/Python Ratio | Meaning |
|---|---|---|
| M1 (Raw JSON) | 6.74x | Lexis is ~7x MORE tokens as raw JSON |
| M4 (Semantic-only) | 1.65x | With a binary format, gap shrinks to ~1.6x |
| T4 Error handling | 0.92x (M4) | Lexis wins — TRY_OR is more compact than try/except |
| T3 Conditional | 1.00x (M4) | Tie — SELECT is as compact as if/else |
Phase 2a: Agent Collaboration
Identity, Trust, Scoping, Composition, Audit
Summary
Introduced multi-agent identity, trust levels, capability scoping, graph fragment composition, and audit trails. This phase laid the security foundation for agents collaborating on shared programs.
What Was Built
- Agent Identity — BLAKE3 cryptographic identities, node signing/verification, provenance tracking (separate from content hashing — dedup preserved)
- Trust Levels & Scoped Security — 4-tier trust system (0-3), three-way capability intersection (trust ceiling ∩ agent declared ∩ program manifest), per-agent sandbox enforcement
- Graph Composition — Safe multi-agent fragment merging with content-hash dedup, open input resolution, provenance preservation (no capability laundering)
- Audit Trail — Append-only, BLAKE3-hashed log tracking every node execution by agent
Key Design Decisions
- Provenance separate from content hash — Two agents writing CONST(5) get the same content hash. Authorship is tracked in Provenance metadata, not in the hash.
- Trust levels 0-3 — Simple integer mapping to capability ceiling. Deliberately coarse-grained. Finer permissions come from the 3-way intersection.
- Three-way capability intersection — All three must agree for an operation to be allowed. Prevents: trust escalation, undeclared use, manifest bypass.
- No capability laundering — When fragments are composed, nodes keep their original agent's provenance.
Phase 2b: Binary Format & Wire Protocol
MessagePack ~5x smaller than JSON
Summary
Added MessagePack binary serialization for compact program encoding (~5x smaller than JSON), a wire protocol for agent communication with signed messages and tamper detection, and content-addressed fragment storage.
What Was Built
- MessagePack Serialization — Opcodes as integers (~5x smaller than JSON), full roundtrip fidelity for programs, subgraphs, agents, provenance
- Wire Protocol — 11 message types for fragment exchange, composition proposals, hash verification. Signed messages with tamper detection
- Fragment Store — Thread-safe content-addressed storage for subgraphs and fragments with dedup reporting
Phase 2c: MAP, FILTER, Parallel
184 tests • 31 opcodes
Summary
Added MAP, FILTER, COMPOSE opcodes and parallel execution. The language now supports multi-agent collaboration end-to-end.
What Was Built
- MAP — Apply subgraph to each sequence element (short-circuits on error)
- FILTER — Keep elements where predicate subgraph returns truthy
- COMPOSE — Create new subgraph B(A(x)) via node prefixing and port rewiring — content-addressable
- Parallel Execution —
parallel_evaluate()using ThreadPoolExecutor withparallelism_levels()grouping, deterministic PRINT ordering, thread-safe stores
Key Design Decisions
- MAP/FILTER as opcodes — Not library functions. Making them opcodes communicates parallelism intent to the scheduler.
- Parallel execution — parallelism_levels() + ThreadPoolExecutor. Deterministic PRINT ordering preserved via output buffering per level.
Phase 3a: Hardening
282 tests total • Pure quality & robustness
Summary
First hardening pass. Added CLI, scheduler edge case tests, adversarial security tests, protocol edge case tests, and error path coverage. No new features — pure quality and robustness.
What Was Built
- CLI entry point for
python -m lexis. Commands: run, validate, hash - Adversarial security tests (12 tests) — Trust escalation, provenance forgery, capability laundering, sandbox bypass attempts
- Protocol edge cases (8 tests) — Boundary conditions in message encoding/decoding
- Error path hardening (8 tests) — REDUCE/MAP/FILTER edge cases, subgraph errors
Phase 3b: Data Access Opcodes
38 opcodes • 10 examples
Summary
Added 7 opcodes for working with sequences and dictionaries. The language can now parse structured strings, index/slice into sequences and strings, and work with key-value dictionaries — unlocking real data processing programs.
Opcodes Added
- INDEX — Access element by position. Polymorphic: works on both sequences and strings.
- SLICE — Extract sub-range. Polymorphic: works on both sequences and strings.
- SPLIT — Split string by delimiter. Empty delimiter = character-level split.
- DICT — Create dictionary from alternating key-value pairs.
- GET — Retrieve value from dictionary by key.
- KEYS — Get all keys from a dictionary as a sequence.
- VALUES — Get all values from a dictionary as a sequence.
Phase 4: Gap-Closing
307 tests • 40 opcodes
Summary
Closed three functional gaps identified by audit. The foundation is genuinely complete after this phase — no known inconsistencies, no known blockers for the existing feature set.
What Was Fixed
- TO_NUM — String-to-number conversion. Parses strings to int/float, bools to 0/1, numbers pass through.
- APPLY — Dynamic subgraph dispatch. The hash comes from an input edge, making the target dynamic. This completes COMPOSE: COMPOSE returns a runtime hash → APPLY invokes it.
- All ops work inside subgraphs — MAP/FILTER/REDUCE/COMPOSE/APPLY inside a subgraph now work correctly.
Phase 5: Type Guards & Pattern Matching
346 tests • 43 opcodes
Summary
Added type introspection (TYPE_OF, HAS_KEY) and the MATCH opcode for pattern matching with guards and handlers. Completed the control-flow story: SELECT for binary conditions, MATCH for multi-way dispatch.
Key Design Decisions
- TYPE_OF does NOT propagate errors — Returns "ERROR" string for ErrorValue inputs. This is introspection, not computation.
- HAS_KEY does NOT propagate errors — Returns False for non-dict/ErrorValue inputs. Safe guard — never errors itself.
- MATCH is lazy — Only evaluates the matching handler. Guards checked in order, first truthy wins.
- Guards and handlers are subgraphs — Referenced via SUBGRAPH_DEF hash. Consistent with APPLY pattern.
Why This Matters for Weaker Models
MATCH shifts cognitive load to runtime. Weaker models can declare patterns without implementing dispatch logic. A heterogeneous list [10, "20", 30, "40"] can be type-dispatched through MATCH with just two guard-handler pairs — previously requiring nested SELECT chains.
Phase 6: Standard Library
423 tests • 18 subgraphs • ~75% token savings
Summary
Created a standard library of 18 pre-built, content-addressed subgraphs that programs can reference by name instead of defining inline. The stdlib is resolved at load time — the evaluator is completely untouched.
Token Impact
| Program | Before (inline) | After (stdlib) | Savings |
|---|---|---|---|
| stdlib_showcase.json | ~80 lines | 10 lines | ~75% |
| stdlib_reduce.json | ~25 lines | 7 lines | ~72% |
| Type-dispatch MATCH | ~2,300 tokens | ~1,200 tokens | ~48% |
An AI model that used to emit 5-node subgraph definitions (~400 tokens each) now emits a 15-character string like "std:always_true".
Key Design Decisions
- Stdlib is compile-time expansion — "std:name" resolved to actual content hash at load time. Evaluator unchanged.
- Content-addressed dedup — Stdlib subgraphs get real BLAKE3 hashes. Two programs using
std:doubleshare the same hash. - "std:" prefix convention — Only values starting with "std:" trigger resolution. No collision with user subgraph names.
Phase 7: LLM Generation Benchmarks
478 tests • 15 tasks across 3 tiers
Summary
Built a benchmark harness to test whether AI models can generate valid Lexis programs from the spec + natural language task descriptions. 15 tasks across 3 tiers. CLI command lexis bench. Works with any OpenAI-compatible API.
Key Design Decisions
- OpenAI-compatible API — Works with LM Studio, Ollama, or any compatible endpoint.
- Robust JSON extraction — Handles code fences, prose wrapping, raw JSON, multiple objects.
- Full pipeline validation — Parse → validate → verify → execute → check output.
- No new dependencies — Uses urllib.request (stdlib) for API calls.
Phase 7b: Spec Hardening
492 tests • 3 rounds of refinement
Summary
Ran the benchmark suite against 5 local models on an NVIDIA 4090 GPU. Performed targeted spec hardening based on error analysis across 3 rounds.
Results
| Model | Size | Score | Tier 1 | Tier 2 | Tier 3 |
|---|---|---|---|---|---|
| Qwen 2.5 Coder 14B | 14B | 11/15 (73%) | 5/5 | 3/5 | 3/5 |
| GLM 4.6v Flash | ~9B | 9/15 (60%) | 5/5 | 2/5 | 2/5 |
| Qwen 2.5 Coder 7B | 7B | 9/15 (60%) | 3/5 | 3/5 | 3/5 |
| GLM 4.7 Flash | 30B | 7/15 (47%) | 2/5 | 3/5 | 2/5 |
| Qwen 2.5 Coder 32B | 32B | 5/15 (33%)* | 1/5 | 2/5 | 2/5 |
*32B model hampered by EXTRACT_FAIL (quantization/VRAM pressure issues, not a capability problem).
Key Insight
Models learn flat graphs and stdlib quickly. The capability wall is multi-level nesting: main graph → subgraph definition → port wiring → higher-order call. This is a working memory limitation in small models, not a spec clarity issue. The spec refinement hit diminishing returns after 3 rounds.
Phase 8: Params, File I/O & Stdin
582 tests • 47 opcodes • 19 examples
Phase 8b: PARAM Opcode
Added runtime parameter injection. Parameters are always strings, injected via CLI --param name=value. Pure operation — no capability required. Missing param → ErrorValue, so use TRY_OR for defaults.
Phase 8a: File I/O (READ_FILE, WRITE_FILE)
Added file system operations with a verifier bug fix: both verifier and sandbox now read from the same OP_REQUIRED_CAPABILITIES dict — single source of truth. FS_READ at trust level 2, FS_WRITE at trust level 3. Write-then-read chaining works naturally: WRITE_FILE returns the filename.
Phase 8c: Standard Input (READ_STDIN)
Added READ_STDIN with the stdin_reader callable pattern for testability. EOF → ErrorValue, TRY_OR provides graceful fallback. No CLI changes needed — just pipe: echo "hi" | lexis run prog.json.
Phase 8 Complete: Lexis can now read/write files, accept parameters, and read from stdin. It has graduated from a calculator to a real data processor.
Phase 9: AI-Native Networking
HTTP_GET, HTTP_POST, HTTP_REQUEST • Security-first design
Summary
Added 3 HTTP opcodes with a security-first design: closed-by-default domain allowlist, SSRF prevention, HTTPS-only by default, no redirects by default, response size limits, and timeout enforcement.
Key Design Decisions
- 3 AI-native HTTP opcodes — Three tiers of complexity. GET = 1 input (URL). POST = 2 inputs (URL, body). REQUEST = 3 inputs (URL, method, body). LLMs pick the simplest tier.
- Auto-JSON parse/serialize — Responses with JSON Content-Type auto-parsed. POST auto-serializes dict bodies.
- Closed-by-default domain allowlist — Must explicitly declare every domain. Empty = all blocked.
- SSRF prevention — DNS pre-resolution + private IP blocking before any connection.
- Zero new dependencies — Uses Python stdlib: urllib, ipaddress, socket, json, ssl.
Phase 10: Persistence & Caching
3-layer caching system • Cache invalidation is a non-problem
Summary
Added a 3-layer caching system: L1 runtime memo (always on, within-run), L2 persistent disk cache (opt-in, cross-run), L3 LLM-aware catalog (for token savings in prompts). Content-addressing makes cache invalidation a non-problem.
Key Design Decisions
- Purity as foundation — Only pure subgraphs (no I/O ops transitively) are cached.
- Cache invalidation is a non-problem — Content-addressing guarantees: same inputs + same function hash = same result. The hash IS the cache key. No staleness possible.
- Bool vs int distinction —
Trueand1hash differently. Semantically different values must have different hashes. - Layer 1 always on — Zero cost when no subgraph calls occur. Thread-safe via Lock.
- Layer 2 opt-in — DiskCache enabled via
--cacheCLI flag. Sharded storage with LRU eviction.
Phase 11: Multi-Agent Runtime
156 + 51 tests • Transport, event loop, discovery, negotiation
Summary
Transformed Lexis from a single-interpreter language into a multi-agent collaborative system. Added transport layer, event loop, enhanced provenance with Hybrid Logical Clocks, content-aware routing, agent discovery, and capability negotiation.
Architecture
- Transport Layer — Abstract interface for agent communication. LocalTransport with thread-safe queues. ContentRouter with IPFS-style Want/Have protocol.
- Agent Event Loop — State machine with 6 states (IDLE, OFFERING, REQUESTING, COMPOSING, EXECUTING, STOPPED). 19 message handlers.
- Enhanced Provenance — Hybrid Logical Clocks combining physical time + logical counter + agent_id. ProvenanceChain for lineage tracking.
- Discovery & Negotiation — TTL-based agent announcements. Capability negotiation with state machine (PROPOSED→ACCEPTED|REJECTED|COUNTERED|EXPIRED).
Hardening (Phase 11b)
51 new tests: 18 integration, 21 adversarial, 12 stress. Tested trust escalation, provenance forgery, transport attacks, capability laundering, concurrent messaging, HLC causality. No production bugs found — all Phase 11 code held up.
Phase 12: Benchmark Expansion
22 tasks • 4 tiers • 996 tests
Summary
Updated the LLM benchmark suite to cover all features added in Phases 8-11. Updated the spec from 43 to 50 opcodes. Added 7 new tasks in Tier 4 (I/O & Capabilities).
Key Finding: 7B Beats 14B
The Qwen 2.5 Coder 7B model scored 91% vs 82% for the 14B — consistently across two runs each. For DAG-structured JSON output, the smaller model's more constrained generation appears to be an advantage. Less "creativity" means fewer wrong answers.
Phase 13: MCP Server
1056 tests • 6 tools, 3 resources, 1 prompt
Summary
Built a Model Context Protocol (MCP) server so any MCP-compatible AI agent — Claude Code, Cursor, Claude Desktop, Windsurf — can generate, validate, execute, and compose Lexis programs through standard tool calls. This is the fastest path to adoption: turns Lexis from "a language you learn" into "a tool you call."
Design Decisions
- lexis_generate reuses validate_generated_program — The benchmark validation function already handles everything: auto-infer capabilities, auto-flatten inline objects, staged error classification.
- Structured JSON returns — LLMs need machine-parseable responses to self-correct in agentic loops.
- Suggestions field — Maps each error class to specific fix instructions. The self-correction loop that makes agentic workflows work.
Phase 14: ITERATE Opcode
Bounded iteration with guaranteed termination
Summary
Added bounded iteration. MAP/FILTER/REDUCE handle collection processing, but there was no way to express "repeat until done" — retry patterns, numeric convergence, iterative refinement. ITERATE fills this gap with guaranteed termination via max_steps.
Design
- Inputs: 2 — (initial_value, max_steps)
- Step subgraph: 1 input → 1 output (
[new_value, should_stop]) - Follows REDUCE pattern — LLMs that can generate REDUCE programs can immediately generate ITERATE programs.
- max_steps as computed value — Allows dynamic iteration limits based on input data.
Phase 15: Enhanced Error Reporting
1111 tests • Actionable error diagnostics
Summary
Made Lexis errors actionable enough that local AI models (7B-14B) can self-correct instead of getting stuck in retry loops.
What Changed
- classify_error() — Pattern-matches exceptions to error classes (PARSE_FAIL, OPCODE_FAIL, ARITY_FAIL, REF_FAIL, CYCLE_FAIL, SECURITY_FAIL, RUNTIME_FAIL)
- "Did you mean?" — Uses
difflib.get_close_matches()for op typos."conts"→"Did you mean: const, concat?" - --debug flag — Prints all node values in topological order to stderr
- CLI error suggestions — Every error handler now attaches actionable fix suggestions
Phase 16: Tiered Spec
Reorganized spec document into 3 tiers
Motivation
Local LLMs (7B-14B) kept reaching for advanced features (subgraphs, match) when simple tasks only needed basics. The spec presented all 50 opcodes flat — the model couldn't distinguish simple from complex.
Solution
- Tier 1: Core Language (25 opcodes) — Self-contained: a model reading only Tier 1 can build calculators, string processors, conditional logic
- Tier 2: Functions & Collections (19 opcodes) — Subgraphs, higher-order ops, collections
- Tier 3: I/O & Networking (6 opcodes) — File, stdin, HTTP
- Tier directive at top: "Start with Tier 1. Only use Tier 2 if Tier 1 cannot solve the task."
Phase 17: lexis_check Tool
1122 tests • Validate + execute in one call
Motivation
During local model testing, we discovered that models call lexis_validate and lexis_run separately — and some skip lexis_run entirely, declaring "Task Completed" without verifying output. A model did this with a broken calculator that would have failed at runtime.
Solution
lexis_check is a single MCP tool that validates, runs, and verifies a program in one call. Models can't claim success without actual execution. Returns per-stage results (parse_ok, structure_ok, security_ok, execution_ok) plus fixes_applied that tells models what was auto-corrected.
Phase 18: Native GUI
6 opcodes • 10 widget types • Tkinter backend
Summary
Added 6 GUI opcodes enabling Lexis programs to create native windowed applications with interactive widgets, event handling, and canvas drawing. Backend is tkinter (zero extra dependencies).
Design Philosophy
- Declarative scene-graph — The DAG describes UI as data structures, not imperative API calls
- 6 opcodes, not 20+ — GUI_WIDGET uses a type discriminator; GUI_DRAW uses a shape discriminator. Adding new widget/shape types requires zero opcode changes
- 5 pure + 1 impure — Only GUI_RENDER is side-effecting. The other 5 build descriptor dicts — pure, cacheable, content-addressable
- Callback subgraphs — Event handlers are subgraphs invoked with state dict → return new state dict
Widget Types
label, button, text_input, checkbox, dropdown, slider, vbox, hbox, grid, frame
Example Programs
gui_hello.json (static window), gui_counter.json (buttons + state), gui_canvas_drawing.json (shapes), gui_calculator.json (digit buttons, +, =, C)
Phase 19: DAG Visualization
3 modes: static, trace, live • Cytoscape.js
Summary
Added browser-based DAG visualization system with 3 modes: static (structure view), trace (step-through playback), and live (real-time GUI program tracing). Uses Cytoscape.js + dagre layout (CDN, zero build step).
Features
- 10 opcode color categories (Literal, Arithmetic, Comparison, Logic, String, Collection, Control Flow, Subgraph, I/O, GUI)
- Collapsible subgraph compound nodes
- Click-to-inspect sidebar (node ID, op, value, result, hash)
- Trace playback controls (step, play/pause, speed slider, reset)
- Live mode with pulsing indicator and real-time node highlighting
- Dark theme inspired by VS Code
Phase 20: Meta-Programming
5 opcodes • Self-bootstrapping foundation
Summary
Added 5 meta-programming opcodes that enable Lexis programs to construct, inspect, and execute graph fragments at runtime. This is the foundation for self-hosting: AI agents using Lexis programs to generate, validate, and compose other Lexis programs.
EVAL Security Design
- Capability ceiling: inner caps = (declared ∩ parent caps) − {META_EVAL}
- No privilege escalation: inner code cannot use capabilities the parent lacks
- No recursive eval: META_EVAL stripped from ceiling prevents eval-of-eval chains
- Recursion depth limit: MAX_EVAL_DEPTH = 3
Phase 21: Production Patterns
FORMAT, TRY_CATCH, RETRY • 64 opcodes
Summary
Added 3 production-pattern opcodes that close the gap between "Lexis can do it in theory" and "Lexis handles it cleanly in practice."
Opcodes
- FORMAT — String interpolation:
FORMAT("Hello {}, count: {}", name, num). Eliminates 60-70% of nodes in message-building patterns. - TRY_CATCH — Unwrap value/error into inspectable dict. Always returns a dict — never propagates. Enables error inspection that was previously impossible.
- RETRY — Bounded retry of subgraph up to N times until success. Makes API orchestration practical (HTTP 429/503 recovery).
Impact
Together, these three opcodes make Lexis production-ready for AI agent tool chains, regulated computation pipelines, and API orchestration.
Phase 22: Utility Opcodes
DELAY, RANGE, SORT, REVERSE, MERGE • 69 opcodes
Summary
Added 5 utility opcodes that fill the most common practical gaps: generating number sequences, sorting/reversing data, combining dicts, and pausing for retry backoff.
Design Highlights
- DELAY — Pass-through semantics. Returns input value, enables chaining. Max 60 seconds.
- RANGE — Variable arity (2-3). Auto-detects direction. 10,000 element limit.
- SORT — Type-homogeneous only. Mixed types return ErrorValue.
- MERGE — Dict union. Second dict wins on conflicts.
Implementation Pattern
All 5 opcodes are simple builtins — no evaluator.py changes needed. They're dispatched via BUILTIN_OPS[node.op](*input_values) automatically. The cleanest possible pattern for new opcodes.
Phase 23: Expanded Stdlib
18 → 32 subgraphs • +14 new
Summary
Expanded the stdlib from 18 to 32 subgraphs. All 14 new subgraphs are composed from existing opcodes — validating the composability of the core opcode set.
New Transforms (9)
abs, square, increment, decrement, get_last, not_op, is_negative, is_zero, is_empty
New Reducers (5)
sub, min, max, concat, and_op
Key Decision
No new opcodes needed. All 14 subgraphs are built from existing opcodes (ADD, SUB, MUL, GT, LT, EQ, SELECT, NOT, LENGTH, INDEX, CONCAT, AND). No registry changes — auto-discovery from the subgraph dictionaries.
Phase 24: String Operations
UPPER, LOWER, TRIM, REPLACE, STARTS_WITH, ENDS_WITH, CONTAINS • 76 opcodes
Summary
Added 7 native string manipulation opcodes. These fill the most critical gap AI models face when generating text-processing programs.
Design Decisions
- All pure, no capabilities — String operations have no side effects.
- Auto-coercion via str() — Matches existing CONCAT/SPLIT pattern. Pragmatic for AI models.
- REPLACE replaces ALL occurrences — What users expect.
- Placed in Tier 1 — Fundamental string operations, unlike SPLIT which produces a collection.
Phase 25: Math Operations
FLOOR, CEIL, ROUND, POWER, SQRT, RANDOM • 82 opcodes
Summary
Added 6 math opcodes filling the gap between basic arithmetic and what models need for calculators, converters, scientific computation, and games.
Design Decisions
- Skipped native ABS — stdlib already has
std:abs(Phase 23). - RANDOM is impure — In IO_OPS (not cached) but requires no capability. Generating random numbers isn't dangerous.
- ROUND variable arity (1-2) —
round(3.5)→ 4,round(3.456, 2)→ 3.46. - Strict type checking — Math ops reject bools and non-numbers with ErrorValue.
Phase 26: Sequence Operations
ZIP, FLATTEN, UNIQUE, TAKE, DROP, ENUMERATE • 88 opcodes
Summary
Added 6 sequence manipulation opcodes filling the gap between basic sequence creation and higher-order ops.
Design Decisions
- ZIP truncates to shorter — Follows Python semantics. No padding, no error.
- FLATTEN is one level only — Safe, predictable, covers 95% of use cases.
- UNIQUE handles unhashable types — Uses repr() fallback. Preserves first-occurrence order.
- TAKE/DROP clamp to bounds — No error on oversized count.
- ENUMERATE starts at 0 — Always. Keeping it simple.
Phase 27: Benchmark Refresh
22 → 30 tasks • 1652 tests
Summary
Added 8 new benchmark tasks (t23-t30) covering opcodes from Phases 21-26: string ops, math ops, format, and sequence ops. All 30 baselines pass the full validation pipeline.
New Tasks
| ID | Name | Opcodes Tested |
|---|---|---|
| t23 | string_normalize | trim, lower, eq, select |
| t24 | text_transform | upper, replace, contains |
| t25 | pythagorean | power, sqrt, add |
| t26 | rounding | floor, ceil, round |
| t27 | format_string | format |
| t28 | sequence_pipeline | unique, sort, take |
| t29 | zip_enumerate | zip, enumerate, map |
| t30 | flatten_reduce | flatten, reduce |
Phase 28: Dict Operations
SET, DELETE_KEY, ITEMS • 91 opcodes
Summary
Added 3 dict mutation opcodes. All operations are pure (immutable) — they return NEW dicts, consistent with Lexis's functional design. This completes the dict API: create (dict), read (get), update (set), delete (delete_key).
Design Decisions
- All pure (PURE_OPS) — Immutable operations that return new dicts.
- DELETE_KEY is a no-op on missing keys — Returns dict unchanged rather than erroring.
- ITEMS returns [[key, value], ...] — Uses 2-element lists consistent with Lexis collections. Enables dict↔sequence pipelines with MAP/ZIP.
Phase 29: Developer Tooling
VS Code extension, enhanced debugging, session logger
Phase 29a: VS Code Syntax Highlighting
Created a VS Code extension with TextMate grammar for Lexis JSON programs. Highlights all 93 opcodes, string values, numbers, booleans, node IDs, capabilities, stdlib references, and structural JSON keys. Packaged as a VSIX file for installation.
Phase 29b: Enhanced Debugging
Added structural warnings, program summary, execution snapshot, opcode hints, call stack tracking, output diff, and contextual suggestions. Designed to help both AI models and human developers understand what went wrong and how to fix it.
Phase 29c: Session Logger
Built benchmark session infrastructure for cross-model comparison:
- sessions.py — Session CRUD, import existing results
- analysis.py — Pipeline inference, opcode extraction, failure patterns, cross-model comparison
- reports.py — Markdown report generation (scoreboard, strengths, hardest tasks, failure patterns)
- session_cli.py — CLI: start, list, show, report, import
Workflow: bench session start "Name" → bench -m model --session ID → bench session report ID
Phase 30: COMMENT & ASSERT
Developer tools • 93 opcodes • 1823 tests
New Opcodes
- COMMENT (1 input) — No-op pass-through node. Returns input unchanged.
valuefield holds a label string for documentation. Acts as inline documentation in the data flow graph. Does NOT propagate errors — passes them through silently. - ASSERT (2 inputs) — Runtime assertion. Input 1: condition (truthy/falsy). Input 2: value to pass through. If condition is truthy, returns value unchanged. If falsy, returns ErrorValue with assertion failure message. Uses @_propagating: error inputs propagate before assertion check. Recoverable with TRY_OR.
Design Decisions
- COMMENT skips @_propagating — Same as TYPE_OF, HAS_KEY. Annotation should never alter semantics.
- ASSERT uses @_propagating — If inputs are already errors, they should propagate rather than masking as "assertion passed."
- Both are PURE_OPS — No I/O capabilities needed.
Session Logger Fixes
- Pipeline Breakdown:
compare_models()now callsanalyze_run()per run and includes pipeline_rates. Reports show real Parse/Validate/Security/Execute/Correct percentages instead of dashes. - Model Deduplication: When multiple runs exist for the same model, keeps the best-scoring one.
Test Results
18 new tests for COMMENT + ASSERT. 2 new tests for pipeline_rates + deduplication. Updated 3 opcode count assertions (91 → 93). Total: 1823 tests passing.