Lexis — Benchmark Results

Test Environment

How We Tested

Each model receives the Lexis specification and a task description, then generates a complete Lexis JSON program. We run that program through the full pipeline: parse, validate, security check, execute, and verify output.

☁ Cloud Models (via Ollama)

GLM-5:cloud
Kimi-K2.5:cloud
Minimax-M2.5:cloud
Qwen3-Coder-Next:cloud

💻 Local Models (RTX 4090 24GB — LM Studio)

Qwen2.5-Coder-14B (Q6_K quantization)
Qwen2.5-Coder-7B-Instruct (Q8_0 quantization)

Scoreboard

Overall Rankings

Models ranked by overall pass rate. All models were given the same spec document and the same 30 tasks across 4 difficulty tiers.

Rank	Model	Type	Overall	Tier 1 Basic	Tier 2 Intermediate	Tier 3 Advanced	Tier 4 I/O & Capabilities	Avg Time
1	GLM-5	Cloud	93% (28/30)	100%	86%	80%	100%	12.0s
1	Kimi-K2.5	Cloud	93% (28/30)	100%	86%	80%	100%	35.2s
3	Minimax-M2.5	Cloud	90% (27/30)	100%	86%	60%	100%	17.5s
4	Qwen3-Coder-Next	Cloud	87% (26/30)	100%	71%	80%	86%	9.2s
5	Qwen2.5-Coder-14B	Local Q6_K	73% (22/30)	82%	71%	60%	71%	6.3s
6	Qwen2.5-Coder-7B	Local Q8_0	70% (21/30)	64%	57%	60%	100%	4.4s

Task-Level Results

Every Task, Every Model

Green means the model generated a correct program. Red shows the failure type. Tasks are grouped by difficulty tier.

Task	Tier	GLM-5	Kimi-K2.5	Minimax	Qwen3-CN	Qwen-14B	Qwen-7B
Tier 1 — Basic (11 tasks)
t01 arithmetic	1	✓	✓	✓	✓	✓	✓
t02 strings	1	✓	✓	✓	✓	✓	SEM
t03 conditional	1	✓	✓	✓	✓	✓	✓
t04 error_handling	1	✓	✓	✓	✓	✓	✓
t05 dict_create	1	✓	✓	✓	✓	✓	✓
t23 string_normalize	1	✓	✓	✓	✓	✓	✓
t24 text_transform	1	✓	✓	✓	✓	REF	REF
t25 pythagorean	1	✓	✓	✓	✓	✓	REF
t26 rounding	1	✓	✓	✓	✓	REF	✓
t27 format_string	1	✓	✓	✓	✓	✓	✓
t28 sequence_pipeline	1	✓	✓	✓	✓	✓	REF
Tier 2 — Intermediate (7 tasks)
t06 subgraph_call	2	✓	✓	SEM	✓	✓	✓
t07 map	2	✓	✓	✓	✓	✓	✓
t08 filter	2	✓	✓	✓	✓	✓	✓
t09 reduce	2	✓	SEM	✓	SEM	✓	✓
t10 pipeline	2	✓	✓	✓	SEM	SEM	REF
t29 zip_enumerate	2	✓	✓	✓	✓	REF	REF
t30 flatten_reduce	2	SEM	✓	✓	✓	✓	REF
Tier 3 — Advanced (5 tasks)
t11 stdlib_map	3	✓	✓	✓	✓	✓	✓
t12 stdlib_reduce	3	✓	✓	✓	✓	✓	✓
t13 match_basic	3	✓	✓	✓	✓	✓	SCH
t14 match_dict	3	VAL	✓	SEM	✓	SCH	REF
t15 compose_apply	3	✓	SEM	RUN	SEM	REF	✓
Tier 4 — I/O & Capabilities (7 tasks)
t16 param_default	4	✓	✓	✓	✓	✓	✓
t17 param_provided	4	✓	✓	✓	✓	✓	✓
t18 read_stdin	4	✓	✓	✓	✓	✓	✓
t19 read_file	4	✓	✓	✓	✓	✓	✓
t20 write_read	4	✓	✓	✓	SEM	SEM	✓
t21 http_get	4	✓	✓	✓	✓	✓	✓
t22 http_post	4	✓	✓	✓	✓	REF	✓

✓ Pass SEM — Wrong output REF — Bad node reference VAL — Validation error SCH — Schema error RUN — Runtime crash

Error Analysis

Why Models Fail

Not all failures are equal. We categorize errors by where in the pipeline the program broke down.

🔍

REF_FAIL — Node Reference Errors

12 failures across the local models. The #1 killer for small models. They pass literal strings like "2" or "hello" as node references instead of creating const nodes first. Cloud models almost never make this mistake.

❌

SEMANTIC_FAIL — Wrong Output

10 failures across all model tiers. The program is structurally valid and executes, but produces the wrong answer. Often caused by reversed compose order, missing type conversions, or subgraph placement errors.

🛠

SCHEMA_FAIL — Invalid Structure

2 failures. The program JSON doesn't match the expected schema. Examples: using a stdlib name as an opcode (std:is_num), or using has_key as a stdlib subgraph reference.

⚠

VALIDATION_FAIL + RUNTIME_FAIL

2 failures. One validation error (missing INPUT_PORT in a handler subgraph) and one runtime crash (INPUT_PORT opcode executed in main graph context). Both related to subgraph construction mistakes.

Key Findings

What We Learned

Cloud models understand Lexis extremely well

All four cloud models scored 87% or higher, with two reaching 93%. Every cloud model achieved 100% on Tier 1 (basic tasks), proving the spec is clear enough for large models to master the fundamentals.

Small models struggle with node references, not concepts

The 7B and 14B models understand Lexis concepts (they solve I/O and subgraph tasks) but make mechanical errors — passing literal values where node IDs are expected. This is a working memory limitation, not a comprehension gap.

Two tasks are universally hard

t14 (match_dict) and t15 (compose_apply) were solved by only 2 out of 6 models each. These involve complex subgraph patterns that even cloud models struggle with. This tells us where to improve the spec.

Local models are 3-8x faster

Running on an RTX 4090, local models average 4-6 seconds per task compared to 9-35 seconds for cloud models. For iterative development where you need fast feedback, local models still make sense despite lower accuracy.

The Lexis spec works

152 out of 180 evaluations produced correct programs (84.4%). No model needed hand-holding or prompt engineering — they read the spec and wrote working code. That's the promise of an AI-native language.

Hardest Tasks

Where Models Struggled Most

These tasks had the lowest pass rates across all models.

Task	Tier	Pass Rate	Failed By	Common Error
t14 match_dict	3	33% (2/6)	GLM-5, Minimax, Qwen-14B, Qwen-7B	Subgraph handler construction
t15 compose_apply	3	33% (2/6)	Kimi, Minimax, Qwen3-CN, Qwen-14B	Compose order / subgraph placement
t10 pipeline	2	50% (3/6)	Qwen3-CN, Qwen-14B, Qwen-7B	Missing type conversions in chains
t09 reduce	2	67% (4/6)	Kimi, Qwen3-CN	Subgraph hash placement
t30 flatten_reduce	2	67% (4/6)	GLM-5, Qwen-7B	Subgraph nesting / node references

Pipeline Breakdown

Where Programs Break Down

Every generated program passes through 5 stages. This shows what percentage make it through each stage.

Model	Parse	Validate	Security	Execute	Correct
GLM-5	100%	97%	97%	97%	93%
Kimi-K2.5	100%	100%	100%	100%	93%
Minimax-M2.5	100%	100%	100%	97%	90%
Qwen3-Coder-Next	100%	100%	100%	100%	87%
Qwen2.5-Coder-14B	100%	80%	80%	80%	73%
Qwen2.5-Coder-7B	100%	73%	73%	73%	70%

Every model achieved 100% parse rate — all generated valid JSON. The gap between cloud and local models opens at the validation stage, where smaller models fail on node reference errors.

How Well Can AI Write Lexis?

How We Tested

☁ Cloud Models (via Ollama)

💻 Local Models (RTX 4090 24GB — LM Studio)

Overall Rankings

Every Task, Every Model

Why Models Fail

REF_FAIL — Node Reference Errors

SEMANTIC_FAIL — Wrong Output

SCHEMA_FAIL — Invalid Structure

VALIDATION_FAIL + RUNTIME_FAIL

What We Learned

Cloud models understand Lexis extremely well

Small models struggle with node references, not concepts

Two tasks are universally hard

Local models are 3-8x faster

The Lexis spec works

Where Models Struggled Most

Where Programs Break Down