Benchmark Results

How Well Can AI Write Lexis?

We tested 6 AI models on 30 programming tasks to see how well they understand the Lexis specification and generate correct programs. Here are the results.

6
Models Tested
30
Tasks Each
180
Total Evaluations
86%
Avg Pass Rate

Test Environment

How We Tested

Each model receives the Lexis specification and a task description, then generates a complete Lexis JSON program. We run that program through the full pipeline: parse, validate, security check, execute, and verify output.

☁ Cloud Models (via Ollama)

  • GLM-5:cloud
  • Kimi-K2.5:cloud
  • Minimax-M2.5:cloud
  • Qwen3-Coder-Next:cloud

💻 Local Models (RTX 4090 24GB — LM Studio)

  • Qwen2.5-Coder-14B (Q6_K quantization)
  • Qwen2.5-Coder-7B-Instruct (Q8_0 quantization)

Scoreboard

Overall Rankings

Models ranked by overall pass rate. All models were given the same spec document and the same 30 tasks across 4 difficulty tiers.

Rank Model Type Overall Tier 1
Basic
Tier 2
Intermediate
Tier 3
Advanced
Tier 4
I/O & Capabilities
Avg Time
1 GLM-5 Cloud 93% (28/30) 100% 86% 80% 100% 12.0s
1 Kimi-K2.5 Cloud 93% (28/30) 100% 86% 80% 100% 35.2s
3 Minimax-M2.5 Cloud 90% (27/30) 100% 86% 60% 100% 17.5s
4 Qwen3-Coder-Next Cloud 87% (26/30) 100% 71% 80% 86% 9.2s
5 Qwen2.5-Coder-14B Local Q6_K 73% (22/30) 82% 71% 60% 71% 6.3s
6 Qwen2.5-Coder-7B Local Q8_0 70% (21/30) 64% 57% 60% 100% 4.4s

Task-Level Results

Every Task, Every Model

Green means the model generated a correct program. Red shows the failure type. Tasks are grouped by difficulty tier.

Task Tier GLM-5 Kimi-K2.5 Minimax Qwen3-CN Qwen-14B Qwen-7B
Tier 1 — Basic (11 tasks)
t01 arithmetic1
t02 strings1 SEM
t03 conditional1
t04 error_handling1
t05 dict_create1
t23 string_normalize1
t24 text_transform1 REFREF
t25 pythagorean1 REF
t26 rounding1 REF
t27 format_string1
t28 sequence_pipeline1 REF
Tier 2 — Intermediate (7 tasks)
t06 subgraph_call2 SEM
t07 map2
t08 filter2
t09 reduce2 SEMSEM
t10 pipeline2 SEMSEMREF
t29 zip_enumerate2 REFREF
t30 flatten_reduce2 SEMREF
Tier 3 — Advanced (5 tasks)
t11 stdlib_map3
t12 stdlib_reduce3
t13 match_basic3 SCH
t14 match_dict3 VALSEMSCHREF
t15 compose_apply3 SEMRUNSEMREF
Tier 4 — I/O & Capabilities (7 tasks)
t16 param_default4
t17 param_provided4
t18 read_stdin4
t19 read_file4
t20 write_read4 SEMSEM
t21 http_get4
t22 http_post4 REF
✓ Pass SEM — Wrong output REF — Bad node reference VAL — Validation error SCH — Schema error RUN — Runtime crash

Error Analysis

Why Models Fail

Not all failures are equal. We categorize errors by where in the pipeline the program broke down.

🔍

REF_FAIL — Node Reference Errors

12 failures across the local models. The #1 killer for small models. They pass literal strings like "2" or "hello" as node references instead of creating const nodes first. Cloud models almost never make this mistake.

SEMANTIC_FAIL — Wrong Output

10 failures across all model tiers. The program is structurally valid and executes, but produces the wrong answer. Often caused by reversed compose order, missing type conversions, or subgraph placement errors.

🛠

SCHEMA_FAIL — Invalid Structure

2 failures. The program JSON doesn't match the expected schema. Examples: using a stdlib name as an opcode (std:is_num), or using has_key as a stdlib subgraph reference.

VALIDATION_FAIL + RUNTIME_FAIL

2 failures. One validation error (missing INPUT_PORT in a handler subgraph) and one runtime crash (INPUT_PORT opcode executed in main graph context). Both related to subgraph construction mistakes.


Key Findings

What We Learned

01

Cloud models understand Lexis extremely well

All four cloud models scored 87% or higher, with two reaching 93%. Every cloud model achieved 100% on Tier 1 (basic tasks), proving the spec is clear enough for large models to master the fundamentals.

02

Small models struggle with node references, not concepts

The 7B and 14B models understand Lexis concepts (they solve I/O and subgraph tasks) but make mechanical errors — passing literal values where node IDs are expected. This is a working memory limitation, not a comprehension gap.

03

Two tasks are universally hard

t14 (match_dict) and t15 (compose_apply) were solved by only 2 out of 6 models each. These involve complex subgraph patterns that even cloud models struggle with. This tells us where to improve the spec.

04

Local models are 3-8x faster

Running on an RTX 4090, local models average 4-6 seconds per task compared to 9-35 seconds for cloud models. For iterative development where you need fast feedback, local models still make sense despite lower accuracy.

05

The Lexis spec works

152 out of 180 evaluations produced correct programs (84.4%). No model needed hand-holding or prompt engineering — they read the spec and wrote working code. That's the promise of an AI-native language.


Hardest Tasks

Where Models Struggled Most

These tasks had the lowest pass rates across all models.

Task Tier Pass Rate Failed By Common Error
t14 match_dict 3 33% (2/6) GLM-5, Minimax, Qwen-14B, Qwen-7B Subgraph handler construction
t15 compose_apply 3 33% (2/6) Kimi, Minimax, Qwen3-CN, Qwen-14B Compose order / subgraph placement
t10 pipeline 2 50% (3/6) Qwen3-CN, Qwen-14B, Qwen-7B Missing type conversions in chains
t09 reduce 2 67% (4/6) Kimi, Qwen3-CN Subgraph hash placement
t30 flatten_reduce 2 67% (4/6) GLM-5, Qwen-7B Subgraph nesting / node references

Pipeline Breakdown

Where Programs Break Down

Every generated program passes through 5 stages. This shows what percentage make it through each stage.

Model Parse Validate Security Execute Correct
GLM-5 100% 97% 97% 97% 93%
Kimi-K2.5 100% 100% 100% 100% 93%
Minimax-M2.5 100% 100% 100% 97% 90%
Qwen3-Coder-Next 100% 100% 100% 100% 87%
Qwen2.5-Coder-14B 100% 80% 80% 80% 73%
Qwen2.5-Coder-7B 100% 73% 73% 73% 70%

Every model achieved 100% parse rate — all generated valid JSON. The gap between cloud and local models opens at the validation stage, where smaller models fail on node reference errors.


Tests conducted on February 20, 2026 • Lexis v0.1.0 • 93 opcodes • 30 benchmark tasks • Cloud models via Ollama • Local models on NVIDIA RTX 4090 (24GB VRAM) via LM Studio