We tested 6 AI models on 30 programming tasks to see how well they understand the Lexis specification and generate correct programs. Here are the results.
Each model receives the Lexis specification and a task description, then generates a complete Lexis JSON program. We run that program through the full pipeline: parse, validate, security check, execute, and verify output.
Models ranked by overall pass rate. All models were given the same spec document and the same 30 tasks across 4 difficulty tiers.
| Rank | Model | Type | Overall | Tier 1 Basic |
Tier 2 Intermediate |
Tier 3 Advanced |
Tier 4 I/O & Capabilities |
Avg Time |
|---|---|---|---|---|---|---|---|---|
| 1 | GLM-5 | Cloud | 93% (28/30) | 100% | 86% | 80% | 100% | 12.0s |
| 1 | Kimi-K2.5 | Cloud | 93% (28/30) | 100% | 86% | 80% | 100% | 35.2s |
| 3 | Minimax-M2.5 | Cloud | 90% (27/30) | 100% | 86% | 60% | 100% | 17.5s |
| 4 | Qwen3-Coder-Next | Cloud | 87% (26/30) | 100% | 71% | 80% | 86% | 9.2s |
| 5 | Qwen2.5-Coder-14B | Local Q6_K | 73% (22/30) | 82% | 71% | 60% | 71% | 6.3s |
| 6 | Qwen2.5-Coder-7B | Local Q8_0 | 70% (21/30) | 64% | 57% | 60% | 100% | 4.4s |
Green means the model generated a correct program. Red shows the failure type. Tasks are grouped by difficulty tier.
| Task | Tier | GLM-5 | Kimi-K2.5 | Minimax | Qwen3-CN | Qwen-14B | Qwen-7B |
|---|---|---|---|---|---|---|---|
| Tier 1 — Basic (11 tasks) | |||||||
| t01 arithmetic | 1 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| t02 strings | 1 | ✓ | ✓ | ✓ | ✓ | ✓ | SEM |
| t03 conditional | 1 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| t04 error_handling | 1 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| t05 dict_create | 1 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| t23 string_normalize | 1 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| t24 text_transform | 1 | ✓ | ✓ | ✓ | ✓ | REF | REF |
| t25 pythagorean | 1 | ✓ | ✓ | ✓ | ✓ | ✓ | REF |
| t26 rounding | 1 | ✓ | ✓ | ✓ | ✓ | REF | ✓ |
| t27 format_string | 1 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| t28 sequence_pipeline | 1 | ✓ | ✓ | ✓ | ✓ | ✓ | REF |
| Tier 2 — Intermediate (7 tasks) | |||||||
| t06 subgraph_call | 2 | ✓ | ✓ | SEM | ✓ | ✓ | ✓ |
| t07 map | 2 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| t08 filter | 2 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| t09 reduce | 2 | ✓ | SEM | ✓ | SEM | ✓ | ✓ |
| t10 pipeline | 2 | ✓ | ✓ | ✓ | SEM | SEM | REF |
| t29 zip_enumerate | 2 | ✓ | ✓ | ✓ | ✓ | REF | REF |
| t30 flatten_reduce | 2 | SEM | ✓ | ✓ | ✓ | ✓ | REF |
| Tier 3 — Advanced (5 tasks) | |||||||
| t11 stdlib_map | 3 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| t12 stdlib_reduce | 3 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| t13 match_basic | 3 | ✓ | ✓ | ✓ | ✓ | ✓ | SCH |
| t14 match_dict | 3 | VAL | ✓ | SEM | ✓ | SCH | REF |
| t15 compose_apply | 3 | ✓ | SEM | RUN | SEM | REF | ✓ |
| Tier 4 — I/O & Capabilities (7 tasks) | |||||||
| t16 param_default | 4 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| t17 param_provided | 4 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| t18 read_stdin | 4 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| t19 read_file | 4 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| t20 write_read | 4 | ✓ | ✓ | ✓ | SEM | SEM | ✓ |
| t21 http_get | 4 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| t22 http_post | 4 | ✓ | ✓ | ✓ | ✓ | REF | ✓ |
Not all failures are equal. We categorize errors by where in the pipeline the program broke down.
12 failures across the local models.
The #1 killer for small models. They pass literal strings like "2" or
"hello" as node references instead of creating const nodes first.
Cloud models almost never make this mistake.
10 failures across all model tiers. The program is structurally valid and executes, but produces the wrong answer. Often caused by reversed compose order, missing type conversions, or subgraph placement errors.
2 failures.
The program JSON doesn't match the expected schema. Examples: using a stdlib
name as an opcode (std:is_num), or using has_key as a
stdlib subgraph reference.
2 failures. One validation error (missing INPUT_PORT in a handler subgraph) and one runtime crash (INPUT_PORT opcode executed in main graph context). Both related to subgraph construction mistakes.
All four cloud models scored 87% or higher, with two reaching 93%. Every cloud model achieved 100% on Tier 1 (basic tasks), proving the spec is clear enough for large models to master the fundamentals.
The 7B and 14B models understand Lexis concepts (they solve I/O and subgraph tasks) but make mechanical errors — passing literal values where node IDs are expected. This is a working memory limitation, not a comprehension gap.
t14 (match_dict) and t15 (compose_apply) were solved by only 2 out of 6 models each. These involve complex subgraph patterns that even cloud models struggle with. This tells us where to improve the spec.
Running on an RTX 4090, local models average 4-6 seconds per task compared to 9-35 seconds for cloud models. For iterative development where you need fast feedback, local models still make sense despite lower accuracy.
152 out of 180 evaluations produced correct programs (84.4%). No model needed hand-holding or prompt engineering — they read the spec and wrote working code. That's the promise of an AI-native language.
These tasks had the lowest pass rates across all models.
| Task | Tier | Pass Rate | Failed By | Common Error |
|---|---|---|---|---|
| t14 match_dict | 3 | 33% (2/6) | GLM-5, Minimax, Qwen-14B, Qwen-7B | Subgraph handler construction |
| t15 compose_apply | 3 | 33% (2/6) | Kimi, Minimax, Qwen3-CN, Qwen-14B | Compose order / subgraph placement |
| t10 pipeline | 2 | 50% (3/6) | Qwen3-CN, Qwen-14B, Qwen-7B | Missing type conversions in chains |
| t09 reduce | 2 | 67% (4/6) | Kimi, Qwen3-CN | Subgraph hash placement |
| t30 flatten_reduce | 2 | 67% (4/6) | GLM-5, Qwen-7B | Subgraph nesting / node references |
Every generated program passes through 5 stages. This shows what percentage make it through each stage.
| Model | Parse | Validate | Security | Execute | Correct |
|---|---|---|---|---|---|
| GLM-5 | 100% | 97% | 97% | 97% | 93% |
| Kimi-K2.5 | 100% | 100% | 100% | 100% | 93% |
| Minimax-M2.5 | 100% | 100% | 100% | 97% | 90% |
| Qwen3-Coder-Next | 100% | 100% | 100% | 100% | 87% |
| Qwen2.5-Coder-14B | 100% | 80% | 80% | 80% | 73% |
| Qwen2.5-Coder-7B | 100% | 73% | 73% | 73% | 70% |
Every model achieved 100% parse rate — all generated valid JSON. The gap between cloud and local models opens at the validation stage, where smaller models fail on node reference errors.
Tests conducted on February 20, 2026 • Lexis v0.1.0 • 93 opcodes • 30 benchmark tasks • Cloud models via Ollama • Local models on NVIDIA RTX 4090 (24GB VRAM) via LM Studio