Can Small RNNs Replace Compilers? My Journey into Neural Lua Emulation

Date: 23 Sep 2025

This post is an overview of my undergraduate dissertation, the entire paper can be found here.

Abstract

I recently completed my undergraduate dissertation, examining whether small language models like RNNs could tackle complex tasks of emulating a compiler. Traditionally, a strict deterministic compiler has compiled programs, which can be expensive to implement and run, making them unsuitable for in-editor REPL tools such as Quokka. Previous work involving Codex and AlphaCode used transformer models with expensive inference times. This study evaluated Recurrent Neural Networks (RNNs), both standard and minimal, for Lua emulation by learning semantic embeddings, Mixture of Experts, and Attention. Standard models overfit the mark, struggle with meaningful patterns, and fail on complex examples. Minimal RNN architectures generalise better due to implicit regularisation. However, the study was still limited by dataset size and quality, leading to over-fitting.

Introduction

Most programs are written by a programmer in a text editor, and then they are compiled or interpreted depending on the language. Both methods involve a piece of software called Lexer, which takes the lexemes of the language and converts them into tokens. Tokens are then used for lexical analysis, where these tokens are converted into an internal representation (IR) or an Abstract Syntax tree (AST). Additionally, programs contain data such as variables that hold a range of different types and are stored in symbol tables and scope chains. Commonly, this IR is parsed by a virtual machine (VM), which emulates compiler operations like branching and arithmetic.

Traditional code generation involves rule-based VMs that emit machine code from IR or syntax trees, leading to strict and deterministic compilers, which we may want in most cases. However, what if, rather than a rule-based compiler that emits machine code, we want to infer the compiler from program-output examples of code? This research explored this idea - via language modelling - whether small models like RNNs can still be helpful for such an application.

Method

The approach centered on training five different RNN architectures to learn Lua program execution from input-output examples. Rather than hand-coding compilation rules, I wanted to see if these models could learn to predict program outputs directly from semantic patterns.

Architecture Selection

I tested five main architectures:

The minimal architectures were particularly interesting because they use parallel prefix scanning instead of sequential processing, potentially offering both training efficiency and better generalisation.

Data and tokenisation

Creating a suitable dataset proved challenging. Most available Lua code consists of large embedded systems or configuration files - not ideal for input-output learning. I generated 677 Lua program examples using ChatGPT, covering simple operations like string manipulation to more complex programs with loops and functions.

The key innovation was semantic tokenisation using Tree-sitter to parse Abstract Syntax Trees. Instead of raw code tokens, I converted programs into semantic tokens like STRING(hello) and FUNCTION_CALL(print). This approach strips away syntactic noise while preserving semantic meaning, helping models focus on what the code actually does rather than how it’s written.

Training Enhancements

To improve performance, I incorporated:

Results

The results revealed both the promise and limitations of small models for compiler emulation.

Performance Hierarchy

Standard RNN architectures performed poorly across the board:

The MinGRU with 2 experts was the standout performer, achieving 0.0115 on Exact Match and 1.1494 on Pass@1, meaning it correctly predicted some complete program outputs.

Inference Speed Advantage

One clear win for RNN models was inference speed. While transformer baselines (StarCoder2, Qwen 2.5 Coder) took 4-5 seconds per inference, RNN models completed the same task in 0.009-0.027 seconds - nearly 200× faster.

Baseline Comparison

The performance gap with large language models was stark:

However, this comparison isn’t entirely fair given the massive difference in model size and training data.

Discussion

Why Minimal Architectures Worked Better

The superior performance of MinGRU and MinLSTM likely stems from implicit regularisation effects. Simpler architectures are less prone to overfitting and may have better gradient flow properties. The removal of complex gating mechanisms seems to help rather than hurt performance on this task.

The Overfitting Problem

All models showed clear signs of overfitting, with validation loss consistently higher than training loss despite extensive regularisation attempts. The 677-sample dataset was simply too small for these models to learn meaningful generalisation patterns.

Dataset Quality Issues

Using ChatGPT for data generation created its own limitations. The generated programs, while syntactically correct, lacked the diversity and complexity of real-world code. The models learned shallow token-level patterns rather than deep semantic understanding.

Mixture of Experts Insights

MoE showed promise in reducing training loss but had limited impact on generalisation. Interestingly, using more than 2 experts actually hurt performance, likely due to sparsity issues in small models.

Real-World Implications

Despite the limitations, the dramatic inference speed advantage suggests potential applications in:

Closing Thoughts

This research highlighted both the potential and current limitations of using small models for compiler-like tasks. While RNNs couldn’t match large language models in accuracy, they demonstrated compelling advantages in inference speed and training efficiency.

The key insight is that minimal RNN architectures offer a sweet spot - simpler than traditional RNNs but more effective for this task. The implicit regularisation from architectural simplicity proved more valuable than complex gating mechanisms.

However, the fundamental challenge remains data quality and quantity. Future work should focus on:

While we’re not ready to replace traditional compilers with neural networks, this exploration revealed interesting trade-offs between model complexity, inference speed, and accuracy. In applications where sub-millisecond inference matters more than perfect accuracy, small RNNs might still have a role to play.

The journey from deterministic compilation to learned execution is far from over, but understanding these fundamental trade-offs brings us one step closer to more adaptive programming tools.