Modelwerk: Neural Networks as Machinery

Mar 14

For the longest time children stood inside looms lifting weights to allow the thread to go through on the command of the weaver. They were called draw boys. There's a moment in the history of weaving where the draw boy disappears, quickly, after the introduction of the Jacquard loom, the one with the punch cards, before Babbage, before Hollerith applied the same mechanism to computing. In doing so the loom became legible and predictable: you could look at the cards and understand what the machine would do. I wrote about the loom and its impact here.

I think about this a lot when I look at modern AI, not because the technology is simple, but because arguably the opposite problem has set in. The draw boys are back, except now they are us. We sit inside these systems lifting weights to pull threads—fine-tuning, prompt-engineering, RLHF-ing, RAG-ging, context engineering—and we can't explain what the machine is doing, or why, or even what the final pattern will be. Their abstractions have become a mystery. And they are far from predictable or legible.

Modelwerk is a small hobby project that tries to push back on that, at least a little. It implements four foundational neural network architectures in plain Python, with a fifth on the way: no numpy, no PyTorch, no TensorFlow, no Keras. No frameworks at all. Every operation starts from scalar arithmetic and builds up: scalar ops to vector ops, vector ops to matrix ops, matrices to layers, layers to networks, networks to models. You can trace any computation from the final output all the way down to a floating-point multiply. A universe invented for our neural pie.

The idea was to put something together to help developers understand AIs as machinery. Not as APIs, not as services, and definitely not as magical autocomplete. Just as machines, with parts that do specific things for specific mathematical and technical reasons.

The hill climb

The project is structured as a series of lessons, each implementing a landmark paper:

1. The Perceptron (Rosenblatt, 1958) - a single neuron that learns AND, OR, and NAND gates, then fails irreparably on XOR. The original connectionist dream and its first hard limit. For me, one of the most important computing papers ever written.

2. Multi-Layer Networks and Backpropagation (Rumelhart, Hinton & Williams, 1986) - the paper that cracked credit/error assignment. Ours has two hidden neurons solve XOR and another with eight learns nonlinear decision boundaries.

3. LeNet-5 (LeCun et al., 1998) - the convolutional neural network that setttled the argument. Weight sharing, pooling, feature hierarchies. The architecture proved you could learn spatial structure instead of hand-engineering it. Ours can recognise handwritten numbers.

4. The Transformer (Vaswani et al., 2017) — self-attention, positional encoding, the whole thing. The one that kicked off the modern AI revolution. A decoder-only language model that trains on Shakespeare sonnets and generates bad Shakespeare-like text.

5. Continuous Thought Machines (Sakana AI, 2025) — coming soon. I'm still wrapping my head around this one. But it's a credible attempt to advance the state of the art beyond the now dominant transformer architecture.

Each lesson is a runnable Python script. You execute it, it trains the model, it prints a narrative explanation of what just happened. The output walks through the architecture, the math, a worked example, and what the results mean. It's meant to read like a tutorial that also actually runs.

The papers aren’t random. Apart from being important, they form a hill climb through the core ideas: linear classifiers, gradient-based learning, spatial structure, learned attention, and (eventually) internal iterative computation. Each one introduces exactly one major concept that the previous architecture couldn't handle. The Perceptron can't solve XOR. Backprop can. Backprop can't see spatial structure. Convolutions can. Convolutions only see small local parts at a time. Transformers let the network look at everything at once and pay attention to what matters. And so on, up the hill.

The constraint

The hard rule is this: Python standard library only. Matplotlib gets a pass for visualization, but everything else—every matrix multiply, every softmax, every backward pass—is built from `list[float]` and arithmetic.

Now, this is obviously absurd from a performance or even practical standpoint. Training the transformer takes a few minutes, whereas PyTorch would take milliseconds. Our LeNet hack processes a handful of MNIST digits instead of sixty thousand. But that's the point. The constraint exists to keep the code, naive as it is, as legible as possible all the way down. When you call `matrix.mat_mat(Q, K_T)`, you can read the implementation. It's three nested loops. There's nowhere to hide.

The codebase is organized into composable layers, each importing only from below:

1. Primitives: `scalar.py`, `vector.py`, `matrix.py` are named wrappers around the arithmetic common to neural nets. And so, `scalar.add` is literally `a + b`. But naming it means you can grep for it, and it means the code composed up at the higher levels reads as to what it's doing rather than as Python syntax.

2. Building blocks: neurons, dense layers, conv layers, pooling, attention, embeddings, backprop, optimizers: the reusable parts.

3. Models: perceptron, MLP, LeNet-5, transformer with complete architectures assembled from building blocks.

4. Lessons: the runnable scripts that tie it all together with narrative.

The layering is strict (I hope!). The transformer imports from attention and embedding. Attention imports from matrix and activations. Matrix imports from vector. Vector imports from scalar. You can check this yourself: it's just `import` statements.

What I learned

I have an undergrad AI degree from the late Nineties. I can recall our lecturers talking about LeNet-5 when it was published and conv nets, although we didn't cover them (we did have to do backprop by hand, the horror). Neural nets were kind of dead relative to Symbolic AI (GOFAI) but had been growing slowly back over that decade. Our lecturers set some modules anyway—as one of them said, it might be useful some day. And to be clear it was controversial at the time (I was a mature student and got along with the lecturers as they were more or less my age, and they told me as much). I’m eternally grateful to our course leaders for this: there’s a parallel universe near by where neural networks were not on the curriculum. So I can kind of hum along to this stuff karaoke style even though it's been an age since studying them. And I learned quite a bit from this exercise. Building a transformer from scalar operations and articulating what it’s doing teaches you things that using PyTorch won’t. Some examples:

Softmax is stranger than it looks. Most activation functions, `ReLU`, `sigmoid`, `tanh` and so on, are element-wise: each output depends on one input. Softmax couples everything together. Change one logit (the raw output value produced by the final layer of the network) and every probability shifts, because they have to sum to one. The backward pass uses a full Jacobian matrix, not an element-wise derivative. Writing `J_ij = w_i * (delta_ij - w_j)` by hand in the narrative then seeing it in plain code makes this visceral in a way that `torch.softmax` can’t and not something I'd ever internalised before, having glossed over it so many times.

Convolution backward is not convolution forward. The forward pass correlates the filter with the input. The backward pass for the input gradients is a full convolution with the rotated filter. Writing this out loop by loop, with bounds checking to handle the implicit zero-padding, makes you appreciate what autograd does and why it works. This was an insight for me, and something I’m going to have to spend more time on.

Residual connections are about gradient flow, not representation. I knew this intellectually: transformers do better at passing back information to correct the network. But when you write `transformer_backward` and see the gradient split at each residual connection, with one copy flowing through the sublayer, the other sailing straight back through the skip, it helps with understanding why transformers train at all. Without the skip, gradients have to survive every sublayer's backward pass and the information basically gets worn away thanks to vanishing gradients. With it, they get a highway.

Attention is ‘just’ database lookups, made soft. Queries, keys, values. Score every key against the query, normalize with softmax, blend the values proportionally. Once you see the shapes, `(seq_len, d_k)` in, `(seq_len, seq_len)` attention matrix, `(seq_len, d_k)` out, the mechanism loses its mystique. Which is the point. I wrote the narrative with them as mini-databases, since that's how I got my head around the approach originally. But it's different again when you see it in actual code.

The collaboration

I built this with Claude Code, and I want to be clear about what that means. The experience tracks with what I wrote about building the tars assistant in the post Agentic Engineering: Building Without Writing: it's eyes-on, hands-off. I directed the architecture decisions: the compositional layering, which papers to include, the pedagogical approach. I had sketches of the narrative and notes outlined already, as this is something I've wanted to do for a while. Claude Code wrote the vast majority of the implementation. I reviewed everything, caught issues, pushed back on naming choices and comments. For example, we went through multiple optimization passes on the transformer code to make the backward pass readable for someone who isn't a researcher. As another example, we spent a lot of time optimising the graphics and the ascii art. And I'll take some credit for the network diagrams. As yet another example there was a good amount of editing of the perceptron code and commentary to optimise it for reading and learning, but once we established the idioms there, Claude converged on the approach and ran with it. This latter step was very similar to my experience nudging it to test more in tars, a hobby project that now has over 1100 tests. Put another way: you can set the bar and the idiom for a codebase with agents if you want, and if you want. you should do it early.

That said, I wouldn't call it pair programming. It's more like having a very fast, very knowledgeable collaborator who doesn't get bored writing six nested loops for a convolution backward pass, but who needs steering on questions like "will a Python programmer actually follow this?" The answer, for the transformer backward pass, was initially no. We added gradient flow diagrams, step numbering that mirrors the forward pass in reverse, and explicit comments about why the softmax Jacobian exists. It got there in the end. At the same time I'm pretty sure I would have broken my own constraints especially code factoring. A lot of the functions are long and that's intentional to see what these machines are doing without bouncing around a bunch of files and functions. Three decades of priors says I would have broken them up into granular functions.

The project is also a test of whether this kind of agentic collaboration produces coherent code. The answer is yes, with guidance and rules. The layering constraint helped enormously here as it it forces consistency because each level has a clear enough interface and the naming conventions propagate. When we renamed `ff1`/`ff2` to `ff_expand`/`ff_contract` in the transformer for example, the intent became clearer at the call sites.

Finally, Claude did something I suspected it could do, but wasn't sure it would. It nailed the math parts; pretty much one-shotted them. All the followups were for reading and layout purposes. Like I said, I've wanted to do something like this for a long time, but getting the math right and verifying it actually worked via training runs and tests is a lot of work without libraries to do the heavy lifting. I've run the code through Claude (other Claude), ChatGPT and Codex, and it all checked out. Not to say there are no errors, but it broadly holds up.

Who this is for?

Modelwerk is for programmers who want to build intuition what neural networks are doing mechanically. Not at the ‘a neural network is like a brain’ level, and not at the ‘here's how to fine-tune LLaMA’ level. And definitely not at the shrill edges of the current AI debate. It’s more at the ‘here is a matrix multiply, here is why we're doing it, here is where the gradient flows back through it’ level. If you've ever looked at a transformer diagram and thought ‘but what does multi-head attention actually compute?’, this might be for you. If you've wondered why convolutions work for images but not for language, this might be for you. If you've wanted to understand backpropagation as an algorithm rather than as a library call, this might be for you.

The code is slow. The models are tiny. The Shakespeare-shaped output is terrible. None of that really matters. What matters is that every operation is (kind of) legible, every gradient is derived, and every design choice connects back to a specific problem the previous architecture couldn't solve. The repository is at https://github.com/dehora/modelwerk. Run a lesson, read the output, trace the code. The draw boy is gone, but the cards are there to read.

Bill de hÓra