TIL: training small models can be more energy intensive than training large models

As I end up reading more around AI, I came across this snippet from a recent post by Sayah Kapor, which initially felt really counter intuitive:

Paradoxically, smaller models require more training to reach the same level of performance. So the downward pressure on model size is putting upward pressure on training compute. In effect, developers are trading off training cost and inference cost.

Source: AI scaling myths by Sayash Kapoormo

I don’t really have a complete mental model formed for training, but it’s not a million miles away from the make it work, make it right, make it fast mantra from Kent Beck.

Here, While they are obviously different things, there is often enough overlap between the idea of making something fast and making something efficient to feel like sort of can be applied.

If the last few years have been about making it work (with admittedly mixed process on making it right…), then it makes sense that this wave of small models could be interpreted as the make it fast stage of development.

Anyway.