I came across an interesting paper, How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference via a post on social media by a friend of mine, Asim Hussein, and one of the details surprised me enough to want to capture it here.
What was worth writing down then?
It’s on page 7.
The authors of the paper ran a bunch of tests against various generative AI models, both open and closed, and did some clever maths to infer likely energy usage, based on a number of factors, and one takeaway was that the energy cost of running a closed OpenAI model, GPT-4o mini, is likely different depending on whether you get it from OpenAI directly, or if you get it from Microsoft.
The reason for this is that you can deploy the same model onto different kinds of GPU hardware, and different generations of hardware whilst consuming more power in absolute terms are also significantly more efficient per query because they’re able to handle so many requests. The GPU clusters operated by Microsoft Azure, for APIs offered by Microsoft, appear to be using newer H100 hardware, than the existing clusters used when you use a model from OpenAI directly, which run on older A100 hardware.
Here’s the chart visualising it from the paper:

And here’s the quote:
The observed gap is inconsistent with H200 deployment and suggests that GPT-4o mini is running on A100 or H100 systems. Notably, Azure’s version outperforms OpenAI’s by 47% on average, further supporting the likelihood that Azure uses H100 and OpenAI retains A100.
Another bonus takeaway – even bigger models can have a lower energy cost when run on different hardware. The GPT-4o model is seen as a larger, more capable model than GPT-4o mini, but on a per query basis, when paired with newer kit, it came in at a lower energy cost:
Our findings indicate that infrastructure is a crucial determinant of AI inference sustainability. While model design enhances theoretical efficiency, real-world outcomes can substantially diverge based on deployment conditions and factors such as renewable energy usage and hardware efficiency.
(snip)
For instance, GPT-4o mini, despite its smaller architecture, consumes approximately 20% more energy than GPT-4o on long queries due to reliance on older A100 GPU nodes.
These details are not the key point of the paper, which does a fairly good job of delivering against what is promised in its title – How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference.
It’s a good read even if you’re not diving into some details like me here.
Minor update: There’s some commentary in the er… comments, that these models likely run on clusters of servers, for performance and efficiency reasons. So an assumption that they are running on a single A100 or H100 card is unlikely – it seems unlikely that you would mix a cluster of A100 and H10 cards together though, because the slower cards would be bottleneck, meaning you’ve spent loads of money on new hardware, but aren’t able to benefit from any performance gains.