How do I track the direct environmental impact of my own inference and training when working with AI?

I can’t be the only person who is in a situation like this:

You’re working on an AI project, or might start one.
You know there is a much discussed, but frustratingly hard to pin down environmental impact of using AI in general.
You know tooling exists to measure the footprint associated with using servers, and there are various lifecycle stages of an AI project you need to account for, like training, and inference – maybe even the production / disposal of hardware too.
You probably don’t want to run hardware directly yourself, as it’s either expensive, hard to get, or both.
You want some defensible numbers to allow you to make a data-informed decisions how to mitigate any harms more than to see if “the good outweights the bad”.
You’re looking for reliable numbers, but can’t find them, and you’re aware this might represent a regulatory risk to your organisation.

These feel like things you ought to know before you start a project, to help decide if you even should go ahead, but that for many folks, that ship has sailed.

So, I’ll try unpacking these points, and share some context that I hope will be useful.

There is a somewhat selfish angle here – I hope to end up with some useful pointers from people who are also looking into the same issue, and to be able to refer to this post as a snapshot how I see things in August 2024. I’m also hoping it’ll be a conversation starter for some research I’m doing at work.

Let’s run through these together.

You’re working on an AI project, or might do so in future

This is pretty hard to avoid right now, if only because of hype and FOMO either amongst your peers, possibly further up the food chain with management.

There are signs of the AI bubble popping sure, but you also know there are people who are using them, and getting some tangible value from their use. If you consider yourself a professional technologist, you might have found this talk from Simon Willison, Imitation intelligence, given at PyCon, to be a really good, hype-free summary (I know do).

If you don’t consider yourself a professional technologist, you might have found this talk by product designer Maggie Appleton, Home-Cooked Software and Barefoot Developers, given at Local-first Conf to be an interesting argument in favour of experimenting with a subsection of AI, called Generative AI.

Alternatively you might be interested in loads of the other things people lump in as AI, that aren’t LLMs or Generative AI, and can be quite modest in their energy usage, but are nowhere near as hype-y.

You know there is a much discussed, but frustratingly hard to pin down environmental impact of using AI

Again, it’s not hard to find breathless new stories about the apocalyptic increase in power draw coming from the adoption of AI, and extremely silly stories about how we might meet this power demand. I can feel like it’s beyond question that widespread adoption of AI, and particularly Generative AI is going to happen.

It’s fairly common to see charts like this one from the IEA‘s Electricitiy 2024 report, showing massive jumps in energy use in a very short time.

There may be issues with this chart, and the data used to collate it, but it’s from the IEA, who I consider level-headed, rigourous and well resourced. I met some of the people researching this in June last year, when I did a talk for them, and they have a level of access many of us in civil society just do not have.

It’s also worth bearing in mind that there are also multiple issues with how we measure the footprint of the electricity used to power AI right now, and a lot of nuance is lost in public discourse – not least because of how large firms talk about their own environmental footprint.

Check out this chart from the recent, excellent piece from the Financial Times (public archive link) about tech firm’s reported carbon footprints using the location-based footprint for scope 2 emissions, versus the market-based footprint for scope 2 emissions.

A chart showing the diverging carbon footprint for Amazon, Alphabet, Microsoft, Meta, and Apple.

All show location based footprints being much much higher than the reported market based figures

Location-based footprints and market-based footprints are both considered legit if you use the de-facto GHG Protocol standard to report the carbon footprint of electricity usage, but they reflect different perspectives.

A location-based carbon footprint takes into account the physical emissions from the grid use. You might think of this as the average mix of clean and non clean energy used, before you take into account any agreements a firm might make to buy cleaner power.

A market-based footprint is designed to reflect the agreements you might use to reduce the footprint, from signing a longer term power purchase agreement (a PPA) with a generator (typically cheaper and greener than the grid), or buying ‘unbundled’ clean energy certificates (problematic, but accepted under the GHG protocol).

Large firms invariably prefer to use the market-based view, because for the most part, at their scale, the steps you take to reduce your scope 2 footprint with a PPA are often cheaper than doing nothing, and buying unbundled clean energy credits is massively cheaper in the short term than any other measure you might take to reduce your reported scope 2 emissions for electricity usage.

This can you make you look really green, while saving a bunch of money or spending relatively little, compared to other measures you might take.

But as you can see from the chart, using a market based figure can give a very distorted view on the actual impact of an organisation’s environmental impact. You can also see that Amazon exclusively reported the market-based approach until recently, which doesn’t exactly help.

Anyway, I digress. My point was that there is a growing but unclear environmental impact associated with tech firms, and AI-specific datacentres are often cited as a driver of this growth. At the same time, it’s a real challenge to find clear, unambiguous statements from these same firms that definitely confirm this, and often attempts to appraise the ‘greenness’ of suppliers in this area are stymied by the way they report their emissions.

You know tooling exists to measure this footprint associated with using servers, and there are various lifecycle stages of an AI project you need to account for

In 2024, tooling exists to make the power usage of the use of servers easier to measure, even if it’s imperfect, and even if the numbers are riddled with caveats.

Let’s assume you are prepared to work with your team to use this kind of tooling, or look for vendors who use something similar.

Getting an idea of the energy use for server side code

You can use tools like codecarbon.io to get a good starting approximation for basically any code in python for example. The code below is simplified, but not far from real world code:

from codecarbon import EmissionsTracker
from project import run_training()

tracker = EmissionsTracker()
tracker.start()

try:
     # Compute intensive code goes here
     _ = run_training()
finally:
     tracker.stop()

If this isn’t precise enough, there’s a project in the Green Software Foundation called the Real Time Cloud project, listing many of the tools and where they might be used now. There’s now a massive publicly visible Miro board listing them if you wanted to dive in.

Warning – it’s a deep rabbit hole to fall down!

Knowing the lifecycles stages to measure

You might already know there are different lifecycle stages to measure in an AI project – at a very high level, you might call them production, training and inference.

In the production phase a huge amount of energy has gone into the creation of the hardware. Finding direct figures for this embodied energy is extremely difficult, because most AI hardware manufacturers do not publish them. So, you’re mostly left working with estimates, even when you have the time to read through the academic literature on the subject.

In the training phase, there is at least one default solution now – the use of codecarbon.io is cited in many peer reviewed papers, as well as being used in the Hugging Face platform in their Autotrain product, for example. As an aside, Hugging Face also provide some guidance on how to include CO2 figures in the model cards that they have played a large part in popularising. You might already be looking for model cards when choosing models on a project for other reasons, but this is another reason to look for them. If you’re making your own models, you might consider publishing them yourself.

In the inference phase, things less clear. You have some leaderboards which give you per-inference figures for models doing common tasks now, so if you know your base model, then llm-perf-leaderboard on Hugging Face can give you some idea, as can ML.energy.

Both play a similar “leaderboard” role, letting you see how different models compare, and inform your decisions about which ones to use.

There are also promising signs of an Energy Star for AI style project you might be following, again from Hugging Face.

If you want to show someone that model choice matters, there’s also nice tool, GreenCoding.AI, which lets you try various prompts with different models, and returns their answers, along with some figures for the environmental impact of the inferencing you just ran. This lets you see if a small LLM model gives you a good enough answer compared to a large one, and lets you compare the relative impact of each one. You can see a demonstration of these in this workshop deck I gave at DjanogoCon EU in June this year – see slides 47 through 51.

These don’t really help for your specific emissions for your specific system, though. For that you are probably looking for something that can be integrated into the your system itself so you might have some kind of usage based reporting – perhaps a dashboard in future you can refer to regularly. In you’d probably want a spreadsheet or similar you can read.

The most developed open source tool I’ve seen for your specific use of inference, if you’re using a remote provider like OpenAI, Mistral, and so on, so far is Ecologits.ai.

It works by annotating every response you receive back from a remote provider with some environmental impact information, based on information about the response – like the model used, the number of tokens in a response and so on.

These are estimates though, based on educated guesses about the number of GPUs, and other resources used to service a request. Once again, you do not have meaningful data from the organisations actually operating the hardware, so you’re reduced to guestimates – this really should be something the providers share, because they’re the ones with the access to the hardware!

If you’re like me, you probably haven’t found any providers do report these figures for your project, so something like Ecologits.ai might be your best option so far.

You probably don’t want to run hardware directly yourself, as it’s either expensive, hard to get, or both

Given that we’ve established that some tooling, how imperfect exists, and that there are significant data gaps at every stage of the lifecycle, one approach you might have considered might be to measure resource usage directly from AI hardware you physically own and operate yourself.

In most cases, this a likely non starter, because these GPUs are just SO expensive (tens of thousands of EUR for a single card to insert into a server is common), and that’s assuming you can get your hands on one (which for along time, as been a real challenge).

Anyway, assuming you did have the money and access, there’s still the significant embodied energy and carbon from the production you need to think about. You might not have direct figures for AI hardware from manufacturers, but if you assume it’s at all like other regular electronics, it’s reasonable to assume there is a significant carbon footprint here as well – even if you have to make an estimate this, it’s still likely to be a chunky figure on your books to account for.

The final reason you probably don’t want hardware yourself, is that if the project doesn’t work out, you probably don’t want to be stuck with really expensive piece of equipment that you then need to figure out what to do with, or how to dispose of responsibly.

You want some defensible numbers to allow you to make a data-informed decisions how to mitigate any harms

Regardless of how you feel about AI, or even Generative AI vs the regular old ML, having defensible numbers helps you have a data-informed discussion about the impact of AI.

This is not the same as looking for numbers to help you figure out whether the good outweighs the bad, like there’s a single axis, with “good” at one end and “bad” at the other, and it’s a case of making one cancel the other out.

Just because there are some upsides to a project, it doesn’t automatically free you from the responsibility of thinking about how to mitigate the downsides, and thinking in this way is pretty naive and simplistic.

It’s more like being aware of multiple axes. You want to be informed about the likely harms you need to be mindful of, and what mitigating steps you might take to reduce avoidable harm, regardless of how good or bad the intended outcome of deploying a project might be.

You’re looking for reliable numbers, but can’t find them, and this might represent a regulatory risk

This might be redundant, given the first five points but it bears repeating, especially given the shifting regulatory landscape. You are trying to do the right thing as a technology professional, and you can’t find these numbers.

This doesn’t feel great, but if all goes well and this project turns out to be really successful, and pivotal to your organisation’s success, then it’s likely that as usage grows, it’ll end up making up a material part of your own organisation’s environmental impact.

If this is the case, you’ll definitely need access to these numbers, particularly if you’re working in, or selling to people in Europe.

From January 2025, activity that makes up a material part of an organisations’ environmental impact will be something they are legally required to report, as a result of the Corporate Sustainbility Reporting Directive being passed. While it’s true that this mainly applies to large organisations, these reporting requirements explicitly refer to impacts in an organisation’s supply chain.

Even if your organisation is below the size threshold for reporting, if your impact is considered material to a customer’s impact, they’ll need to report it, which means they’ll need these numbers from you, anyway.

Either way, you probably want to be able to produce these numbers in case someone asks for them in a few months time.

Do you recognise any of this?

I’m trying to answer this question myself, and I’ve tried sharing some pointers to promising projects that might help me and others along the way.

However, I’d welcome suggestions, or introductions, because for the all the attention going towards AI projects, the story about being able to understand and manage their direct impact needs a lot of work.