As part of the work I’m doing on the Planet Friendly Web, I’m trying to get access to data that I can base the guide on. In some cases this involves creating datasets from existing data. Here I share some findings from a dataset I generated along the way.
For example, to get a figure on how much of the web runs on renewable power, I started with a dataset of the top 1 million domains by traffic from Alexa.com, then run the list against the Green Web Foundation’s own API, which maintains a list of which domains run on renewable power.
To do this, involves making something like 100k API requests, so I created a screenscraper to carry out the job, and take care of retries, failed requests and so on. You can see it here on github.
I’ve uploaded the dataset created to datbase, partly as an experiment in making it available in a decentralised way, but also partly try out the workflow for publishing data.
So, now we have some data, let’s see what we can do with it, right?
Doing some analysis and some interesting findings
I have an earlier exploration of the data in a notebook on github, but when working with this data, I ‘m bit embarrassed to say I forgot how to use the Dataframe filters to slice the data quickly.
So instead, I’ve used Open Refine. You could probably store this in a Google spreadsheet too, as 100k rows is big, not but THAT big.
Anyway, what do we see?
There’s a few interesting findings just from faceting data like below in Openrefine, and sorting by count along a few dimensions:
If you’re not familiar with OpenRefine, I’ll summarise what’s visible in this view:
- Youtube.com is now more popular than google.com. Who knew?
- The top three websites in the world run on renewable power. Huzzah!
- Based on the greenweb foundation’s data, around 7% of the web the most popular domains on the net run on renewable power.
- Hetzner AG, a German hosting company hosts more domains running on green power than Google does.
- Amazon doesn’t appear here at all as a green provider.
After a slow start, I understood Amazon to be a HUGE player here, and while they have a nice shiny page showing off their windfarms and how much renewable power they use , they also run a load of their servers on coal. That they don’t appear may be an artefact of the Green Web Foundation going by an organisation’s entire power mix, to decide whether a company is running on green power or not.
I think need to check with Rene at the Green Web Foundation to see.
Fancy playing too? Come hang out on slack
This shows some pretty superficial analysis, but there’s already some interesting nuggets here.
If working with this data sounds interesting to you, let me know in the comments – I’m looking for collaborators on the Planet Friendly Web Guide.
Alternatively, come hang out in the sustainableux.com slack channel, where there’s a nice little community growing around sustainable web design.
If you prefer email