Title image of Scraping all the cheese from Wikipedia

Scraping all the cheese from Wikipedia

6 August 2024

·

An amazing co-worker created a game called Cheese or Disease in Powerpoint as part of a fun presentation. The game challenges players to differentiate between cheeses and diseases. It’s surprisingly hard because they both have a lot of random names that catch you out. I really enjoyed it so created a digital version the following weekend: cheeseordisease.com

The first thing needed to make this game was (unsurprisingly) a list of cheeses and diseases. Thankfully Wikipedia has both of these.

Unfortunately Wikipedia stores data in an unstructured way and doesn’t have APIs that we could query. So we can’t use it directly but Wikipedia is just one piece of the Wikimedia empire. Wikimedia has many projects making knowledge available in different ways. The project that can help us with Cheese is Wikidata. Wikidata is a free and open knowledge base that stores data in a structured way. It’s actually the central database for the other wiki projects like Wikipedia.

And the best thing is Wikidata has a query service! https://query.wikidata.org/

The query service uses something called SPARQL which is a bit weird. I don’t fully understand the syntax but I was able to get something working by following their tutorial.

Cheese query

The data in Wikidata is organised into a hierarchy of types and every type has a unique Id. There is a generic cheese type that all the specific types of cheese inherits from. Our cheese query needs to say this: “Give me all items that are a subclass of cheese”.

The Wikidata website is a useful visualization of all the data. Going to the website and finding a cheese shows this hierarchy relationship:

Cheddar cheese in wikidata

The screenshot shows Cheddar is a subclass of cow’s-milk cheese which is then a subclass of Cheese. So there is two levels to traverse to get from Cheese to Cheddar Cheese.

The cheese query needs the Ids of these things to work which can be found in the website. Clicking on the relationship type of ‘sub class’ gives a property Id of P279. Clicking on ‘cow’s milk cheese’ and then ‘cheese’ gives the item Id of Q10943. With these two Ids all the cheese will be ours!

Here is the working query to return all cheese with it’s Name and a link to it’s Wikipedia page:

SELECT ?id ?idLabel ?article
WHERE
{
  ?id wdt:P279* wd:Q10943; # subclasses of cheese
  OPTIONAL {
      ?article schema:about ?id .
      ?article schema:inLanguage "en" .
      FILTER (SUBSTR(str(?article), 1, 25) = "https://en.wikipedia.org/")
    }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

An important bit of this query is the * after P279. That is saying traverse through all levels of sub classes and get the ultimate child instead of just the first level of sub classes.

And here is the cheese we get back:

The list of cheese returned from Wikidata

Yes this cheese will do nicely… and it’s easy to use the same query for scrapping diseases. Just swap out the base class of ‘cheese’ with the Wikidata equivalent of disease. And boom we have the data for our game.