Advanced: Get a CSV from an API

This article comes from the Silk Data Handbook. It explains how to get data from a data API into a CSV. Once you ended up with a suitable spreadsheet, you can import the spreadsheet into your Silk site.

A data API lets you automatically request data from a source, which will usually be returned in JSON format (which you can convert to a CSV for use in Silk). Many great sources have APIs for accessing their data: IMDB, Museums, Flickr, etc.

TLDR: We first fill a CSV with API queries, and feed that to a python script to put the results in one big JSON. We use OpenRefine to convert the JSON to CSV.

Find and read the documentation of the API

Most APIs have documentation that explain how to structure a request. For example, the Rijksmuseum API has its documentation on Github. In it, you can learn how to manipulate a URL to get a JSON with the results that you want.

Create a list of API queries and put them in a CSV

Getting a dataset from API usually involves 2 steps: first you create a list with the objects you want information on; then you fabricate an API call (URL) for each object and construct your final dataset with the output. You use the API for both steps.

Sometimes you only need one API call to get the results you want, which is great! Other times you might need to iterate more than twice, in which case you can repeat the steps explained below multiple times.

Here is an API URL we constructed by looking at the Rijksmuseum API documentation. It lists all the artworks in the museum from the 17th century with the word ‘Winter’ in the description:

From the resulting JSON, we want to gather the values that the API uses as an unique identifier. In this case, these are the values labeled as ‘objectNumber’. We will use these later on to construct an API URL for each object, so we can get more detailed information.

To extract the unique values from the JSON, import the file in OpenRefine and delete all the other columns; use a text editor like Sublime Text to find the values; or use the command line tool ‘grep’ (included on Linux/OS X by default).

Use Google Sheets or Excel to create a CSV with a new API URL for each object.

For our Rijksmuseum API example, the URL for information on a single object looks like this:

With a bit of messing around in Google Sheets, you can get a list of URLs that look like this:

Create a sheet with just the URLs in the first column and save the file as ‘list.csv’ somewhere.

Construct your initial dataset

To construct your final dataset, we will use the ‘concatenateAPI.py’ script. Download the script here. Move the list.csv file you created earlier in the folder. Go to the folder in your terminal, and run python concatenateAPI.py. It will concatenate the results for each URL in ‘output.json’. You can now fire up OpenRefine to clean up the data and convert it to a CSV file for use with Silk.