Finding, tracking and organising datasets isn’t as difficult as you might think. Here’s our simple guide.

Gemma Ritchie
Media Hack
Published in
5 min readSep 22, 2022

--

Data isn’t always readily available in a clean dataset waiting to be analysed and visualised. But that’s not always the case.

Sometimes there is no data. Or there is information available but not in a spreadsheet, such as academic papers, government white papers or statements in PDF. So, here are some steps to build a dataset so you don’t (hopefully) end up pulling your hair out.

You will need:

  • a hypothesis you want to test;
  • a spreadsheet (Google Sheets or Excel);
  • a variable (or a set of variables) you want to track;
  • a means to keep your dataset up-to-date (news alerts, newsletters, social media, press releases).

A hypothesis

I wanted to find out what was happening in by-elections in South Africa following the local government elections on 1 November 2021, and see whether by-elections were the result of ward councillors dying or resigning from their post. I also wanted to see if one of South Africa’s nine provinces had more by-elections than others.

The Independent Electoral Commission (IEC) supplied data on when the by-election happened, who the previous incumbent was and who had replaced them in each ward.

But the dataset supplied by the IEC didn’t explain why the by-elections took place. However, a few weeks before the by-elections, the IEC would share the reasons for them— without naming the ward councillor — in their press releases. These reasons included death, resignations, political parties terminating memberships and court orders — as in when a court set aside the election results.

I knew what information I wanted to track and where to find it. Now, I needed to organise the information into a dataset to track it over time.

A spreadsheet

There are several ways to keep track of information. My favourite is Google Sheets. It is free and great for collaboration. You can use these same ideas with Microsoft Excel or if you are old-school pen-and-paper.

In a clean spreadsheet, I created several columns to answer the questions from my hypothesis using the journalist crib list: when something happened, where it happened, what happened and why it happened and who was involved.

A set of variables

My first column was my date column: “when”. In this column, I noted the date of the by-election, which is important for fact-checking and keeping your data organised.

Then I had my three “where” columns, which answered where the by-election took place, including the province, the municipality and its unique alpha-numeric code. The alpha-numeric code, though not necessary for general visualisations, like bar charts, is useful if you need to make maps and merge your dataset with GeoJson files.

The what column came next. In my first “what” column, I had a column called “brief description” where I described the events that happened — as whether the by-election was won or lost by the incumbent party and the reason for the by-election.

Then came the “who” columns. Because I wanted to track the incoming parties of the ward councillors, I had the columns “new appointee”, “new party”, “previous party”, and “previous appointee”. I kept track of the ward councillor appointees as the data was available because it made it easier to monitor if a ward councillor changed parties.

And then, to summarise the data I was collecting on whether the party stayed in power or not, I had a column called “party retained”, which I filled in with either yes or no.

But to write a story you have to contextualise an issue. People always want to know the reason — the why — something happened. In this case, why did by-elections happen less than a year following the 2021 local government elections?

To answer this, I added a column called “reason for by-election”. It was one or two-word summaries, such as death, resignation, court order or party removal. In the case of death, since not all the ward councillors died from natural causes, as I found while researching my story, I added the column “reasons for death”, which included the variables: murder (shot at leaving the office or home, or while driving), illness (either sudden or a long battle), car crash and unknown (where there is no information in news articles, social media posts or from statements by the party on the nature of the councillor’s death).

I also had several columns called “source” where I placed links to the information I found. These columns are crucial resources if you need to double-check the data you put into your dataset.

One of the things I learned from making this dataset is that you need to keep your variable names the same throughout the process. Otherwise, it will make it difficult to analyse your data. I made a tab called “about this data”, which helped me remember what details went into my columns and how I phrased variables in my dataset. This will also make it easier for others to use your dataset.

If you want to see a part of the dataset I have been working on, see the Municipality Tracker.

An example of what the spreadsheet looks like

A means to keep up-to-date

You now have your dataset set up. How do you keep it up-to-date? One of the easiest ways to keep track of information is to set up a Google Alert. It will send you web pages, news articles and video links when your keywords are identified. You can have it send you an update once a day, a week or each time it sees an update. Preferably, go for the once-a-day or once-a-week option; otherwise, you might overwhelm your inbox.

The second option is to monitor certain accounts using Twitter lists which you can check throughout the day using Twitter or Tweetdeck. If you have some coding skills, you can put together a Twitter scraper, or if you have a budget, you can pay for a Twitter scraper application like Tweet Archiver, a Google Sheets Extension which is easy to use and costs $4.95 a month. But for the most part, the Google Alerts are effective.

Setting up the structure for your dataset shouldn’t take long, and it is fluid. Over time, you may add more columns. Just keep your information organised so you can use it quickly and painlessly!

So why did South Africa have 28 by-elections by July? Most of the by-elections were because a ward councillor had died. You can read my full story here: Two-thirds of by-elections since November 2021 were because a councillor died.

Note, after publishing, this piece was lightly edited for clarity.

Media Hack Collective’s The Outlier publishes a data journalism newsletter every two weeks. Read the latest issue and subscribe here.

--

--