
The coronavirus pandemic has certainly shown just how much the world depends not just on accurate and readily available datasets but also the ability of scientists and data analysts to make sense of that data. All of us are at the mercy of those experts to interpret this data correctly – our lives could quite literally depend on it.
Thankfully we live in a world where the tools are available to allow anyone, with a bit of effort, to learn how to analyse data themselves and not just rely on the experts to tell us what is happening.
The programming language Python, coupled with the pandas dataset analysis library and Bokeh interactive visualisation library, provide a robust and professional set of tools to begin analysing data of all sorts and get it into the right format.
Data on the coronavirus pandemic is available from lots of sources including the UK’s Office for National Statistics as well as the World Health Organisation. I’ve been using data from DataHub which provides datasets in different formats (CSV, Excel, JSON) across a range of topics including climate change, healthcare, economics and demographics. You can find their coronavirus related datasets here.
I’ve created a set of resources which I’ve been using to learn Python and some of its related libraries which is available on my GitHub page here. You’ll also find the project which I’ve been using to analyse some of the COVID-19 data around the world here.
The snippet of code below shows how to load a CSV file into a panda DataFrame – a 2-dimensional data structure that can store data of different types in columns that is similar to a spreadsheet or SQL table.
# Return COVID-19 info for country, province and date.
def covid_info_data(country, province, date):
df4 = pd.DataFrame()
if (country != "") and (date != ""):
try:
# Read dataset as a panda dataframe
df1 = pd.read_csv(path + coviddata)
# Check if country has an alternate name for this dataset
if country in alternatives:
country = alternatives[country]
# Get subset of data for specified country/region
df2 = df1[df1["Country/Region"] == country]
# Get subset of data for specified date
df3 = df2[df2["Date"] == date]
# Get subset of data for specified province. If none specified but there
# are provinces the current dataframe will contain all with the first one being
# country and province as 'NaN'. In that case just select country otherwise select
# province as well.
if province == "":
df4 = df3[df3["Province/State"].isnull()]
else:
df4 = df3[df3["Province/State"] == province]
except FileNotFoundError:
print("Invalid file or path")
# Return selected covid data from last subset
return df4
The first ten rows from the DataFrame df1 shows the data from the first country (Afghanistan).
Date Country/Region Province/State Lat Long Confirmed Recovered Deaths
0 2020-01-22 Afghanistan NaN 33.0 65.0 0.0 0.0 0.0
1 2020-01-23 Afghanistan NaN 33.0 65.0 0.0 0.0 0.0
2 2020-01-24 Afghanistan NaN 33.0 65.0 0.0 0.0 0.0
3 2020-01-25 Afghanistan NaN 33.0 65.0 0.0 0.0 0.0
4 2020-01-26 Afghanistan NaN 33.0 65.0 0.0 0.0 0.0
Three further subsets of data are made, the final one is for a specific country showing the COVID-19 data for a particular date (the UK on 7th May in this case).
Date Country/Region Province/State Lat Long Confirmed Recovered Deaths
26428 2020-05-07 United Kingdom NaN 55.3781 -3.436 206715.0 0.0 30615.0
Once the dataset has been obtained the information can be printed in a more readable way. Here’s a summary of information for the UK on 9th May.
Date: 2020-05-09
Country: United Kingdom
Province: No province
Confirmed: 215,260
Recovered: 0
Deaths: 31,587
Population: 66,460,344
Confirmed/100,000: 323.89
Deaths/100,000: 47.53
Percent Deaths/Confirmed: 14.67
Obviously there are lots of ways of analysing this dataset as well as how to display it. Graphs are always a good way of showing information and Bokeh is a nice and relatively simple to use Python library for creating a range of different graphs. Here’s how Bokeh can be used to create a simple line graph of COVID-19 deaths over a period of time.
from datetime import datetime as dt
from bokeh.plotting import figure, output_file, show
from bokeh.models import DatetimeTickFormatter
def graph_covid_rate(df):
x = []
y = []
country = df.values[0][1]
for deaths, date in zip(df['Deaths'], df['Date']):
y.append(deaths)
date_obj = dt.strptime(date, "%Y-%m-%d")
x.append(date_obj)
# output to static HTML file
output_file("lines.html")
# create a new plot with a title and axis labels
p = figure(title="COVID-19 Deaths for "+country, x_axis_label='Date', y_axis_label='Deaths', x_axis_type='datetime')
# add a line renderer with legend and line thickness
p.line(x, y, legend_label="COVID-19 Deaths for "+country, line_width=3, line_color="green")
p.xaxis.major_label_orientation = 3/4
# show the results
show(p)
Bokeh creates an HTML file of an interactive graph. Here’s the one the above code creates, again for the UK, for the period 2020-02-01 to 2020-05-09.
As a recently retired software architect (who has now started a new career working for Digital Innovators, a company addressing the digital skills gap) coding is still important to me. I’m a believer in the Architect’s Don’t Code anti-pattern believing that design and coding are two sides of the same coin and you cannot design if you cannot code (and you cannot code if you cannot design). These days there really is no excuse not to keep your coding skills up to date with the vast array of resources available to everyone with just a few clicks and Google searches.
I also see coding as not just a way of keeping my own skills up to date and to teach others vital digital skills, but also, as this article helpfully points out, as a way of helping solve problems of all kinds. Coding is a skill for life that is vitally important for young people entering the workplace to at least have a rudimentary understanding of to help them not just get a job but to also understand more of the world in these incredibly uncertain times.