Coronavirus/COVID-19 raw data

I always thought I’m gonna get those raw data and try analyse it myself. Lucky enough at this stage the coronavirus data is quite well organised, and they can be found here: https://github.com/CSSEGISandData/COVID-19. A copy of this data in SQL is also available here: https://www.dolthub.com/repositories/Liquidata/corona-virus/.

After downloading the data from github, I need to combine all the csv into one big file, preferably in SQLite. Since the data is still growing, I need to do it in such a way that is simple enough to run every day.

I’m gonna have to do this with Python, because it’s free. I’m using Spyder and Anaconda for IDE.

If you haven’t started using Python yet, you should. It is by far the most used programming language in the world today. It is used in web development and of course big data.

Here is the code to download the files.

import urllib.request
import datetime
from urllib.error import HTTPError


#create a date list from 22nd of January 2020 (the first publicly available data)
base = datetime.datetime(2020,1,22)
numdays = (datetime.datetime.today() - base).days

dateList = [base + datetime.timedelta(days=x) for x in range(numdays)]
#print( dateList[0].strftime("%m-%d-%Y"))

print('Downloading the COVID-19 files...')

for x in dateList:
    try:

        url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/'+x.strftime("%m-%d-%Y")+'.csv'
        urllib.request.urlretrieve(url, '../csse_covid_19_daily_reports/'+x.strftime("%m-%d-%Y")+'.csv')
    except HTTPError:
        print('No more files. Last file downloaded: '+x.strftime("%m-%d-%Y")+'.csv')
        break

Now that we have the CSV files, we need to convert and combine them into our dataframe.

#Converting the CSVs to pandas dataframe
import pandas

data = pandas.read_csv('../csse_covid_19_daily_reports/'+dateList[numdays-1].strftime("%m-%d-%Y")+'.csv')


for x in range(0,numdays-1):
    try:
        data = data.append(pandas.read_csv('../csse_covid_19_daily_reports/'+dateList[x].strftime("%m-%d-%Y")+'.csv'))
    except FileNotFoundError:
        print("All files are added!")

#dedupe the file, just in case
data.drop_duplicates(subset=None, keep='first', inplace=False)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s