I always thought I’m gonna get those raw data and try analyse it myself. Lucky enough at this stage the coronavirus data is quite well organised, and they can be found here: https://github.com/CSSEGISandData/COVID-19. A copy of this data in SQL is also available here: https://www.dolthub.com/repositories/Liquidata/corona-virus/.
After downloading the data from github, I need to combine all the csv into one big file, preferably in SQLite. Since the data is still growing, I need to do it in such a way that is simple enough to run every day.
I’m gonna have to do this with Python, because it’s free. I’m using Spyder and Anaconda for IDE.
If you haven’t started using Python yet, you should. It is by far the most used programming language in the world today. It is used in web development and of course big data.
Here is the code to download the files.
import urllib.request import datetime from urllib.error import HTTPError #create a date list from 22nd of January 2020 (the first publicly available data) base = datetime.datetime(2020,1,22) numdays = (datetime.datetime.today() - base).days dateList = [base + datetime.timedelta(days=x) for x in range(numdays)] #print( dateList.strftime("%m-%d-%Y")) print('Downloading the COVID-19 files...') for x in dateList: try: url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/'+x.strftime("%m-%d-%Y")+'.csv' urllib.request.urlretrieve(url, '../csse_covid_19_daily_reports/'+x.strftime("%m-%d-%Y")+'.csv') except HTTPError: print('No more files. Last file downloaded: '+x.strftime("%m-%d-%Y")+'.csv') break
Now that we have the CSV files, we need to convert and combine them into our dataframe.
#Converting the CSVs to pandas dataframe import pandas data = pandas.read_csv('../csse_covid_19_daily_reports/'+dateList[numdays-1].strftime("%m-%d-%Y")+'.csv') for x in range(0,numdays-1): try: data = data.append(pandas.read_csv('../csse_covid_19_daily_reports/'+dateList[x].strftime("%m-%d-%Y")+'.csv')) except FileNotFoundError: print("All files are added!") #dedupe the file, just in case data.drop_duplicates(subset=None, keep='first', inplace=False)