• Home
  • About Us
  • Contact Us
  • Privacy Policy
  • Special Offers
Business Intelligence Info
  • Business Intelligence
    • BI News and Info
    • Big Data
    • Mobile and Cloud
    • Self-Service BI
  • CRM
    • CRM News and Info
    • InfusionSoft
    • Microsoft Dynamics CRM
    • NetSuite
    • OnContact
    • Salesforce
    • Workbooks
  • Data Mining
    • Pentaho
    • Sisense
    • Tableau
    • TIBCO Spotfire
  • Data Warehousing
    • DWH News and Info
    • IBM DB2
    • Microsoft SQL Server
    • Oracle
    • Teradata
  • Predictive Analytics
    • FICO
    • KNIME
    • Mathematica
    • Matlab
    • Minitab
    • RapidMiner
    • Revolution
    • SAP
    • SAS/SPSS
  • Humor

How to Aggregate Global Data from the Coronavirus Outbreak

April 14, 2020   Sisense

As COVID-19 continues to spread, healthcare groups and companies of all kinds are under pressure to provide care in the face of increasing demand. Healthy Data is your window into how data is helping these organizations address this crisis.

As the rapid spread of COVID-19 continues, data managers around the world are pulling together a wide variety of global data sources to inform governments, the private sector, and the public with the latest on the spread of this disease.

In this article, we discuss how this data is accessed, an example environment and set-up to be used for data processing, sample lines of Python code to show the simplicity of data transformations using Pandas and how this simple architecture can enable you to unlock new insights from this data yourself.  Let’s get started.

packages CTA banners Cloud Data Teams How to Aggregate Global Data from the Coronavirus Outbreak

The importance and impact of reliable data

Reliable data of this scale has helped to promote responsible decision-making in workplaces and communities around the world, including legislative action around international travel, the provision of emergency medical resources, the bolstering of financial markets, support for small business owners and proprietors, and medicine and treatment to those who are infected.  

Households and families are also using this data to prepare for the nuances of everyday life that will change as a result of this virus, whether it be securing a paycheck, buying groceries and essentials, playing with their kids, inviting the neighbors over for a barbecue or walking the dog. 

The importance of having properly-prepared data is paramount to the success of our attempts to mitigate and contain the spread of the virus, and the impact of information sharing at this level is truly transformative. But this type of globally aggregated data doesn’t just appear all on its own.  Reliable sources need to first be discovered, retrieved, parsed, and aggregated together before they can be distributed around the world. This is not something we’ve been ready to prepare in such a short period of time until now.

Specific challenges involved with data related to the Coronavirus

Access to a reliable source of contagion data for a global event happening in real-time is not easy to discover. The major repositories that include data on public health issues or disease outbreaks can be accessed through an API.

Often the data is needed to be communicated to the public so quickly after it’s generated that it’s provided via more accessible methods first, leaving the programmatic access via the API for future historical analysis. These methods (PDFs or HTML tables) can be time-consuming to parse and scrape correctly and consistently across a bulk set of examples, yet that wasn’t enough to stop the developer community from getting started.

Since then, a team of researchers affiliated with Johns Hopkins University has been retrieving PDF files posted on the websites of these organizations via programmatic methods and parsing their contents into CSV files stored on a public repository on GitHub. This repository has been starred over 15k times and forked over 7k times. It is being used as source data for engineers around the world to ingest into their data pipelines.

In providing a globally useful dataset they have connected with over 14 data sources from around the world and aggregated them into their data model.  Not all these data sources should be treated the same way, they each have specific needs. Some data sources include confirmed cases while others included presumptive cases.  Some data sources are Excel spreadsheets or HTML tables while others are location pins on Google Maps. And some are in completely different languages than English.  

The team at JHU has done a fantastic job in performing the bulk of the tricky normalization for us, yet they continue to face challenges every day in conforming to the new shapes, sizes, and formats of data that seem to be coming online exponentially as this virus continues to spread across our people.  In the next section we’ll provide the link to access this data and perform some data cleaning and normalization operations using Python and Pandas!

How to set up your own data environment for analyzing COVID-19 data

Data Access

Data access is provided by Johns Hopkins University.  They have built a pipeline that ingests global situation reports from the World Health Organization, translates data from South Korea, China, and Taiwan into English and has access to 10+ other global sources that appear to be quite labor-intensive to retrieve (a lot of web-scraping).  

While some cleaning and normalization has already gone into this dataset, such as converting all timestamps to a UTC time zone and addressing some inconsistencies related to update frequency, there are still plenty of opportunities for us to dive in and focus on cleaning and normalization activities that will unlock real insights.

# Data Publisher Scope Source Format Update Cycle Location
Johns Hopkins University* USA GitHub JSON Daily https://github.com/CSSEGISandData/COVID-19

* Johns Hopkins University is actively parsing daily situation reports from WHO and integrating into their data model for the open-source community to access

Data infrastructure

Here is an example of a simple, cloud-based architecture that is suitable for rapid deployment of a data pipeline.  In the environment created for this article, a Virtual Private Cloud (VPC) containing a Linux EC2 instance, a PostgreSQL database, and an internet gateway was spun up on Amazon Web Services and which was then connected to an external BI dashboarding tool.

The pipeline manager hosts a Python installation and Apache Airflow task scheduler (developed by Airbnb) which operates the data pipeline.  After simple configuration Airflow is up and running and is executing Python scripts itself, writing data to the database every time the source data is updated.  

Meanwhile, the internet gateway allows for external BI tools to connect to the data using a trusted connection so the data can be explored visually, and reports generated for communication of information to others.  This is done without needing to download a local copy of the data. This architecture allows for data in the database and downstream reports to stay up to date automatically.

Head in the Clouds 770x2501 How to Aggregate Global Data from the Coronavirus Outbreak

Processing and cleaning the data using Python and Pandas, writing to SQL database

And finally, we will go through a few simple examples of data cleaning and normalization performed in Python that can be used on this dataset in order to insert into a SQL table and query for valuable insights. The goal here is to format data sources into a common structure for effective bulk processing. 

1. Combine data from multiple .csv files and dropping duplicates that may exist.

import pandas as pd
import os

df = pd.DataFrame()
for filename in files:
   filepath = f'csse_covid_19_daily_reports/{filename}'
   with open(filepath) as f:
 da = pd.read_csv(f)
   df = df.append(da, ignore_index=True, sort=False)df.drop_duplicates(inplace=True)

2. Fill null values with empty strings to prevent rows from being removed from subsequent table transformations.

df['Province/State'].fillna('', inplace=True)df['Country/Region'].fillna('', inplace=True)

3. Sort dates in descending order with most recent at the top.

df.sort_values(['Last Update'], ascending=False, inplace=True)

4. Convert datetimes to date and prepare to group and retrieve the most recent record from each date.

df['Last Update'].apply(lambda x: pd.to_datetime(x).strftime('%Y-%m-%d'))

5. Grouping and aggregating to retrieve the latest report from each day.

df.groupby(['Country/Region', 'Province/State', 'Last Update']) \
   .agg({
       'Province/State': 'first',
       'Country/Region': 'first',
       'Confirmed': 'first',
       'Deaths': 'first',
       'Recovered': 'first',
       'Latitude': 'first',
       'Longitude': 'first'})

6. Convert floating-point decimal fields to type “int.”

for col in ['Confirmed', 'Deaths', 'Recovered']:
    df[col] = df[col].astype(int)

7. Write the data to a SQL database.

df.reset_index(inplace=True)
df.to_sql('cssc_daily_reports', con=connection, index=False,    if_exists='replace')

8. Connect to your SQL database and visualize your newly written data.

In this graphic we see that China’s cases mostly leveled off after 3-4 weeks, while growth in other countries is still ongoing. 

As of 4/10/2020

In this graphic we see the differences in new cases each day between the United States, Germany, Italy, and Spain. 

As of 4/10/2020

Fighting fire with data

As we continue to encounter global challenges at this scale, whether it be the coronavirus or another civilization-halting crisis, the importance of data collaboration across countries and state lines should not be underestimated. Not only because data wins debates, but because good data converts talk into consensus and action.

packages CTA banners Cloud Data Teams How to Aggregate Global Data from the Coronavirus Outbreak

Govind Rajagopalan is a Senior Engineering Manager at Sisense. He has over 15 years of experience across both engineering and management roles with diverse companies, domains, teams, and technologies. He is excited to teach, help his teammates thrive, and have fun improving his craft.

Let’s block ads! (Why?)

Blog – Sisense

aggregate, coronavirus, data, from, Global, Outbreak
  • Recent Posts

    • PUNNIES
    • Cashierless tech could detect shoplifting, but bias concerns abound
    • Misunderstood Loyalty
    • Pearl with a girl earring
    • Dynamics 365 Monthly Update-January 2021
  • Categories

  • Archives

    • January 2021
    • December 2020
    • November 2020
    • October 2020
    • September 2020
    • August 2020
    • July 2020
    • June 2020
    • May 2020
    • April 2020
    • March 2020
    • February 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • May 2019
    • April 2019
    • March 2019
    • February 2019
    • January 2019
    • December 2018
    • November 2018
    • October 2018
    • September 2018
    • August 2018
    • July 2018
    • June 2018
    • May 2018
    • April 2018
    • March 2018
    • February 2018
    • January 2018
    • December 2017
    • November 2017
    • October 2017
    • September 2017
    • August 2017
    • July 2017
    • June 2017
    • May 2017
    • April 2017
    • March 2017
    • February 2017
    • January 2017
    • December 2016
    • November 2016
    • October 2016
    • September 2016
    • August 2016
    • July 2016
    • June 2016
    • May 2016
    • April 2016
    • March 2016
    • February 2016
    • January 2016
    • December 2015
    • November 2015
    • October 2015
    • September 2015
    • August 2015
    • July 2015
    • June 2015
    • May 2015
    • April 2015
    • March 2015
    • February 2015
    • January 2015
    • December 2014
    • November 2014
© 2021 Business Intelligence Info
Power BI Training | G Com Solutions Limited