JavaScript Loaders

Tuesday, March 31, 2020

Facts About Coronavirus Disease 2019 (COVID-19) in 5 Charts created with R and ggplot2

Introduction

Coronovirus pandemic is changing our lifestyle from daily routine to near- and midterm plans, affecting relationships at home and work, adjusting our economical priorities and abilities, making us reassess value of goods and services, and arguably impacting all aspects of life. Better knowledge and understanding of the decease, its manifestations and dynamics must play critical role in assessment of current events and decisions we make. Below I compiled some useful facts about COVID-19 into 5 charts and included discussion of R and ggplot2 techniques used to create them.
At the end of 2019, a novel coronavirus was identified as the cause of a cluster of pneumonia cases in Wuhan, a city in the Hubei Province of China. It rapidly spread, resulting in an epidemic throughout China, followed by an increasing number of cases in other countries throughout the world. In February 2020, the World Health Organization designated the disease COVID-19, which stands for coronavirus disease 2019. The virus that causes COVID-19 is designated severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2); previously, it was referred to as 2019-nCoV.

Understanding of COVID-19 is evolving. This topic will discuss the epidemiology, clinical features, diagnosis, management, and prevention of COVID-19. 
         Coronavirus disease 2019 (COVID-19) by Kenneth McIntosh, MD

Though not all topics above are covered in this blog I reserve the right to publish more charts so stay tuned.

Clinical Features

 

Incubation Period

The incubation period for COVID-19 is thought to be within 14 days following exposure, with most cases occurring approximately four to five days after exposure [29-31].

Using data from 181 publicly reported, confirmed cases in China with identifiable exposure, one modeling study estimated that symptoms would develop in 2.5 percent of infected individuals within 2.2 days and in 97.5 percent of infected individuals within 11.5 days [32]. The median incubation period in this study was 5.1 days.
         Coronavirus disease 2019 (COVID-19) by Kenneth McIntosh, MD

Common approach to display quartiles and extreme percentiles of continuous distribution is with box plot. I chose against it for couple of reasons: a) research above had insufficient information about quartiles and b) box plots are less known outside of statistical community. Instead a gauge chart common in dashboard types of applications was used:




Implementation details in R


Dataset

Dataset consists of 6 rows corresponding to 5 percentiles - 0% (minimum), 2.5% and 97.5% (corresponding to 0.95 confidence interval), 50% (median), 100% (maximum) - and one row more for average:

Using factor() will place gauges in order from least to greatest and additional column stext used to display a value in readable format for each gauge.

Graphics

First, let's load packages used for plotting: ggplot2, ggthemes, and scales:

Realization of gauge charts using ggplot2 I borrowed from this example with a few changes explained next:

Line by line explainer:
  • 2-4: prepare rectangles for each value . Each gauge is a pair of overlapping rectangles - one dispaying value  geom_rect() with constant one geom_rect(aes(ymax=14, ymin=0, xmax=2, xmin=1), fill ="#ece8bd") as a background. 
  • 10: separate gauges by facets.
  • 5, 6: transform coordinate system to polar, rotate it to start at 9 pm and trim to display only upper half of gauges.
  • 9 places text label with value in the middle of each gauge.
  • 7, 8: color schema from few_pal()
  • 11: removing guides from the chart.
  • 12-15: title, subtitle, caption, and axis labels.
  • 16-19: customization using ggthemes package and theme().

Illness Severity

The spectrum of symptomatic infection ranges from mild to critical; most infections are not severe [33,35-40]. Specifically, in a report from the Chinese Center for Disease Control and Prevention that included approximately 44,500 confirmed infections with an estimation of disease severity [41]:
  ● Mild (no or mild pneumonia) was reported in 81 percent.
  ● Severe disease (eg, with dyspnea, hypoxia, or >50 percent lung involvement on imaging within 24 to 48 hours) was reported in 14 percent.
  ● Critical disease (eg, with respiratory failure, shock, or multiorgan dysfunction) was reported in 5 percent.
  ● The overall case fatality rate was 2.3 percent; no deaths were reported among noncritical cases.
         Coronavirus disease 2019 (COVID-19) by Kenneth McIntosh, MD

Obvious choice is a bar chart consisting of 4 bars - 3 for illness severity specturm plus case fatality rate reported in the same study:



Implementation details in R


Dataset


Dataset with 4 rows and 4 columns where severity is a factor() ordered by percent, percent_label used to display values above bars, and severity_label details illness severity:


Graphics

This is the case of simple bar chart using geom_bar() with state='identity' enhanced just with a couple of artifacts: geom_text() and annotate():


Line by line explainer:
  • 1-2: bar chart with stat="identity" displaying 4 bars.
  • 3: placing percent labels above bars.
  • 4: displaying y-axis labels in percent format.
  • 5-6: color schema from few_pal() and custom labeling of the legend.
  • 7-8: text annotation about CFR in the middle of the chart.
  • 9-12: title, subtitle, caption, and axis labels.
  • 13-17: customization using ggthemes package and theme().

Clinical Manifestations 

Pneumonia appears to be the most frequent serious manifestation of infection, characterized primarily by fever, cough, dyspnea, and bilateral infiltrates on chest imaging [32,36-38]. There are no specific clinical features that can yet reliably distinguish COVID-19 from other viral respiratory infections.

In a study describing 138 patients with COVID-19 pneumonia in Wuhan, the most common clinical features at the onset of illness were [38]:
  ●Fever in 99 percent
  ●Fatigue in 70 percent
  ●Dry cough in 59 percent
  ●Anorexia in 40 percent
  ●Myalgias in 35 percent
  ●Dyspnea in 31 percent
  ●Sputum production in 27 percent
         Coronavirus disease 2019 (COVID-19) by Kenneth McIntosh, MD

Continuing using bar chart to display clinical manifestations of COVID-19 at the onset of illness:

Implementation Details in R


Dataset

This is example of a bar chart requiring a bare minimum of information - just 2 columns with name and percent to display 7 bars:


Graphics

Once again code below creates a bar chart using stat = "identity":


Line by Line explainer:
  • 1-2: bar chart with stat="identity" displaying 4 bars.
  • 3: displaying y-axis labels in percent format.
  • 4: color schema from few_pal().
  • 5-8: title, subtitle, caption, and axis labels.
  • 9-12: customization using ggthemes package and theme().

Case Fatality Rate

According to a joint World Health Organization (WHO)-China fact-finding mission, the case-fatality rate ranged from 5.8 percent in Wuhan to 0.7 percent in the rest of China [17]. Most of the fatal cases occurred in patients with advanced age or underlying medical comorbidities [20,41]. (See 'Risk factors for severe illness' below.)

The proportion of severe or fatal infections may vary by location. As an example, in Italy, 12 percent of all detected COVID-19 cases and 16 percent of all hospitalized patients were admitted to the intensive care unit; the estimated case fatality rate was 7.2 percent in mid-March [42,43]. In contrast, the estimated case fatality rate in mid-March in South Korea was 0.9 percent [44]. This may be related to distinct demographics of infection; in Italy, the median age of patients with infection was 64 years, whereas in Korea the median age was in the 40s.
         Coronavirus disease 2019 (COVID-19) by Kenneth McIntosh, MD

This chart displays CFR's by age groups based on 44672 confirmed cases in China through February 11 with overall CFR = 2.3%:


Imlementation Details in R

Dataset

The data includes age, deaths, cases, and cfr computed as a ratio of last two:


Graphics

This chart combines bar and line charts into single plot reflecting CFR rate dynamic over age groups and additionally reflects size of these groups using bar width:


Line by line explainer:
  • 1,2: line chart over CFR by age groups.
  • 3: horizontal dotted line representing overall case fatality  rate.
  • 1,4: bar chart with stat="identity" displaying CFR's for each age group with adjusted bar width based on number of cases in each group.
  • 5,6: placing text labels with explicit value and calculation of CFR for each age group.
  • 7: displaying y-axis labels in percent format.
  • 8: color schema from few_tŠ°bleau().
  • 9-12: title, subtitle, caption, and axis labels.
  • 13-15: customization using ggthemes package and theme().

 

Epidemiology

Period of infectivity

The interval during which an individual with COVID-19 is infectious is uncertain. Most data informing this issue are from studies evaluating viral RNA detection from respiratory and other specimens. However, detection of viral RNA does not necessarily indicate the presence of infectious virus.

Viral RNA levels appear to be higher soon after symptom onset compared with later in the illness [18]; this raises the possibility that transmission might be more likely in the earlier stage of infection, but additional data are needed to confirm this hypothesis.

The duration of viral shedding is also variable; there appears to be a wide range, which may depend on severity of illness. In one study of 21 patients with mild illness (no hypoxia), 90 percent had repeated negative viral RNA tests on nasopharyngeal swabs by 10 days after the onset of symptoms; tests were positive for longer in patients with more severe illness [19]. In another study of 137 patients who survived COVID-19, the median duration of viral RNA shedding from oropharyngeal specimens was 20 days (range of 8 to 37 days) [20].
         Coronavirus disease 2019 (COVID-19) by Kenneth McIntosh, MD
 
This chart informs of minimum, median, and maxium duration of viral shedding by infected individuals by using bars resembling time lines:


 

Imlementation Details in R

Dataset

This chart will use bars to imitate time lines of period of infectivity based on research of how long individuals shedded viral RNA that identified minimum, median and maximum times:


Graphics

Yet another example of a bar chart with additional hack using geom_point()'s to display an improvised icon of SARS-CoV-2 virus:


Line by line explainer:
  • 1,2: bar chart with stat="identity" displaying 3 very thin bars imitating time line. 
  • 3-6: overlaying 3 different point shapes with varying size to improvise virus icon
  • 7,8: text annotation about the difference between being infectious and viral RNA shedding.
  • 9: flipping x and y axis to display time line horizontally.
  • 10-13: title, subtitle, caption, and axis labels. 
  • 14-16: customization using ggthemes package and theme().

Conclusions

Most of the facts above are results of very young research of COVID-19 - just little over 3 months old. There are still many unknowns about both the virus SARS-CoV-2 and the disease. To emphasize this I compiled a few of unknowns in the bonus chart - some will seem surprising given the wealth of knowledge scientists accumulated about other similar diseases:



References


Wednesday, March 18, 2020

Survey Results: What Degree is Best for Data Science?

The Survey

Results from the survey What Degree is Best for Data Science? (the survey is still open) collected from  February 9 through March 12, 2020 asking participants 4 questions:

  • Answers about self:
    • Q1: What is the highest level of school degree you have completed?
    • Q2: Which of the following best describes the field in which you received your highest degree?
  •  Answers about best education:
    • Q3: What level of school degree you consider optimal for successful career in data science?
    • Q4: Which field of study you consider optimal for successful career in data science?

During that period 289 respondents participated and 285 successfully completed all 4 questions, so 4 participants with partial answers were removed from analysis below.

Though simple and short (average time it took to complete was 55 seconds (after removing 6 outliers who took over 500 seconds)) the survey's questions possess certain internal structure in time and subject. Questions form 2 groups in time: one about education already acquired by a participant and the other about participant recommendations for best education. Subjects of questions yield 2 alternative groups: pair of 1st and 3d about degree and pair of 2d and 4th about field of study.


Answers to Each Question




 



Bird's-Eye View



 

Sankey Diagrams: How Data Flows

Sankey diagrams help visualize how answers flow through the questions. We start with pairs of related questions and finish with all 4 questions together. 

Completed Degree and Field of Study (Q1, Q2)


Best Degree and Field of Study (Q3, Q4)

Completed Degree vs. Best Degree (Q1, Q3)

Completed Field vs. Best Field (Q2, Q4)

Complete Flow of Answers For All 4 Questions

Concluding comments

The results are self-evident. The survey is still open so anyone who didn't participate can still do so and let others know about it. 

If you haven't noticed yet there is certain bias towards statistics in answers. This might originate from the fact that significant part of respondents reached the survey via R-bloggers distribution popular among R users (who often have background in statistics). 

Finally, there is another implicit bias: people with degree in Math are likely to suggest Math as best field, and so on for other fields and degrees. This sort of bias is evident from Sankey diagrams above: see (Q1, Q3) and (Q2, Q4) diagrams. Removing such bias from the results could be useful and I attempted this exercise but found it to be either too naive in my DIY approach or too extensive to process in short period of time from resources discovered. If you have pointers or even better a method of removing such bias from answers I'd love to hear from you.













Friday, February 21, 2020

Survey: What Degree is Best for Data Science?



TL;DR
Just answer 4 questions about best degree for Data Science here:
https://www.surveymonkey.com/r/7FGGWS7

No doubt asking the question "What's the best degree for Data Science?" one won't expect unified or even a few opinions (unless everything I know about people practicing data science is all wrong). Stephanie Glen analyzed various sources on the topic to show just that: 


Source: Best Degree for Data Science (in One Picture)
https://www.datasciencecentral.com/profiles/blogs/best-degree-for-data-science-in-one-picture

Trying to replicate her analysis with answers from data science practitioners I constructed 1-minute anonymous survey asking the same: https://www.surveymonkey.com/r/7FGGWS7
There you will find 4 questions: 2 on what degree you have and 2 on what degree you recommend. After collecting 100+ responses I will share results, thank you for participating!

Tuesday, February 11, 2020

H2O.ai Academic Program for Professors and Students: Quick Start with Driverless AI and Paperspace

If you are a professor teaching or a student enrolled in a machine learning program or non-technical program with a machine learning hands-on lab becoming a member of the H2O.ai Academic Program will get you free access to non-commercial use of software license for education and research purposes. In November 2018 H2O.ai (my employer) made its ground-breaking automated machine learning  (AutoML) platform Driverless AI available to academia for free. 

What Does Driverless AI Do?

H2O.ai defines Driverless AI as  
"an artificial intelligence platform for automatic machine learning"
To find out how Driverless AI automates machine learning activities into integral and repeatable workflow seamlessly encompassing feature engineering, model validation, hyper-parameter tuning, model selection and ensembles, custom recipes for transformers, models and scorers, automated model documentation, and finally model deployment visit User Guide. Not to forget MLI (Machine Learning Interpretability) module that offers tools for both white and black box model interpretability, model debugging, disparate impact analysis, and what-if (sensitivity) analysis.


H2O.ai Academic Program

To sign up to the H2O Academic Program launched back in October of 2018 start by filling out this form given following conditions hold true:
  • intended use is non-commercial for education and research purposes only and
  • person belongs to higher education institution or is a student currently enrolled in a higher education degree program and
  • if a student then academic status can be verified by sending a photo of your current student ID to academic@h2o.ai (required).
Upon approval H2O.ai will issue a free license for Driverless AI for non-commercial use only. While waiting to be approved apply for access to H2O.ai Community Slack channel here and don't forget to join #academic).

Driverless AI Installation Options

After receiving a license key, follow installation instructions for Mac OS X or Windows 10 Pro (via WSL Ubuntu option is highly preferred) to run Driverless AI on your workstation or laptop. While such an approach suffices for small datasets  serious problems demand installing and running Driverless AI on modern data center hardware with multiple CPUs and one or several GPUs for best results.

There are several economical cloud providers for such a solution. For general guidelines and instructions for native DEB installation on Linux Ubuntu see here. Steps below can be traced back to this documentation.

Why Paperspace

Paperspace offers a robust choice of configurations to provision and run Linux Ubuntu VMs with single GPU (no multi GPU systems available). The pricing appears competitive to suit thrifty academic budget by starting at around $0.50/hour for GPU systems with 30G of memory that should comfortably host Driverless AI. It also features a simple streamlined interface to deploy and manage VMs.


Step-by-Step Guide

Spinning up Linux VM

1. Create Paperspace Account

Start with creating account at paperspace.com:


2. Create a Cloud VM

After successfully creating account proceed to create a cloud VM:


3. Start Adding New Machine

Under Core -> Compute -> Machines on the left select (+) to add new machine:


4. Machine Location

Choose region closer to your location - in my case it was "East Coast (NY2)":


5. Choose Type Operating System

Scroll down to "Choose OS" and click on "Linux Templates":


6. Choose OS Version

Keep default Ubuntu 16.04 server image:


7. Pick Machine Type (How Much to Pay)

Scroll down to choose machine profile (keep hourly rate): for VM pick type "P4000" or more expensive machine type with GPU, while for CPU only system pick "C6" or higher (in case this instance type is not enabled instructions to enable it should pop up):
 

8. Enable Public IP

Scroll down to "Public IP" to enable it while keeping other settings unchanged except maybe for "Storage" and "Auto-Shutdown". While 50G of storage suffices for many applications if you plan on using larger datasets or create massive numbers of models increase your storage accordingly: allocate at least 20 times storage as the largest dataset you plan to use. Lastly change auto-shutdown timeout according to your needs:


9. Apply 5NXWB5R Promo Code with Payment

Scroll down to payment to enter credit card information, enter promotion code 5NXWB5R to apply (Paperspace should credit your account $10.00) before finally creating VM with "Create Your Paperspace" button:


10. Creating VM

While new system initializes its state appears as "Provisioning":


11. Wait for System to Start

Wait a minute or two until system state changes to "On/Ready" and click on small gear inside the box in upper right corner to move to system console:


12. System Console

System console displays detailed information about VM including public IP address assigned to your VM:


13. Notification from Paperspace

Next find email from Paperspace with system password:
With public IP address and password you can ssh (on Mac OS X or Linux) or connect using putty (on Windows) to Paperspace VM and install Driverless AI software following steps for vanilla Ubuntu system. This example continues with this install to show all steps in detail. 

Installing Prerequisites

14.  Terminal Access to VM

ssh to the Paperspace VM from Mac OS terminal using Public IP and password as shown in steps 12 and 13 (ssh below is used on Mac OS X - for other OSes adjust accordingly):




15. Change paperspace assigned password (optional):





16. Install core packages (optional):



17. Add support for NVIDIA GPU libraries (CUDA 10):


18. Install other prerequisites and open port Driverless AI listens to:



Installing Driverless AI 


19. H2O Download Page

Leave (do not close) ssh terminal for a browser and locate H2O.ai download page. Choose latest version of Driverless AI product:

17. Download Link

Go to Linux (X86) tab and then right-click on the "Download" link for DEB package to copy link location:

18. Back to Terminal Access

Return to ssh terminal session connected to paperspace VM. If session timed out or became inactive repeat step 14.

19. Download and install Driverless AI DEB package:



20. Install Completed

After installer successfully finishes it displays following helpful information:


21. Start Driverless AI

Check that Driverless AI is installed but inactive and then start it and check yet again its status and logs:


22. Web Access

Open browser and enter URL with public IP address like this: http://209.51.170.97:12345 (ignore 127.0.0.1 in screenshot as I was using port forwarding when taking them):


23. License Agreement

Scroll down to accept license agreement:


24. Login to Driverless AI

Driverless AI display login screen - enter credentials h2oai/h2oai:


25. Activate License

Driverless AI prompts to Enter License to activate software license:


26. License Key

Enter Driverless AI license key received by enrolling to H2O.ai Academic Program and press Save:


27. All Done

Now Driverless AI platform is fully enabled to help in your research or studies or both: 


Resources