JavaScript Loaders

Monday, April 13, 2020 Academic Program for Professors and Students: Part 2 - Creating Your First (Time Series) Experiment

Part 1 of this blog series discussed how to:
  1. apply for free academic license of automated machine learning (AutoML) platform Driverless AI,
  2. spin up a VM with budget-oriented cloud provider Paperspace that can host Driverless AI,
  3. install Driverless AI on VM including configuration that utizlizes powerful GPUs available on Paperspace.
In part 2 we'll show how to:
  1. upgrade Driverless AI on Linux VM
  2. organize modeling workflow in Driverless AI
  3. manipulate and load datasets
  4. perform automated data exploration
  5. create models to forecast time series
  6. analyze time series model created with Driverless AI

Back to Paperspace VM

At the end of part 1 we had fully functional instance of Driverless AI but 
  • it was likely stopped by Paperspace due to inactivity and 
  • new version 1.8.5 of H2O Driverless AI has been released since.
So we begin by starting Paperspace VM and upgrading H2O software to the latest version, but please adjust steps below to your specific circumstances and possibly newer version of Driverless AI.

1. Starting VM in Paperspace

When you log in back to Paperspace and go to console (under Core -> Compute) it will show VM you created in the state "Off". Next, press anywhere on the machine box:

Next screen displays machine terminal view including button to start VM:

After pressing Start machine button wait for terminal window to appear indicating that VM restarted successfully:

2. Upgrading Driverless AI

Locate your original email from Paperspace from when you created VM in part 1 that has ssh command and password (unless you changed it since). You can use either web-based terminal window shown above or a terminal application like Mac OS Terminal to ssh (I prefer the latter as it allows easy copy and paste on Mac OS):

At this point we can upgrade to release (the latest version at the time of this writing). To locate installer point your browser to and click on the button for latest stable version of Driverless AI (1.8 LTS at this time):

This brings you to the download page for version (or later). Make sure it displays Linux (X86) tab (first tab) and copy location of installer file by right-clicking on the Download link corresponding to DEB Ubuntu 16.04/Ubuntu 18.04 option:

Go back to terminal window and enter wget command and paste file location:

After waiting for wget to finish download perform upgrade of Driverless AI with these 4 commands (don't forget to change file name with your version, given how frequently H2O does releases likely you will be installing newer version):
sudo systemctl stop dai
sudo dpkg -i
sudo nvidia-smi -pm 1
sudo systemctl daemon-reload
sudo systemctl start dai
After executing commands above terminal screen should look similar to this:

Test that upgrade took place and Driverless AI is running by pointing your browser to the public ip address of your VM and port 12345:

Using the same credentials as in part 1 in step 24 (h2oai/h2oai) login to Driverless AI and go to Resources -> System Info to observe that parameters of your system are in accordance with Paperspace spec:

Because disk size I used is rather small there is already 57% of disk used. To free up space I recommend removing installer file(s) used to install Driverless AI:
rm *.deb

This freed up over 10G of space in my case.

3. Troubleshooting 

Sometimes Driverless AI doesn't start successfully which manifests in browser unable to establish connection. In that case enable persistence mode for Nvidia GPUs and restart Driverless AI with these commands:
sudo nvidia-smi -pm 1
sudo systemctl stop dai

sudo systemctl start dai
Ultimately, the goal is to run nvidia-smi -pm 1 each time system starts. One way to accomplish this is with cron utility by adding a task that executes each time the system restarts
  1. run sudo crontab -e to edit crontab file containing cron tasks for root
  2. if using for first time crontab prompts to pick an editor
  3. add as first line (after comments):
    @reboot nvidia-smi -pm 1 


4. Preparing Data

We prepared data to demonstrate how to experiment and analyze models with Driverless AI. Departing from trivial examples like titanic or other ML "Hello, World!" types a COVID-19 theme made sense. But it won't be exponential growth / curve of COVID-19 cases modeling which is while important and extremely powerful already found on H2O blog Modeling Currently Infected Case by COVID-19 Using H2O Driverless AI by Marios Michailidis. To compliment this analysis let's look into forecasting demand for certain product groups. We prepared data with package gtrendsR utilizing Google Trends to proxy demand for popular products during COVID-19 crisis:

The dataset contains daily Google trends (search interest) for products represented by keywords (serving as a proxy to real demand) in United States, Canada, and Great Britain (majority English speaking) from 2020-12-28 through 2020-04-04 (before breaking it up which is discussed next):

Lastly, two datasets are created: one for training containing data since 2020-01-01 except for last week (2020-03-31 through 2020-04-06) that makes up test set exactly 7 days long. The reason we allocated test data for one week is because our model will forecast next 7 days of demand and having test data spanning exactly the same time period is ideal.

In later installments of this series we show how to create a dataset with Google Trends inside Driverless AI using data recipes.

Getting Started with Driverless AI

5. Navigating Driverless AI

Several unofficial rules of Driverless AI will guide us throughout this post starting with how to organize the flow  of activities:
An unofficial rule #1:

Always try following the same flow of actions as the order of navigation tabs on top from left to right:

The tabs that lead you throughout Driverless AI workflow are:
  1. Datasets: displays and manages datasets ingested into Driverless AI system (action: ingesting and preparing data);
  2. AutoViz: displays and manages list of automated visualization dashboards (one per dataset, action: automated exploratory data analysis);
  3. Experiments: displays and manages list of Driverless AI experiments (multiple experiments per dataset, action: creating machine learning models);
  4. Diagnostics: displays and manages list of model diagnostic dashboards (mulitple diagnostics per experiment and dataset possible, action: analyzing model performance);
  5. MLI: displays and managers list of Machine Learning Interpretability dashboards (usually one explanation dashboard per experiment, action: explaining and interpreting models);
  6. Deployments: displays and manages list of model deployments (multiple deployments per experiment possible depending on environment, action: deploying models).
To better understand the flow and relationships between Driverless AI artifacts review the following diagram:
Thus, a complete Driverless AI workflow consists of ingesting a dataset, exploring it with AutoViz, creating a model (experiment) trained on a dataset, diagnosing a model, exploring a model in MLI, and finally deploying it.

5. Loading Data

In Driverless AI go to Datasets tab and click on Add Datasets, then pick Upload File option:

Browser will open file picker where you can choose one or multiple files at once:

This will trigger upload, auto-parsing and saving of Google trend data files from your local machine to Driverless AI storage:

Alternatively, since both files are also available from the public S3 bucket here:, you can import using Amazon S3 option by entering S3 url: s3://h2o-public-test-data/smalldata/timeSeries/

Driverless AI will warn but still let you create duplicate datasets if you confirm your intent. Besides wasting disk space it may introduce confusion down the road so choose either uploading from your machine or import from S3 but not both (but feel free to try differnt options - you can always delete extra datasets after all).

6. Dataset Details

An unofficial rule #2:
Check after dataset auto-parsing to confirm that data imported as expected, e.g column names, data types, missing vlaues, etc.
Main reason for checking after Driverless AI is not because its auto-parsing functoinality is lacking, but much simpler: there is always ambiguity in data that may result in multiple acceptable formats and/or data types. For example, categorical column represented with only numericals results usually parsed as numeric data type. Dataset details let user both review and correct data type decision made by auto-parsing: in Datasets tab click on product_demand_train.csv to see available actions: Details, Visualize, Split, Predict, Rename, Download, and Delete:

After choosing Details Driverless AI displays Dataset Details view:

This view contains fsummary statistics and distribution plot for each column. It also offers ways to alter data types and data formats to correct auto-parsing as mentioned above by the rule #2: one example - backtracking from numeric type to string (or categorical) for values containing only numeric characters. 

7. Analyzing Data with AutoViz

An unofficial rule #3:
Never build a model without visualizing data in AutoViz first.
For advanced exploration choose next option in Dataset menu - Visualize:

Driverless AI will take you to Visualizations tab:

Click on product_demand_test.csv to display AutoViz dashboard that contains different types of advanced visualizations selected for the dataset:

So what just happened? By choosing Visualize we triggered fully automated process that runs statistical tests, unsuperivsed models, and anomaly detection, then selects interesting observations leaving out trivial ones, and finally compiles them into visual dashboard to represent results. Such workflow received a name in Driverless AI - AutoViz - and is characterized with:
  1. univariate analysis on dataset features: outliers, skewedness, spikey distributions, gaps in distributions;
  2. multivariate analysis on dataset: correlations (including between numeric and categorical features), varying boxplots, heteroscedastic boxplots, biplots, multivariate outliers, k-means clustering, 1-NN, SVD and more;
  3. qualifying which results to include, for example, correlated scatterplots include pairs of features with value of squared Pearson’s r greater than 0.95;
  4. aggregating the data to display larger points: "the bigger the point is, the bigger number of exemplars (aggregated points) the plot covers".
For details please read chapter in the docs Visualizing Datasets and we leave AutoViz with illustration of k-means clustering analysis that found 23 clusters (and no multivariate outliers) in the training dataset displaying graphics with Parallel Coordinates Plot:

8. Starting Experiment

To begin AutoML workflow go to Datasets tab and click on product_demand_train.csv, then choose Predict

Driverless AI will switch to a supervised machine learning experiment prompting to select a target:
Select hits first and observe that the other options get filled automatically:

What just happened? Driverless AI determined that:
  1. because target variable hits is integer and it contains over 100 unique values the problem type is regression;
  2. set RMSE as optimization metric;
  3. number of rows ~10K and number of features 4 in training data;
  4. set default settings for accuracy to 7, time to 2, and interpretability to 8. The higher accuracy (1-10) the more effort invested to reach better results. The higher time (1-10) the longer experiment spends on searching best transformations and hyper parameters. The higher interpretability (1-10) the less complex models and transformations are used;
  5. finally, high level plan for experiment pipeline shows:
    • all training data will be used (no sampling);
    • algorithms to try: Decision Tree, LightGBM, and XGBoost;
    • models and validation schema used during feature evolution phase: 3-fold cross-validation;
    • final model and validation schema to train it on: 6 model ensemble trained on 3-fold cross-validation;
    • feature evolution genetic algorith phase configuration in terms of number of individuals and generations (iterations): 8 individuals, 48 iterations;
    • early stopping for feature evolution: 5 iterations of no improvement;
    • any constraints on features: monotonicity constraint and pre-pruning of features based on permutation importance are enabled;
    • number of models to train for:
      • target transform tuning: 36, 
      • model and feature tuning: 192, 
      • feature evolution: 288, and
      • final model: 6;
    • esimated runtime is in minutes (usually very crude estimate);
    • model will auto-finish after 1 day and model will auto-abort after 7 days.
At this point, if experiment setup is complete you should safely start it by pressing on Launch Experiment button. Which brings up 
An unofficial rule #4:
When creating first time model always use default (or "lower") settings in Driverless AI.
The word lower was quoted because interpretability setting moves model performance in opposite direction - from 10 "lowest" to 1 "highest" (you can think of it as 
 complexity = 11 - interpretabilty
to have all 3 settings move consistently). 

If the data were indeed i.i.d. then a regression setup above would be enough to start experimenting per last rule.  

9. Time Series Model Setup

But Google trends data pertains to time series model, and Driverless AI supports it with its Time Series Lag-Based Recipe so we continue with experiment setup:

Extra steps to setup time series model:
  1. set dataset product_demand_test.csv as Test. This could (better should, see rule #5 below) have been done for regression or other types of models as well, but in case of lag-based time series recipe it has additional important role: indicating how far in the future we want model predictions to be (forecast horizon below).
  2. set column date as Time Column which effectively makes experiment time series and triggers displaying of Time Series Settings on the right side of the screen.
  3. set columns geo and keyword in Time Groups Columns (TGC) to identify multiple time series by state (geo) and keyword in the data. This is arguably most powerful feature in Driverless AI approach as it allows single model to forecast on multiple time series having access to both single time series and aggreated data.
  4. Time column is automatically parsed to determine time dimension, interval and periodicity including proposed forecast horizon based on the time span in test (see 1.).
  5. set scorer to MAE as one of standard metrics for time series model performance.
  6. observe new values of the settings: 8/4/8 and feel free to change them to "lower" values if you like (remember rule #4).
  7. finally, review a few notable chaninges in experiment pipeline:
    • validation schema switched to 4 time-based validation splits (time-based splits are always necessary when time column defined, even if time series lag-based recipe is disabled in Expert Settings);
    • LightGBM and XGBoost are the only models used;
    • new lag-based transformation were added: Lags, EwmaLags, LagsAggregates, and LagsInteraction;
    • greater number of models will be created due to increase in accuracy and time settings. 
While working through experiment setup we relied on a few more conventions. 
An unofficial rule #5:
Always strive to assign a test dataset in experiment.
Test data (or holdout) is never used during training of the modeling pipeline, which means final model is the same with or without it (except for Kaggle mode in Expert Settings disabled by default). Driverless AI computes test predictions to provide an estimate for generalization (or out-of-sample) error at very end.
An unofficial rule #6:
When creating first time model avoid using Expert Settings unless absolutely necessary for experiment setup.
Seldom an option in Expert Settings is necessary for model setup. One example is when data is non-i.i.d. (time dependent) so time column is set but time series lag-based recipe doesn't apply and should be disabled in Expert Settings. Google Trends dataset is a mulitple time series problem that Driverless AI can comfortably handle without advanced customizations to start. That doesn't mean that certain tweaking in Expert Settings - especially inside its Time Series tab - will not come handy later.
An unofficial rule #7:
Do not leave TGC set to AUTO but rather set column or columns identifying multiple time series (i.e. TGC) explicitly.
While Driverless AI is certainly capable of recognizing automatically TGC (columns that group data into multiple time series), by setting TGC you elminate slightest chance for uncertainty.

10. Time Series Model

Launch Experiment by pressing namesake button - wait for a couple of minutes while reading and clearing out experiment notifications (notifications are always available to review via Notifications link above CPU/Memory timeline) so you can observe current state of experiment workflow:

When the experiment completes Driverless AI displays final model:

Completed experiment view consists of:
  1. Experiment setup;
  2. Evolution pipeline displaying models generated during feature and model tuning and selection; 
  3. Top variable in terms of feature transformations selected during evolution;
  4. Experiment summary;
  5. Available actions:
    • deploy to a cloud or locally
    • Interpret the model (MLI)
    • Diagnose the model
    • Score on another dataset
    • Transform on another dataset
    • Download predictions (training or test)
    • Download Python scoring pipeline
    • Download MOJO scoring pipeline
    • Visualize scoring pipeline
    • Download summary and logs
    • Download Autoreport (Auto Documentation)

11. Time Series Model Analysis

Each action belongs to its own post so we focus on model analysis with MLI. Because the experiment used lag-based time series recipe Driverless AI will engage special flavor of MLI for time series. Press the Interpret This Model button to see Driverless AI start processing and computing predictions, its errors, Shapley values and more:

MLI for time series comes handy to visualize each time series per group to compare predictions and actuals. When interpreting model completes it displays MLI view including:
  • MAE Time Series plot: errors across validation (holdout) and forecast horizon averaged for all time series;
  • Test metrics for top 5 and bottom 5 groups (time series identified by their TGC values);
  • Actual vs. predicted plot across holdout and forecast horizon plus actual for training time for any choice of time series entered with its values as shown:

12. Shapley Values

MLI for time series is a powerful diagnostics tool for analysis of multiple time series models. But it goes beyond diagnostics as it includes Shapley values that go beyond diagnostics to explain key factors (features) contributing to each prediction. For example, enter TGC values US,milk to display its time series and click on the peak value in forecast horizon period as shown below:

Driverless AI displays a Shapley values bar chart of feature contributions for the prediction on that date. For example, as shown above on the 4th of April the biggest impact was from the feature representing 7-day lag (TargetLag:date:geo:keyword.5). You can continue changing dates to see how contributions shift and features become more or less impactful across forecast horizon time line. Remember that this analysis is specific to the time series for locaton:US with keyword:milk so switching to different time series may produce similar or drastically different results for the same Driverless AI model.

13. What's Next?

The model we just created could be considered as a baseline model. The next would be iterating over experiments to achieve higher score by means of:
  1. increasing accuracy setting;
  2. increasing time setting;
  3. lowering interpretability setting (increasing complexity);
  4. adjusting Expert Settings.
An unofficial rule #8:
When iterating apply "atomic" changes to next experiment to allow attributing increase or decrease in model performance to a single factor associated with that "atomic" change.
For example, if you increase accuracy and/or time then keep all other parameters intact. Likewise, if you adjust certain parameter in Expert Settings then keep the rest of parameters the same. Then whether a model gets better (or worse or the same) only that "atomic" change could be responsible for the effect. Embrace the change if performance increased or discard the change if not.

What constitutes "atomic" change? The easy answer is a change to single setting or parameter, but it could also be a set of related parameters that work together - typical examples are accuracy and time settings or switching algorithms on/off in Expert Settings.

Happy experimenting!

Tuesday, March 31, 2020

Facts About Coronavirus Disease 2019 (COVID-19) in 5 Charts created with R and ggplot2


Coronovirus pandemic is changing our lifestyle from daily routine to near- and midterm plans, affecting relationships at home and work, adjusting our economical priorities and abilities, making us reassess value of goods and services, and arguably impacting all aspects of life. Better knowledge and understanding of the decease, its manifestations and dynamics must play critical role in assessment of current events and decisions we make. Below I compiled some useful facts about COVID-19 into 5 charts and included discussion of R and ggplot2 techniques used to create them.
At the end of 2019, a novel coronavirus was identified as the cause of a cluster of pneumonia cases in Wuhan, a city in the Hubei Province of China. It rapidly spread, resulting in an epidemic throughout China, followed by an increasing number of cases in other countries throughout the world. In February 2020, the World Health Organization designated the disease COVID-19, which stands for coronavirus disease 2019. The virus that causes COVID-19 is designated severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2); previously, it was referred to as 2019-nCoV.

Understanding of COVID-19 is evolving. This topic will discuss the epidemiology, clinical features, diagnosis, management, and prevention of COVID-19. 
         Coronavirus disease 2019 (COVID-19) by Kenneth McIntosh, MD

Though not all topics above are covered in this blog I reserve the right to publish more charts so stay tuned.

Clinical Features


Incubation Period

The incubation period for COVID-19 is thought to be within 14 days following exposure, with most cases occurring approximately four to five days after exposure [29-31].

Using data from 181 publicly reported, confirmed cases in China with identifiable exposure, one modeling study estimated that symptoms would develop in 2.5 percent of infected individuals within 2.2 days and in 97.5 percent of infected individuals within 11.5 days [32]. The median incubation period in this study was 5.1 days.
         Coronavirus disease 2019 (COVID-19) by Kenneth McIntosh, MD

Common approach to display quartiles and extreme percentiles of continuous distribution is with box plot. I chose against it for couple of reasons: a) research above had insufficient information about quartiles and b) box plots are less known outside of statistical community. Instead a gauge chart common in dashboard types of applications was used:

Implementation details in R


Dataset consists of 6 rows corresponding to 5 percentiles - 0% (minimum), 2.5% and 97.5% (corresponding to 0.95 confidence interval), 50% (median), 100% (maximum) - and one row more for average:

Using factor() will place gauges in order from least to greatest and additional column stext used to display a value in readable format for each gauge.


First, let's load packages used for plotting: ggplot2, ggthemes, and scales:

Realization of gauge charts using ggplot2 I borrowed from this example with a few changes explained next:

Line by line explainer:
  • 2-4: prepare rectangles for each value . Each gauge is a pair of overlapping rectangles - one dispaying value  geom_rect() with constant one geom_rect(aes(ymax=14, ymin=0, xmax=2, xmin=1), fill ="#ece8bd") as a background. 
  • 10: separate gauges by facets.
  • 5, 6: transform coordinate system to polar, rotate it to start at 9 pm and trim to display only upper half of gauges.
  • 9 places text label with value in the middle of each gauge.
  • 7, 8: color schema from few_pal()
  • 11: removing guides from the chart.
  • 12-15: title, subtitle, caption, and axis labels.
  • 16-19: customization using ggthemes package and theme().

Illness Severity

The spectrum of symptomatic infection ranges from mild to critical; most infections are not severe [33,35-40]. Specifically, in a report from the Chinese Center for Disease Control and Prevention that included approximately 44,500 confirmed infections with an estimation of disease severity [41]:
  ● Mild (no or mild pneumonia) was reported in 81 percent.
  ● Severe disease (eg, with dyspnea, hypoxia, or >50 percent lung involvement on imaging within 24 to 48 hours) was reported in 14 percent.
  ● Critical disease (eg, with respiratory failure, shock, or multiorgan dysfunction) was reported in 5 percent.
  ● The overall case fatality rate was 2.3 percent; no deaths were reported among noncritical cases.
         Coronavirus disease 2019 (COVID-19) by Kenneth McIntosh, MD

Obvious choice is a bar chart consisting of 4 bars - 3 for illness severity specturm plus case fatality rate reported in the same study:

Implementation details in R


Dataset with 4 rows and 4 columns where severity is a factor() ordered by percent, percent_label used to display values above bars, and severity_label details illness severity:


This is the case of simple bar chart using geom_bar() with state='identity' enhanced just with a couple of artifacts: geom_text() and annotate():

Line by line explainer:
  • 1-2: bar chart with stat="identity" displaying 4 bars.
  • 3: placing percent labels above bars.
  • 4: displaying y-axis labels in percent format.
  • 5-6: color schema from few_pal() and custom labeling of the legend.
  • 7-8: text annotation about CFR in the middle of the chart.
  • 9-12: title, subtitle, caption, and axis labels.
  • 13-17: customization using ggthemes package and theme().

Clinical Manifestations 

Pneumonia appears to be the most frequent serious manifestation of infection, characterized primarily by fever, cough, dyspnea, and bilateral infiltrates on chest imaging [32,36-38]. There are no specific clinical features that can yet reliably distinguish COVID-19 from other viral respiratory infections.

In a study describing 138 patients with COVID-19 pneumonia in Wuhan, the most common clinical features at the onset of illness were [38]:
  ●Fever in 99 percent
  ●Fatigue in 70 percent
  ●Dry cough in 59 percent
  ●Anorexia in 40 percent
  ●Myalgias in 35 percent
  ●Dyspnea in 31 percent
  ●Sputum production in 27 percent
         Coronavirus disease 2019 (COVID-19) by Kenneth McIntosh, MD

Continuing using bar chart to display clinical manifestations of COVID-19 at the onset of illness:

Implementation Details in R


This is example of a bar chart requiring a bare minimum of information - just 2 columns with name and percent to display 7 bars:


Once again code below creates a bar chart using stat = "identity":

Line by Line explainer:
  • 1-2: bar chart with stat="identity" displaying 4 bars.
  • 3: displaying y-axis labels in percent format.
  • 4: color schema from few_pal().
  • 5-8: title, subtitle, caption, and axis labels.
  • 9-12: customization using ggthemes package and theme().

Case Fatality Rate

According to a joint World Health Organization (WHO)-China fact-finding mission, the case-fatality rate ranged from 5.8 percent in Wuhan to 0.7 percent in the rest of China [17]. Most of the fatal cases occurred in patients with advanced age or underlying medical comorbidities [20,41]. (See 'Risk factors for severe illness' below.)

The proportion of severe or fatal infections may vary by location. As an example, in Italy, 12 percent of all detected COVID-19 cases and 16 percent of all hospitalized patients were admitted to the intensive care unit; the estimated case fatality rate was 7.2 percent in mid-March [42,43]. In contrast, the estimated case fatality rate in mid-March in South Korea was 0.9 percent [44]. This may be related to distinct demographics of infection; in Italy, the median age of patients with infection was 64 years, whereas in Korea the median age was in the 40s.
         Coronavirus disease 2019 (COVID-19) by Kenneth McIntosh, MD

This chart displays CFR's by age groups based on 44672 confirmed cases in China through February 11 with overall CFR = 2.3%:

Imlementation Details in R


The data includes age, deaths, cases, and cfr computed as a ratio of last two:


This chart combines bar and line charts into single plot reflecting CFR rate dynamic over age groups and additionally reflects size of these groups using bar width:

Line by line explainer:
  • 1,2: line chart over CFR by age groups.
  • 3: horizontal dotted line representing overall case fatality  rate.
  • 1,4: bar chart with stat="identity" displaying CFR's for each age group with adjusted bar width based on number of cases in each group.
  • 5,6: placing text labels with explicit value and calculation of CFR for each age group.
  • 7: displaying y-axis labels in percent format.
  • 8: color schema from few_tŠ°bleau().
  • 9-12: title, subtitle, caption, and axis labels.
  • 13-15: customization using ggthemes package and theme().



Period of infectivity

The interval during which an individual with COVID-19 is infectious is uncertain. Most data informing this issue are from studies evaluating viral RNA detection from respiratory and other specimens. However, detection of viral RNA does not necessarily indicate the presence of infectious virus.

Viral RNA levels appear to be higher soon after symptom onset compared with later in the illness [18]; this raises the possibility that transmission might be more likely in the earlier stage of infection, but additional data are needed to confirm this hypothesis.

The duration of viral shedding is also variable; there appears to be a wide range, which may depend on severity of illness. In one study of 21 patients with mild illness (no hypoxia), 90 percent had repeated negative viral RNA tests on nasopharyngeal swabs by 10 days after the onset of symptoms; tests were positive for longer in patients with more severe illness [19]. In another study of 137 patients who survived COVID-19, the median duration of viral RNA shedding from oropharyngeal specimens was 20 days (range of 8 to 37 days) [20].
         Coronavirus disease 2019 (COVID-19) by Kenneth McIntosh, MD
This chart informs of minimum, median, and maxium duration of viral shedding by infected individuals by using bars resembling time lines:


Imlementation Details in R


This chart will use bars to imitate time lines of period of infectivity based on research of how long individuals shedded viral RNA that identified minimum, median and maximum times:


Yet another example of a bar chart with additional hack using geom_point()'s to display an improvised icon of SARS-CoV-2 virus:

Line by line explainer:
  • 1,2: bar chart with stat="identity" displaying 3 very thin bars imitating time line. 
  • 3-6: overlaying 3 different point shapes with varying size to improvise virus icon
  • 7,8: text annotation about the difference between being infectious and viral RNA shedding.
  • 9: flipping x and y axis to display time line horizontally.
  • 10-13: title, subtitle, caption, and axis labels. 
  • 14-16: customization using ggthemes package and theme().


Most of the facts above are results of very young research of COVID-19 - just little over 3 months old. There are still many unknowns about both the virus SARS-CoV-2 and the disease. To emphasize this I compiled a few of unknowns in the bonus chart - some will seem surprising given the wealth of knowledge scientists accumulated about other similar diseases: