JavaScript Loaders

Tuesday, December 25, 2018

Finally, You Can Plot H2O Decision Trees in R

Creating and plotting decision trees (like one below) for the models created in H2O will be main objective of this post:
Figure 1. Decision Tree Visualization

Decision Trees with H2O

With release 3.22.0.1 H2O-3 (a.k.a. open source H2O or simply H2O) added to its family of tree-based algorithms (which already included DRF, GBM, and XGBoost) support for one more: Isolation Forest (random forest for unsupervised anomaly detection). There were no simple way to visualize H2O trees except following clunky (albeit reliable) method of creating a MOJO object and running combination of Java and dot commands.

That changed in 3.22.0.1 too with introduction of unified Tree API to work with any of the tree-based algorithms above. Data scientists are now able to utilize powerful visualization tools in R (or Python) without resorting to producing intermediate artifacts like MOJO and running external utilities. Please read this article by Pavel Pscheidl who did superb job of explaining H2O Tree API and S4 classes in R before coming back to take it a step further to visualize trees.

The Workflow: from Data to Decision Tree

Whether you are still here or came back after reading Pavel's excellent post let's set goal straight: create single decision tree model in H2O and visualize its tree graph. With H2O there is always a choice between using Python or R - the choice for R here will become clear when discussing its graphical and analytical capabilities later.

CART models operate on labeled data (classification and regression) and offer arguably unmatched model interpretability by means of analyzing a tree graph. In data science there is never single way to solve given problem so let's define end-to-end logical workflow from "raw" data to visualized decision tree:
Figure 2. Workflow of tasks in this post

One may argue that the choice of executing steps inside H2O or R could be different but let's follow outlined plan for this post. Next diagram adds implementation details:
  • R package data.table for data munging
  • H2O grid for hyper-parameter search
  • H2O GBM for modeling single decision tree algorithm
  • H2O Tree API for tree model representation
  • R package data.tree for visualization  
Figure 3. Workflow of tasks in this post with implementation details



Discussion of this workflow continues for the rest of this post.

Titanic Dataset

The famous Titanic dataset contains information about the fate of passengers of the RMS Titanic that sank after colliding with an iceberg. It regularly serves as toy data for blog exercises like this.

H2O public S3 bucket holds the Titanic dataset readly available and using package data.table makes it fast one-liner to load into R:




Data Engineering

Passenger features from the Titanic dataset are discussed at length online, e.g. see Predicting the Survival of Titanic Passengers and Predicting Titanic Survival using Five Algorithms. To summarize the following features were selected and engineered for decision tree model:
  • survived indicates if passenger survived the wreck
  • boat and body leak survival outcome and were dropped completely before modeling
  • name and cabin are too noisy as they are and only used to derive new features
  • title is parsed from name
  • cabin_type is parsed from cabin
  • family_size and family_type are derived from combination of count features sibsp (siblings+spouse) and parch (parents+children)
  • ticket and home.dest are dropped to preserve simplicity of the model
  • missing values in age and fare are imputed using target encoding (mean) over grouping by survived, sex, and embarked columns. 
Data load and data munging steps above are implemented in R using data.table:



Starting with H2O

Creating models with H2O requires running a server process (remote or local) and a client (package h2o in R available from CRAN) where the latter connects and sends commands to the former. The Tree API was introduced with release 3.22.0.1 (10/26/2018) but due to CRAN policies h2o package usually lags several versions behind (on the time of this writing CRAN hosted version 3.20.0.8). There are two ways to work around this:
  1. Install and run package available from CRAN and use strict_version_check=FALSE inside h2o.connect() to communicate with newer version running on server
  2. Or install the latest version of h2o available from H2O repository either to connect to remote server or to both connect and run server locally.
Tree API is available only with 2d option because it requires access to new classes and functions in h2o package (remember, I asked you read Pavel's blog). Below code from the official H2O download page shows how to download and install the latest version of the package: 


Building Decision Tree with H2O

While H2O offers no dedicated single decision tree algorithm there two approaches using superseding models:

Choosing GBM option requires one less line of code (no need to calculate number of features to set mtries) so it was used for this post. Otherwise both ways result in the same decision tree with the steps below fully reproducible using h2o.randomForest() instead of h2o.gbm().

Decision Tree Depth

When building single decision tree models maximum tree depth stands as the most important parameter to pick. Shallow trees tend to underfit by failing to capture important relationships in data producing similar trees despite varying training data (error due to high bias). On the other hand trees grown too deep overfit by reacting to noise and slight changes in data (error due to high variance). Tuning H2O model's parameter max_depth that limits decision tree depth aims at balancing the effects of bias and variance. In R using H2O to split data and to tune the model, then visualizing results with ggplot to look for right value unfolds like this:   
  1. split Titanic data into training and validation sets
  2. define grid search object with parameter max_depth
  3. launch grid search on GBM models and grid object to obtain AUC values (model performance)
  4. plot grid model AUC'es vs. max_depth values to determine "inflection point" where AUC growth stops or saturates (see plot below)
  5. register tree depth value at inflection point to use in the final model
Code below implements these steps:

and produces chart that points to inflection point for maximum tree depth at 5:

Figure 4. Visualization of AUC vs. maximum tree depth hyper-parameter trend
extracted from the H2O grid object after running grid search in H2O.
Marked inflection point indicates when increasing maximum tree depth
no longer improves model performance on validation set

 

Creating Decision Tree

As evident from the Figure 4 optimal decision tree depth is 5. The code below constructs single decision tree model in H2O and then retrieves tree representation from a GBM model with Tree API function h2o.getModelTree(), which creates an instance of S4 class H2OTree and assigns to variable titanicH2oTree:


At this point all action moved back inside R with its unparalleled access to analytical and visualization tools. So before navigating and plotting a decision tree - final goal for this post - let's have brief intro to networks in R.


Overview of Network Analysis in R

R offers arguably the richest functionality when it comes to analyzing and visualizing network (graph, tree) objects. Before taking on the task of conquering it spend time visiting a couple of comprehensive articles describing vast landscape of tools and approaches available: Static and dynamic network visualization with R by Katya Ognyanova and Introduction to Network Analysis with R by Jesse Sadler.

To summarize there are two commonly used packages to manage and analyze networks in R: network (part of statnet family) and igraph (family in itself). Each package implements namesake classes to represent network structures so there is significant overlap between the two and  they mask each other's functions. Preferred approach is picking only one of two: it appears that igraph is more common for general-purpose applications while network is preferred for social network and statistical analysis (my subjective assessment). And while researching these packages do not forget about package intergraph that seamlessly transforms objects between network and igraph classes. (And this analysis stopped short of expanding into universe of R packages hosted on Bioconductor).

When it comes to visualizing networks choices quickly proliferate. Both network and igraph offer graphical functions that use R base plotting system but it doesn't stop here. Following packages specialize in advanced visualizations for at least one or both of the classes:
  • ggraph
  • ggnet2
  • ggnetwork
  • visNetwork
  • DiagrammeR
  • networkD3

Finally, there is packagedata.tree designed specifically to create and analyze trees in R. It fits the bill of representing and visualizing decision trees perfectly, so it became a tool of choice for this post. Still, visualizing H2O model trees could be fully reproduced with any of network and visualization packages mentioned above. 

Visualizing H2O Trees

In the last step a decision tree for the model created by GBM moved from H2O cluster memory to H2OTree object in R by means of Tree API. Still, specific to H2O the H2OTree object now contains necessary details about decision tree, but not in the format understood by R packages such asdata.tree.

To fill this gap function createDataTree(H2OTree) created that traverses a tree and translates it from H2OTreeinto data.tree accumulating information about decision tree splits and predictions into node and edge attributes of a tree:


Finally everything lined up and ready for the final step of plotting decision tree:
  • single decision tree model created in H2O
  • its structure made available in R
  • and translated to specialized data.tree for network analysis.
Styling and plotting data.tree objects is built around rich functionality of the DiagrammerR package. For anything that goes beyond simple plotting read documentation here but also remember that for plotting data.tree takes advantage of:
  • hierarchical nature of tree structures
  • GraphViz attributes to style graph, node and edge properties
  • and dynamic callback functions (in this example GetEdgeLabel(node), GetNodeShape(node), GetFontName(node)) to customize tree's feel and look 
The following code will produce this moderately customized decision tree for our H2O model: 


Figure 5. H2O Decision Tree for Titanic Model Visualized in R using data.tree package

     

References

Sunday, April 1, 2018

Surviving Shelter: Analysis of Time Spent and Outcome in Dallas Animal Shelters

In previous post we discovered Dallas Animal Services data sources (available on Dallas Open Data) and analyzed how animals get admitted to and discharged from the city shelters. We loaded actual shelter records and looked at the types of admittance, different outcomes and their relationships. In this post we continue this analysis by focusing on the time animals spend and factors that favor or hinder survival of dogs in the shelters. For consistency and representation only types of admission ConfiscatedOwner Surrender, and Stray and outcomes AdoptionDied, EuthanizedReturned to Owner, and Transfer were included. Dead on Arrival was excluded from survival analysis because it preempties outcome before stay in shelter begins.

Time Spent in Shelters

Compare the distributions of time spent in shelter for cats and dogs to note both similarities and differences:



Distributions are bimodal with relatively fat tails but they differ in how major modes compare to minor ones. As Wikipedia rightly notices "a bimodal distribution most commonly arises as a mixture of two different unimodal distributions" and dissecting data by admission and outcome types opens the door to further discovery:



If the former histogram used facets for separate plots for cats and dogs, the latter plot switched to dodged bars to pack more information into less space. Some interesting observations:

  • Confiscated admissions have distinctively different profile and peaks presumingly attributed to legal obligations to owners;
  • Confiscated has distinct bimodal distributions when outcomes are either Returned to Owner or Transfer;
  • Adoption times are similar for both cats and dogs;
  • Most distributions have clear unimodal profiles specific to the types of admission and outcome that vary between dogs and cats in density;
  • Adoption and to less degree Owner Surrender distributions are almost indistinguishable between cats and dogs.
Rendering the same data using density curve estimates lets us validate the differences and similarities observed:


The densities demonstrate striking similarity in Adoption and most differences in Euthanized outcome times. 

Sankeys With Average Times

We already used Sankey diagrams to project flow from admission to discharge by total number of occurrences in each transition. This time we decided on novel approach to Sankeys when thickness reflects average time spent in shelter. First diagram is for cats:




And then for dogs:





The thinner the line the shorter average stay between admission and outcome it connects. And the larger vertical panel (admission or outcome) the longer it indicates an animal spends in shelter after admission or before discharge (on average and unweighted).  

Expected Chance of Not Surviving in Shelter

For the purpose of this analysis any outcome other than Died or Euthanized means animal survived to leave shelter alive (most with outcomes Adoption, Foster, Returned to Owner or Transfer). Remember that we also excluded dogs with intake type Dead on Arrival (see introduction).

We begin with rather simple calculations - an estimates of chance of dying in shelter given animal satisfies certain condition. Plot below contains conditional probabilities for dogs (unless cats specified) not surviving in shelter given certain factor at the time of admission (intake categories):

Two health conditions stand out with the highest rates: untreatable and unmanageable, while another health condition contagious is present in 3 out of top 4 factors.

There is one more factor breed which has over 200 values just for dogs. Below we display chances of dying for the dog breeds with at least 100 recorded admissions:

Note that probability scale is different between last two plots. Surprisingly, breed Chow Chow took the top spot with Pit Bull Terrier breeds Staffordshire, Pit Bull, Am Pit Bull Terrier, and American Staffordshire close next. 

Survival Analysis

While applying classic survival analysis to animal shelter data presents certain challenges we apply the approach by ignoring few details. But any suggestions or comments how to improve are welcome. The survival function S(t) gives the probability that the subject (pet admitted to shelter) survives longer than time t

In this case pets survived when discharged with any outcome other than Died or Euthanized. The time t is always in days since the day of admission and all animal records included in this analysis are for animals that were discharged (effectively eliminating both left and right censoring cases). Survival analysis accounts for censored data - those subjects with last known status alive and no later information available. In our case all animal records contain outcome and thus all discharged alive are censored at discharge date.

Kaplan-Meier Estimator

Kaplan-Meier (KM) estimate is a non-parametric maximum likelihood estimate of the survival function, S(t). It measures the fraction of animals living for a certain number of days t after admission and produces a declining step function with drops (KM curve) that approximates the real survival function from data. Given single categorical factor we can observe and compare KM curves (univariate analysis) among multiple factor values. KM curves estimate and visualize survival chances in time just as survival functions: given time t what is probability that subject survives at least to that time or longer.

Cats vs. Dogs KM Curves

First we compare survival curves between cats and dogs:



The survival curve plot (top) is augmented with the bar chart of totals by categories and survival outcome (bottom) to give better understanding of underlying data. Survival chances for cats are never better than those and overall cats fare much worse than dogs - see bar chart above. Zooming in into the most critical first days after admission reveals more differences:



Day of admission is the worst for both but cats fare twice as bad with 25% lost right away. Days 4 and 5 are critical for dogs as their survival plummets on these days. After that survival rates stabilize and trend in similar pattern.


KM Curves by Dog Intake Types

To make further analysis more plausible we include only dog records from this point on. We also exclude pets admitted as Dead on Arrival or Euthanasia Requested since their outcomes are obvious and immediate.



Confiscated dogs survival chances are the best in first 10 days or so but then they quickly deteriorate crossing and diving below 2 other types after 2 weeks. The worst chances as expected belong to dogs surrendered by owner. And after 2 weeks all 3 curves cross to become less distinguishable.
 

KM Curves by Dog Origins

Dallas Animal Services also maintain origin field assigning it at admission with 3 most prevalent values being Field, Over the Counter, and Sweep. These are how survival curves differ depending on dog origin:


Again, significant shifts in survival chances happen after 5 days and then after 2-3 weeks when the fortunes of different origins turn around: after 5 days Over the Counter from the worst becomes 2d worst (or best) and then after 3 weeks the best. Both Field and Sweep drop after 5 days. In absolute numbers (shown in the bar plots) Field dogs survive  the worst.

Health Conditions at Admission

Unhealthy animals have little chance to survive shelters as evident from the following:


No surprise that unhealthy animals survival is significntly below healthy ones. Also, dominant majority of dogs accepted are in unhealthy condition, which is both not surprising and unfortunate. 

There is more information about unhealthy dogs available from shelter records: treatable vs. untreatable and contagious vs. non-contagious. Unfortunately, these values reside inside single field so the survival curves include combinations of the health factors:



It clearly shows how each health factor reduces survival chances: from Healthy to Treatable Rehabilitable to Treatable Manageable to Unhealthy Untreatable to finally Unhealthy Untreatable Contagious

If we extract and analyze each health factor (ignoring the rest) then these relationships become more apparent:





Survival of Dogs with Chips

As of June 17, 2017, all dogs and cats four months and older in the city of Dallas must be microchipped. This relatively new regulation will likely change both the share of chipped dogs in Dallas and survival curves as observed below from 2015 through October 2017:



Still having a dog microchipped will almost certainly keep survival chances higher.
 

Dog Breeds

Dallas shelters admitted dogs of over 200 different breeds from 2015 through 2017. Among them 56 breeds appeared 100 times or more (over 95% of all admissions): 


Top 4 breeds - Pit Bull, Labrador Retriever, Chihuahua, and German Shepherd - account for almost 60% of all admissions with next breed - Cairn Terrier - dropping to just under 3%. The survival curves for these 5 breeds contain almost 2/3 of all dogs admitted to Dallas shelters:

Pit Bull's suffer the worst survival rate of the 5 most admitted breeds. It drops to below 50% survival rate after just over a week at shelter. Labrador and German Shepherd get 50% some time  into 3 week period. Smaller breeds last much better as evident from Chihuahua and Cairn Terrier curves.

It turns out there are more breeds closely related to Pit Bull: American Staff, Am Pit Bull Ter, and Staffordshire:



Similar pattern for three of four breeds from the group sharply differ from the 4th - American Staffordshire for reason(s) beyond this analysis.
 

Next

In the next and final post on Dallas animal shelters we will apply Cox proportional hazard semi-parameterical statistical analysis to assess simultaneously the effect of several factors on survival time and outcome.

Resources

The R notebook (source code) with data pipeline and visualizations can be found here with knitted version on RPubs.

Monday, September 4, 2017

How Pets Get Admitted and Later Leave Dallas Animal Shelters

Thanks to Dallas OpenData anyone has access to the city animal shelter records.  If you lost or found a pet it could be that he or she spent some time in a shelter - I personally took lost dogs there. It's unfortunate but every year tens of thousands of animals find their way to shelters with significant fraction never finding way out. 


What and How Many Animals are Admitted?


City of Dallas animal shelter dataset contains 5 different animal types with solid lead belonging to dogs and cats (hardly any surprise to anyone):



For consistency and plausibility of analysis we will focus on cats and dogs only


How Animals get Admitted


Each shelter record has animal's intake type (reason animal was admitted) and outcome (cause for animal disharge). Top 2 reasons why cats and dogs turn up at shelters are Stray (lost or abandoned) and Owner Surrender (willingly brought in by owner) while Confiscated (abused, no owner, etc.) is #3 for dogs but not cats



How Animals Leave Shelter


Animals leave shelters (either alive or dead) for 4 main reasons (outcomes):  Adoption (good),  Euthanized (bad), Returned to Owner (good), and Transfer (neutral):



Unfortunately, for both cats and dogs the top reason to leave shelter is being euthanized. But that's where similarity between them ends: 


  • cats don't get returned to owner anywhere near as often as dogs;
  • dogs' adoption and euthanized rates are almost the same while cats get adopted far less. 

From Admissions to Outcomes with Sankey


So what is the relationship between intake types and outcomes? Which and to what extent intake types drive outcomes? The good news there is some causality effect because each stay begins with intake type and ends with outcome. 

We begin analyzing this relationship with higher level (in that case) but visually appealing visualization called sankey diagram (or just sankey). It is a specific type of flow diagram, in which the width of the arrows is shown proportionally to the flow quantity. In our case each dog shelter stay contributes to the pipe size flowing from left (an intake type) to right (an outcome). With this we basically visualize conditional probabilities of dog leaving shelter with certain outcome given its admission intake type (first image illustrates transitions for cats and second does the same for dogs):







While Owner Surrender intake type flows similarly for both, Stray animals don't: cat outcomes are dominated by Euthanized but dogs are dominated by Adoption with Transfer and Returned to Owner outcomes together matching Euthanized.


Correlations Between Admissions and Outcomes


Next, we go beyond overall totals used in the sankey and compute correlations. To correlate intake types and outcomes we construct time series by computing monthly totals for each intake type and outcome obtaining monthly trends. Then we correlate between monthly trends (separate for cats and dogs) animals brought in and removed from Dallas animal shelters for each pair of top intake types (Confiscated, Owner Surrender, and Stray) and outcomes (Adoption, Euthanized, Returned to Owner, and Transfer) - 12 coefficients in total:




In this case strong correlation implies (at least to some extent) causation effect due to presence of temporal relationship, consistency, and plausibility criteria (see here and here). Few observations to note:
  • The highest correlation for cats (0.91) and second highest for dogs (0.77) are observed between intake Surrendered by Owner and outcomeEuthanized which is almost as obvious as unfortunate.
  • Correlation between Stray and Returned to Owner for dogs is the highest at 0.86. This is great news because it means the more dogs get lost the more of them are found. The higher this correlation the healthier the city for 2 reasons: a) lost animals return home and b) larger share of stray dogs are lost ones and not abandoned (given that the city keeps collecting them).
  • Unfortunately trend in Stray cats correlates highly with Euthanized. So while Stray dog trend drives adoptions and returns, Stray cat trend affects euthanizations the most (we've seen that in sankey as well).
  • No trends are affected by variations in Confiscated dogs, but this is likely due to smaller share of such admissions.
  • Variation in Stray dogs admitted affect every outcome (but Euthanized). Indeed Stray intake type is the largest and is almost twice as big as the 2d largest dog type Owner Surrender
  • Low correlation for dogs between Stray and Euthanized needs additional analysis because it's counter-intuitive.

Monthly Trends


But can we do better than correlations of these trends which technically are sophisticated but still aggregates? Next visual places time series instead of correlation coefficients inside the same matrix grid allowing to see and compare actual monthly trends:



Note that each plot is a 3 x 4 matrix - the same dimension as correlations matrices before. But instead of correlation coefficient each cell contains a pair of monthly trends (in fact, each correlation was computed for these exact pairs of trends, hence, a reference to its aggregation origin). Each row corresponds to an intake type (the same blue line in each) and each column to an outcome (the same red line in each). Being able to see trends over time let's record a few observations (following the matrix order top down):
  • Confiscated intake trends flat for both cats and dogs with only significant spike for dogs in January 2016. This spike is so unusual, relatively big, and contained within single month or two that it begs additional investigation into probable external event or procedural change that may have caused it.
  • Number of Confiscated animals is relatively low to noticeably affect outcomes. Still, if we can reduce effect of other intake types some relationships are possible.
  • Owner Surrender trend correlation with Euthanized outcome is so obvious that this type of visualization is sufficient to find it. Yes, it is unfortunate but people bring their old or unhealthy pets for a reason. 
  • The same applies to Stray and Owner Surrender for cats only. 
  • Owner Surrender has significant seasonal component spiking in summer possibly due to hot weather or holiday season or both. For cats only seasonal component is also strong in Stray trend.
  • Euthanized trends together with Owner Surrender which causes it to a large degree.
  • Stray dogs trend slowly upwards in Dallas and it's alarming.
  • Adoption also trends upwards but not steep enough to compensate for inflow of dogs into shelters.  Targeted campaign to encourage more adoptions of pets in the city is due.
  • Transfer outcome trending upward also compensates  for the growth in stray dogs. It's not clear if it's positive or negative though as there is no means to track what happens to dogs after transfer (or is it?).
  • Stray trend for dogs dipped in January 2016 exactly when confiscated trend spiked - it could be a coincidence or related - for sure something to consider when investigating further.
  • For dogs Euthanized trend correlates strongly with Stray intake until the summer of 2016 when they start to diverge in opposite directions - again some policy or procedural change apparently caused it. Indeed, if we observe other outcomes we notice that Returned to Owner trend began its uptick at around the same time (indeed, after I observed this I found out about this and this - significant changes in Dallas Animal Services leadership and policies around summer and fall of 2016).
I will be back with more analysis (survival analysis). R code for data processing, analysis, and visualizations from this post can be found here.