tag:blogger.com,1999:blog-75302188029392524762024-03-15T20:10:22.027-05:00novydenGregory Kanevskyhttp://www.blogger.com/profile/09179130896383881927noreply@blogger.comBlogger38125tag:blogger.com,1999:blog-7530218802939252476.post-75657443615630273182021-07-19T23:05:00.004-05:002021-07-20T16:43:59.471-05:00Time Travel with py datatable 1.0<p><span style="font-family: georgia;"><span style="font-size: large;">R package <b>data.table</b> has become a tool of choice when working with big tabular data thanks to its versatility and performance. Its Python counterpart <b>py datatable</b> follows R cousin in performance and steadily catches up in functionality. A notable omission - temporal data types - were introduced in version </span></span><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;">1.0 </span></span>by means of two new types: </span></span></p><ul style="text-align: left;"><li><span style="font-family: georgia;"><span style="font-size: large;"><i>datatable.Type.date32</i> to represent and store particular calendar date without a time component and</span></span></li><li><span style="font-family: georgia;"><span style="font-size: large;"><i>datatable.Type.time64</i> to store specific moment in time (i.e. date with a time component) <br /></span></span></li></ul><p><span style="font-family: georgia;"><span style="font-size: large;">and the <i>datatable.time</i> family of functions: <a href="https://datatable.readthedocs.io/en/latest/api/time.html">https://datatable.readthedocs.io/en/latest/api/time.html</a></span></span></p><p><span style="font-family: georgia;"><span style="font-size: large;">Let's have a brief overview of how to use them.</span></span></p><h2 style="text-align: left;"><span style="font-family: georgia;"><span style="font-size: large;">datatable.Type.date32</span></span></h2><p><span style="font-family: georgia;"><span style="font-size: large;">This type represents a calendar date without a time component and i</span><span style="font-size: large;">nternally stores date as a 32-bit signed integer
counting the number of days since (positive) or before (negative) the epoch (1970-01-01). Thus, this
type includes dates within the range of approximately ±5.8 million
years which places the oldest stored date into </span></span><span style="font-family: georgia;"><span style="font-size: large;"><a href="https://www.britannica.com/science/Miocene-Epoch" target="_blank">the Late Miocene Epoch</a> and the maximum one into completely unknown even to science fiction year 5,879,610 of the 58797th century in the future: <br /></span></span></p><p><span style="font-family: georgia;"><span style="font-size: large;">
</span></span></p><div><span style="font-family: georgia;"><span style="font-size: large;">
<code data-gist-file="01-date32-intro.py" data-gist-hide-footer="true" data-gist-id="064f2a7183a2987893dce398a013d980"></code>
</span></span></div><span style="font-family: georgia;"><span style="font-size: large;">
</span></span><p></p><p><span style="font-family: georgia;"></span></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjJ32BVepHNVmm-3tE6spdqpoTQf0sAm3LtLIxhoETYfgv7QX9axHlg7aBw-oPBJnipWlvm1vJcwZZ8UWth47Hh8SjVluql_HVDBV_ckNaYS9PLFP2QAzBN-VMrxEgYvZNeW-u2F-Yrz1Et/s298/Screen+Shot+2021-07-19+at+14.50.04.png" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="298" data-original-width="276" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjJ32BVepHNVmm-3tE6spdqpoTQf0sAm3LtLIxhoETYfgv7QX9axHlg7aBw-oPBJnipWlvm1vJcwZZ8UWth47Hh8SjVluql_HVDBV_ckNaYS9PLFP2QAzBN-VMrxEgYvZNeW-u2F-Yrz1Et/w185-h200/Screen+Shot+2021-07-19+at+14.50.04.png" width="185" /></a></div><br /><span style="font-size: large;"><br /> </span><p></p><p><span style="font-family: georgia;"><span style="font-size: large;"><br /></span></span></p><p><span style="font-family: georgia;"><span style="font-size: large;"><br /></span></span></p><p><span style="font-family: georgia;"><span style="font-size: large;"><br /></span></span></p><p><span style="font-family: georgia;"><span style="font-size: large;"> </span></span></p><p><span style="font-family: georgia;"><span style="font-size: large;">There are various ways to initialize and/or create <i>date32</i> column inside datatable:</span></span></p><p><span style="font-family: georgia;"><span style="font-size: large;">
</span></span></p><div><span style="font-family: georgia;"><span style="font-size: large;">
<code data-gist-file="02-date32-create.py" data-gist-hide-footer="true" data-gist-id="064f2a7183a2987893dce398a013d980"></code>
</span></span></div><span style="font-family: georgia;"><span style="font-size: large;">
</span></span><p></p><p><span style="font-family: georgia;"></span></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgq2yqziN5K4ktwYP9e1BhWGBYGbgUi6ngZwl-Bhx4dW_g_wjKdcaEk5xci1vKfnuRf3y8dfbgZ5lxzoScx-0L7aYqB1BjWOvmLAa-wUFVqTo7RI4aiBEPvt4amnBEO-5RspGi2fSLdbbs1/s298/Screen+Shot+2021-07-19+at+15.09.30.png" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="298" data-original-width="276" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgq2yqziN5K4ktwYP9e1BhWGBYGbgUi6ngZwl-Bhx4dW_g_wjKdcaEk5xci1vKfnuRf3y8dfbgZ5lxzoScx-0L7aYqB1BjWOvmLAa-wUFVqTo7RI4aiBEPvt4amnBEO-5RspGi2fSLdbbs1/w185-h200/Screen+Shot+2021-07-19+at+15.09.30.png" width="185" /></a></div><br /><span style="font-size: large;"><br /></span><p></p><p><span style="font-family: georgia;"><span style="font-size: large;"><br /></span></span></p><p><span style="font-family: georgia;"><span style="font-size: large;"><br /></span></span></p><p><span style="font-family: georgia;"><span style="font-size: large;"><br /></span></span></p><p><span style="font-family: georgia;"><span style="font-size: large;"><br /></span></span></p><p><span style="font-family: georgia;"><span style="font-size: large;"> </span></span></p><p><span style="font-family: georgia;"><span style="font-size: large;">
</span></span></p><div><span style="font-family: georgia;"><span style="font-size: large;">
<code data-gist-file="03-date32-create.py" data-gist-hide-footer="true" data-gist-id="064f2a7183a2987893dce398a013d980"></code>
</span></span></div><span style="font-family: georgia;"><span style="font-size: large;">
</span></span><p></p><p><span style="font-family: georgia;"><span style="font-size: large;">or</span></span></p><p><span style="font-family: georgia;"><span style="font-size: large;">
</span></span></p><div><span style="font-family: georgia;"><span style="font-size: large;">
<code data-gist-file="04-date32-create.py" data-gist-hide-footer="true" data-gist-id="064f2a7183a2987893dce398a013d980"></code>
</span></span></div><span style="font-family: georgia;"><span style="font-size: large;">
</span></span><p></p><p><span style="font-family: georgia;"></span></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgqC-fLBN3M42rf3RIl2R6h5VvlCfC-oDKbdyN0UA9J1_BudrfDtKSHCwsxHIHlYcKe5WgFskPTb5d7uXiJ6NMmTNWo4CQt8T-oXV0DuICw6oR-Z1jaOilofkXAmB19vBN04iFcxPxkk2fu/s298/Screen+Shot+2021-07-19+at+15.16.49.png" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="298" data-original-width="276" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgqC-fLBN3M42rf3RIl2R6h5VvlCfC-oDKbdyN0UA9J1_BudrfDtKSHCwsxHIHlYcKe5WgFskPTb5d7uXiJ6NMmTNWo4CQt8T-oXV0DuICw6oR-Z1jaOilofkXAmB19vBN04iFcxPxkk2fu/w185-h200/Screen+Shot+2021-07-19+at+15.16.49.png" width="185" /></a></div><br /><span style="font-size: large;"><br /></span><p></p><p><span style="font-family: georgia;"><span style="font-size: large;"><br /></span></span></p><p><span style="font-family: georgia;"><span style="font-size: large;"><br /></span></span></p><p><span style="font-family: georgia;"><span style="font-size: large;"><br /></span></span></p><p><span style="font-family: georgia;"><span style="font-size: large;"><br /></span></span></p><p><span style="font-family: georgia;"><span style="font-size: large;"> </span></span></p><p><span style="font-family: georgia;"><span style="font-size: large;">Remember to use <a href="https://en.wikipedia.org/wiki/ISO_8601" target="_blank">ISO 8601</a> format when representing dates as strings, otherwise parsing fails silently: </span></span></p><p><br /><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;">
</span></span></span></span></p><div><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;">
<code data-gist-file="05-date32-create.py" data-gist-hide-footer="true" data-gist-id="064f2a7183a2987893dce398a013d980"></code>
</span></span></span></span></div><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;">
</span></span></span></span><p></p><p><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;"></span></span></span></span></p><div class="separator" style="clear: both; text-align: center;"><span style="font-size: large;"><span style="font-size: large;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiFKqD5xiJALTJZpt7FMc4_6NYBhQKrC5wiutyMeGajd_dFC7pBzH2emU4nUQTiHE-POoodrAEsqaQTkqIWdKSVyM3bF0wgan_1T3e-x_3Wo0xQlRMTpEop_eygFUIFyTVpikhELQONA1E0/s298/Screen+Shot+2021-07-19+at+15.28.09.png" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="298" data-original-width="276" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiFKqD5xiJALTJZpt7FMc4_6NYBhQKrC5wiutyMeGajd_dFC7pBzH2emU4nUQTiHE-POoodrAEsqaQTkqIWdKSVyM3bF0wgan_1T3e-x_3Wo0xQlRMTpEop_eygFUIFyTVpikhELQONA1E0/w185-h200/Screen+Shot+2021-07-19+at+15.28.09.png" width="185" /></a></span></span></div><span style="font-size: large;"><span style="font-size: large;"><br /> </span></span><p></p><p><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;"> </span></span></span></span></p><p><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;"> </span></span></span></span></p><p><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;"> </span></span></span></span></p><p><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;"> </span></span></span></span></p><p><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;"> </span></span></span></span></p><p><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;">If a frame already contains dates as strings then using combination of the constructor function <i>datatable.time.ymd()</i> (to create <i>date32</i> type), cast function <i>datatable.as_type()</i> (to convert <i>str</i> to <i>int</i>) and string slicer <i>datatable.str.slice()</i> (to substring date elements) suffices to parse and create corresponding <i>date32</i> value all within datatable API:</span></span></span></span></p><p><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;">
</span></span></span></span></p><div><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;">
<code data-gist-file="06-date32-create.py" data-gist-hide-footer="true" data-gist-id="064f2a7183a2987893dce398a013d980"></code>
</span></span></span></span></div><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;">
</span></span></span></span><p></p><p><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"></span></span></span></p><div class="separator" style="clear: both; text-align: center;"><span style="font-size: large;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhQe6odG3vtA7yBgD4HWgL5C0sIMKBEkgWfqkCRn7UXHndM7uywGQQjam-V5zTqhWW4oDMHZYsaDmUjzhEXvbtv_-j_2obHglMNhKc-JCs5HbSEtF3D-WrqGeXV7q4iiaWiPnh32-2-r8qp/s298/Screen+Shot+2021-07-19+at+15.50.34.png" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="298" data-original-width="276" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhQe6odG3vtA7yBgD4HWgL5C0sIMKBEkgWfqkCRn7UXHndM7uywGQQjam-V5zTqhWW4oDMHZYsaDmUjzhEXvbtv_-j_2obHglMNhKc-JCs5HbSEtF3D-WrqGeXV7q4iiaWiPnh32-2-r8qp/w185-h200/Screen+Shot+2021-07-19+at+15.50.34.png" width="185" /></a></span></div><span style="font-size: large;"><br /><span style="font-size: large;"><br /></span></span><p></p><p><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;"><br /></span></span></span></span></p><p><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;"><br /></span></span></span></span></p><p><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;"><br /></span></span></span></span></p><p><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;"><br /></span></span></span></span></p><p><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;"><br /></span></span></span></span></p><h2 style="text-align: left;"><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;">datatable.Type.time64</span></span></span></span></h2><p style="text-align: left;"><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;">This type represents a specific moment in time and is stored i</span></span></span></span><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;">nternally as a 64-bit integer containing the number of
nanoseconds since the epoch (</span></span></span></span><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;">1970-01-01</span></span>) in UTC:</span></span></span></span></p><p style="text-align: left;"><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;"></span></span></span></span></p>
<code data-gist-file="07-time64-intro.py" data-gist-hide-footer="true" data-gist-id="064f2a7183a2987893dce398a013d980"></code>
<div class="separator" style="clear: both; text-align: center;"><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjHvfMOVzq6Tro2AvT4tpLUMMMHJ1lrp7Vo6A-F6hn4Lo-0zctMV0u-4sZZNNIq_6TQrumrXaCCkUS7g7UpdWKWJiCGFAvO_hQhvnLkOymnlRJmTWpsIPYrcXo03ZH52_3YI1DwMajIdY-u/s404/Screen+Shot+2021-07-19+at+16.27.16.png" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="298" data-original-width="404" height="148" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjHvfMOVzq6Tro2AvT4tpLUMMMHJ1lrp7Vo6A-F6hn4Lo-0zctMV0u-4sZZNNIq_6TQrumrXaCCkUS7g7UpdWKWJiCGFAvO_hQhvnLkOymnlRJmTWpsIPYrcXo03ZH52_3YI1DwMajIdY-u/w200-h148/Screen+Shot+2021-07-19+at+16.27.16.png" width="200" /></a></span></span></span></span></div><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;"><br /></span></span></span></span><p></p><p style="text-align: left;"><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;"> </span></span></span></span></p><p style="text-align: left;"><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;"> </span></span></span></span></p><p style="text-align: left;"><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;"> </span></span></span></span></p><p style="text-align: left;"><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;"> </span></span></span></span></p><p style="text-align: left;"><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;">Similarly <i>time64</i> can be created in the same fashion as <i>date32</i> type above, for example:</span></span></span></span></p>
<code data-gist-file="08-time64-create.py" data-gist-hide-footer="true" data-gist-id="064f2a7183a2987893dce398a013d980"></code>
<div class="separator" style="clear: both; text-align: center;"><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjD_iwxqI6wm5rfKaYnCx2lM4o9HUDkxTUuiKTdTHAGaqGgcnh8E9uvZkka5Qx-GlqfrL-Xi0WG-zxNJ6uaP5hQbURUd2-jc4An5NrQLNdSJloZA2eBWk6gu1Hnh1UFvRP-w0hJ8geyrdp3/s404/Screen+Shot+2021-07-19+at+16.39.48.png" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="298" data-original-width="404" height="148" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjD_iwxqI6wm5rfKaYnCx2lM4o9HUDkxTUuiKTdTHAGaqGgcnh8E9uvZkka5Qx-GlqfrL-Xi0WG-zxNJ6uaP5hQbURUd2-jc4An5NrQLNdSJloZA2eBWk6gu1Hnh1UFvRP-w0hJ8geyrdp3/w200-h148/Screen+Shot+2021-07-19+at+16.39.48.png" width="200" /></a></span></span></span></span></div><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;"><br /></span></span></span></span><p></p><p style="text-align: left;"><span style="font-family: georgia;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: large;"> <br /></span></span></span></span></p><p style="text-align: left;"><span style="font-family: georgia; font-size: large;"><span><span style="font-family: georgia;"><span> </span></span></span></span></p><p style="text-align: left;"><span style="font-size: large;"> </span></p><p style="text-align: left;"><span style="font-family: georgia;"><span style="font-size: large;"> </span></span></p><p style="text-align: left;"><span style="font-family: georgia;"><span style="font-size: large;">Again a time string should include ISO 8601 format as well. To create time from its components use a constructor function <i>datatable.time.ymdt()</i>:<br /></span></span></p>
<code data-gist-file="09-time64-create.py" data-gist-hide-footer="true" data-gist-id="064f2a7183a2987893dce398a013d980"></code>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhgzZY4m2aDDosvNMe5jifBJM5JxFxMA3tTD5rOm1_yf4S-ZTse1zZasQDlo_xrprmmpXHZTXmplmZzC-gSlSjQNEB4n6lF71-KP34laapUHp8BQ8uYudchKXC1uDFoELTGFiklg4wqrJRz/s916/Screen+Shot+2021-07-19+at+17.45.39.png" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="398" data-original-width="916" height="174" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhgzZY4m2aDDosvNMe5jifBJM5JxFxMA3tTD5rOm1_yf4S-ZTse1zZasQDlo_xrprmmpXHZTXmplmZzC-gSlSjQNEB4n6lF71-KP34laapUHp8BQ8uYudchKXC1uDFoELTGFiklg4wqrJRz/w400-h174/Screen+Shot+2021-07-19+at+17.45.39.png" width="400" /></a></div><br /><p style="text-align: left;"><br /></p><p style="text-align: left;"><br /></p><p style="text-align: left;"><br /></p><p style="text-align: left;"><br /></p><p style="text-align: left;"><br /></p><p style="text-align: left;"><br /></p><h2 style="text-align: left;"><span style="font-size: large;"><span style="font-family: georgia;"><span>datatable.time.* Functions</span></span></span></h2><p style="text-align: left;"><span style="font-size: large;"><span style="font-family: georgia;"><span>To effectively use datatable <i>date32</i> and <i>time64</i> types there are special functions included that are part of <i>datatable.time</i> family:</span></span></span></p><ul style="text-align: left;"><li><span style="font-size: large;"><span style="font-family: georgia;"><span>constructors <i>ymd()</i> and <i>ymdt()</i> and</span></span></span></li><li><span style="font-size: large;"><span style="font-family: georgia;"><span>date and time part functions: <i>day(), day_of_week(), hour(), minute(), month(), nanosecond(), second(), year()</i><br /></span></span></span></li></ul><p style="text-align: left;"><span style="font-size: large;"><span style="font-family: georgia;"><span>Using constructors was showcased already and the part functions will come handy when filtering data etc., e.g.:</span></span></span></p><p style="text-align: left;"><span style="font-size: large;">
<code data-gist-file="10-datetime-filter.py" data-gist-hide-footer="true" data-gist-id="064f2a7183a2987893dce398a013d980"></code>
</span></p><p style="text-align: left;"><span style="font-size: large;"></span></p><div style="text-align: left;"><span style="font-size: large;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhhTUqUD1aYcnd4FfaqmOcSNaIT2ZW0VA-Q8tgfZ5tu5GdaYdaLPALA008sogzW_luS9d8qsBk8HXGUKEzSwT7xOZtsD0ILbvTIBynSMPV0l5A0IDUxd6BNMQZ7FjQk6oSg2jC8x9xNwUku/s904/Screen+Shot+2021-07-19+at+22.44.46.png" style="clear: left; margin-bottom: 1em;"><img border="0" data-original-height="178" data-original-width="904" height="79" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhhTUqUD1aYcnd4FfaqmOcSNaIT2ZW0VA-Q8tgfZ5tu5GdaYdaLPALA008sogzW_luS9d8qsBk8HXGUKEzSwT7xOZtsD0ILbvTIBynSMPV0l5A0IDUxd6BNMQZ7FjQk6oSg2jC8x9xNwUku/w400-h79/Screen+Shot+2021-07-19+at+22.44.46.png" width="400" /></a></span></div><span style="font-size: large;"><br /></span><br /><p></p>Gregory Kanevskyhttp://www.blogger.com/profile/09179130896383881927noreply@blogger.com0tag:blogger.com,1999:blog-7530218802939252476.post-12586744918663274692021-01-06T07:55:00.004-06:002021-01-09T15:34:10.830-06:00IMDb datasets: 3 centuries of movie rankings visualized<h2 style="text-align: left;"><span style="font-size: medium;">The Question <br /></span></h2><p><span style="font-size: medium;"><span style="font-size: large;"><span style="font-family: georgia;">I am a sucker for IMDb ratings so don't judge me. They are my priors before watching almost anything on a screen (home screen that is). But between movies (feature films), TV movies and TV (mini) series IMDb ratings are highly inconsistent. For example, series <a href="https://www.imdb.com/title/tt1190634/">The Boys</a> has rating 8.7 and so does movie <a href="https://www.imdb.com/title/tt0099685/" target="_blank">Goodfellas</a> by Martin Scorsese. Does it make sense The Boys ranked as high as #16 rated movie title in the whole IMDb database (among those with at least 25,000 user votes)? Or, in other words, if and how much apples vs. oranges those ratings are? </span></span><br /></span></p><h2 style="text-align: left;"><span style="font-size: medium;"> </span></h2><h2 style="text-align: left;"><span style="font-size: medium;">Rating Distributions <br /></span></h2><p><span style="font-size: large;"><span style="font-family: georgia;"><span>To start I downloaded IMDb datasets (<a href="https://www.imdb.com/interfaces/">here</a>). Let's show distributions of title ratings depending on the types: movie (i.e. feature film), TV movie, TV mini series, and TV series between fiction and documentaries:</span></span></span></p><div class="separator" style="clear: both; text-align: center;"><span style="font-size: large;"><span style="font-family: georgia;"><span><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhO6ifiFmCRPPgfZQNXJqbeyJjUszHOutmZVJdkuBw_I-C_5JYKfc4-fVj8Rxm_BbuTd0H8xEjgOjp6nWJZRMxonE2Hkc9os2eDV2Ivu6D0Tv1yAnI6SDFjUL1w61JRwCwh5Jj9bIvG5g2C/s900/imdb-ratings-by-title-types.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="700" data-original-width="900" height="498" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhO6ifiFmCRPPgfZQNXJqbeyJjUszHOutmZVJdkuBw_I-C_5JYKfc4-fVj8Rxm_BbuTd0H8xEjgOjp6nWJZRMxonE2Hkc9os2eDV2Ivu6D0Tv1yAnI6SDFjUL1w61JRwCwh5Jj9bIvG5g2C/w640-h498/imdb-ratings-by-title-types.png" width="640" /></a></span></span></span></div><p></p><p><span style="font-size: medium;"></span></p><p></p><p><span style="font-size: large;"><span style="font-family: georgia;"><span>Title ratings drift towards higher values </span><span><span>depending on their types (shown on the right): movie, TV movie, TV mini series, and TV series. So indeed ratings of movies and TV series come from different distributions representing different things like apples and oranges. But how much different they are? (we will focus on fiction titles only from this point on.)</span></span></span></span></p><h2 style="text-align: left;"><span style="font-size: medium;"><span style="font-size: medium;"> </span></span></h2><h2 style="text-align: left;"><span style="font-size: medium;"><span style="font-size: medium;">Percentiles <br /></span></span></h2><p><span style="font-size: large;"><span style="font-family: georgia;"><span>If a title has all time best rating then no doubt it's worth giving a try </span></span></span><span style="font-size: large;"><span style="font-family: georgia;"><span><span style="font-size: large;"><span style="font-family: georgia;"><span>(let's
say among titles with at least 1000 votes - number of votes is rather
important consideration but we let it slide here and may come back to
votes later)</span></span></span>. Why? Because 100% of other titles are rated below or at best the same and that indicates exceptional qualities. In statistics such rating has a name: 100th percentile. Following the same logic 99th percentile represents rating above 99% of all titles in the database (again, don't forget about minimum threshold for number of votes to be considered).</span></span></span></p><p><span style="font-size: medium;"><span style="font-size: large;"><span style="font-family: georgia;">Based on above we can assign IMDb titles to groups based on the highest percentile they belong to: 99% percentile suggests that the title is <i>very best</i>, 95% - <i>excellent</i>, 90% - <i>very good</i>, 75% - <i>good</i>, 50% - <i>average</i>, and 25% - <i>bad</i>. Feel free to assign and name percentiles differently in your analysis but we stick with this convention for this post. Last piece of the puzzle is taking percentiles not across whole IMDb set but rather for each title type separately and compare them: </span></span><br /></span></p><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJyrQnchv50xGIF-OzC1vrNAh01V5zFNo6q6NTrjJxfGb-COiQGT6QxqUdHUrgMCbtxdrgvip0kEi4LuOpGOIuPXxOYF2rH-4BlIYrhh8AJxY-R4Y5j6EQ3k_MiEwwUtf9Fl557OqtCKMV/s900/imdb-ratings-by-percentiles-dodge.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="700" data-original-width="900" height="498" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJyrQnchv50xGIF-OzC1vrNAh01V5zFNo6q6NTrjJxfGb-COiQGT6QxqUdHUrgMCbtxdrgvip0kEi4LuOpGOIuPXxOYF2rH-4BlIYrhh8AJxY-R4Y5j6EQ3k_MiEwwUtf9Fl557OqtCKMV/w640-h498/imdb-ratings-by-percentiles-dodge.png" width="640" /></a><p></p><p><span style="font-size: large;"><span style="font-family: georgia;"><span>Going back to our example, 8.7 in TV Series places The Boys firmly in "Excellent" (95th percentile), while Goodfellas at 8.7 sits at the top of "Very Best" (99th percentile) in movies - noticeable difference between the two.<br /></span></span></span></p><p><span style="font-size: large;"><span style="font-family: georgia;"><span>The difference becomes even more meaningful when looking at the lower tiers "Very Good" (90th percentile) and below: while rating of 7.6 suffices for a movie </span><span><span>(e.g. <a href="https://www.imdb.com/title/tt0314331/">Love Actually</a>) </span>to place in "Very Good", a TV series must achieve rating of 8.4 to qualify for the same 90th percentile. In fact, a TV Series with 7.6 rating (like <a href="https://www.imdb.com/title/tt0314331/">Grey's Anatomy</a>) places just above "Average" 50th percentile. Furthermore, the rating of 8 would place a movie firmly in top 5% while the same 8 for a TV series barely cracks top 25%.</span></span></span></p><h2 style="text-align: left;"><span style="font-size: medium;"> </span></h2><h2 style="text-align: left;"><span style="font-size: medium;">Percentiles Extra <br /></span></h2><p><span style="font-size: large;"><span style="font-family: georgia;"><span>Comparing and analyzing ratings between title types can be helped by organizing and visualizing the same percentile data in a few different ways:</span></span></span></p><ul style="text-align: left;"><li><span style="font-size: medium;"><span style="font-size: large;"><span style="font-family: georgia;">Overlapping bar charts by title types: <br /><br /></span></span><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjg3Y_kR8DW4jK-3hop0Cdeeg_yTDa0mkfatskrIMTCFzx6YdIOgwrnlSF-lFGUPQpH6etSj2Yk6SCh3c9naVVvmrSUhvDnFdYk-atrmzCFaBfdgKTqc5jhgjVzgvo0lCVPzoZDGi7VmVs1/s900/imdb-ratings-by-title-types-identity.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="700" data-original-width="900" height="498" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjg3Y_kR8DW4jK-3hop0Cdeeg_yTDa0mkfatskrIMTCFzx6YdIOgwrnlSF-lFGUPQpH6etSj2Yk6SCh3c9naVVvmrSUhvDnFdYk-atrmzCFaBfdgKTqc5jhgjVzgvo0lCVPzoZDGi7VmVs1/w640-h498/imdb-ratings-by-title-types-identity.png" width="640" /></a></div><br /></span></li></ul><p></p><p></p><ul style="text-align: left;"><li><span style="font-size: medium;"><span style="font-size: large;"><span style="font-family: georgia;">Line chart by title types:<br /><br /></span></span></span><span style="font-size: medium;"><span style="font-size: large;"><span style="font-family: georgia;"></span></span></span><span style="font-size: medium;"><div class="separator" style="clear: both; text-align: center;"><span style="font-size: large;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhweEGlg1H8p4BNKVZHLnf2YE18sa_XIsziJXbfGAy91D7rP29jp0Bbfo5hx9iRxzMzbt-I0BXi2-9uRbO_r8zlq765YdlbhHFMGvajwoXTXA4mBn1E0pyObtC0A1lD4_oHqrw7InT9_9Gy/s900/imdb-ratings-by-title-types-lines.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="700" data-original-width="900" height="498" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhweEGlg1H8p4BNKVZHLnf2YE18sa_XIsziJXbfGAy91D7rP29jp0Bbfo5hx9iRxzMzbt-I0BXi2-9uRbO_r8zlq765YdlbhHFMGvajwoXTXA4mBn1E0pyObtC0A1lD4_oHqrw7InT9_9Gy/w640-h498/imdb-ratings-by-title-types-lines.png" width="640" /></a></span></div></span><br /></li></ul><p></p><ul style="text-align: left;"><li><span style="font-size: large;"><span style="font-family: georgia;"><span>Line chart by percentiles:<br /><br /></span></span></span><div class="separator" style="clear: both; text-align: center;"><span style="font-size: large;"><span style="font-family: georgia;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiCian3SgjYS9d9mnS58VRSU9xuEoWPSf0SGv0u39htMyUeSGV5Vup8cnUXcgLZvdi8E_6oyntWJGpQ-ovjVJdU4tyjmWrL76O6gWsPH1qYzuyGxvO-SLf5YolBZQbBiNOJxg_HsQI9k4PS/s900/imdb-ratings-by-percentiles-lines.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="700" data-original-width="900" height="498" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiCian3SgjYS9d9mnS58VRSU9xuEoWPSf0SGv0u39htMyUeSGV5Vup8cnUXcgLZvdi8E_6oyntWJGpQ-ovjVJdU4tyjmWrL76O6gWsPH1qYzuyGxvO-SLf5YolBZQbBiNOJxg_HsQI9k4PS/w640-h498/imdb-ratings-by-percentiles-lines.png" width="640" /></a></span></span></div><span style="font-size: medium;"><br /></span></li></ul><br /><h2 style="text-align: left;"><span style="font-size: medium;">What About Documentaries?</span></h2><p style="text-align: left;"><span style="font-size: large;"><span style="font-family: georgia;">The title percentiles above excluded documentaries. To be able to compare ratings between fiction and documentary titles the following visual computes and dissects rating percentiles between fiction and documentaries by title types:</span></span></p><p style="text-align: left;"></p><div class="separator" style="clear: both; text-align: center;"><span style="font-size: large;"><span style="font-family: georgia;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjH7VuY0pGhfGL-s8tjnHOxlSIvIby3bdEhOg6R5g1kOBZl8oo1Y62whhrIVxQMTEIkHLAMv_q2-9zF7bI-BJTIjwMe4LXE0MhjidPvfZ7lnAcUMj02ZChePLs7ahlWGdB8xGMJu_9BJN5P/s900/imdb-ratings-by-percentiles-fiction-vs-docum.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="700" data-original-width="900" height="498" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjH7VuY0pGhfGL-s8tjnHOxlSIvIby3bdEhOg6R5g1kOBZl8oo1Y62whhrIVxQMTEIkHLAMv_q2-9zF7bI-BJTIjwMe4LXE0MhjidPvfZ7lnAcUMj02ZChePLs7ahlWGdB8xGMJu_9BJN5P/w640-h498/imdb-ratings-by-percentiles-fiction-vs-docum.png" width="640" /></a></span></span></div><span style="font-size: large;"><span style="font-family: georgia;"><br /></span></span><span style="font-size: medium;"><span style="font-size: large;"><span style="font-family: georgia;">For whatever reason IMDb users rate documentaries more generously than their fiction counterparts across all title types.</span></span><br /></span><p></p><span style="font-size: medium;"></span><h2 style="text-align: left;"><span style="font-size: medium;"> </span></h2><h2 style="text-align: left;"><span style="font-size: medium;">Historical Perspective Mixed with Film Trivia<br /></span></h2><p style="text-align: left;"><span style="font-size: large;"><span style="font-family: georgia;">The oldest film on IMDb is <a href="https://en.wikipedia.org/wiki/Passage_de_V%C3%A9nus" target="_blank">Passage de Venus</a> made in 1874, is ranked 6.9 with 1282 votes (as of January 2020), and is filed under title type <i>short</i> and genre <i>Documentary</i>. In chronological order it is followed by 2 titles in 1878 (short animation <i>Le singe musicien</i> and short documentary <i>Sallie Gardner at a Gallop</i>), 1 in 1881 (short documentary <i>Athlete Swinging a Pick</i>), 1 in 1883 (short documentary <i>Buffalo Running</i>), and 1 in 1885 (short animation <i>L'homme machine</i>). Starting with 1887 that cranked up 45 titles total there are no more gap years, but such production feast will be surpassed only 1894 with 97 titles. First <i>movie</i> title (and only that year) <i>Reproduction of the Corbett and Fitzsimmons Fight</i> was filmed in 1897 under <i>Documentary</i>, <i>News</i>, and <i>Sport</i> genres. Lastly, first year when total number of titles exceeded its year numerical value is 1952 with 2059 shorts, movies, etc. under the belt. Did I just say last? One more factoid if you excuse me: movie production in 2020 (35,109 titles total) dropped us exactly 10 years back when 35,062 titles were produced in 2010, while the absolute record belongs to 2017 with 51231 films total.<br /></span></span></p><p style="text-align: left;"><span style="font-size: large;"><span style="font-family: georgia;">What about visualizing film production over time?</span></span></p><p style="text-align: left;"><span style="font-size: large;"><span style="font-family: georgia;"> </span></span></p><div class="separator" style="clear: both; text-align: center;"><span style="font-size: large;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPPoVteBezya7avj_sglUQd8ynFS42VkCiepjHCJsAwxV55j8OCPFlOUhKXlamTPaiHNf5rN7eyFhuK0oR2UpuNPdIgvRSVtVBbEx3wBs8mz52mqvTVrrHT3yYmXSp6RAtsYYJArTrwMBs/s900/films-released-all-time-yearly.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="700" data-original-width="900" height="498" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPPoVteBezya7avj_sglUQd8ynFS42VkCiepjHCJsAwxV55j8OCPFlOUhKXlamTPaiHNf5rN7eyFhuK0oR2UpuNPdIgvRSVtVBbEx3wBs8mz52mqvTVrrHT3yYmXSp6RAtsYYJArTrwMBs/w640-h498/films-released-all-time-yearly.png" width="640" /></a></span></div><span style="font-size: large;"><br /></span><br /><p></p><h2 style="text-align: left;"><span style="font-size: medium;">Final Thoughts<br /></span></h2><p><span style="font-size: large;"><span style="font-family: georgia;">IMDb dataset turned out to be richer and deeper than I expected and I just scratched the surface. There is plenty to play with - genres, runtimes, adult movies (yes, probably for compliance IMDb flags each title as adult or not), and, of course, ratings. IMDb uses adjusted (weighted) rating formula (based on averages and number of user votes) in their rankings (see <a href="https://help.imdb.com/article/imdb/track-movies-tv/weighted-average-ratings/GWT2DSBYVT2F25SK?ref_=helpart_nav_8#">Weighted Average Ratings</a>) so the title <b>averageRating</b> we looked at can't be taken at the face value after all.</span></span></p><p><br /></p><h2 style="text-align: left;"><span style="font-size: medium;">Source Code<br /></span></h2><p style="text-align: left;"><span style="font-size: large;"><span style="font-family: georgia;">IMDb R notebook with all data prep and visualizations for this post found here: <a href="https://github.com/grigory93/r-notebooks/blob/master/IMDB-movie-ratings.Rmd">https://github.com/grigory93/r-notebooks/blob/master/IMDB-movie-ratings.Rmd</a></span></span></p><p style="text-align: left;"><span style="font-size: large;"><span style="font-family: georgia;"><span style="font-size: small;"><b>Information courtesy of IMDb (<a href="http://www.imdb.com">http://www.imdb.com</a>).<br />
Used with permission.</b></span> <br /></span></span></p><p style="text-align: left;"><span style="font-size: large;"><span style="font-family: georgia;"></span></span></p>Gregory Kanevskyhttp://www.blogger.com/profile/09179130896383881927noreply@blogger.com0tag:blogger.com,1999:blog-7530218802939252476.post-37928609783517149282020-04-13T18:07:00.001-05:002021-01-09T12:00:15.253-06:00H2O.ai Academic Program for Professors and Students: Part 2 - Creating Your First (Time Series) Experiment<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgJQARR4dhFw_yoidEtj7_WwKgvmkRxCUnJDfpLy5a95ZOh3Z3C5WjB2EbSap4vb7mXRkNmuPvgCO5rmu-t3JOCHeKhY7lXzlf3qpqkyvBr3C9oRBRruD6HH48413rnQElB_bi0wpolgAGS/s1600/DAI-artifacts+relationships.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="405" data-original-width="720" height="360" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgJQARR4dhFw_yoidEtj7_WwKgvmkRxCUnJDfpLy5a95ZOh3Z3C5WjB2EbSap4vb7mXRkNmuPvgCO5rmu-t3JOCHeKhY7lXzlf3qpqkyvBr3C9oRBRruD6HH48413rnQElB_bi0wpolgAGS/s640/DAI-artifacts+relationships.png" width="640" /></a></div>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><a href="http://novyden.blogspot.com/2020/02/h2oai-academic-program-for-professors.html" target="_blank">Part 1</a> of this blog series discussed how to:</span></span><br />
<ol>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">apply for free academic license of H2O.ai automated machine learning (AutoML) platform Driverless AI,</span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">spin up a VM with budget-oriented cloud provider Paperspace that can host Driverless AI,</span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">install Driverless AI on VM including configuration that utizlizes powerful GPUs available on Paperspace.</span></span></li>
</ol>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">In part 2 we'll show how to:</span></span><br />
<ol>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">upgrade Driverless AI on Linux VM</span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">organize modeling workflow in Driverless AI </span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">manipulate and load datasets</span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">perform automated data exploration</span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">create models to forecast time series</span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">analyze time series model created with Driverless AI </span></span></li>
</ol>
<br />
<h2>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Back to Paperspace VM </span></span></h2>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">At the end of part 1 we had fully functional instance of Driverless AI but </span></span><br />
<ul>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">it was likely stopped by Paperspace due to inactivity and </span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">new version 1.8.5 of H2O Driverless AI has been released since.</span></span></li>
</ul>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">So we begin by starting Paperspace VM and upgrading H2O software to the latest version, but please adjust steps below to your specific circumstances and possibly newer version of Driverless AI.</span></span><br />
<h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">1. Starting VM in Paperspace </span></span></h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">When you log in back to Paperspace and go to console (under <span style="font-family: "arial" , "helvetica" , sans-serif;">Core -> Compute</span>) it will show VM you created in the state <span style="font-family: "arial" , "helvetica" , sans-serif;">"Off"</span>. Next, press anywhere on the machine box:</span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjDs_Hv5r3zka5FA9eqfuDDZuwIKuZ2SN7WqfgDQrja_aG3y7fIIPin4fqSq8eqAOFUnOPZckLXAJly0ssRis1fAGFZ_R4zxJro7X0VEeknTjmiLLwX_JSXADXM1i_lMz_6aZ6XP0eOLhGK/s1600/Screen+Shot+2020-04-03+at+22.20.32.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="996" data-original-width="1600" height="396" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjDs_Hv5r3zka5FA9eqfuDDZuwIKuZ2SN7WqfgDQrja_aG3y7fIIPin4fqSq8eqAOFUnOPZckLXAJly0ssRis1fAGFZ_R4zxJro7X0VEeknTjmiLLwX_JSXADXM1i_lMz_6aZ6XP0eOLhGK/s640/Screen+Shot+2020-04-03+at+22.20.32.png" width="640" /></a></div>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><br /></span></span>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Next screen displays machine terminal view including button to start VM: </span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZw22O_gVY1AGn8Eq1UyPcwucqn6hP5oCHspHaqXCTiM3awXpVPuO-ig6AcmczUdz4dlqpMw69CPeRMHrdvgVqT_2VQb7Qd00D3Imxe4SOwh9-FR23zj6gqoWQUip8uhKsPNISVXh-f7-2/s1600/Screen+Shot+2020-04-04+at+20.56.28.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="996" data-original-width="1600" height="398" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZw22O_gVY1AGn8Eq1UyPcwucqn6hP5oCHspHaqXCTiM3awXpVPuO-ig6AcmczUdz4dlqpMw69CPeRMHrdvgVqT_2VQb7Qd00D3Imxe4SOwh9-FR23zj6gqoWQUip8uhKsPNISVXh-f7-2/s640/Screen+Shot+2020-04-04+at+20.56.28.png" width="640" /></a></div>
<br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">After pressing Start machine button wait for terminal window to appear indicating that VM restarted successfully:</span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-6NvqTN1L2uMPn9npt3RWB9Pxok1EEDYx5PHBHt-90COQUE1WqaWWkIUxy8_UXYJnn-6Ada_-PArg5rPqd-xX0Crxr6WRx4t6mjhq4lI3bY8v0v8xJ7Bc7QYP1IgwIhOUQyjawWsyWO5i/s1600/Screen+Shot+2020-04-04+at+21.05.17.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="921" data-original-width="1600" height="368" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-6NvqTN1L2uMPn9npt3RWB9Pxok1EEDYx5PHBHt-90COQUE1WqaWWkIUxy8_UXYJnn-6Ada_-PArg5rPqd-xX0Crxr6WRx4t6mjhq4lI3bY8v0v8xJ7Bc7QYP1IgwIhOUQyjawWsyWO5i/s640/Screen+Shot+2020-04-04+at+21.05.17.png" width="640" /></a></div>
<br />
<h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">2. Upgrading Driverless AI </span></span></h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Locate your original email from Paperspace from when you created VM in <a href="http://novyden.blogspot.com/2020/02/h2oai-academic-program-for-professors.html" target="_blank">part 1</a> that has <span style="font-family: "arial" , "helvetica" , sans-serif;">ssh</span> command and password (unless you changed it since). You can use either web-based terminal window shown above or a terminal application like Mac OS Terminal to <span style="font-family: "arial" , "helvetica" , sans-serif;">ssh</span> (I prefer the latter as it allows easy copy and paste on Mac OS):</span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhC39FllghHb8LQKqHfFgDnRAQnIH9RGMFOy0Lg-YljZeYSAXrvfL6zi_vwlRRZTR3sl7Ae0SWVdGnAkg3thnphO4mpSE0Up1ZvstMemxV6L3YN_9W408R4beCfm7_FyssCDfKiyeaqBVVW/s1600/Screen+Shot+2020-04-05+at+05.25.42.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="946" data-original-width="1482" height="408" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhC39FllghHb8LQKqHfFgDnRAQnIH9RGMFOy0Lg-YljZeYSAXrvfL6zi_vwlRRZTR3sl7Ae0SWVdGnAkg3thnphO4mpSE0Up1ZvstMemxV6L3YN_9W408R4beCfm7_FyssCDfKiyeaqBVVW/s640/Screen+Shot+2020-04-05+at+05.25.42.png" width="640" /></a></div>
<br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">At this point we can upgrade to release 1.8.5.1 (the latest version at the time of this writing). To locate installer point your browser to <a href="https://www.h2o.ai/download/">https://www.h2o.ai/download/</a> and click on the button for latest stable version of Driverless AI (1.8 LTS at this time):</span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjaQIn8H3zxhQs4HI5jqOXAQTuyv34EqcmNQdkCf3eQFpVY95SmwCnFirZ5OfCHkPEYHr-mRhuZMk30VQLtJCckWazVDoea9rqcS_8oiziYJbUFpc2n6yA6MFOcq4o_50CSzquap7PObQgj/s1600/Screen+Shot+2020-04-05+at+05.31.06.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="812" data-original-width="1600" height="324" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjaQIn8H3zxhQs4HI5jqOXAQTuyv34EqcmNQdkCf3eQFpVY95SmwCnFirZ5OfCHkPEYHr-mRhuZMk30VQLtJCckWazVDoea9rqcS_8oiziYJbUFpc2n6yA6MFOcq4o_50CSzquap7PObQgj/s640/Screen+Shot+2020-04-05+at+05.31.06.png" width="640" /></a></div>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><br /></span></span>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><br /></span></span>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">This brings you to the download page for version 1.8.5.1 (or later). Make sure it displays <span style="font-family: "arial" , "helvetica" , sans-serif;">Linux (X86)</span> tab (first tab) and copy location of installer file by right-clicking on the Download link corresponding to DEB Ubuntu 16.04/Ubuntu 18.04 option:</span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhYXnYpyHK0TR2TyeiTChJQ87GitLdDc4nzCXH2SAUM0hn6Hit8k2yahRkQo9j0JnQCEfYIzUjb32K32bnA9C5QKiUA7fSJoghHUzKm4-2bkKrX-PVrotPEbZ3ZHpj7GIBjXIXxV0TNexEX/s1600/Screen+Shot+2020-04-05+at+05.33.03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1129" data-original-width="1600" height="450" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhYXnYpyHK0TR2TyeiTChJQ87GitLdDc4nzCXH2SAUM0hn6Hit8k2yahRkQo9j0JnQCEfYIzUjb32K32bnA9C5QKiUA7fSJoghHUzKm4-2bkKrX-PVrotPEbZ3ZHpj7GIBjXIXxV0TNexEX/s640/Screen+Shot+2020-04-05+at+05.33.03.png" width="640" /></a></div>
<br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Go back to terminal window and enter <span style="font-family: "arial" , "helvetica" , sans-serif;">wget</span> command and paste file location:</span></span><br />
<blockquote class="tr_bq">
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-size: medium;"><span style="font-size: x-small;"><span style="font-family: "arial" , "helvetica" , sans-serif;">wget https://s3.amazonaws.com/artifacts.h2o.ai/releases/ai/h2o/dai/rel-1.8.5-64/x86_64-centos7/dai_1.8.5.1_amd64.deb</span></span></span></span></blockquote>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjqG6b7dZ18eHAUePpW5hAAPwyXRx8xWVcXZHcGZ9dRXMbZNcntnpExGZKZQULwEoXXdfi8NYJ2W9HpbT1RHsZynEdAwp0ca2fIAVb2waqJW3fcBb-urlqKlVDV4haTP0y2ehQmJ50uk_6L/s1600/Screen+Shot+2020-04-05+at+05.45.24.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1002" data-original-width="1600" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjqG6b7dZ18eHAUePpW5hAAPwyXRx8xWVcXZHcGZ9dRXMbZNcntnpExGZKZQULwEoXXdfi8NYJ2W9HpbT1RHsZynEdAwp0ca2fIAVb2waqJW3fcBb-urlqKlVDV4haTP0y2ehQmJ50uk_6L/s640/Screen+Shot+2020-04-05+at+05.45.24.png" width="640" /></a></div>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><br /></span></span>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">After waiting for <span style="font-family: "arial" , "helvetica" , sans-serif;">wget</span> to finish download perform upgrade of Driverless AI with these 4 commands (don't forget to change file name with your version, given how frequently H2O does releases likely you will be installing newer version):</span></span><br />
<blockquote class="tr_bq">
<span style="font-size: large;"><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;">sudo systemctl stop dai</span><br />sudo dpkg -i </span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;">dai_1.8.5.1_amd64.deb</span></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-size: large;"><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-size: large;"><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;">sudo nvidia-smi -pm 1</span></span></span></span></span></span></span><span style="font-family: "arial" , "helvetica" , sans-serif;"><br />sudo systemctl daemon-reload</span></span><span style="font-size: large;"><span style="font-family: "arial" , "helvetica" , sans-serif;">sudo systemctl start dai</span></span></blockquote>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">After executing commands above terminal screen should look similar to this:</span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiIt1C_NsHxKeWqjNEXNo9ZWfDJUJ9egsFg9qf5TiBn83hQRPO_CaJO6XeJEYs3AHyZocORVc0nyKafdsDtcSC_cSTSxRNyH3uyfvw-RFgoW55G8dWQg7U3W1G8ycdjWCyC_WYf9mWkVl2b/s1600/Screen+Shot+2020-04-05+at+05.58.46.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1002" data-original-width="1600" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiIt1C_NsHxKeWqjNEXNo9ZWfDJUJ9egsFg9qf5TiBn83hQRPO_CaJO6XeJEYs3AHyZocORVc0nyKafdsDtcSC_cSTSxRNyH3uyfvw-RFgoW55G8dWQg7U3W1G8ycdjWCyC_WYf9mWkVl2b/s640/Screen+Shot+2020-04-05+at+05.58.46.png" width="640" /></a></div>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><br /></span></span>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Test that upgrade took place and Driverless AI is running by pointing your browser to the public ip address of your VM and port 12345:</span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiQ_B4xvUSo_bu-mMau0bgH7ALjyu3dJG8ZsiA8RK4yUTe1aR6l8Hn5cNQ_5dH5FbZMq_zXbqS66xRQmnOVY4VLI6sSokXts-BWHH7WmLgDHIMMJQofAY58s7nj5_kQiUEmY-4KN3rTf6kd/s1600/Screen+Shot+2020-04-05+at+06.04.11.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1334" data-original-width="1600" height="532" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiQ_B4xvUSo_bu-mMau0bgH7ALjyu3dJG8ZsiA8RK4yUTe1aR6l8Hn5cNQ_5dH5FbZMq_zXbqS66xRQmnOVY4VLI6sSokXts-BWHH7WmLgDHIMMJQofAY58s7nj5_kQiUEmY-4KN3rTf6kd/s640/Screen+Shot+2020-04-05+at+06.04.11.png" width="640" /></a></div>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><br /></span></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgIjMU0yK6oAdfIZqOEuUwJ6OjSNGkZGepcWnb9Re-7GRJEqY9k-u62uKHDWiP64iBtNTXl5WeAKNJELA8emIWMCtPvTcxAmVdZPt_08RJSD0q9JliaNmSNx_DYUG_UJm-_RNVpgQN6LuO3/s1600/Screen+Shot+2020-04-05+at+06.07.51.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1334" data-original-width="1600" height="532" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgIjMU0yK6oAdfIZqOEuUwJ6OjSNGkZGepcWnb9Re-7GRJEqY9k-u62uKHDWiP64iBtNTXl5WeAKNJELA8emIWMCtPvTcxAmVdZPt_08RJSD0q9JliaNmSNx_DYUG_UJm-_RNVpgQN6LuO3/s640/Screen+Shot+2020-04-05+at+06.07.51.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><br /></span></span>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Using the same credentials as in part 1 in step 24 (</span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "verdana" , sans-serif;">h2oai/h2oai</span></span></span>) login to Driverless AI and go to Resources -> System Info to observe that parameters of your system are in accordance with Paperspace spec:</span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgxDzismov2HQAhyrBC7HlYJDjRNIBNoarrjgoc2fknI-zX20pFHAD0NvnuMFheGdhM0FI892d_rhtTPvzicQqaaAWh8dUrMo7tW84jB3uvjajVFg_wKocKpqo3Zp7TocUuEma-4V_MBcDS/s1600/Screen+Shot+2020-04-05+at+06.14.23.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1002" data-original-width="1600" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgxDzismov2HQAhyrBC7HlYJDjRNIBNoarrjgoc2fknI-zX20pFHAD0NvnuMFheGdhM0FI892d_rhtTPvzicQqaaAWh8dUrMo7tW84jB3uvjajVFg_wKocKpqo3Zp7TocUuEma-4V_MBcDS/s640/Screen+Shot+2020-04-05+at+06.14.23.png" width="640" /></a></div>
<br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Because disk size I used is rather small there is already 57% of disk used. To free up space I recommend removing installer file(s) used to install Driverless AI:</span></span><br />
<blockquote class="tr_bq">
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "arial" , "helvetica" , sans-serif;">rm *.deb </span></span></span></blockquote>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyLs_uoZDW7AwNESWnsnGGCmcwp49V68qomoKaP5iMJZ2irYUi0310VfkEef9oeHH9hDV6sYIXy7DuRak72VfRFtXGaTl1gSjHANZeHJrfMvRP1xFVDf5vL053gqumBRiP0Lu7m13FZ9gy/s1600/Screen+Shot+2020-04-05+at+06.19.41.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1046" data-original-width="1600" height="418" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyLs_uoZDW7AwNESWnsnGGCmcwp49V68qomoKaP5iMJZ2irYUi0310VfkEef9oeHH9hDV6sYIXy7DuRak72VfRFtXGaTl1gSjHANZeHJrfMvRP1xFVDf5vL053gqumBRiP0Lu7m13FZ9gy/s640/Screen+Shot+2020-04-05+at+06.19.41.png" width="640" /></a></div>
<br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">This freed up over 10G of space in my case.</span></span><br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: small;"> </span></span></span><br />
<h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">3. Troubleshooting</span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"> </span></span></span></span></h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Sometimes Driverless AI doesn't start successfully which manifests in browser unable to establish connection. In that case enable <a href="https://developer.download.nvidia.com/compute/DCGM/docs/nvidia-smi-367.38.pdf" target="_blank">persistence mode for Nvidia GPUs</a> and restart Driverless AI with these commands:</span></span></span></span><br />
<blockquote class="tr_bq">
<span style="font-size: large;"><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;">sudo nvidia-smi -pm 1<br />sudo systemctl stop dai</span></span></span><span style="font-size: large;"><br /><span style="font-family: "arial" , "helvetica" , sans-serif;">sudo systemctl start dai</span></span></blockquote>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Ultimately, the goal is to run <span style="font-family: "arial" , "helvetica" , sans-serif;">nvidia-smi -pm 1</span> each time system starts. One way to accomplish this is with cron utility by adding a task that <a href="https://askubuntu.com/a/816/514493" target="_blank">executes each time the system restarts</a>: </span></span><br />
<ol>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">run <span style="font-family: "arial" , "helvetica" , sans-serif;">sudo crontab -e</span> to edit crontab file containing cron tasks for root</span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">if using for first time <span style="font-family: "arial" , "helvetica" , sans-serif;">crontab</span> prompts to pick an editor</span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">add as first line (after comments): <br /><span style="font-family: "arial" , "helvetica" , sans-serif;">@reboot nvidia-smi -pm 1</span> </span></span></li>
</ol>
<br />
<h2>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Dataset </span></span></h2>
<h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">4. Preparing Data</span></span></h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">We prepared data to demonstrate how to experiment and analyze models with Driverless AI. Departing from trivial examples like titanic or other ML "Hello, World!" types a COVID-19 theme made sense. But it won't be exponential growth / curve of COVID-19 cases modeling which is while important and extremely powerful already found on H2O blog <a href="https://www.h2o.ai/blog/modelling-currently-infected-cases-of-covid-19-using-h2o-driverless-ai/" target="_blank">Modeling Currently Infected Case by COVID-19 Using H2O Driverless AI</a> by <a href="https://www.linkedin.com/in/mariosmichailidis/" target="_blank">Marios Michailidis</a>. To compliment this analysis let's look into forecasting demand for certain product groups. We prepared data with package <span style="font-family: "arial" , "helvetica" , sans-serif;">gtrendsR</span> utilizing Google Trends to proxy demand for popular products during COVID-19 crisis: </span></span><br />
<br />
<div>
<code data-gist-file="covid-19-products-google-trends.R" data-gist-hide-footer="false" data-gist-id="6021d0915f29e32673f3953589eb2899"></code>
</div>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><br /></span></span>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">The dataset contains daily <a href="https://support.google.com/trends/answer/4365533?hl=en" target="_blank">Google trends (search interest)</a> for products represented by keywords (serving as a proxy to real demand) in</span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"> United States, Canada, and Great Britain (majority English speaking) </span></span></span></span>from 2020-12-28 through 2020-04-04</span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"> </span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">(before breaking it up which is discussed next)</span></span>:</span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjDViyy0I6B8wQglb-WN7Qs0Pn7_bUDJW7NQE8BNW5SMT9CuYrvh2YMRE3zcC5G_MKoj6Lx1TD8iMmLQx_n5v91PY2aEm2WZuLRMb7sVcaFbswDJ3S2IVZtJzjyNGxUUEqRt8b9APrnYsf/s1600/covid-19-product-google-trends.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1500" data-original-width="1500" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjDViyy0I6B8wQglb-WN7Qs0Pn7_bUDJW7NQE8BNW5SMT9CuYrvh2YMRE3zcC5G_MKoj6Lx1TD8iMmLQx_n5v91PY2aEm2WZuLRMb7sVcaFbswDJ3S2IVZtJzjyNGxUUEqRt8b9APrnYsf/s640/covid-19-product-google-trends.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Lastly, two datasets are created: one for training containing data since 2020-01-01 except for last week (2020-03-31 through 2020-04-06) that makes up test set exactly 7 days long. The reason we allocated test data for one week is because our model will forecast next 7 days of demand and having test data spanning exactly the same time period is ideal.</span></span><br />
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">In later installments of this series we show how to create a dataset with Google Trends inside Driverless AI using data recipes. </span></span><br />
<br />
<h2>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Getting Started with Driverless AI</span></span></h2>
<br />
<h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">5. Navigating Driverless AI</span></span></h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Several <b>unofficial</b> rules of Driverless AI will guide us throughout this post starting with how to organize the flow of activities:</span></span><br />
<blockquote class="tr_bq">
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">An unofficial rule #1:</span></span><i><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /><br />Always try following the same flow of actions as the order of navigation tabs on top from left to right:</span></span></i></blockquote>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjbggv76xSK2OzNqFw1TSLmKrDvw_kY0lUiLWFw6kzSy-8RzBoz6M1Zzx1S-JA_krP2BOaEeZHOje_lGiW6FNXi_GG3YT2cZdXFypriIjuOThk-GhixOrHq-DH3VvT11oOkEljtwyExwrRC/s1600/Screen+Shot+2020-04-08+at+16.49.52.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="328" data-original-width="1600" height="130" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjbggv76xSK2OzNqFw1TSLmKrDvw_kY0lUiLWFw6kzSy-8RzBoz6M1Zzx1S-JA_krP2BOaEeZHOje_lGiW6FNXi_GG3YT2cZdXFypriIjuOThk-GhixOrHq-DH3VvT11oOkEljtwyExwrRC/s640/Screen+Shot+2020-04-08+at+16.49.52.png" width="640" /></a></div>
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">The tabs that lead you throughout Driverless AI workflow are:</span></span><br />
<ol>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;">Datasets</span>: displays and manages datasets ingested into Driverless AI system (action: ingesting and preparing data);</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;">AutoViz</span>: displays and manages list of automated visualization dashboards (one per dataset, action: automated exploratory data analysis);</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;">Experiments</span>: displays and manages list of Driverless AI experiments (multiple experiments per dataset, action: creating machine learning models);</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;">Diagnostics</span>: displays and manages list of model diagnostic dashboards (mulitple diagnostics per experiment and dataset possible, action: analyzing model performance);</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;">MLI</span>: displays and managers list of Machine Learning Interpretability dashboards (usually one explanation dashboard per experiment, action: explaining and interpreting models);</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;">Deployments</span>: displays and manages list of model deployments (multiple deployments per experiment possible depending on environment, action: deploying models).</span></span></li>
</ol>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">To better understand the flow and relationships between Driverless AI artifacts review the following diagram:</span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjMrleW1JprVd4ST6c6fk58Oet2oL6sou4Gum8t7DzXojR62MXvIP7AImPl1LgqgnTm95D1v5enxyaIBDpocOYnPpp8lGkXvmV7tMXKHQUmRHUwglBECp4P4Xwa38v7ce9UptXRvc8OihC6/s1600/DAI-artifacts+relationships.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="405" data-original-width="720" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjMrleW1JprVd4ST6c6fk58Oet2oL6sou4Gum8t7DzXojR62MXvIP7AImPl1LgqgnTm95D1v5enxyaIBDpocOYnPpp8lGkXvmV7tMXKHQUmRHUwglBECp4P4Xwa38v7ce9UptXRvc8OihC6/s1600/DAI-artifacts+relationships.png" /></a></span></span></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Thus, a complete Driverless AI workflow consists of ingesting a dataset, exploring it with AutoViz, creating a model (experiment) trained on a dataset, diagnosing a model, exploring a model in MLI, and finally deploying it.</span></span><br />
<br />
<h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">5. Loading Data</span></span></h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">In Driverless AI go to Datasets tab and click on Add Datasets, then pick Upload File option:</span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgRMnScpism1lADpN6I1RdVYiRdl9HVdqR9RNjhNn8SEyu_xasDoZM3NIKqrhPHl0yb9cvG5_U8gPFiP0NhYZOZ14Mn2gYua3QBx8wUHz0Pqa-p1_ALYXG70m0BvmxFz7y6UYqPKUrs06Is/s1600/Screen+Shot+2020-04-05+at+17.15.10.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1027" data-original-width="1600" height="410" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgRMnScpism1lADpN6I1RdVYiRdl9HVdqR9RNjhNn8SEyu_xasDoZM3NIKqrhPHl0yb9cvG5_U8gPFiP0NhYZOZ14Mn2gYua3QBx8wUHz0Pqa-p1_ALYXG70m0BvmxFz7y6UYqPKUrs06Is/s640/Screen+Shot+2020-04-05+at+17.15.10.png" width="640" /></a></div>
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Browser will open file picker where you can choose one or multiple files at once:</span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgG4GbtAQeaxw5nDbXs32jrtg3LlW6vvC6dYf1gKR2Y7CeRHeG7datdUEB5za8GuASnO0I9wqm_4NXWV3lauv2dbVcOuvagQmSdes1CoqjbcTUG7yAtnwvYr0SUGWWBgqcf9GYID2iFoqwZ/s1600/Screen+Shot+2020-04-05+at+17.23.22.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1027" data-original-width="1600" height="410" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgG4GbtAQeaxw5nDbXs32jrtg3LlW6vvC6dYf1gKR2Y7CeRHeG7datdUEB5za8GuASnO0I9wqm_4NXWV3lauv2dbVcOuvagQmSdes1CoqjbcTUG7yAtnwvYr0SUGWWBgqcf9GYID2iFoqwZ/s640/Screen+Shot+2020-04-05+at+17.23.22.png" width="640" /></a></div>
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">This will trigger upload, auto-parsing and saving of Google trend data files from your local machine to Driverless AI storage:</span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidx62ku2qxfHNf-_uRYlhtR5Qw06IN6scy8oVVmGe5RqIRkBoMLjjKHoPUnKtfULeaa8cK6pHf9VYoDQiob5pyScnHbiMldGawVx5ZRr8BM_cx90i-gApxPeKnWtjlU7ybGKmjCWEu_IbX/s1600/Screen+Shot+2020-04-05+at+17.25.55.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1027" data-original-width="1600" height="410" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidx62ku2qxfHNf-_uRYlhtR5Qw06IN6scy8oVVmGe5RqIRkBoMLjjKHoPUnKtfULeaa8cK6pHf9VYoDQiob5pyScnHbiMldGawVx5ZRr8BM_cx90i-gApxPeKnWtjlU7ybGKmjCWEu_IbX/s640/Screen+Shot+2020-04-05+at+17.25.55.png" width="640" /></a></div>
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Alternatively, since both files are also available from the public S3 bucket here: <span style="font-family: "arial" , "helvetica" , sans-serif;">https://s3.console.aws.amazon.com/s3/buckets/h2o-public-test-data/smalldata/timeSeries/?region=us-east-1</span>, you can import using <span style="font-family: "arial" , "helvetica" , sans-serif;">Amazon S3</span> option by entering S3 url: <span style="font-family: "arial" , "helvetica" , sans-serif;">s3://h2o-public-test-data/smalldata/timeSeries/</span> </span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgz3zB9mCG42hhhtzjpJy2nl-48iup-zAkCTfPt_34Di3NXQcIEayQI7bbIWfYFAd8dwa2_nfm2ftIrS3DTrFGCG492B9ra9FwC5HLzjLP8ReS1-dnsB4EJjset_b5O2rEJ8IVaRPE3VHiM/s1600/Screen+Shot+2020-04-05+at+17.36.35.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1027" data-original-width="1600" height="410" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgz3zB9mCG42hhhtzjpJy2nl-48iup-zAkCTfPt_34Di3NXQcIEayQI7bbIWfYFAd8dwa2_nfm2ftIrS3DTrFGCG492B9ra9FwC5HLzjLP8ReS1-dnsB4EJjset_b5O2rEJ8IVaRPE3VHiM/s640/Screen+Shot+2020-04-05+at+17.36.35.png" width="640" /></a></div>
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Driverless AI will warn but still let you create duplicate datasets if you confirm your intent. Besides wasting disk space it may introduce confusion down the road so choose either uploading from your machine or import from S3 but not both (but feel free to try differnt options - you can always delete extra datasets after all).</span></span><br />
<br />
<h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">6. Dataset Details</span></span></h3>
<blockquote class="tr_bq">
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">An unofficial rule #2:</span></span></blockquote>
<blockquote class="tr_bq">
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><i>Check after dataset auto-parsing to confirm that data imported as expected, e.g column names, data types, missing vlaues, etc.</i> </span></span></blockquote>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Main reason for checking after Driverless AI is not because its auto-parsing functoinality is lacking, but much simpler: there is always ambiguity in data that may result in multiple acceptable formats and/or data types. For example, categorical column represented with only numericals results usually parsed as numeric data type. Dataset details let user both review and correct data type decision made by auto-parsing: in Datasets tab click on <span style="font-family: "arial" , "helvetica" , sans-serif;">product_demand_train.csv</span> to see available actions: <span style="font-family: "arial" , "helvetica" , sans-serif;">Details, Visualize, Split, Predict, Rename, Download,</span> and <span style="font-family: "arial" , "helvetica" , sans-serif;">Delete</span>:</span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg62RCsyhyphenhyphen00xN6T1vaywghcU7LmVT-x0XRNgb6c7vxk6oGDWWXZeJMFrPSkNjZr-maWh90hRXyiAN8yNJ_je2Owya6bQMo7si51wT6yhwWx77xthpRLJhU_HFmZ71Yj67VZSGq7XsR0NIW/s1600/Screen+Shot+2020-04-05+at+17.49.28.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1027" data-original-width="1600" height="410" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg62RCsyhyphenhyphen00xN6T1vaywghcU7LmVT-x0XRNgb6c7vxk6oGDWWXZeJMFrPSkNjZr-maWh90hRXyiAN8yNJ_je2Owya6bQMo7si51wT6yhwWx77xthpRLJhU_HFmZ71Yj67VZSGq7XsR0NIW/s640/Screen+Shot+2020-04-05+at+17.49.28.png" width="640" /></a></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">After choosing <span style="font-family: "arial" , "helvetica" , sans-serif;">Details</span> Driverless AI displays <a href="http://docs.h2o.ai/driverless-ai/1-8-lts/docs/userguide/datasets.html#dataset-details" target="_blank">Dataset Details view</a>:</span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJ1tCI8sRGm_d7761mx2aA7GezaEoF1KakRwReWUY_eG_Gjy5hrYZCSmy_qq0dkJBJIC8oO_zua2IHFw442RSSaQmfMqjrnfx3VSl9ax963uzFqHFo5-tIrRcSrfvjHoM-1qXQQuX3DVBC/s1600/Screen+Shot+2020-04-09+at+16.49.02.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="943" data-original-width="1600" height="376" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJ1tCI8sRGm_d7761mx2aA7GezaEoF1KakRwReWUY_eG_Gjy5hrYZCSmy_qq0dkJBJIC8oO_zua2IHFw442RSSaQmfMqjrnfx3VSl9ax963uzFqHFo5-tIrRcSrfvjHoM-1qXQQuX3DVBC/s640/Screen+Shot+2020-04-09+at+16.49.02.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">This view contains fsummary statistics and distribution plot for each column. It also offers ways to alter data types and data formats to correct auto-parsing as mentioned above by the rule #2: one example - backtracking from numeric type to string (or categorical) for values containing only numeric characters. </span></span><br />
<br />
<h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">7. Analyzing Data with AutoViz</span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"> </span></span></h3>
<blockquote class="tr_bq">
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">An unofficial rule #3: </span></span></blockquote>
<blockquote class="tr_bq">
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><i>Never build a model without visualizing data in AutoViz f</i></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><i><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><i>irst</i></span></span>.</i> </span></span></blockquote>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">For advanced exploration choose next option in Dataset menu - <span style="font-family: "arial" , "helvetica" , sans-serif;">Visualize</span>:</span></span></span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEigrFRorib2WDN_3KpLInGukvdOmYEuzZYhmVbXTnxI2b8ZTtleZcWuhY9Ao-7rsByBAuYA151g26SQi0GYEnhWwZR_Ejl_q8i_2LiKN9kNrx-9ScHs07jt6HqWXDIFw1sADSCYYf2tWbGY/s1600/Screen+Shot+2020-04-09+at+13.41.46.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="943" data-original-width="1600" height="376" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEigrFRorib2WDN_3KpLInGukvdOmYEuzZYhmVbXTnxI2b8ZTtleZcWuhY9Ao-7rsByBAuYA151g26SQi0GYEnhWwZR_Ejl_q8i_2LiKN9kNrx-9ScHs07jt6HqWXDIFw1sADSCYYf2tWbGY/s640/Screen+Shot+2020-04-09+at+13.41.46.png" width="640" /></a></div>
<br />
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Driverless AI will take you to Visualizations tab:</span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh8sBZK02fMss_idQXe5fivwSpA46JQeNdBMHoI3I9Co6F8pzRY9w0B-rYlkN3xpc8U0dpd9Nyc-ednQUB_7RES4ACD0zO9eboLM4oVoWt3gbuQWXbm2xrx4RPTimPxZ9YDdtA28ScT3lKY/s1600/Screen+Shot+2020-04-09+at+17.01.50.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="943" data-original-width="1600" height="376" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh8sBZK02fMss_idQXe5fivwSpA46JQeNdBMHoI3I9Co6F8pzRY9w0B-rYlkN3xpc8U0dpd9Nyc-ednQUB_7RES4ACD0zO9eboLM4oVoWt3gbuQWXbm2xrx4RPTimPxZ9YDdtA28ScT3lKY/s640/Screen+Shot+2020-04-09+at+17.01.50.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Click on <span style="font-family: "arial" , "helvetica" , sans-serif;">product_demand_test.csv</span> to display AutoViz dashboard that contains different types of advanced visualizations selected for the dataset</span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">:</span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh1tT9Vqcwzp52jzkZv9yC7J1dxAaylgC0Jpg-IpAjbt2C10AOe6zZe_vcMmIxyXKQTXIWi7eW4yd-efR2naH2fzG6vv1Cul62ikYIC7A5IRitW1nbMQ0ajLtKM4mTllBkC3RQC2qhRdIX-/s1600/Screen+Shot+2020-04-09+at+17.02.41.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="943" data-original-width="1600" height="376" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh1tT9Vqcwzp52jzkZv9yC7J1dxAaylgC0Jpg-IpAjbt2C10AOe6zZe_vcMmIxyXKQTXIWi7eW4yd-efR2naH2fzG6vv1Cul62ikYIC7A5IRitW1nbMQ0ajLtKM4mTllBkC3RQC2qhRdIX-/s640/Screen+Shot+2020-04-09+at+17.02.41.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">So what just happened? By choosing <span style="font-family: "arial" , "helvetica" , sans-serif;">Visualize</span> we triggered fully automated process that runs statistical tests, unsuperivsed models, and anomaly detection, then selects interesting observations leaving out trivial ones, and finally compiles them into visual dashboard to represent results. Such workflow received a name in Driverless AI - AutoViz - and is characterized with:</span></span><br />
<ol>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">univariate analysis on dataset features: </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">outliers, skewedness, spikey distributions, gaps in distributions;</span></span> </span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">multivariate analysis on dataset: </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">correlations (including between numeric and categorical features), varying boxplots,<b> </b></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">heteroscedastic boxplots, biplots, multivariate outliers, k-means clustering, 1-NN, SVD and more;</span></span></span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">qualifying which results to include, for example, correlated scatterplots include pairs of features </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">with value of squared Pearson’s <span style="font-family: "arial" , "helvetica" , sans-serif;">r</span> </span></span></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">greater than 0.95;</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">aggregating the data to display larger points: </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><a href="http://docs.h2o.ai/driverless-ai/1-8-lts/docs/userguide/datasets.html#the-visualization-page" target="_blank">"the bigger the point is, the bigger number of exemplars (aggregated points) the plot covers"</a>.</span></span></li>
</ol>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">For details please read chapter in the docs <a href="http://docs.h2o.ai/driverless-ai/1-8-lts/docs/userguide/datasets.html#visualizing-datasets" target="_blank">Visualizing Datasets</a> and we leave AutoViz with illustration of k-means clustering analysis that found 23 clusters (and no multivariate outliers) in the training dataset displaying graphics with Parallel Coordinates Plot:</span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgH2liz1aVfbPnHfdEmhV1nwHCtEHOdeWA5Lk0wmYFIdOMxLAcrwxV9I5_FT6QcxeAMcNBiLxno0r5bWzdDBVqveO6KXv88pFV3GlbwzuvgP6oJxtQ_Aozd4PZapT0koDxw9E4qhbtvh5RH/s1600/Screen+Shot+2020-04-09+at+17.03.35.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="872" data-original-width="1600" height="348" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgH2liz1aVfbPnHfdEmhV1nwHCtEHOdeWA5Lk0wmYFIdOMxLAcrwxV9I5_FT6QcxeAMcNBiLxno0r5bWzdDBVqveO6KXv88pFV3GlbwzuvgP6oJxtQ_Aozd4PZapT0koDxw9E4qhbtvh5RH/s640/Screen+Shot+2020-04-09+at+17.03.35.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">8. Starting Experiment</span></span></h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">To begin AutoML workflow go to Datasets tab and click on <span style="font-family: "arial" , "helvetica" , sans-serif;">product_demand_train.csv</span>, then choose <span style="font-family: "arial" , "helvetica" , sans-serif;">Predict</span>: </span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiAl4DUHFh6fz3slsT7baCJiNEgGoNEDOuHvQlAxHVXXJkNvJPupwCqJvJQDebRmM7X-wLYhNK6Kw9IvbTzB97_pMZQDiWP8jY03gKFHHhdyslQt8RouXwcvUzyr0nE44_DhLNteaO_NPeQ/s1600/Screen+Shot+2020-04-09+at+16.58.33.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="943" data-original-width="1600" height="376" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiAl4DUHFh6fz3slsT7baCJiNEgGoNEDOuHvQlAxHVXXJkNvJPupwCqJvJQDebRmM7X-wLYhNK6Kw9IvbTzB97_pMZQDiWP8jY03gKFHHhdyslQt8RouXwcvUzyr0nE44_DhLNteaO_NPeQ/s640/Screen+Shot+2020-04-09+at+16.58.33.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Driverless AI will switch to a </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">supervised machine learning</span></span> experiment prompting to select a target:</span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEifddqF_ee1629xy2NgqNJYfU5lkYslbavAr8AxTT6vp6KDRl81TbvRYW7ccwYoz7GI-s84ePKcJKmuWNbZ40wWcWj1kBn24AMP8kP_Ug52L0j7AObc3C8KCPN_lrO0klH8EYg7xIxOiy_m/s1600/Screen+Shot+2020-04-09+at+15.06.58.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="943" data-original-width="1600" height="376" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEifddqF_ee1629xy2NgqNJYfU5lkYslbavAr8AxTT6vp6KDRl81TbvRYW7ccwYoz7GI-s84ePKcJKmuWNbZ40wWcWj1kBn24AMP8kP_Ug52L0j7AObc3C8KCPN_lrO0klH8EYg7xIxOiy_m/s640/Screen+Shot+2020-04-09+at+15.06.58.png" width="640" /></a></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: small;"> </span> </span></span><br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Select hits first and observe that the other options get filled automatically:</span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjlRbMPi6dALLgg38YqZwhX_KVkRMo1_GoN809r99ACC0-Lwx83FJQC3pGc864MfTa_UGcx75z90gHa1T6bDqKspBAOwhrSkvp13PKf-hqEaHhp0a3MavFTXk-BTeUCLSOgE5A6cLQ6cQkR/s1600/Screen+Shot+2020-04-09+at+17.15.12.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="935" data-original-width="1600" height="372" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjlRbMPi6dALLgg38YqZwhX_KVkRMo1_GoN809r99ACC0-Lwx83FJQC3pGc864MfTa_UGcx75z90gHa1T6bDqKspBAOwhrSkvp13PKf-hqEaHhp0a3MavFTXk-BTeUCLSOgE5A6cLQ6cQkR/s640/Screen+Shot+2020-04-09+at+17.15.12.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: small;"><br /></span>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"></span></span><br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">What just happened? Driverless AI determined that:</span></span><br />
<ol>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">because target variable <span style="font-family: "arial" , "helvetica" , sans-serif;">hits</span> is integer and it contains over 100 unique values the problem type is regression;</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">set RMSE as optimization metric; </span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">number of rows ~10K and number of features 4 in training data;</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">set default settings for <i>accuracy</i> to 7, <i>time</i> to 2, and <i>interpretability</i> to 8. The higher <i>accuracy</i> </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">(1-10)</span></span> the more effort invested to reach better results. The higher <i>time</i> (1-10) the longer experiment spends on searching best transformations and hyper parameters. The higher <i>interpretability</i> (1-10) the less complex models and transformations are used;</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">finally, high level plan for experiment pipeline shows:</span></span></li>
<ul>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">all training data will be used (no sampling);</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">algorithms to try: Decision Tree, </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">LightGBM, and </span></span>XGBoost;</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">models and validation schema used during feature evolution phase: 3-fold cross-validation;</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">final model and validation schema to train it on: 6 model ensemble trained on 3-fold cross-validation;</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">feature evolution genetic algorith phase configuration in terms of number of individuals and generations (iterations): 8 individuals, 48 iterations;</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">early stopping for feature evolution: 5 iterations of no improvement;</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">any constraints on features: monotonicity constraint and pre-pruning of features based on permutation importance are enabled;</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">number of models to train for:</span></span></li>
<ul>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">target transform tuning: 36, </span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">model and feature tuning: 192, </span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">feature evolution: 288, and</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">final model: 6;</span></span></li>
</ul>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">esimated runtime is in <i>minutes</i> (usually very crude estimate);</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">model will auto-finish after 1 day and model will auto-abort after 7 days.</span></span></li>
</ul>
</ol>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">At this point, if experiment setup is complete you should safely start it by pressing on <span style="font-family: "arial" , "helvetica" , sans-serif;">Launch Experiment</span> button. Which brings up </span></span><br />
<blockquote class="tr_bq">
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">An unofficial rule #4:</span></span></blockquote>
<blockquote class="tr_bq">
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><i>When creating first time model always use default (or "lower") settings in Driverless AI.</i></span></span></blockquote>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">The word <i>lower</i> was quoted because <i>interpretability</i> setting moves model performance in opposite direction - from 10 "lowest" to 1 "highest" (you can think of it as </span></span><br />
<div style="text-align: center;">
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"> </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><i>complexity</i> = 11 - <i>interpretabilty</i></span></span></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">to have all 3 settings move consistently). </span></span><br />
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">If the data were indeed i.i.d. then a regression setup above would be enough to start experimenting per last rule. </span></span><br />
<br />
<h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">9. Time Series Model Setup</span></span></h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">But Google trends data pertains to time series model, and Driverless AI supports it with its <a href="http://209.51.170.97:12345/docs/userguide/expert-settings.html#time-series-lag-based-recipe" target="_blank">Time Series Lag-Based Recipe</a> so we continue with experiment setup:</span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhrkCY8Zwox6CEYAhs2baaNWYV0dmtyxUhI51GsHGQfvOyWWe-nFwphwBUZOqwdbLuo0AwDydpbkdTJ6JlYKy5dKr1DFSnUzvl25moFzia6B5biDY7wbMZnoM4MrytrHA8ZpsfT54l7Oxmd/s1600/Screen+Shot+2020-04-10+at+22.56.59.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="937" data-original-width="1600" height="374" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhrkCY8Zwox6CEYAhs2baaNWYV0dmtyxUhI51GsHGQfvOyWWe-nFwphwBUZOqwdbLuo0AwDydpbkdTJ6JlYKy5dKr1DFSnUzvl25moFzia6B5biDY7wbMZnoM4MrytrHA8ZpsfT54l7Oxmd/s640/Screen+Shot+2020-04-10+at+22.56.59.png" width="640" /></a></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Extra steps to setup time series model:</span></span><br />
<ol>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">set dataset <span style="font-family: "arial" , "helvetica" , sans-serif;">product_demand_test.csv</span> as Test. This could (better should, see rule #5 below) have been done for regression or other types of models as well, but in case of lag-based time series recipe it has additional important role: indicating how far in the future we want model predictions to be (forecast horizon below).</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">set column <span style="font-family: "arial" , "helvetica" , sans-serif;">date</span> as Time Column which effectively makes experiment time series and triggers displaying of Time Series Settings on the right side of the screen.</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">set columns <span style="font-family: "arial" , "helvetica" , sans-serif;">geo</span> and <span style="font-family: "arial" , "helvetica" , sans-serif;">keyword</span> in Time Groups Columns (TGC) to identify multiple time series by state (<span style="font-family: "arial" , "helvetica" , sans-serif;">geo</span>) and keyword in the data. This is arguably most powerful feature in </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Driverless AI approach </span></span>as it allows single model to forecast on multiple time series having access to both single time series and aggreated data.</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Time column is automatically parsed to determine time dimension, interval and periodicity including proposed forecast horizon based on the time span in test (see 1.).</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">set scorer to MAE as one of standard metrics for time series model performance.</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">observe new values of the settings: 8/4/8 and feel free to change them to "lower" values if you like (remember rule #4).</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">finally, review </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">a few notable chaninges in experiment pipeline</span></span>:</span></span></li>
<ul>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">validation schema switched to 4 time-based validation splits (time-based splits are always necessary when time column defined, even if time series lag-based recipe is disabled in Expert Settings);</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">LightGBM and XGBoost are the only models used;</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">new lag-based transformation were added: Lags, EwmaLags, LagsAggregates, and LagsInteraction;</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">greater number of models will be created due to increase in <i>accuracy</i> and <i>time</i> settings. </span></span></li>
</ul>
</ol>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">While working through experiment setup we relied on a few more conventions. </span></span><br />
<blockquote class="tr_bq">
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">An unofficial rule #5:</span></span></blockquote>
<blockquote class="tr_bq">
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><i>Always strive to assign a test dataset in experiment.</i> </span></span></blockquote>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Test data (or holdout) is never used during training of the modeling pipeline, which means final model is the same with or without it (except for <a href="http://209.51.170.97:12345/docs/userguide/expert-settings.html#pipeline-building-recipe" target="_blank">Kaggle mode in Expert Settings</a> disabled by default). Driverless AI computes test predictions to provide an estimate for generalization (or out-of-sample) error at very end. </span></span><br />
<blockquote class="tr_bq">
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">An unofficial rule #6:</span></span></blockquote>
<blockquote class="tr_bq">
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><i>When creating first time model avoid using Expert Settings unless absolutely necessary for experiment setup.</i></span></span></span></span></blockquote>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Seldom an option in Expert Settings is necessary for model setup. One example is when
data is non-i.i.d. (time dependent) so time column is set but time series lag-based recipe doesn't apply and should be disabled in Expert Settings. Google Trends dataset is a mulitple time series problem that
Driverless AI can comfortably handle without advanced customizations to start. That
doesn't mean that certain tweaking in Expert Settings - especially
inside its Time Series tab - will not come handy later.</span></span> </span></span><br />
<blockquote class="tr_bq">
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">An unofficial rule #7:</span></span></blockquote>
<blockquote class="tr_bq">
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><i>Do not leave TGC set to AUTO but rather set column or columns identifying multiple time series (i.e. TGC) explicitly. </i></span></span></blockquote>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">While
Driverless AI is certainly capable of recognizing </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">automatically </span></span>TGC (columns that group data
into multiple time series), by setting TGC you elminate slightest chance
for uncertainty.</span></span><br />
<br />
<h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">10. Time Series Model </span></span></h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Launch Experiment by pressing namesake button - wait for a couple of minutes while reading and clearing out experiment notifications (notifications are always available to review via <span style="font-family: "arial" , "helvetica" , sans-serif;">Notifications</span> link above CPU/Memory timeline) so you can observe current state of experiment workflow:</span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhGrltyHpuvke-80Zl3blNI1P_e4HelB6umtraVylMQcSpDnbcjmDq9HE1anIcMX5KgnajqPDR9kqaTsX2i5zvJ81dRgO4CV4Atbc9aiKkOJcUB-xaUc13nM-VKEEcNML8kyKH3kAloD1sm/s1600/Screen+Shot+2020-04-11+at+11.02.49.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="937" data-original-width="1600" height="374" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhGrltyHpuvke-80Zl3blNI1P_e4HelB6umtraVylMQcSpDnbcjmDq9HE1anIcMX5KgnajqPDR9kqaTsX2i5zvJ81dRgO4CV4Atbc9aiKkOJcUB-xaUc13nM-VKEEcNML8kyKH3kAloD1sm/s640/Screen+Shot+2020-04-11+at+11.02.49.png" width="640" /></a></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">When the experiment completes Driverless AI displays final model:</span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh1Mm2rUM2x6Ksmu5tcjZ0BxrIAlDNAZWxaF1XgRUjhlEZjO0ITlich2VAq5-HY_yVcwEjpv6m97ytTnNbDY-t5XwtGllSq6AsCw9B05zEj3uGglkQjXpj4ciUCj6SvQq5y_blRkNp-NJ0T/s1600/Screen+Shot+2020-04-11+at+17.09.51.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="925" data-original-width="1600" height="368" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh1Mm2rUM2x6Ksmu5tcjZ0BxrIAlDNAZWxaF1XgRUjhlEZjO0ITlich2VAq5-HY_yVcwEjpv6m97ytTnNbDY-t5XwtGllSq6AsCw9B05zEj3uGglkQjXpj4ciUCj6SvQq5y_blRkNp-NJ0T/s640/Screen+Shot+2020-04-11+at+17.09.51.png" width="640" /></a></div>
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Completed experiment view consists of:</span></span><br />
<ol>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Experiment setup;</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Evolution pipeline displaying models generated during feature and model tuning and selection; </span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Top variable in terms of feature transformations selected during evolution;</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Experiment summary;</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Available actions:</span></span></li>
<ul>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">deploy to a cloud or locally</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Interpret the model (MLI)</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Diagnose the model</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Score on another dataset</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Transform on another dataset</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Download predictions (training or test)</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Download Python scoring pipeline</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Download MOJO scoring pipeline</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Visualize scoring pipeline</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Download summary and logs</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Download Autoreport (Auto Documentation)</span></span></li>
</ul>
</ol>
<h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">11. Time Series Model Analysis </span></span></h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Each action belongs to its own post so we focus on model analysis with MLI. Because the experiment used lag-based time series recipe Driverless AI will engage special flavor of MLI for time series. Press the Interpret This Model button to see Driverless AI start processing and computing predictions, its errors, Shapley values and more:</span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiXiI84bvwd5ucURK2ifkMAO3pANUyIP3oHiYOAa6pocW_4vBpL0wQJ9EMIzpl23nG9_1NL2fNAjoA7z2BfpcoqAChP7jvLEGG8EQOUvW0zwEBOw0n9rFMO_EupNwyiAVI28ZxWqFSuIS35/s1600/Screen+Shot+2020-04-11+at+17.22.34.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="925" data-original-width="1600" height="368" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiXiI84bvwd5ucURK2ifkMAO3pANUyIP3oHiYOAa6pocW_4vBpL0wQJ9EMIzpl23nG9_1NL2fNAjoA7z2BfpcoqAChP7jvLEGG8EQOUvW0zwEBOw0n9rFMO_EupNwyiAVI28ZxWqFSuIS35/s640/Screen+Shot+2020-04-11+at+17.22.34.png" width="640" /></a></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">MLI for time series comes handy to visualize each time series per group to compare predictions and actuals. When interpreting model completes it displays MLI view including:</span></span><br />
<ul>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">MAE Time Series plot: errors across validation (holdout) and forecast horizon averaged for all time series;<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiHbPgx7gKVuYv-sS5d4aF2b4To8-0e0mEU7qzjeExfOwDGARSOuMMzvNxN9hIU5xlfDjJQOvaZ7RcobyYUbTdOzbSxAK37mjmYRWN_SMWoZ72Tp11i-sBb7ELU96MRhh9dRwL9sZ5zu18C/s1600/Screen+Shot+2020-04-11+at+20.00.09.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="787" data-original-width="1600" height="314" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiHbPgx7gKVuYv-sS5d4aF2b4To8-0e0mEU7qzjeExfOwDGARSOuMMzvNxN9hIU5xlfDjJQOvaZ7RcobyYUbTdOzbSxAK37mjmYRWN_SMWoZ72Tp11i-sBb7ELU96MRhh9dRwL9sZ5zu18C/s640/Screen+Shot+2020-04-11+at+20.00.09.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Test metrics for top 5 and bottom 5 groups (time series identified by their TGC values);<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjSDLH3yU7pC57y9qyuS5YXUr3L4JkG_32h15sHD3F900OkLncHslWmMx167YNctTUiTcPAoRYjbgQ0EGAoPX_8caSZvEV4DFNNv_DmJZhxzw0QVGqKsEjEZUQus4EF8mTRrD0pyUggTNkn/s1600/Screen+Shot+2020-04-11+at+20.01.06.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1038" data-original-width="1600" height="414" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjSDLH3yU7pC57y9qyuS5YXUr3L4JkG_32h15sHD3F900OkLncHslWmMx167YNctTUiTcPAoRYjbgQ0EGAoPX_8caSZvEV4DFNNv_DmJZhxzw0QVGqKsEjEZUQus4EF8mTRrD0pyUggTNkn/s640/Screen+Shot+2020-04-11+at+20.01.06.png" width="640" /></a></div>
</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Actual vs. predicted plot across holdout and forecast horizon plus actual for training time for any choice of time series entered with its values as shown:<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgDJtgkXFUZ5dX1cQbuIhofQlCiDkzHMr1QrCr4JGcBQSiT4WFD09oge7mJYEGT5ArL2jEZFohHc_oeOG2zvMZ-zcDtIiwKFses2Qaq1kRJorQWFjFqIvDLbREc8yftCp91B-nvT8n_AyaH/s1600/Screen+Shot+2020-04-11+at+20.14.18.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="791" data-original-width="1600" height="316" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgDJtgkXFUZ5dX1cQbuIhofQlCiDkzHMr1QrCr4JGcBQSiT4WFD09oge7mJYEGT5ArL2jEZFohHc_oeOG2zvMZ-zcDtIiwKFses2Qaq1kRJorQWFjFqIvDLbREc8yftCp91B-nvT8n_AyaH/s640/Screen+Shot+2020-04-11+at+20.14.18.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiy5aL9VEP3E_kC9Lt-Mz41esNjrDRbyGyeo_x1b5Obh_Xvwq0CnIKQeI6dO9i8xFgrn4KLFAHTIov5YOpjMNdMO0c2Qzt1rVK43NYqF6D6gN8MocxrOIUWZiVkvEkKfnC7uMFnSBm2tdZT/s1600/Screen+Shot+2020-04-11+at+20.12.26.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="812" data-original-width="1600" height="324" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiy5aL9VEP3E_kC9Lt-Mz41esNjrDRbyGyeo_x1b5Obh_Xvwq0CnIKQeI6dO9i8xFgrn4KLFAHTIov5YOpjMNdMO0c2Qzt1rVK43NYqF6D6gN8MocxrOIUWZiVkvEkKfnC7uMFnSBm2tdZT/s640/Screen+Shot+2020-04-11+at+20.12.26.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhEJH6jSj1TKM87KbKEg2eqKsSx6s18dldl3Z-O_mmEG7KLF9z0cS5tkarxu3MBMPtJd2QPcPXzhYFNM25Lm4ecjNKBGJyF1UJtHU9gAmMWaNT62j8gvlowD3so80iUp6rZkFNWjAJvCKIt/s1600/Screen+Shot+2020-04-11+at+20.06.10.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="812" data-original-width="1600" height="324" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhEJH6jSj1TKM87KbKEg2eqKsSx6s18dldl3Z-O_mmEG7KLF9z0cS5tkarxu3MBMPtJd2QPcPXzhYFNM25Lm4ecjNKBGJyF1UJtHU9gAmMWaNT62j8gvlowD3so80iUp6rZkFNWjAJvCKIt/s640/Screen+Shot+2020-04-11+at+20.06.10.png" width="640" /></a></div>
</span></span></li>
</ul>
<h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">12. Shapley Values </span></span></h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">MLI for time series is a powerful diagnostics tool for analysis of multiple time series models. But it goes beyond diagnostics as it includes <a href="https://christophm.github.io/interpretable-ml-book/shapley.html" target="_blank">Shapley values</a> that go beyond diagnostics to explain key factors (features) contributing to each prediction. For example, enter TGC values <b>US,milk</b> to display its time series and click on the peak value in forecast horizon period as shown below: </span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQ89YYXYKa-b5VhCo_9sGYEQdQOFeb7sv9JCDv0C8dlp2ZPtfdUhYYk8ITZ3pqQow__3uJQXPokW8XyAgBopbk4SdCyb4QularA2cxtRFRGWo2fjmsR8KLsyw9SyjifYj2-VcHiqS3IZy8/s1600/Screen+Shot+2020-04-12+at+08.47.32.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1563" data-original-width="1600" height="624" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQ89YYXYKa-b5VhCo_9sGYEQdQOFeb7sv9JCDv0C8dlp2ZPtfdUhYYk8ITZ3pqQow__3uJQXPokW8XyAgBopbk4SdCyb4QularA2cxtRFRGWo2fjmsR8KLsyw9SyjifYj2-VcHiqS3IZy8/s640/Screen+Shot+2020-04-12+at+08.47.32.png" width="640" /></a></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Driverless AI displays a Shapley values bar chart of feature contributions for the prediction on that date. For example, as shown above on the 4th of April the biggest impact was from the feature representing 7-day lag (<span style="font-family: "arial" , "helvetica" , sans-serif;">TargetLag:date:geo:keyword.5</span>). You can continue changing dates to see how contributions shift and features become more or less impactful across forecast horizon time line. Remember that this analysis is specific to the time series for locaton:<b>US</b> with keyword:<b>milk</b> so switching to different time series may produce similar or drastically different results for the same Driverless AI model.</span></span></span></span><br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span></span></span>
<br />
<h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">13. What's Next?</span></span></h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">The model we just created could be considered as a baseline model. The next would be iterating over experiments to achieve higher score by means of:</span></span><br />
<ol>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">increasing <i>accuracy</i> setting;</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">increasing <i>time</i> setting;</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">lowering <i>interpretability</i> setting (increasing <i>complexity</i>);</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">adjusting Expert Settings.</span></span></li>
</ol>
<blockquote class="tr_bq">
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">An unofficial rule #8:</span></span></blockquote>
<blockquote class="tr_bq">
<i><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">When iterating apply "atomic" changes to next experiment to allow attributing increase or decrease in model performance to a single factor associated with that "atomic" change.</span></span></i> </blockquote>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">For example, if you increase <i>accuracy</i> and/or <i>time</i> then keep all other parameters intact. Likewise, if you adjust certain parameter in Expert Settings then keep the rest of parameters the same. Then whether a </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">model gets better (or worse or the same) only that "atomic" change could be responsible for the effect. Embrace the change if performance increased or discard the change if not.</span></span></span></span><br />
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">What constitutes "atomic" change? The easy answer is a change to single setting or parameter, but it could also be a set of related parameters that work together - typical examples are <i>accuracy</i> and <i>time</i> settings or switching algorithms on/off in Expert Settings.</span></span><br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Happy experimenting!</span></span>Gregory Kanevskyhttp://www.blogger.com/profile/09179130896383881927noreply@blogger.com1tag:blogger.com,1999:blog-7530218802939252476.post-48112559300887582662020-03-31T18:45:00.002-05:002021-01-09T12:00:30.169-06:00Facts About Coronavirus Disease 2019 (COVID-19) in 5 Charts created with R and ggplot2<h2>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Introduction</span></span></h2>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Coronovirus pandemic is changing our lifestyle from daily routine to near- and midterm plans, affecting relationships at home and work, adjusting our economical priorities and abilities, making us reassess value of goods and services, and arguably impacting all aspects of life. Better knowledge and understanding of the decease, its manifestations and dynamics must play critical role in assessment of current events and decisions we make. Below I compiled some useful facts about COVID-19 into 5 charts and included discussion of <span style="font-family: "arial" , "helvetica" , sans-serif;">R</span> and <span style="font-family: "arial" , "helvetica" , sans-serif;">ggplot2</span> techniques used to create them.</span></span><br />
<blockquote class="tr_bq">
<span style="font-family: inherit;"><span style="font-size: large;">At the end of 2019, a novel
coronavirus was identified as the cause of a cluster of pneumonia cases
in Wuhan, a city in the Hubei Province of China. It rapidly spread,
resulting in an epidemic throughout China, followed by an increasing
number of cases in other countries throughout the world. In February
2020, the World Health Organization designated the disease COVID-19,
which stands for coronavirus disease 2019.
The virus that causes COVID-19 is designated severe acute respiratory
syndrome coronavirus 2 (SARS-CoV-2); previously, it was referred to as
2019-nCoV.</span></span><br />
<br />
<span style="font-family: inherit;"><span style="font-size: large;">Understanding of COVID-19 is evolving. </span><span style="font-size: large;">This topic will discuss the epidemiology, clinical features, diagnosis, management, and prevention of COVID-19. </span></span></blockquote>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"> </span></span><a href="https://www.uptodate.com/contents/coronavirus-disease-2019-covid-19">Coronavirus disease 2019 (COVID-19) by Kenneth McIntosh, MD</a><br />
<br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Though not all topics above are covered in this blog I reserve the right to publish more charts so stay tuned.</span></span><br />
<br />
<h2>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Clinical Features</span></span></h2>
<h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"> </span></span></h3>
<h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Incubation Period</span></span></h3>
<h2>
<span style="font-size: large;"></span></h2>
<blockquote class="tr_bq">
<div class="headingAnchor" id="H1636679944">
<span style="font-size: large;">The incubation period for
COVID-19 is thought to be within 14 days following exposure, with most
cases occurring approximately four to five days after exposure [<a class="abstract_t" href="https://www.uptodate.com/contents/coronavirus-disease-2019-covid-19/abstract/29-31">29-31</a>].</span></div>
<br />
<span style="font-size: large;">Using
data from 181 publicly reported, confirmed cases in China with
identifiable exposure, one modeling study estimated that symptoms would
develop in 2.5 percent of infected individuals within 2.2 days and in
97.5 percent of infected individuals within 11.5 days [<a class="abstract_t" href="https://www.uptodate.com/contents/coronavirus-disease-2019-covid-19/abstract/32">32</a>]. The median incubation period in this study was 5.1 days.</span></blockquote>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"> </span></span><a href="https://www.uptodate.com/contents/coronavirus-disease-2019-covid-19">Coronavirus disease 2019 (COVID-19) by Kenneth McIntosh, MD</a><br />
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Common approach to display quartiles and extreme percentiles of continuous distribution is with <a href="https://en.wikipedia.org/wiki/Box_plot">box plot</a>. I chose against it for couple of reasons: a) research above had insufficient information about quartiles and b) box plots are less known outside of statistical community. Instead a gauge chart common in dashboard types of applications was used:</span></span><br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhWzz7FyLk1GplvVbk3URxJBcFmJ3dhOvSiJk8_pZqPfTSBWZEA2oyUkM1cjMPiGGeB-6hu5wuooAPUckXX1s4K7aMxlegwJzeR8hozEoq5KnBfRJSmOzOWZvodt8UKOqnvJbB80pqYXu1y/s1600/COVID-19-Incubation-Time.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="500" data-original-width="750" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhWzz7FyLk1GplvVbk3URxJBcFmJ3dhOvSiJk8_pZqPfTSBWZEA2oyUkM1cjMPiGGeB-6hu5wuooAPUckXX1s4K7aMxlegwJzeR8hozEoq5KnBfRJSmOzOWZvodt8UKOqnvJbB80pqYXu1y/s1600/COVID-19-Incubation-Time.png" /></a></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span>
<br />
<h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Implementation details in R</span></span></h3>
<br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Dataset</span></span></h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Dataset consists of 6 rows corresponding to 5 percentiles - 0% (minimum), 2.5% and 97.5% (corresponding to 0.95 confidence interval), 50% (median), 100% (maximum) - and one row more for average:</span></span><br />
<br />
<div>
<code data-gist-file="1-incubation-time-data.R" data-gist-hide-footer="true" data-gist-id="903df6826b16f672aa8d8f5073d11057"></code>
</div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Using <span style="font-family: "arial" , "helvetica" , sans-serif;">factor()</span> will place gauges in order from least to greatest and additional column <span style="font-family: "arial" , "helvetica" , sans-serif;">stext</span> used to display a value in readable format for each gauge.</span></span><br />
<br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Graphics</span></span></h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">First, let's load packages used for plotting: <span style="font-family: "arial" , "helvetica" , sans-serif;">ggplot2</span>, <span style="font-family: "arial" , "helvetica" , sans-serif;">ggthemes</span>, and <span style="font-family: "arial" , "helvetica" , sans-serif;">scales</span>:</span></span><br />
<br />
<div>
<code data-gist-file="0-covid-19-blog.R" data-gist-hide-footer="true" data-gist-id="903df6826b16f672aa8d8f5073d11057"></code>
</div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Realization of gauge charts using <span style="font-family: "arial" , "helvetica" , sans-serif;">ggplot2</span> I borrowed from <a href="https://pomvlad.blog/2018/05/03/gauges-ggplot2/" target="_blank">this example</a> with a few changes explained next:</span></span><br />
<br />
<div>
<code data-gist-file="1-incubation-time-plot.R" data-gist-hide-footer="true" data-gist-id="903df6826b16f672aa8d8f5073d11057"></code>
</div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Line by line explainer:</span></span><br />
<ul>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">2-4: prepare rectangles for each value <span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-family: "georgia" , "times new roman" , serif;">. Each gauge is a pair of overlapping</span></span> rectangles - one dispaying value <span style="font-family: "arial" , "helvetica" , sans-serif;">geom_rect</span></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">(</span></span>)</span> with constant one </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;">geom_rect(aes(<span class="pl-v">ymax</span><span class="pl-k">=</span><span class="pl-c1">14</span>, <span class="pl-v">ymin</span><span class="pl-k">=</span><span class="pl-c1">0</span>, <span class="pl-v">xmax</span><span class="pl-k">=</span><span class="pl-c1">2</span>, <span class="pl-v">xmin</span><span class="pl-k">=</span><span class="pl-c1">1</span>), <span class="pl-v">fill</span> <span class="pl-k">=</span><span class="pl-s"><span class="pl-pds">"</span>#ece8bd<span class="pl-pds">"</span></span>)</span> as a background. </span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">10: separate gauges by facets. </span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">5, 6: transform coordinate system to polar, rotate it to start at 9 pm and trim to display only upper half of gauges.</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">9 places text label with value in the middle of each gauge.</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">7, 8: color schema from <span style="font-family: "arial" , "helvetica" , sans-serif;">few_pal()</span>. </span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">11: removing guides from the chart.</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">12-15: title, subtitle, caption, and axis labels.</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">16-19: customization using <span style="font-family: "arial" , "helvetica" , sans-serif;">ggthemes</span> package and <span style="font-family: "arial" , "helvetica" , sans-serif;">theme()</span>.</span></span>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span>
</li>
</ul>
<br />
<h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Illness Severity</span></span></h3>
<blockquote class="tr_bq">
<span style="font-size: large;">The spectrum of symptomatic infection ranges from mild to critical; most infections are not severe [<a class="abstract_t" href="https://www.uptodate.com/contents/coronavirus-disease-2019-covid-19/abstract/33,35-40">33,35-40</a>]. Specifically, in a report from the Chinese Center for Disease Control
and Prevention that included approximately 44,500 confirmed infections
with an estimation of disease severity [<a class="abstract_t" href="https://www.uptodate.com/contents/coronavirus-disease-2019-covid-19/abstract/41">41</a>]: </span><br />
<div class="bulletIndent1">
<span style="font-size: large;"><span class="glyph"> ●</span> Mild (no or mild pneumonia) was reported in 81 percent.</span></div>
<div class="bulletIndent1">
<span style="font-size: large;"><span class="glyph"> ●</span> Severe
disease (eg, with dyspnea, hypoxia, or >50 percent lung involvement
on imaging within 24 to 48 hours) was reported in 14 percent.</span></div>
<div class="bulletIndent1">
<span style="font-size: large;"><span class="glyph"> ●</span> Critical disease (eg, with respiratory failure, shock, or multiorgan dysfunction) was reported in 5 percent.</span></div>
<div class="bulletIndent1">
<span style="font-size: large;"><span class="glyph"> ●</span> The overall case fatality rate was 2.3 percent; no deaths were reported among noncritical cases. </span></div>
</blockquote>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"> </span></span><a href="https://www.uptodate.com/contents/coronavirus-disease-2019-covid-19">Coronavirus disease 2019 (COVID-19) by Kenneth McIntosh, MD</a><br />
<br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Obvious choice is a <a href="https://en.wikipedia.org/wiki/Bar_chart" target="_blank">bar chart</a> consisting of 4 bars - 3 for illness severity specturm plus case fatality rate reported in the same study:</span></span><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhuORXqfWN1xDqnEOiT15q5-68W5UFwdI_wJIGNBqFOVDlO1Fih7jmK6s1DB8p94MR0YYmt3nwHZxoFPGKNld-Gi3C1Pvk0ZgVKsWb5rvdWoCBKBQwH6eZpw0-n3g8mPpNakO_lgMldpYoO/s1600/COVID-19-Illness-Severity.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="500" data-original-width="750" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhuORXqfWN1xDqnEOiT15q5-68W5UFwdI_wJIGNBqFOVDlO1Fih7jmK6s1DB8p94MR0YYmt3nwHZxoFPGKNld-Gi3C1Pvk0ZgVKsWb5rvdWoCBKBQwH6eZpw0-n3g8mPpNakO_lgMldpYoO/s1600/COVID-19-Illness-Severity.png" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><br /></span></span>
<br />
<h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Implementation details in R</span></span></h3>
<br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Dataset</span></span></h4>
<span style="font-family: "georgia" , "times new roman" , serif;"></span><br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Dataset with 4 rows and 4 columns where <span style="font-family: "arial" , "helvetica" , sans-serif;">severity</span> is a <span style="font-family: "arial" , "helvetica" , sans-serif;">factor()</span> ordered by <span style="font-family: "arial" , "helvetica" , sans-serif;">percent</span>, <span style="font-family: "arial" , "helvetica" , sans-serif;">percent_label</span> used to display values above bars, and <span style="font-family: "arial" , "helvetica" , sans-serif;">severity_label</span> details illness severity:</span></span><br />
<br />
<div>
<code data-gist-file="2-illness-severity-dataset.R" data-gist-hide-footer="true" data-gist-id="903df6826b16f672aa8d8f5073d11057"></code>
</div>
<br />
<h4>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Graphics</span></span></h4>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">This is the case of simple bar chart using <span style="font-family: "arial" , "helvetica" , sans-serif;">geom_bar()</span> with <span style="font-family: "arial" , "helvetica" , sans-serif;">state='identity'</span> enhanced just with a couple of artifacts: <span style="font-family: "arial" , "helvetica" , sans-serif;">geom_text()</span> and <span style="font-family: "arial" , "helvetica" , sans-serif;">annotate()</span>:</span></span><br />
<br />
<div>
<code data-gist-file="2-illness-severity-plot.R" data-gist-hide-footer="true" data-gist-id="903df6826b16f672aa8d8f5073d11057"></code>
</div>
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Line by line explainer:</span></span><br />
<ul>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">1-2: bar chart with <span style="font-family: "arial" , "helvetica" , sans-serif;">stat="identity"</span> displaying 4 bars.</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">3: placing percent labels above bars.</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">4: displaying y-axis labels in percent format.</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">5-6: color schema from <span style="font-family: "arial" , "helvetica" , sans-serif;">few_pal()</span> and custom labeling of the legend.</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">7-8: text annotation about CFR in the middle of the chart.</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">9-12:</span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"> title, subtitle, caption, and axis labels.</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">13-17: customization using <span style="font-family: "arial" , "helvetica" , sans-serif;">ggthemes</span> package and <span style="font-family: "arial" , "helvetica" , sans-serif;">theme()</span>.</span></span></li>
</ul>
<br />
<h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Clinical Manifestations </span></span></h3>
<blockquote class="tr_bq">
<h3>
<span style="font-weight: normal;"><span style="font-family: inherit;"><span style="font-size: large;">Pneumonia appears to be the
most frequent serious manifestation of infection, characterized
primarily by fever, cough, dyspnea, and bilateral infiltrates on chest
imaging [<a class="abstract_t" href="https://www.uptodate.com/contents/coronavirus-disease-2019-covid-19/abstract/32,36-38">32,36-38</a>]. There are no specific clinical features that can yet reliably distinguish COVID-19 from other viral respiratory infections. </span></span></span></h3>
</blockquote>
<blockquote class="tr_bq">
<span style="font-family: inherit;"><span style="font-size: large;">In
a study describing 138 patients with COVID-19 pneumonia in Wuhan, the
most common clinical features at the onset of illness were [<a class="abstract_t" href="https://www.uptodate.com/contents/coronavirus-disease-2019-covid-19/abstract/38">38</a>]:</span></span><br />
<div class="bulletIndent1">
<span style="font-family: inherit;"><span style="font-size: large;"><span class="glyph"> ●</span>Fever in 99 percent</span></span></div>
<div class="bulletIndent1">
<span style="font-family: inherit;"><span style="font-size: large;"><span class="glyph"> ●</span>Fatigue in 70 percent</span></span></div>
<div class="bulletIndent1">
<span style="font-family: inherit;"><span style="font-size: large;"><span class="glyph"> ●</span>Dry cough in 59 percent</span></span></div>
<div class="bulletIndent1">
<span style="font-family: inherit;"><span style="font-size: large;"><span class="glyph"> ●</span>Anorexia in 40 percent</span></span></div>
<div class="bulletIndent1">
<span style="font-family: inherit;"><span style="font-size: large;"><span class="glyph"> ●</span>Myalgias in 35 percent</span></span></div>
<div class="bulletIndent1">
<span style="font-family: inherit;"><span style="font-size: large;"><span class="glyph"> ●</span>Dyspnea in 31 percent</span></span></div>
<div class="bulletIndent1">
<span style="font-family: inherit;"><span style="font-size: large;"><span class="glyph"> ●</span>Sputum production in 27 percent</span></span></div>
</blockquote>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"> </span></span><a href="https://www.uptodate.com/contents/coronavirus-disease-2019-covid-19">Coronavirus disease 2019 (COVID-19) by Kenneth McIntosh, MD</a><br />
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Continuing using bar chart to display clinical manifestations of COVID-19 at the onset of illness:</span></span><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhUfsXdCG8BpeeN8Kj7jfRCUq4SgdhK4204lK4J28EOOUQN7Hg9iY8Txz3X38LFaT-e91adU33vxYpT3oCVaWni8CIDRvvKChQulKUIHAUGAmFanly6UF7b1fbC1N7jQZ8Er5kTVMRlEKaG/s1600/COVID-19-Clinical-Manifestations.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="500" data-original-width="750" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhUfsXdCG8BpeeN8Kj7jfRCUq4SgdhK4204lK4J28EOOUQN7Hg9iY8Txz3X38LFaT-e91adU33vxYpT3oCVaWni8CIDRvvKChQulKUIHAUGAmFanly6UF7b1fbC1N7jQZ8Er5kTVMRlEKaG/s1600/COVID-19-Clinical-Manifestations.png" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Implementation Details in R</span></span></h3>
<br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Dataset</span></span></h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">This is example of a bar chart requiring a bare minimum of information - just 2 columns with <span style="font-family: "arial" , "helvetica" , sans-serif;">name</span> and <span style="font-family: "arial" , "helvetica" , sans-serif;">percent</span> to display 7 bars:</span></span></span></span><br />
<br />
<div>
<code data-gist-file="3-clinical-manifestations-data.R" data-gist-hide-footer="true" data-gist-id="903df6826b16f672aa8d8f5073d11057"></code>
</div>
<br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Graphics</span></span></h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Once again code below creates a bar chart </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"> using <span style="font-family: "arial" , "helvetica" , sans-serif;">stat = "identity":</span></span></span></span></span></span></span><br />
<br />
<div>
<code data-gist-file="3-clinical-manifestations-plot.R" data-gist-hide-footer="true" data-gist-id="903df6826b16f672aa8d8f5073d11057"></code>
</div>
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-family: "georgia" , "times new roman" , serif;">Line by Line explainer:</span></span></span></span></span></span></span></span><br />
<ul>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">1-2: bar chart with <span style="font-family: "arial" , "helvetica" , sans-serif;">stat="identity"</span> displaying 4 bars.</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">3: displaying y-axis labels in percent format.</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">4: color schema from <span style="font-family: "arial" , "helvetica" , sans-serif;">few_pal()</span>.</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">5-8:</span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"> title, subtitle, caption, and axis labels.</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">9-12: customization using <span style="font-family: "arial" , "helvetica" , sans-serif;">ggthemes</span> package and <span style="font-family: "arial" , "helvetica" , sans-serif;">theme()</span>. </span></span></li>
</ul>
<br />
<h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Case Fatality Rate</span></span></h3>
<blockquote class="tr_bq">
<span style="font-size: large;"><span style="font-family: inherit;">According to a joint World Health Organization (WHO)-China
fact-finding mission, the case-fatality rate ranged from 5.8 percent in
Wuhan to 0.7 percent in the rest of China [<a class="abstract_t" href="https://www.uptodate.com/contents/coronavirus-disease-2019-covid-19/abstract/17">17</a>]. Most of the fatal cases occurred in patients with advanced age or underlying medical comorbidities [<a class="abstract_t" href="https://www.uptodate.com/contents/coronavirus-disease-2019-covid-19/abstract/20,41">20,41</a>]. (See <a class="local" data-see-link-view-event="" href="https://www.uptodate.com/contents/coronavirus-disease-2019-covid-19#H943884075">'Risk factors for severe illness'</a> below.)</span></span></blockquote>
<br />
<blockquote class="tr_bq">
<span style="font-size: large;"><span style="font-family: inherit;">The
proportion of severe or fatal infections may vary by location. As an
example, in Italy, 12 percent of all detected COVID-19 cases and 16
percent of all hospitalized patients were admitted to the intensive care
unit; the estimated case fatality rate was 7.2 percent in mid-March [<a class="abstract_t" href="https://www.uptodate.com/contents/coronavirus-disease-2019-covid-19/abstract/42,43">42,43</a>]. In contrast, the estimated case fatality rate in mid-March in South Korea was 0.9 percent [<a class="abstract_t" href="https://www.uptodate.com/contents/coronavirus-disease-2019-covid-19/abstract/44">44</a>].
This may be related to distinct demographics of infection; in Italy,
the median age of patients with infection was 64 years, whereas in Korea
the median age was in the 40s.</span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"> </span></span><span style="font-size: large;"><span style="font-family: inherit;"></span></span></blockquote>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"> </span></span><a href="https://www.uptodate.com/contents/coronavirus-disease-2019-covid-19">Coronavirus disease 2019 (COVID-19) by Kenneth McIntosh, MD</a><br />
<br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">This chart displays CFR's by age groups based on 44672 confirmed cases in China through February 11 with overall CFR = 2.3%:</span></span><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_tgJQBbPMAHJNmQ7WdQIa05H8c77YxIs5waTi7TEPFIJLSV5p1sA_S4Wz1osQsqEvoPyCfKUbJYHaHFyoYQd2IBOxkTjN8Gtlk3QHj3vX_3RRTu9t-yltibRrMrcPbsXRdpTq0VHULeB7/s1600/COVID-19-CFR-by-Age-Groups.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="500" data-original-width="750" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_tgJQBbPMAHJNmQ7WdQIa05H8c77YxIs5waTi7TEPFIJLSV5p1sA_S4Wz1osQsqEvoPyCfKUbJYHaHFyoYQd2IBOxkTjN8Gtlk3QHj3vX_3RRTu9t-yltibRrMrcPbsXRdpTq0VHULeB7/s1600/COVID-19-CFR-by-Age-Groups.png" /></a></div>
<br />
<h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Imlementation Details in R</span></span></h3>
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Dataset</span></span></h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">The data includes age, deaths, cases, and cfr computed as a ratio of last two:</span></span></span></span><br />
<br />
<div>
<code data-gist-file="5-cfr-by-age-groups-data.R" data-gist-hide-footer="true" data-gist-id="903df6826b16f672aa8d8f5073d11057"></code>
</div>
<br />
<h4>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Graphics</span></span></h4>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">This chart combines bar and line charts into single plot reflecting CFR rate dynamic over age groups and additionally reflects size of these groups using bar width:</span></span><br />
<br />
<div>
<code data-gist-file="5-cfr-by-age-groups-plot.R" data-gist-hide-footer="true" data-gist-id="903df6826b16f672aa8d8f5073d11057"></code>
</div>
<br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Line by line explainer:</span></span><br />
<ul>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">1,2: line chart over CFR by age groups.</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">3: horizontal dotted line representing overall case fatality rate.</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">1,4: bar chart with <span style="font-family: "arial" , "helvetica" , sans-serif;">stat="identity"</span> displaying CFR's for each age group with adjusted bar width based on number of cases in each group.</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">5,6: placing text labels with explicit value and calculation of CFR for each age group.</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">7: displaying y-axis labels in percent format.</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">8: color schema from <span style="font-family: "arial" , "helvetica" , sans-serif;">few_tаbleau()</span>.</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">9-12:</span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"> title, subtitle, caption, and axis labels.</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">13-15: customization using <span style="font-family: "arial" , "helvetica" , sans-serif;">ggthemes</span> package and <span style="font-family: "arial" , "helvetica" , sans-serif;">theme()</span>. </span></span></li>
</ul>
<h2>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"> </span></span></h2>
<h2>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Epidemiology</span></span></h2>
<h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Period of infectivity</span></span></h3>
<blockquote class="tr_bq">
<div class="headingAnchor" id="H166293483">
<span class="headingEndMark"></span><span style="font-size: large;"><span style="font-family: inherit;">The
interval during which an individual with COVID-19 is infectious is
uncertain. Most data informing this issue are from studies evaluating
viral RNA detection from respiratory and other specimens. However,
detection of viral RNA does not necessarily indicate the presence of
infectious virus. </span></span></div>
</blockquote>
<br />
<blockquote class="tr_bq">
<span style="font-size: large;"><span style="font-family: inherit;">Viral RNA levels appear to be higher soon after symptom onset compared with later in the illness [<a class="abstract_t" href="https://www.uptodate.com/contents/coronavirus-disease-2019-covid-19/abstract/18">18</a>];
this raises the possibility that transmission might be more likely in
the earlier stage of infection, but additional data are needed to
confirm this hypothesis. </span></span></blockquote>
<br />
<blockquote class="tr_bq">
<span style="font-size: large;"><span style="font-family: inherit;">The duration of viral shedding is also
variable; there appears to be a wide range, which may depend on severity
of illness. In one study of 21 patients with mild illness (no hypoxia),
90 percent had repeated negative viral RNA tests on nasopharyngeal
swabs by 10 days after the onset of symptoms; tests were positive for
longer in patients with more severe illness [<a class="abstract_t" href="https://www.uptodate.com/contents/coronavirus-disease-2019-covid-19/abstract/19">19</a>].
In another study of 137 patients who survived COVID-19, the median
duration of viral RNA shedding from oropharyngeal specimens was 20 days
(range of 8 to 37 days) [<a class="abstract_t" href="https://www.uptodate.com/contents/coronavirus-disease-2019-covid-19/abstract/20">20</a>]. </span></span></blockquote>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"> </span></span><a href="https://www.uptodate.com/contents/coronavirus-disease-2019-covid-19">Coronavirus disease 2019 (COVID-19) by Kenneth McIntosh, MD</a><br />
<br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">This chart informs of minimum, median, and maxium duration of viral shedding by infected individuals by using bars resembling time lines:</span></span><br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><br /></span></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjCyKWbUX0xElqg3gdZ24ULenvuQobTx74PKnUpqR-7S8W1PG3agnR5l4yc-bH42lkmoRHMLc0Hul_gnc03Q88YiNZ1AkQZVoM1yZKMVk_yekIyP1J_8tBai148islapr6QqqbFIPXWNMCl/s1600/COVID-19-Period-of-Infectivity.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="500" data-original-width="750" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjCyKWbUX0xElqg3gdZ24ULenvuQobTx74PKnUpqR-7S8W1PG3agnR5l4yc-bH42lkmoRHMLc0Hul_gnc03Q88YiNZ1AkQZVoM1yZKMVk_yekIyP1J_8tBai148islapr6QqqbFIPXWNMCl/s1600/COVID-19-Period-of-Infectivity.png" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<h2>
</h2>
<h4>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><br /></span></span></h4>
<h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Imlementation Details in R</span></span></h3>
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Dataset</span></span></h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">This chart will use bars to imitate time lines of period of infectivity based on research of how long individuals shedded viral RNA that identified minimum, median and maximum times:</span></span></span></span><br />
<br />
<div>
<code data-gist-file="4-period-of-infectivity-data.R" data-gist-hide-footer="true" data-gist-id="903df6826b16f672aa8d8f5073d11057"></code>
</div>
<br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Graphics</span></span></span></span></h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Yet another example of a bar chart with additional hack using <span style="font-family: "arial" , "helvetica" , sans-serif;">geom_point()</span>'s to display an improvised icon of SARS-CoV-2 virus:</span></span></span></span><br />
<br />
<div>
<code data-gist-file="4-period-of-infectivity-plot.R" data-gist-hide-footer="true" data-gist-id="903df6826b16f672aa8d8f5073d11057"></code>
</div>
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Line by line explainer:</span></span></span></span><br />
<ul>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">1,2: bar chart with <span style="font-family: "arial" , "helvetica" , sans-serif;">stat="identity"</span> displaying 3 very thin bars imitating time line.</span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"> </span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">3-6: overlaying 3 different point shapes with varying size to improvise virus icon</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">7,8: text annotation about the difference between being infectious and viral RNA shedding.</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">9: flipping x and y axis to display time line horizontally.</span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"> </span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">10-13:</span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"> title, subtitle, caption, and axis labels.</span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"> </span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">14-16: customization using <span style="font-family: "arial" , "helvetica" , sans-serif;">ggthemes</span> package and <span style="font-family: "arial" , "helvetica" , sans-serif;">theme()</span>. </span></span></li>
</ul>
<br />
<h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Conclusions</span></span></span></span></h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Most of the facts above are results of very young research of </span></span></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">COVID-19 -</span></span></span></span> just little over 3 months old. There are still many unknowns about both the virus SARS-CoV-2 and the disease. To emphasize this I compiled a few of unknowns in the bonus chart - some will seem surprising given the wealth of knowledge scientists accumulated about other similar diseases:</span></span></span></span><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiMbWbC6WoJKPx7O2axnQQ6rZ-8kJgNVM75Np2ZQwwrIwo3SukA0NjzoGSD1ZnhgYZ6za2LYb7RKzFecOZ6z1yJbkwyDlB92SeYj4S1Qs-WAa3EpZItJUNdBBw9XgPHSKk0xLWdqcD27af5/s1600/COVID-19-What-We-Still-Dont-Know.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="500" data-original-width="850" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiMbWbC6WoJKPx7O2axnQQ6rZ-8kJgNVM75Np2ZQwwrIwo3SukA0NjzoGSD1ZnhgYZ6za2LYb7RKzFecOZ6z1yJbkwyDlB92SeYj4S1Qs-WAa3EpZItJUNdBBw9XgPHSKk0xLWdqcD27af5/s1600/COVID-19-What-We-Still-Dont-Know.png" /></a></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><br /></span></span></span></span>
<br />
<h2>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">References</span></span></h2>
<ul>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><a href="https://www.uptodate.com/contents/coronavirus-disease-2019-covid-19">Coronavirus disease 2019 (COVID-19) by Kenneth McIntosh, MD</a></span></span></li>
<li><a href="http://weekly.chinacdc.cn/en/article/id/e53946e2-c6c4-41e9-9a9b-fea8db1a8f51" target="_blank"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-weight: normal;">Vital Surveillances: The Epidemiological Characteristics of an Outbreak of 2019 Novel Coronavirus Diseases (COVID-19) — China, 2020</span></span></span></a><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-weight: normal;"> </span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><a href="http://www.cidrap.umn.edu/covid-19/maps-visuals" target="_blank">COVID-19 Maps and Visuals</a></span></span></li>
</ul>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><br /></span></span></span></span>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">
</span></span>Gregory Kanevskyhttp://www.blogger.com/profile/09179130896383881927noreply@blogger.com0tag:blogger.com,1999:blog-7530218802939252476.post-66318077965671670862020-03-18T00:28:00.000-05:002020-03-25T12:58:29.036-05:00Survey Results: What Degree is Best for Data Science?<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEihp1REnvZwKLnR9U8drsEsIYwGKJ-PQr01DUuMXYVZPy5owIfQEtGvTS50edS32DZgHfXUF2tMneSlqBW2hs7flAnOzVRTWEJ_ehDG-m0szCE7nSojzNOLie0x7ZYJyQWRDO5AeiAxtEPi/s1600/Screen+Shot+2020-03-12+at+00.29.27.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1110" data-original-width="1214" height="365" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEihp1REnvZwKLnR9U8drsEsIYwGKJ-PQr01DUuMXYVZPy5owIfQEtGvTS50edS32DZgHfXUF2tMneSlqBW2hs7flAnOzVRTWEJ_ehDG-m0szCE7nSojzNOLie0x7ZYJyQWRDO5AeiAxtEPi/s400/Screen+Shot+2020-03-12+at+00.29.27.png" width="400" /></a></div>
<h2>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">The Survey </span></span></h2>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Results from the survey <a href="https://www.surveymonkey.com/r/7FGGWS7">What Degree is Best for Data Science?</a> </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">(the survey is still open) </span></span></span></span>collected from </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">February 9 through March 12, 2020 asking participants 4 questions:</span></span> </span></span><br />
<br />
<ol>
</ol>
<ul>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Answers about self:</span></span></li>
<ul>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Q1: What is the highest level of school degree you have completed? </span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Q2: Which of the following best describes the field in which you received your highest degree?</span></span></li>
</ul>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"> Answers about best education:</span></span></li>
<ul>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Q3: What level of school degree you consider optimal for successful career in data science?</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Q4: Which field of study you consider optimal for successful career in data science?</span></span></li>
</ul>
</ul>
<ol>
</ol>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"></span></span><br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">During
that period 289 respondents participated and 285 successfully completed
all 4 questions, so 4 participants with partial answers were removed from analysis
below.</span></span></span></span><br />
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Though simple and short (average time it took to complete was 55 seconds (after removing 6 outliers who took over 500 seconds)) the survey's questions possess certain internal structure in time and subject. Questions form 2 groups in time: one about education already acquired by a participant and the other about participant recommendations for best education. Subjects of questions yield 2 alternative groups: pair of 1st and 3d about
degree and pair of 2d and 4th about field of study.</span></span><br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span>
<br />
<h2>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Answers to Each Question</span></span></h2>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPPc-DN69H7M-sdX2g9bLDBRTkaNFaPnrqKVC20Q-3jbqDRrrOHWodxh609vljNOWo1Tj6iqxDTW__fy6bQzTJeQleGprGosU_IKHlRMqQWklbWO2ZT4O-HWPOuGcjIhKrYVaft_3OV6Nd/s1600/best-data-science-edu-Q1-answers.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="458" data-original-width="500" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPPc-DN69H7M-sdX2g9bLDBRTkaNFaPnrqKVC20Q-3jbqDRrrOHWodxh609vljNOWo1Tj6iqxDTW__fy6bQzTJeQleGprGosU_IKHlRMqQWklbWO2ZT4O-HWPOuGcjIhKrYVaft_3OV6Nd/s1600/best-data-science-edu-Q1-answers.png" /> </a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBpTcB5KuiCEQK3Ke438fMeSNaQstRBAnIPb1h5JfehAK-VNuKsun2DA4A2KBBHD-sKbTxEz_HSITmMidWGMt2keCk-nrpXN7Snq6QwYq9CCHLPyG7pN3z81S0o_Ix2S9dsg9nr8iYaPct/s1600/best-data-science-edu-Q2-answers.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="459" data-original-width="500" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBpTcB5KuiCEQK3Ke438fMeSNaQstRBAnIPb1h5JfehAK-VNuKsun2DA4A2KBBHD-sKbTxEz_HSITmMidWGMt2keCk-nrpXN7Snq6QwYq9CCHLPyG7pN3z81S0o_Ix2S9dsg9nr8iYaPct/s1600/best-data-science-edu-Q2-answers.png" /> </a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiWwRbn6EWw6WFQFgY6Gs4DRzu1TjGJkMpUqLbH6mfpytopA_2hLNAAQsRZQ2eyeAnCnM1T2g_a4eEiU8LIsugkyzUQVzeqjW-kuBpva-xlFoZaBn7tUpCU4ccLpSt2Ie2FMMHM0BdDSzrX/s1600/best-data-science-edu-Q3-answers.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="459" data-original-width="500" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiWwRbn6EWw6WFQFgY6Gs4DRzu1TjGJkMpUqLbH6mfpytopA_2hLNAAQsRZQ2eyeAnCnM1T2g_a4eEiU8LIsugkyzUQVzeqjW-kuBpva-xlFoZaBn7tUpCU4ccLpSt2Ie2FMMHM0BdDSzrX/s1600/best-data-science-edu-Q3-answers.png" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiFlPp3bhn0pOh-nNoYMTjvvM9ywpskk-DE_1trkPeajpKDhyUkjgh5NE9_Siag-xbG645HftNFBgoutybrLrGhPiD_cB3XcopwV6F22MWGoT3IZF13wT5A-3VdcMFrDU21HcQfT7af9RMa/s1600/best-data-science-edu-Q4-answers.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="459" data-original-width="500" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiFlPp3bhn0pOh-nNoYMTjvvM9ywpskk-DE_1trkPeajpKDhyUkjgh5NE9_Siag-xbG645HftNFBgoutybrLrGhPiD_cB3XcopwV6F22MWGoT3IZF13wT5A-3VdcMFrDU21HcQfT7af9RMa/s1600/best-data-science-edu-Q4-answers.png" /></a></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span>
<br />
<h2>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Bird's-Eye View</span></span></h2>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjSWMg0-6IPJX8ayAExpduxqmgeaEyiVmQfgbjKekl_DhJELu2B_VpACOyXeCBFCMlNC-U9pf6Ld_n3Xe4iHt2fmL9Lg8OW2ODlpZlSikyqA2Xcs5_DBhW51-rPidsw6YlJvEnoqrv8cvCS/s1600/best-data-science-edu-Q1-Q2-heatmap.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="459" data-original-width="500" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjSWMg0-6IPJX8ayAExpduxqmgeaEyiVmQfgbjKekl_DhJELu2B_VpACOyXeCBFCMlNC-U9pf6Ld_n3Xe4iHt2fmL9Lg8OW2ODlpZlSikyqA2Xcs5_DBhW51-rPidsw6YlJvEnoqrv8cvCS/s1600/best-data-science-edu-Q1-Q2-heatmap.png" /> </a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEht7oZkJ37C-y3FPYaSHggOTqesYV_FNjatGCgZZ94AKDBei2oxq5vPpwQXwTGSwcPS09na-84-nJJz5ZNdVbus0h9IGHHUazb3rsY9YBaJCNzh8xz7096nSFwf5lmY4b6cPedP1htN2xAr/s1600/best-data-science-edu-Q3-Q4-heatmap.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="459" data-original-width="500" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEht7oZkJ37C-y3FPYaSHggOTqesYV_FNjatGCgZZ94AKDBei2oxq5vPpwQXwTGSwcPS09na-84-nJJz5ZNdVbus0h9IGHHUazb3rsY9YBaJCNzh8xz7096nSFwf5lmY4b6cPedP1htN2xAr/s1600/best-data-science-edu-Q3-Q4-heatmap.png" /></a></div>
<h2>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"> </span></span></h2>
<h2>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Sankey Diagrams: How Data Flows</span></span></h2>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Sankey diagrams help visualize how answers flow through the questions. We start with pairs of related questions and finish with all 4 questions together. </span></span><br />
<div style="text-align: center;">
<br />
<b><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Completed Degree and Field of Study (Q1, Q2)</span></span></b><br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"></span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgF4huiDDCRpcR2IGyXagPD2qyaAuylI9sgvNaJJPN4miGC6HDsbPtBTWLP9kB2w6YRiw42FPPkpc5kOJjU9IxSa19QbDVDmhD7omOlgiBHoev5crbIAh1WMxo3JhZ3zcTb_HirnSLBnZ6u/s1600/best-data-science-edu-Q1-Q2-sankey.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1082" data-original-width="1172" height="588" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgF4huiDDCRpcR2IGyXagPD2qyaAuylI9sgvNaJJPN4miGC6HDsbPtBTWLP9kB2w6YRiw42FPPkpc5kOJjU9IxSa19QbDVDmhD7omOlgiBHoev5crbIAh1WMxo3JhZ3zcTb_HirnSLBnZ6u/s640/best-data-science-edu-Q1-Q2-sankey.png" width="640" /></a></span></span></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span>
<b><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Best Degree and Field of Study (Q3, Q4)</span></span></b><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEipI8e5TOs-O8MlJD5i238WmuLi5nANImBV2MGMhLFeMSuQNYtZmTRG8YChTewlDPW71Ry-bghXOQ_nE0jeqomv11PRsUc7_YtNh4T3lB80OUInG2xIxvRBk4zpwz8NFrH6po8e6-B_5VVv/s1600/best-data-science-edu-Q3-Q4-sankey.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1082" data-original-width="1172" height="590" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEipI8e5TOs-O8MlJD5i238WmuLi5nANImBV2MGMhLFeMSuQNYtZmTRG8YChTewlDPW71Ry-bghXOQ_nE0jeqomv11PRsUc7_YtNh4T3lB80OUInG2xIxvRBk4zpwz8NFrH6po8e6-B_5VVv/s640/best-data-science-edu-Q3-Q4-sankey.png" width="640" /></a></div>
<br />
<b><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Completed Degree vs. Best Degree (Q1, Q3)</span></span></b><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8bd_m8KJOb6jJBDaDJk48OkVcEUB3tVsQZr2I0BCXXuutO8Ju3j7I_ZV9i__-hdbvGiK6JG8dKXOHa9ufuE_IelNV-4zUT3k3j0CD3HW-beFjWjwCqlCRPD1irY1SKMiLzHLGzf3EpDO3/s1600/best-data-science-edu-Q1-Q3-sankey.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1082" data-original-width="1172" height="590" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8bd_m8KJOb6jJBDaDJk48OkVcEUB3tVsQZr2I0BCXXuutO8Ju3j7I_ZV9i__-hdbvGiK6JG8dKXOHa9ufuE_IelNV-4zUT3k3j0CD3HW-beFjWjwCqlCRPD1irY1SKMiLzHLGzf3EpDO3/s640/best-data-science-edu-Q1-Q3-sankey.png" width="640" /></a></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span>
<b><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Completed Field vs. Best Field (Q2, Q4)</span></span></b><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgNTwXXWKjvEL_Ri1a0Y5ny-aXdhI828rhn4hjwfO9RRH5TCcZd3F-Dm_hu_muNvGr0rGBnQJ3wFBaCDZvOnRrOqpiiDHZFNxED-DVHNk9zWMl-cyeULD02SHdUQ-BDmsDmqgatcbj3ZwQz/s1600/best-data-science-edu-Q2-Q4-sankey.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1082" data-original-width="1172" height="590" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgNTwXXWKjvEL_Ri1a0Y5ny-aXdhI828rhn4hjwfO9RRH5TCcZd3F-Dm_hu_muNvGr0rGBnQJ3wFBaCDZvOnRrOqpiiDHZFNxED-DVHNk9zWMl-cyeULD02SHdUQ-BDmsDmqgatcbj3ZwQz/s640/best-data-science-edu-Q2-Q4-sankey.png" width="640" /></a></div>
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><b>Complete Flow of Answers For All 4 Questions</b></span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhMabEWeMvaUoeV14zqGtwqcBpWSfkqelVkcxVZn0OaIuFgLiHiITGsKLtsFDNlFMlPquwguJtdXXGBtUfLZfLm-28db17pLIlcyLnI62E831BiVuj0T1JgD2YtIGxBh4OfvT38inr4lwBW/s1600/best-data-science-edu-Q1-Q2-Q3-Q4-sankey.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1082" data-original-width="1172" height="590" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhMabEWeMvaUoeV14zqGtwqcBpWSfkqelVkcxVZn0OaIuFgLiHiITGsKLtsFDNlFMlPquwguJtdXXGBtUfLZfLm-28db17pLIlcyLnI62E831BiVuj0T1JgD2YtIGxBh4OfvT38inr4lwBW/s640/best-data-science-edu-Q1-Q2-Q3-Q4-sankey.png" width="640" /></a></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<h2>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Concluding comments</span></span></h2>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">The results are self-evident. The survey is still open so anyone who didn't participate can still do so and let others know about it. </span></span><br />
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">If you haven't noticed yet there is certain bias towards statistics in answers. This might originate from the fact that significant part of respondents reached the survey via <a href="https://www.r-bloggers.com/">R-bloggers</a> distribution popular among R users (who often have background in statistics). </span></span><br />
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Finally, there is another implicit bias: people with degree in Math are likely to suggest Math as best field, and so on for other fields and degrees. This sort of bias is evident from Sankey diagrams above: see (Q1, Q3) and (Q2, Q4) diagrams. Removing such bias from the results could be useful and I attempted this exercise but found it to be either too naive in my DIY approach or too extensive to process </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">in short period of time </span></span>from resources discovered. If you have pointers or even better a method of removing such bias from answers I'd love to hear from you.</span></span><br />
<h2>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span></h2>
<h2>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span></h2>
<br />
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"></span></span><br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"></span></span><br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"></span></span><br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"></span></span><br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"></span></span><br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"></span></span><br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span>Gregory Kanevskyhttp://www.blogger.com/profile/09179130896383881927noreply@blogger.com0tag:blogger.com,1999:blog-7530218802939252476.post-46580228308730921952020-02-21T15:00:00.001-06:002020-02-26T10:00:15.729-06:00Survey: What Degree is Best for Data Science?<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgjoQajV7xMkXjXwEjfy0L8cixC5YjarfEeLx8k54C6zFmquqZRONhVoUZZdrsCO3ljS6JtBuzoSVGKcwcjH8y6tQBICfFHfp0WDQCFSsSWJVhbpPUDlduMMC5OUHLHflerheHIYotwwN_G/s1600/noun_education_1724963%25281%2529.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="512" data-original-width="512" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgjoQajV7xMkXjXwEjfy0L8cixC5YjarfEeLx8k54C6zFmquqZRONhVoUZZdrsCO3ljS6JtBuzoSVGKcwcjH8y6tQBICfFHfp0WDQCFSsSWJVhbpPUDlduMMC5OUHLHflerheHIYotwwN_G/s320/noun_education_1724963%25281%2529.png" width="320" /> </a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
<span style="font-size: large;"><i><span style="font-family: "georgia" , "times new roman" , serif;">TL;DR<br />Just answer 4 questions about best degree for Data Science here:<br /> <a href="https://www.surveymonkey.com/r/7FGGWS7">https://www.surveymonkey.com/r/7FGGWS7</a></span></i></span> </div>
<br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">No doubt asking the question "What's the best degree for Data Science?" one won't expect unified or even a few opinions (unless everything I know about people practicing data science is all wrong). <a href="https://www.datasciencecentral.com/profile/StephanieGlen">Stephanie Glen</a> analyzed various sources on the topic to show just that: </span></span><br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><br /></span></span>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://storage.ning.com/topology/rest/1.0/file/get/3774712371?profile=original" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="391" data-original-width="800" height="313" src="https://storage.ning.com/topology/rest/1.0/file/get/3774712371?profile=original" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Source: Best Degree for Data Science (in One Picture) <br />
https://www.datasciencecentral.com/profiles/blogs/best-degree-for-data-science-in-one-picture</td></tr>
</tbody></table>
<br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Trying to replicate her analysis with answers from data science practitioners I constructed 1-minute anonymous survey asking the same: https://www.surveymonkey.com/r/7FGGWS7</span></span><br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">There you will find 4 questions: 2 on what degree you have and 2 on what degree you recommend. After collecting 100+ responses I will share results, thank you for participating!</span></span>Gregory Kanevskyhttp://www.blogger.com/profile/09179130896383881927noreply@blogger.com1tag:blogger.com,1999:blog-7530218802939252476.post-62508113902136022582020-02-11T17:13:00.000-06:002020-04-17T21:39:28.360-05:00H2O.ai Academic Program for Professors and Students: Quick Start with Driverless AI and Paperspace<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">If
you are a professor teaching or a student </span><span style="font-family: "georgia" , "times new roman" , serif;">enrolled in a machine
learning program or </span><span style="font-family: "georgia" , "times new roman" , serif;">non-technical program with a machine learning hands-on lab </span>becoming a member of the <a href="https://www.h2o.ai/academic/" target="_blank">H2O.ai Academic Program</a> will get you free access to</span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"> non-commercial use of software license for education and research purposes.<span style="font-family: "georgia" , "times new roman" , serif;"> In November 2018 H2O.ai
(my employer) made its ground-breaking automated machine learning (AutoML)
platform <a href="https://www.h2o.ai/products/h2o-driverless-ai/" target="_blank">Driverless AI</a> available to academia for free. </span></span></span><br />
<br />
<h2>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">What Does Driverless AI Do?</span></span></span></h2>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">H2O.ai defines Driverless AI as </span></span></span></span><br />
<blockquote class="tr_bq">
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><a href="http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/index.html#overview">"<span style="background-color: #fcfcfc; color: #404040; display: inline; float: none; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">an artificial intelligence platform for automatic machine learning</span>"</a> </span></blockquote>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">To find out how Driverless AI automates machine learning
activities into integral and repeatable workflow seamlessly encompassing feature engineering, model validation, hyper-parameter tuning, model selection and
ensembles, custom recipes for transformers, models and scorers, automated model documentation, and finally model deployment visit</span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"> <a href="https://www.h2o.ai/products/h2o-driverless-ai/#features" target="_blank">User Guide</a></span>. Not to forget MLI (Machine
Learning Interpretability) module that offers tools for both white and black box model interpretability, model debugging, disparate impact analysis, and </span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">what-if (sensitivity) analysis</span>.</span><br />
<span style="font-size: large;"><br /></span>
<br />
<h2>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">H2O.ai Academic Program</span></span></h2>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">To sign up to the H2O Academic Program </span></span><span style="font-family: "georgia" , "times new roman" , serif;">launched <a href="https://www.h2o.ai/company/news/h2o-ai-launches-academic-program-to-accelerate-discovery-with-ai/" target="_blank">back in October of 2018</a> start </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">by filling out <a href="https://www.h2o.ai/academic/#sign-up" target="_blank">this form</a> </span></span></span><span style="font-family: "georgia" , "times new roman" , serif;">given following conditions hold true:</span></span><br />
<ul>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">intended use is non-commercial for education and research purposes only and</span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">person belongs to higher education institution or is a student currently enrolled in a higher education degree program and</span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">if a student then academic status can be verified by sending a photo of your current student ID to <a href="mailto:academic@h2o.ai">academic@h2o.ai</a> (required).</span></li>
</ul>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Upon approval H2O.ai will issue </span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">a free license for Driverless AI for non-commercial use only. While waiting to be approved apply for access to <a href="http://h2oai-community.slack.com/" target="_blank">H2O.ai Community Slack channel</a> <a href="https://www.h2o.ai/community/#community-form" target="_blank">here</a> and don't forget to join <i>#academic</i>).</span><br />
<br />
<h2>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">Driverless AI Installation Options </span></span></span></h2>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">After receiving a license key, follow installation instructions for <a href="http://prerelease.h2o.ai/docs/userguide/install/mac-osx.html" target="_blank">Mac OS X</a> or <a href="http://prerelease.h2o.ai/docs/userguide/install/windows.html" target="_blank">Windows 10 Pro</a> (via WSL Ubuntu option is highly preferred) to run Driverless AI on your workstation or laptop. While such an approach suffices for small datasets serious problems demand installing and running Driverless AI on modern data center hardware with multiple CPUs and one or several GPUs for best results.</span></span></span><br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span></span>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">There are several economical cloud providers for such a solution. For general guidelines and instructions for native DEB installation on Linux Ubuntu see <a href="http://prerelease.h2o.ai/docs/userguide/install/linux-deb.html" target="_blank">here</a>. Steps below can be traced back to this documentation.</span></span></span><br />
<br />
<h2>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">Why Paperspace </span></span></span></h2>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><a href="https://www.paperspace.com/" target="_blank">Paperspace</a> offers a robust choice of configurations to provision and run Linux Ubuntu VMs with single GPU (no multi GPU systems available). The pricing appears competitive to suit thrifty academic budget by starting at around $0.50/hour for GPU systems with 30G of memory that should comfortably host Driverless AI. It also features a simple streamlined interface to deploy and manage VMs.</span></span></span><br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span></span>
<br />
<h2>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">Step-by-Step Guide </span></span></span></h2>
<h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">Spinning up Linux VM </span></span></span></h3>
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">1. Create Paperspace Account</span></span></span></h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">Start with creating account at <a href="https://paperspace.io/&R=5NXWB5R" target="_blank">paperspace.com</a>:</span></span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh5j9vHHwUXOQVN6X13yBRmPY4dnXCeDEd5ocTUdA9Mofegj1mtamDo-S0tGa1YqAoxhDdPXsnv-uV1EZaz6dwne7KirQi3kF85FmfFQxSLSChIoWf2fJOPWnOi690osy4vNeOo_s2Q09-k/s1600/Screen+Shot+2020-02-10+at+00.00.08.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1000" data-original-width="1600" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh5j9vHHwUXOQVN6X13yBRmPY4dnXCeDEd5ocTUdA9Mofegj1mtamDo-S0tGa1YqAoxhDdPXsnv-uV1EZaz6dwne7KirQi3kF85FmfFQxSLSChIoWf2fJOPWnOi690osy4vNeOo_s2Q09-k/s640/Screen+Shot+2020-02-10+at+00.00.08.png" width="640" /></a></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span></span>
<br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">2. Create a Cloud VM</span></span></span></h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">After successfully creating account proceed to create a cloud VM:</span></span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhM1gMcj3wMoSpyMHAnS1rsM1Lnd2U5dHkSeyXM2USita_2lliz35FANcY1qJwBM_gg3nT3HLEZnDiFmlcXgoZfZSi3dmetE-M7XStuH4yGBIN8-CCt0MzRqwUegkr3H3h72ZORiZTDtrZu/s1600/Screen+Shot+2020-02-10+at+00.02.03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1000" data-original-width="1600" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhM1gMcj3wMoSpyMHAnS1rsM1Lnd2U5dHkSeyXM2USita_2lliz35FANcY1qJwBM_gg3nT3HLEZnDiFmlcXgoZfZSi3dmetE-M7XStuH4yGBIN8-CCt0MzRqwUegkr3H3h72ZORiZTDtrZu/s640/Screen+Shot+2020-02-10+at+00.02.03.png" width="640" /></a></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span></span>
<br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">3. Start Adding New Machine</span></span></span></h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">Under <span style="font-family: "verdana" , sans-serif;">Core -> Compute -> Machines</span> on the left select <span style="font-family: "verdana" , sans-serif;">(+)</span> to add new machine:</span></span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjKoQANwR_aPpFTjXjN0pa6Ik7-S0uvLHH_dgEm6tBGA4kZLG-m5QGNZSmYPS4r5B2JJYiuPBvjyEr-3jJN-jIaMZfqRlkqaPU9ShqRtKAT6daHr02Hi4ormzX7aNetX8Gc9_rjFC6xE_m0/s1600/Screen+Shot+2020-02-10+at+00.02.46.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1000" data-original-width="1600" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjKoQANwR_aPpFTjXjN0pa6Ik7-S0uvLHH_dgEm6tBGA4kZLG-m5QGNZSmYPS4r5B2JJYiuPBvjyEr-3jJN-jIaMZfqRlkqaPU9ShqRtKAT6daHr02Hi4ormzX7aNetX8Gc9_rjFC6xE_m0/s640/Screen+Shot+2020-02-10+at+00.02.46.png" width="640" /></a></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span></span>
<br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">4. Machine Location</span></span></span></h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">Choose region closer to your location - in my case it was <span style="font-family: "verdana" , sans-serif;">"East Coast (NY2)"</span>:</span></span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjd-U8Xw-9shJkrxPPm_C3GYfAJ5VuZxTKQMil33zIjDjweEEqqh_rhUscTsaMZeud6je6WmAIM2Q_vKzldBbChZXuDg7d79uMKkXQt29u8zr5rqZ9ICWDjMoB4gj3IhyphenhyphenH-r0ZRpdMSThl9/s1600/Screen+Shot+2020-02-10+at+00.03.55.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1000" data-original-width="1600" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjd-U8Xw-9shJkrxPPm_C3GYfAJ5VuZxTKQMil33zIjDjweEEqqh_rhUscTsaMZeud6je6WmAIM2Q_vKzldBbChZXuDg7d79uMKkXQt29u8zr5rqZ9ICWDjMoB4gj3IhyphenhyphenH-r0ZRpdMSThl9/s640/Screen+Shot+2020-02-10+at+00.03.55.png" width="640" /></a></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span></span>
<br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">5. Choose Type Operating System </span></span></span></h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">Scroll down to <span style="font-family: "verdana" , sans-serif;">"Choose OS"</span> and click on <span style="font-family: "verdana" , sans-serif;">"Linux Templates"</span>:</span></span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgmwmE1TrNeUV2EB-YHC1sh0M-stBhQOKpLVgh8dYSQ385BwIDeWGXkujulTZ_HyG-pRyCdRLzvEl52pBzWvqheQ9r0aUGqMh3XLsKIezmdKmVtKvpiICgfPlFaCN2oHXDlfs9SSRi6HMKV/s1600/Screen+Shot+2020-02-10+at+00.06.34.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1000" data-original-width="1600" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgmwmE1TrNeUV2EB-YHC1sh0M-stBhQOKpLVgh8dYSQ385BwIDeWGXkujulTZ_HyG-pRyCdRLzvEl52pBzWvqheQ9r0aUGqMh3XLsKIezmdKmVtKvpiICgfPlFaCN2oHXDlfs9SSRi6HMKV/s640/Screen+Shot+2020-02-10+at+00.06.34.png" width="640" /></a></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span></span>
<br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">6. Choose OS Version</span></span></span></h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">Keep default Ubuntu 16.04 </span></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">server image</span></span></span>:</span></span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgXpvB1x459CN2k6D3N3Fn1RIuwt5yiGKGz2B_I44iKw6nCyer5VqlyVgoTh6itghCp_hUELmYaZQTCDP8ijpmBkrj4B7ez75HN3P4mXDSBmQgjsiFbwYR4MI0aPpCi6XnCdOg4y8cp9crh/s1600/Screen+Shot+2020-02-10+at+00.07.10.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1000" data-original-width="1600" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgXpvB1x459CN2k6D3N3Fn1RIuwt5yiGKGz2B_I44iKw6nCyer5VqlyVgoTh6itghCp_hUELmYaZQTCDP8ijpmBkrj4B7ez75HN3P4mXDSBmQgjsiFbwYR4MI0aPpCi6XnCdOg4y8cp9crh/s640/Screen+Shot+2020-02-10+at+00.07.10.png" width="640" /></a></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span></span>
<br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">7. Pick Machine Type (How Much to Pay)</span></span></span></h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">Scroll down to choose machine profile (keep hourly rate): for VM pick type <span style="font-family: "verdana" , sans-serif;">"P4000"</span> or more expensive machine type with GPU, while for CPU only system pick <span style="font-family: "verdana" , sans-serif;">"C6"</span> or higher (in case this instance type is not enabled instructions to enable it should pop up):</span></span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8sCvGxvkdz59gE05nfUFprcMg_sasKNWYt6520GAS8F3YLBXM73mZIElFhlX6v4u8wia27tHnJjFOD9Ox3NgRqMd-Op_t9dGC7_rAccwiIPSVdAg_wACqTTkFXcjEi9ALQ6kW4lZ-4iOj/s1600/Screen+Shot+2020-02-10+at+00.09.15.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1000" data-original-width="1600" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8sCvGxvkdz59gE05nfUFprcMg_sasKNWYt6520GAS8F3YLBXM73mZIElFhlX6v4u8wia27tHnJjFOD9Ox3NgRqMd-Op_t9dGC7_rAccwiIPSVdAg_wACqTTkFXcjEi9ALQ6kW4lZ-4iOj/s640/Screen+Shot+2020-02-10+at+00.09.15.png" width="640" /></a></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"> </span></span></span><br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">8. Enable Public IP</span></span></span></h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">Scroll down to <span style="font-family: "verdana" , sans-serif;">"Public IP"</span> to enable it while keeping other settings unchanged except maybe for <span style="font-family: "verdana" , sans-serif;">"Storage"</span> and <span style="font-family: "verdana" , sans-serif;">"Auto-Shutdown"</span>. While 50G of storage suffices for many applications if you plan on using larger datasets or create massive numbers of models increase your storage accordingly: allocate at least 20 times storage as the largest dataset you plan to use. Lastly change auto-shutdown timeout according to your needs:</span></span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhNZA4wtoLJ-E-XNht7mXJQ1luMHEytgMjGMkKtVRaRZg5OTlg4XUIiC8SFk2JtC1r2AZMffOatj0JP9woPeKX4gG2LiQB-Bc5KCVcOTY_pK8xMYCRsZsyBP80uJm09oXxDpwMiQhWtUuGX/s1600/Screen+Shot+2020-02-10+at+00.20.32.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1000" data-original-width="1600" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhNZA4wtoLJ-E-XNht7mXJQ1luMHEytgMjGMkKtVRaRZg5OTlg4XUIiC8SFk2JtC1r2AZMffOatj0JP9woPeKX4gG2LiQB-Bc5KCVcOTY_pK8xMYCRsZsyBP80uJm09oXxDpwMiQhWtUuGX/s640/Screen+Shot+2020-02-10+at+00.20.32.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span></span>
<br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">9. Apply<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "verdana" , sans-serif;"><span style="font-weight: normal;"> 5NXWB5R</span></span></span></span></span></span></span></span></span> Promo Code<span style="font-family: "georgia" , "times new roman" , serif;"></span></span></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "verdana" , sans-serif;"> </span></span></span></span></span></span></span></span>with Payment</span></span></span></h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">Scroll down to payment to enter credit card information, enter </span></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">promotion code <span style="font-family: "verdana" , sans-serif;">5NXWB5R</span> to</span></span></span> apply (Paperspace should credit your account $10.00) before finally creating VM with <span style="font-family: "verdana" , sans-serif;">"Create Your Paperspace"</span> button:</span></span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-zrlH9kZSeBFoCNo3BHKingSJZpCDOXvHsIHnhecAihGj6d1hw9w2C9eX3JTAdgcAPUuQp5nRTob6B9JyPdv7TklQ8keHpopygCeUhSfsrXPMTv3Nh7EpCIbftD95RO0CLppbfKk70z-i/s1600/Screen+Shot+2020-02-10+at+00.22.12.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1000" data-original-width="1600" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-zrlH9kZSeBFoCNo3BHKingSJZpCDOXvHsIHnhecAihGj6d1hw9w2C9eX3JTAdgcAPUuQp5nRTob6B9JyPdv7TklQ8keHpopygCeUhSfsrXPMTv3Nh7EpCIbftD95RO0CLppbfKk70z-i/s640/Screen+Shot+2020-02-10+at+00.22.12.png" width="640" /></a></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span></span>
<br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">10. Creating VM</span></span></span></h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">While new system initializes its state appears as <span style="font-family: "verdana" , sans-serif;">"Provisioning"</span>:</span></span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjd7vjOspi7fu6AKWUuwAjjz7d4YeI8CEa0Xlnrzkd-OxZD4FKXTmA8i6Io5t13lvJ7Vup1tGUfe5rTvHyL1UGcN0lQImeruSpPX_iJgiAooiGjHn9h-ZMSybKE43G3c-cel91AHdhftwTU/s1600/Screen+Shot+2020-02-10+at+00.25.10.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1000" data-original-width="1600" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjd7vjOspi7fu6AKWUuwAjjz7d4YeI8CEa0Xlnrzkd-OxZD4FKXTmA8i6Io5t13lvJ7Vup1tGUfe5rTvHyL1UGcN0lQImeruSpPX_iJgiAooiGjHn9h-ZMSybKE43G3c-cel91AHdhftwTU/s640/Screen+Shot+2020-02-10+at+00.25.10.png" width="640" /></a></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span></span>
<br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">11. Wait for System to Start</span></span></span></h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">Wait a minute or two until system state changes to <span style="font-family: "verdana" , sans-serif;">"On/Ready"</span> and click on small gear inside the box in upper right corner to move to system console:</span></span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhjObtWTkUq8iKfrg-0N6vZIg1mstcXic6fJyh1HjiSXWF4kXgF1CY3xcl6aFnfSXP1Rs8CT-GmSZNCbD4aQEn2Km2R9mGm4o_Zrkrds_0tn8_uXr3M2jLtAu4MGOT5V0dEy4b4ANBD0LnF/s1600/Screen+Shot+2020-02-10+at+00.27.28.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1000" data-original-width="1600" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhjObtWTkUq8iKfrg-0N6vZIg1mstcXic6fJyh1HjiSXWF4kXgF1CY3xcl6aFnfSXP1Rs8CT-GmSZNCbD4aQEn2Km2R9mGm4o_Zrkrds_0tn8_uXr3M2jLtAu4MGOT5V0dEy4b4ANBD0LnF/s640/Screen+Shot+2020-02-10+at+00.27.28.png" width="640" /></a></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span></span>
<br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">12. System Console</span></span></span></h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">System console displays detailed information about VM including public IP address assigned to your VM:</span></span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi4QevuKSSSlAjEh5IZS2z-TTMHBBqSrRryVB0pr9GzZK_-GszuRYO0O_mBAtgQDT-6LREAHzmBcV37gJn99HIxakIt0Zf1f6F1X5aa5egwL0vVl_m2kj966SPCEaEg4yDBTWR4WNaPCPXP/s1600/Screen+Shot+2020-02-10+at+00.42.23.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1000" data-original-width="1600" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi4QevuKSSSlAjEh5IZS2z-TTMHBBqSrRryVB0pr9GzZK_-GszuRYO0O_mBAtgQDT-6LREAHzmBcV37gJn99HIxakIt0Zf1f6F1X5aa5egwL0vVl_m2kj966SPCEaEg4yDBTWR4WNaPCPXP/s640/Screen+Shot+2020-02-10+at+00.42.23.png" width="640" /></a></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span></span>
<br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">13. Notification from Paperspace</span></span></span></h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">Next find email from Paperspace with system password:</span></span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjosZsgBTqScGhzMwniLxt84nwcGNVE1JWh-FCNcEnG7pGEmgxbCyXWkwBQz-Lxow0aBb2cR6RKQfrOYIkacHb2XJcUg_TaXWFWc8b29es8tZuByToaj0OC5OOBL8lutIFLUOSPZiLH4aoa/s1600/Screen+Shot+2020-02-10+at+01.20.09.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1528" data-original-width="1486" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjosZsgBTqScGhzMwniLxt84nwcGNVE1JWh-FCNcEnG7pGEmgxbCyXWkwBQz-Lxow0aBb2cR6RKQfrOYIkacHb2XJcUg_TaXWFWc8b29es8tZuByToaj0OC5OOBL8lutIFLUOSPZiLH4aoa/s640/Screen+Shot+2020-02-10+at+01.20.09.png" width="622" /></a></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">With public IP address and password you can <span style="font-family: "verdana" , sans-serif;">ssh</span> (on Mac OS X or Linux) or connect using <b>putty</b> (on Windows) to Paperspace VM and install Driverless AI software following steps for <a href="http://prerelease.h2o.ai/docs/userguide/install/linux-deb.html#" target="_blank">vanilla Ubuntu system</a>. This example continues </span></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">with this install to show all steps in detail. </span></span></span></span></span></span><br />
<br />
<h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">Installing Prerequisites</span></span></span></h3>
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">14. <span style="font-family: "georgia" , "times new roman" , serif;">Terminal Access to VM</span></span></span></span></h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "verdana" , sans-serif;">ssh</span> to the Paperspace VM from Mac OS terminal using Public IP and password as shown in steps 12 and 13 (</span></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">ssh below is used on Mac OS X - for other OSes adjust accordingly</span></span></span>):</span></span></span><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVhNd-Q_zGWrCubuvpf6SAoXdVXukMLks62egCx8s8_7TrOX76XDsM8NTaEkhNX1Kf1JAsjr4LPvrmEokE8LHso8JQstD0hnsT1diXFYGxXC6FOopVAaorSLytSQ3y5uV3D_5LIDfMeKvA/s1600/Screen+Shot+2019-01-13+at+18.24.17.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="322" data-original-width="1256" height="164" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVhNd-Q_zGWrCubuvpf6SAoXdVXukMLks62egCx8s8_7TrOX76XDsM8NTaEkhNX1Kf1JAsjr4LPvrmEokE8LHso8JQstD0hnsT1diXFYGxXC6FOopVAaorSLytSQ3y5uV3D_5LIDfMeKvA/s640/Screen+Shot+2019-01-13+at+18.24.17.png" width="640" /></a></span></span></span></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"> </span></span></span><br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span></span>
<br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">15. Change paperspace assigned password (optional):</span></span></span></h4>
<br />
<div>
<code data-gist-file="5-change-paperspace-passwd.sh" data-gist-hide-footer="true" data-gist-id="cbcfaeb292897b76d713e006f2638736"></code>
</div>
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span></span>
<br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">16. Install core packages (optional):</span></span></span></h4>
<br />
<div>
<code data-gist-file="1-install-core-packages.sh" data-gist-hide-footer="true" data-gist-id="cbcfaeb292897b76d713e006f2638736"></code>
</div>
<br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">17. Add support for NVIDIA GPU libraries (CUDA 10):</span></span></span></h4>
<br />
<div>
<code data-gist-file="2-install-nvidia-for-gpu.sh" data-gist-hide-footer="true" data-gist-id="cbcfaeb292897b76d713e006f2638736"></code>
</div>
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">18. Install other prerequisites and open port Driverless AI listens to:</span></span></h4>
<br />
<div>
<code data-gist-file="2a-install-other-prerequisites.sh" data-gist-hide-footer="true" data-gist-id="cbcfaeb292897b76d713e006f2638736"></code>
</div>
<div class="toc-pro">
</div>
<br />
<h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;">Installing Driverless AI </span></span></span></h3>
<br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">19. H2O Download Page</span></span></h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Leave (do not close) <span style="font-family: "verdana" , sans-serif;">ssh</span> terminal for a browser and locate H2O.ai <a href="https://www.h2o.ai/download/" target="_blank">download page</a>. Choose latest version of Driverless AI product:</span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhzqp9CXKHqNuvciXD2cP1xCnVFVZzwdvqh-z_G9gz348ipajA07W2olvI8F53rd3ILn_q5M_THpsBECOM0sQ_iW980iBqWciA73QIQodCMFMXbJp9PcQd5SpcxrmM8UtjINnH98n6EZVWv/s1600/Screen+Shot+2020-02-10+at+00.50.29.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1000" data-original-width="1600" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhzqp9CXKHqNuvciXD2cP1xCnVFVZzwdvqh-z_G9gz348ipajA07W2olvI8F53rd3ILn_q5M_THpsBECOM0sQ_iW980iBqWciA73QIQodCMFMXbJp9PcQd5SpcxrmM8UtjINnH98n6EZVWv/s640/Screen+Shot+2020-02-10+at+00.50.29.png" width="640" /></a></div>
<br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">17. Download Link</span></span></h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Go to <span style="font-family: "verdana" , sans-serif;">Linux (X86)</span> tab and then right-click on the <span style="font-family: "verdana" , sans-serif;">"Download"</span> link for DEB package to copy link location:</span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj9WVJzAEinEMgZDM1x0K-ET3DVY_9LJnGEjctQ7_BEIomzjMXLV51KC9JKalJAWWxnzQZxHjIGQYb8HS2VU0CNdLr1bfIuLK9pdUCZroJwkuzczWkbbPnaqx0cRDVHGYxQPl5O9ICPTfNg/s1600/Screen+Shot+2020-02-10+at+00.52.03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1000" data-original-width="1600" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj9WVJzAEinEMgZDM1x0K-ET3DVY_9LJnGEjctQ7_BEIomzjMXLV51KC9JKalJAWWxnzQZxHjIGQYb8HS2VU0CNdLr1bfIuLK9pdUCZroJwkuzczWkbbPnaqx0cRDVHGYxQPl5O9ICPTfNg/s640/Screen+Shot+2020-02-10+at+00.52.03.png" width="640" /></a></div>
<br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">18. Back to Terminal Access</span></span></h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Return to <span style="font-family: "verdana" , sans-serif;">ssh</span> terminal session connected to paperspace VM. If session timed out or became inactive repeat step 14.</span></span><br />
<br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">19. Download and install Driverless AI DEB package:</span></span></h4>
<br />
<div>
<code data-gist-file="3-install-dai-deb.sh" data-gist-hide-footer="true" data-gist-id="cbcfaeb292897b76d713e006f2638736"></code>
</div>
<br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">20. Install Completed</span></span></h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">After installer successfully finishes it displays following helpful information:</span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhg0Qz28NMipsuP7cw_2dY5HYAN_-5QOsjnlaX52YDCZvP_wqwhjm8YmRe1B40Yi0RP9MQepQMaxedguBxh1Ov2-1KZoxC1DNH0SFLEiijJfUgqT8kh8ezbjWe5roD6GxdWFLT9QJOwcEw5/s1600/Screen+Shot+2019-03-11+at+22.03.46.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="748" data-original-width="1600" height="299" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhg0Qz28NMipsuP7cw_2dY5HYAN_-5QOsjnlaX52YDCZvP_wqwhjm8YmRe1B40Yi0RP9MQepQMaxedguBxh1Ov2-1KZoxC1DNH0SFLEiijJfUgqT8kh8ezbjWe5roD6GxdWFLT9QJOwcEw5/s640/Screen+Shot+2019-03-11+at+22.03.46.png" width="640" /></a></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span>
<br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">21. Start Driverless AI</span></span></h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Check that Driverless AI is installed but inactive and then start it and check yet again its status and logs:</span></span><br />
<br />
<div>
<code data-gist-file="4a-start-dai.sh" data-gist-hide-footer="true" data-gist-id="cbcfaeb292897b76d713e006f2638736"></code>
</div>
<br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">22. Web Access</span></span></h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Open browser and enter URL with public IP address like this: http://209.51.170.97:12345 (ignore 127.0.0.1 in screenshot as I was using port forwarding when taking them):</span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhIoIFtKhci3yZy709vxvYSHU9tcar8hS8zipg4sa5ZngZ5bpZovtsC0Zx6gjf2qlv7Dhnvj9qNjZwD1frEnGEsLQzvfFDZnM-dL-frp596v95PSJMmaOchw7l7hUpRV_QjZl3OcaB2nHEr/s1600/Screen+Shot+2020-02-10+at+01.22.01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1000" data-original-width="1600" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhIoIFtKhci3yZy709vxvYSHU9tcar8hS8zipg4sa5ZngZ5bpZovtsC0Zx6gjf2qlv7Dhnvj9qNjZwD1frEnGEsLQzvfFDZnM-dL-frp596v95PSJMmaOchw7l7hUpRV_QjZl3OcaB2nHEr/s640/Screen+Shot+2020-02-10+at+01.22.01.png" width="640" /></a></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span>
<br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">23. License Agreement</span></span></h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Scroll down to accept license agreement:</span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh0tKhfh2YZ0RgIfhqAY4xV7DzwGzhqCFCEkrhA6yRMyuBTBPFm2hYwICs6dWYuRGmafV4koXfHdO7W4thm6QrdGR4147Z-1KanN3Tfgi1OQBrhW9sE8HiMskSa43205SEuk_4g6oW-_YoE/s1600/Screen+Shot+2020-02-10+at+01.22.48.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1000" data-original-width="1600" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh0tKhfh2YZ0RgIfhqAY4xV7DzwGzhqCFCEkrhA6yRMyuBTBPFm2hYwICs6dWYuRGmafV4koXfHdO7W4thm6QrdGR4147Z-1KanN3Tfgi1OQBrhW9sE8HiMskSa43205SEuk_4g6oW-_YoE/s640/Screen+Shot+2020-02-10+at+01.22.48.png" width="640" /></a></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span>
<br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">24. Login to Driverless AI</span></span></h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Driverless AI display login screen - enter credentials <span style="font-family: "verdana" , sans-serif;">h2oai/h2oai</span>:</span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEixJD_Cou4ry0XjFIpKTFNFsIGAt-uGdZTt1o-aWLsnGmU8xj2KM1osTyCyHHzCilfpav7YrZO8yNHGgwr3diQfhhtNgADobHCHZC8tIAPwm8l28HeDQbJ_FKNIaeTJwzzD-XERMkddWMwe/s1600/Screen+Shot+2020-02-10+at+01.23.39.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1000" data-original-width="1600" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEixJD_Cou4ry0XjFIpKTFNFsIGAt-uGdZTt1o-aWLsnGmU8xj2KM1osTyCyHHzCilfpav7YrZO8yNHGgwr3diQfhhtNgADobHCHZC8tIAPwm8l28HeDQbJ_FKNIaeTJwzzD-XERMkddWMwe/s640/Screen+Shot+2020-02-10+at+01.23.39.png" width="640" /></a></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span>
<br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">25. Activate License</span></span></h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Driverless AI prompts to <span style="font-family: "verdana" , sans-serif;">Enter License</span> to activate software license:</span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg38vxXhb6ONo72C4kGb_q4PkwEJuQwEcAGAl8vV8kJ9jKzm0D8N59UTE3LmRclL026_-OOMJ5gbDrzuDE7-jRL2M5z-fcF2mVhEaL4I7kZ46LvKIBUnMFOu6Mjt4HHcC9Oq9TEyJUgXqdf/s1600/Screen+Shot+2020-02-10+at+01.24.19.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1000" data-original-width="1600" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg38vxXhb6ONo72C4kGb_q4PkwEJuQwEcAGAl8vV8kJ9jKzm0D8N59UTE3LmRclL026_-OOMJ5gbDrzuDE7-jRL2M5z-fcF2mVhEaL4I7kZ46LvKIBUnMFOu6Mjt4HHcC9Oq9TEyJUgXqdf/s640/Screen+Shot+2020-02-10+at+01.24.19.png" width="640" /></a></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span>
<br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">26. </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">License Key</span></span></span></span></h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Enter Driverless AI license key received by enrolling to H2O.ai Academic Program and press <span style="font-family: "verdana" , sans-serif;">Save</span>:</span></span></span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEifcPUOzH6jeMphAHgDuyEcZaSjC9Oq0fU5bTFFEpGnCe10e82IvAWuU4xmzmD6jSw2bekTybODjNuLXb0tPdLKbK_6VFR_3okQ8DEi7Dgt4Dtt88gp5uLRpMWqVBl3tKhZMivCi12UQvdd/s1600/Screen+Shot+2020-02-10+at+01.25.49.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1000" data-original-width="1600" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEifcPUOzH6jeMphAHgDuyEcZaSjC9Oq0fU5bTFFEpGnCe10e82IvAWuU4xmzmD6jSw2bekTybODjNuLXb0tPdLKbK_6VFR_3okQ8DEi7Dgt4Dtt88gp5uLRpMWqVBl3tKhZMivCi12UQvdd/s640/Screen+Shot+2020-02-10+at+01.25.49.png" width="640" /></a></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span>
<br />
<h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">27. All Done</span></span></h4>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Now Driverless AI platform is fully enabled to help in your research or studies or both: </span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiUVEOkJ3Id4CSSkJMHJWRHEESZOwXsd95MoZR1USWOxPyDf32UIViKrLafLrFkGc3DahGTnPoB-ikZniDAQvP9tlseFW-LT1zZqEjcu2ub17NRYpkFkGotzry3jor_PYWmX7nmtpopR23m/s1600/Screen+Shot+2020-02-10+at+01.27.04.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1000" data-original-width="1600" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiUVEOkJ3Id4CSSkJMHJWRHEESZOwXsd95MoZR1USWOxPyDf32UIViKrLafLrFkGc3DahGTnPoB-ikZniDAQvP9tlseFW-LT1zZqEjcu2ub17NRYpkFkGotzry3jor_PYWmX7nmtpopR23m/s640/Screen+Shot+2020-02-10+at+01.27.04.png" width="640" /></a></div>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span>
<br />
<h2>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Resources </span></span></h2>
<ul>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><a href="https://www.h2o.ai/blog/" target="_blank">H2O.ai blog</a> </span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><a href="https://h2oai.github.io/tutorials/" target="_blank">H2O.ai tutorials</a> </span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Try Driverless AI for free at </span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><a href="https://aquarium.h2o.ai/" style="-moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; color: #00006a; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-decoration: underline; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;" title="https://aquarium.h2o.ai/"><span style="color: #03386d;">https://</span><span style="background-color: #ffef95; color: black;">aquarium</span><span style="color: #03386d;">.h2o.ai/</span></a></span></span></span></li>
<li><a href="https://support.paperspace.com/hc/en-us/articles/236362888-Public-IPs"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="color: #03386d;">Paperspace public IP adresses</span></span></span></span></a></li>
<li><a href="https://support.paperspace.com/hc/en-us/articles/115001876167-How-to-SSH-into-your-Paperspace-machine"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="color: #03386d;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">How to SSH into your Paperspace machine</span></span></span></span></span></span></a></li>
<li><a href="http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/index.html"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="color: #03386d;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Driverless AI User Guide</span></span></span></span></span></span></a>
</li>
</ul>
<br />Gregory Kanevskyhttp://www.blogger.com/profile/09179130896383881927noreply@blogger.com0tag:blogger.com,1999:blog-7530218802939252476.post-58978470339434573722019-12-14T15:46:00.001-06:002019-12-16T22:21:59.732-06:00How H2O propels data scientists ahead of itself: enhancing Driverless AI models with advanced options, recipes and visualizations<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">H2O engineers continually innovate and introduce new techniques by adopting latest research, working on cutting edge use cases, and participating in and winning machine learning competitions like Kaggle. </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">But </span></span></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">thanks to explosion of AI research and applications </span></span>even
most advanced automated machine learning platform like <a href="https://www.h2o.ai/products/h2o-driverless-ai/" target="_blank">H2O.ai Driverless AI</a> can not come with all bells and whistles to satisfy every
data scientist out there. </span></span>Which means there is that feature or algorithm that customer may be wanting and not yet finding in <a href="http://docs.h2o.ai/" target="_blank">H2O docs</a>.</span></span><br />
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Having that in </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">mind H2O engineers designed several mechanisms </span></span>to help data scientists lead the way with Driverless AI instead of waiting or looking elsewhere. The idea is to enable users to extend functionality with little (or possibly more involved) effort by integrating into Driverless AI workflow and model pipeline. These are the mechanisms that accomplish such goals:</span></span><br />
<ul>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">experiment configuration profile</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">transformer recipes (custom feature engineering)</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">model recipes (custom algorithms)</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">scorer recipes (custom loss functions)</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">data recipes (data load, prep and augmentation; starting with 1.8.1)</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Client API</span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">s for both Python and R </span></span></li>
</ul>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">This post will explain what they mean, how they work, and will finish with more elaborate example of using R Client to enhance model analysis with visualizations.</span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"> </span></span><br />
<br />
<h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Experiment Configuration</span></span></h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">All possible configuration options inside Driverless AI can be found inside <span style="font-family: "arial" , "helvetica" , sans-serif;">config.toml</span> file (see <a href="http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/config_toml.html?highlight=config%20toml#sample-config-toml-file" target="_blank">here</a>). Any experiment (<i>experiment</i> is a Driverless AI term for encompassing AutoML workflow resulting in complete model) can selectively override any option (as applicable) in Expert Settings using <b><a href="http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/expert-settings.html?highlight=expert%20settings%20config%20toml#add-to-config-toml-via-toml-string" target="_blank">Add to config.toml via toml String</a></b> entry limiting the scope to this experiment only. </span></span><br />
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">For example, while Driverless AI completely automates tuning and selection of the built-in algorithms (GLM, LightGBM, XGBoost, TensorFlow, RuleFit, FTRL) it can not foresee all possible use cases or control and tune every parameter. So the following configuration settings let user customize parameters for each algorithm:</span></span><br />
<ul>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">LightGBM parameters: <span style="font-family: "arial" , "helvetica" , sans-serif;">params_lightgbm</span></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"> and </span></span></span></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;">params_tune_lightgbm</span></span></span></span></span></span></span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">XGBoost GBM: <span style="font-family: "arial" , "helvetica" , sans-serif;">params_xgboost </span></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">and </span></span></span></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;">params_tune_xgboost</span></span></span></span></span></span></span></span></span></span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">XGBoost Dart: <span style="font-family: "arial" , "helvetica" , sans-serif;">params_dart </span></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">and </span></span></span></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;">params_tune_dart</span></span></span></span></span></span></span></span></span></span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Tensorflow: <span style="font-family: "arial" , "helvetica" , sans-serif;">params_tensorflow</span></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;"> </span></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">and </span></span></span></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;">params_tune_tensorflow</span></span></span></span></span></span></span></span></span></span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">GLM: <span style="font-family: "arial" , "helvetica" , sans-serif;">params_gblinear </span></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;"></span></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">and </span></span></span></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;">params_tune_gblinear</span></span></span></span></span></span></span></span></span></span></span></span></span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">RuleFit: <span style="font-family: "arial" , "helvetica" , sans-serif;">params_rulefit</span></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;"> </span></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;"></span></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">and </span></span></span></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;">params_tune_rulefit</span></span></span></span></span></span></span></span></span></span></span></span></span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">FTRL: <span style="font-family: "arial" , "helvetica" , sans-serif;">params_ftrl</span></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;"> </span></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;"></span></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">and </span></span></span></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;">params_tune_ftrl</span></span></span></span></span></span></span></span></span></span></span></span></span></span></li>
</ul>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Thus, to adjust architecture of TensorFlow models trained in your experiment use </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;">params_tensorflow</span></span></span>:</span></span><br />
<br />
<pre class="c-mrkdwn__pre" data-stringify-type="pre" style="--saf-0: rgba(var(--sk_foreground_low,29,28,29),0.13); -webkit-text-stroke-width: 0px; background: rgba(var(--sk_foreground_min,29,28,29),0.04); border-radius: 4px; border: 1px solid var(--saf-0); box-sizing: inherit; color: #1d1c1d; font-family: Monaco, Menlo, Consolas, "Courier New", monospace !important; font-size: 12px; font-style: normal; font-variant-caps: normal; font-variant-ligatures: none; font-weight: 400; letter-spacing: normal; line-height: 1.50001; margin: 4px 0px; orphans: 2; overflow-wrap: break-word; padding: 8px; tab-size: 4; text-align: left; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-break: normal; word-spacing: 0px;">params_tensorflow = "{'lr': 0.01, 'add_wide': False, 'add_attention': True, 'epochs': 30, 'layers': (100, 100), 'activation': 'selu', 'batch_size': 64, 'chunk_size': 1000, 'dropout': 0.3, 'strategy': 'one_shot', 'l1': 0.0, 'l2': 0.0, 'ort_loss': 0.5, 'ort_loss_tau': 0.01, 'normalize_type': 'streaming'}"</pre>
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"> or to override LightGBM parameters </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;">params_lightgbm</span></span></span></span></span>: </span></span><br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: small;"> </span></span></span><br />
<pre class="c-mrkdwn__pre" data-stringify-type="pre" style="--saf-0: rgba(var(--sk_foreground_low,29,28,29),0.13); -webkit-text-stroke-width: 0px; background: rgba(var(--sk_foreground_min,29,28,29),0.04); border-radius: 4px; border: 1px solid var(--saf-0); box-sizing: inherit; color: #1d1c1d; font-family: Monaco, Menlo, Consolas, "Courier New", monospace !important; font-size: 12px; font-style: normal; font-variant-caps: normal; font-variant-ligatures: none; font-weight: 400; letter-spacing: normal; line-height: 1.50001; margin: 4px 0px; orphans: 2; overflow-wrap: break-word; padding: 8px; tab-size: 4; text-align: left; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-break: normal; word-spacing: 0px;"><span style="color: black;"><span style="background-color: rgba(29 , 28 , 29 , 0.04); display: inline; float: none; font-family: "monaco" , "menlo" , "consolas" , "courier new" , monospace; font-size: 12px; font-style: normal; font-weight: 400; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: pre-wrap; word-spacing: 0px;">params_lightgbm = "{'objective': 'binary:logistic', 'n_estimators': 100, 'max_leaves': 64, 'random_state': 1234}</span></span>"</pre>
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">or use <span style="font-family: "arial" , "helvetica" , sans-serif;">params_tune_xxxx</span> to provide a grid that limits or extends search of </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">hyper parameter space per algorithm, e.g. for XGBoost GBM</span></span>:</span></span><br />
<br />
<pre class="c-mrkdwn__pre" data-stringify-type="pre" style="--saf-0: rgba(var(--sk_foreground_low,29,28,29),0.13); -webkit-text-stroke-width: 0px; background: rgba(var(--sk_foreground_min,29,28,29),0.04); border-radius: 4px; border: 1px solid var(--saf-0); box-sizing: inherit; font-family: Monaco, Menlo, Consolas, "Courier New", monospace !important; font-size: 12px; font-style: normal; font-variant-caps: normal; font-variant-ligatures: none; font-weight: 400; letter-spacing: normal; line-height: 1.50001; margin: 4px 0px; overflow-wrap: break-word; padding: 8px; text-align: left; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; white-space: pre-wrap; word-break: normal; word-spacing: 0px;"><span style="background-color: rgba(29 , 28 , 29 , 0.04); display: inline; float: none; font-family: "monaco" , "menlo" , "consolas" , "courier new" , monospace; font-size: 12px; font-style: normal; font-weight: 400; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: pre-wrap; word-spacing: 0px;">params_tune_xgboost = "{'max_leaves': [8, 16, 32, 64]}"</span></pre>
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">To add multiple parameters via Expert Settings use double double quotes ("") around the whole configuration string while separating parameters with new line (\n):</span></span><br />
<br />
<pre class="c-mrkdwn__pre" data-stringify-type="pre" style="--saf-0: rgba(var(--sk_foreground_low,29,28,29),0.13); -webkit-text-stroke-width: 0px; background: rgba(var(--sk_foreground_min,29,28,29),0.04); border-radius: 4px; border: 1px solid var(--saf-0); box-sizing: inherit; color: #1d1c1d; font-family: Monaco, Menlo, Consolas, "Courier New", monospace !important; font-size: 12px; font-style: normal; font-variant-caps: normal; font-variant-ligatures: none; font-weight: 400; letter-spacing: normal; line-height: 1.50001; margin: 4px 0px; orphans: 2; overflow-wrap: break-word; padding: 8px; tab-size: 4; text-align: left; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-break: normal; word-spacing: 0px;">""params_tensorflow = "{'lr': 0.01, 'epochs': 30, 'activation': 'selu'}" \n <span style="color: black;"><span style="background-color: rgba(29 , 28 , 29 , 0.04); display: inline; float: none; font-family: "monaco" , "menlo" , "consolas" , "courier new" , monospace; font-size: 12px; font-style: normal; font-weight: 400; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: pre-wrap; word-spacing: 0px;">params_lightgbm = "{'objective': 'binary:logistic', 'n_estimators': 100, 'max_leaves': 64}</span></span>" \n <span style="background-color: rgba(29 , 28 , 29 , 0.04); display: inline; float: none; font-family: "monaco" , "menlo" , "consolas" , "courier new" , monospace; font-size: 12px; font-style: normal; font-weight: 400; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: pre-wrap; word-spacing: 0px;">params_tune_xgboost = "{'max_leaves': [8, 16, 32, 64]}"""</span></pre>
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">To confirm that settings took effect view experiment's log file (to access logs while experiment running see <a href="http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/logging.html?highlight=download%20log#while-an-experiment-is-running" target="_blank">here</a> or for completed experiment <a href="http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/logging.html?highlight=download%20log#after-an-experiment-has-finished" target="_blank">here</a>) and find Config Settings section near top of the logs. Overridden settings should appear with asterisk and assigned values:</span></span><br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: small;"> </span></span></span><br />
<pre class="c-mrkdwn__pre" data-stringify-type="pre" style="--saf-0: rgba(var(--sk_foreground_low,29,28,29),0.13); -webkit-text-stroke-width: 0px; background: rgba(var(--sk_foreground_min,29,28,29),0.04); border-radius: 4px; border: 1px solid var(--saf-0); box-sizing: inherit; color: #1d1c1d; font-family: Monaco, Menlo, Consolas, "Courier New", monospace !important; font-size: 12px; font-style: normal; font-variant-caps: normal; font-variant-ligatures: none; font-weight: 400; letter-spacing: normal; line-height: 1.50001; margin: 4px 0px; orphans: 2; overflow-wrap: break-word; padding: 8px; tab-size: 4; text-align: left; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-break: normal; word-spacing: 0px;">params_tensorflow *: {'lr': 0.01, 'epochs': 30, 'activation': 'selu'}
<span style="color: black;"><span style="background-color: rgba(29 , 28 , 29 , 0.04); display: inline; float: none; font-family: "monaco" , "menlo" , "consolas" , "courier new" , monospace; font-size: 12px; font-style: normal; font-weight: 400; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: pre-wrap; word-spacing: 0px;">params_lightgbm *: {'objective': 'binary:logistic', 'n_estimators': 100, 'max_leaves': 64}</span></span>
<span style="background-color: rgba(29 , 28 , 29 , 0.04); display: inline; float: none; font-family: "monaco" , "menlo" , "consolas" , "courier new" , monospace; font-size: 12px; font-style: normal; font-weight: 400; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: pre-wrap; word-spacing: 0px;">params_tune_xgboost *: {'max_leaves': [8, 16, 32, 64]}</span><span style="background-color: rgba(29 , 28 , 29 , 0.04); display: inline; float: none; font-family: "monaco" , "menlo" , "consolas" , "courier new" , monospace; font-size: 12px; font-style: normal; font-weight: 400; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: pre-wrap; word-spacing: 0px;"></span></pre>
<h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"> </span></span></h3>
<h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Transformer Recipes</span></span></h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Starting with version 1.7.0 (July 2019) Driverless AI supports Bring Your Own Recipe (BYOR) framework to seamlessly integrate user extensions into its workflow. </span></span>Feature engineering and selection make up significant part of the automated machine learning (AutoML) workflow and utilizes Genetic Algorithm (GA) and set of built-in feature transformers and interactions to maximize model performance. The following high-level and simplified view of Driverless AI AutoML workflow illustrates how pieces like GA, BYOR, model tuning fall together:</span></span><br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhF-ZwFWuLhPilF0BYD5KGDEEnRqliO21cuoPhDhTPoLEAg9WtBJPyPwQWJ1ezRTDx4GQ80xwfnmhMy7k3pHnpaCOZQALY_bfou8ZmooXoEtqz71Fe8BiplHVuaNd_AtKSfKNOiv97Is0sC/s1600/DAI-workflow.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="868" data-original-width="1600" height="346" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhF-ZwFWuLhPilF0BYD5KGDEEnRqliO21cuoPhDhTPoLEAg9WtBJPyPwQWJ1ezRTDx4GQ80xwfnmhMy7k3pHnpaCOZQALY_bfou8ZmooXoEtqz71Fe8BiplHVuaNd_AtKSfKNOiv97Is0sC/s640/DAI-workflow.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 1. Driverless AI GA and BYOR workflow</td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Still, variety of data and ever more complex use cases sometimes demand more specialized feature transformations and interactions. Using BYOR transformers </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">(or transformer recipes) extends core functionality to include any transformations and interactions written in Python according to BYOR specification. Implemented in Python with access to any Python packages transformer recipes integrate into GA workflow to compete with built-in transformations and interactions.</span></span></span></span><br />
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Such fare competition inside Driverless AI is good for both models and users: models improve with better features and users </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">take advantage of exchanging ideas and solutions in the form of recipes. With BYOR Driverless AI realizes democratization of AI that H2O.ai stands for. To start with custom transformers look for recipes found in public H2O BYOR repo </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">in its transformer section: <a href="https://github.com/h2oai/driverlessai-recipes/tree/master/transformers">h2oai/driverlessai-recipes/transformers</a></span></span>. For help and examples on creating your first recipe see <a href="https://www.h2o.ai/blog/how-to-write-a-transformer-recipe-for-driverlessai/" target="_blank">How to Write a Transformer Recipe</a>. </span></span><br />
<br />
<h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Model Recipes</span></span></h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">XGBoost and LightGBM consistently deliver top models and carry most of transactional (i.i.d. data) and time series use cases in </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Driverless AI. Other workhorse algorithm </span></span>delivering top models for NLP and multi-class use cases is TensorFlow. Still more algorithms - Random Forest, GLM, and FTRL - compete for the best model in Driverless AI (see Figure 1). This competition is not closed: BYOR framework lets any algorithm written in Python to <a href="https://github.com/h2oai/driverlessai-recipes/blob/rel-1.8.1/models/model_template.py">the interface spec</a> compete for the top positions on the leaderboard. Model recipes are classification or regression algorithms plugged into Driverless AI workflow, which in turn tunes and combines with them with powerful feature engineering and selection enabled by GA. Based on experiment accuracy setting Driverless AI either picks the best model or builds an ensemble from top models on the leaderboard. For examples of existing model recipes refer to <a href="https://github.com/h2oai/driverlessai-recipes/tree/master/models" target="_blank">h2oai/driverlessai-recipes/models</a>.</span></span><br />
<br />
<h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Scorer Recipes</span></span></h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Often data scientists swear by their favorite scorer so Driverless AI includes large set of <a href="http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/scorers.html" target="_blank">built-in scorers</a> for both classification and regression. But we don't pretend to have all the answers and, again, BYOR framework allows to extend Driverless AI workflow to any scoring (loss) function. Being it from the latest research papers, or made to specific business requirements all that needs to be created per <a href="https://github.com/h2oai/driverlessai-recipes/blob/master/scorers/scorer_template.py">BYOR scorer interface spec</a>. Rather representative and useful collection of scorers can be found in <a href="https://github.com/h2oai/driverlessai-recipes/tree/master/scorers" target="_blank">h2oai/driverlessai-recipes/scorers</a> repository while tutorial on using custom scorers found in Driverless AI <a href="http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/custom-recipes-scorer.html" target="_blank">docs</a>. </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Remember
that Driverless AI uses custom scorers in GA workflow to select
best features and models but not inside algorithms themselves where
it is likely not desirable. </span></span></span></span><br />
<br />
<h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Data Recipes </span></span></h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Starting with version 1.8.1 (December 2019) new BYOR feature - data recipe - was added to Driverless AI. The concept is simple: bring your Python code into Driverless AI to create new or manipulate existing datasets to enhance data and elevate models. </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Data recipes utilize data APIs, datatable, pandas, numpy and other third-party libraries in Python</span></span> and belong to one of two types:</span></span><br />
<ul>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">producing data recipe creates one or more dataset(s) by prototyping connectors, bringing data in and processing it. They are similar to data connectors in a way they import and process data from external sources (see <a href="http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/datasets.html#supported-file-types">here</a>);</span></span></li>
<li><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">modifying data recipe creates one or more dataset(s) by transforming a copy of existing Driverless AI dataset (see <a href="http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/datasets-describing.html#modify-by-recipe">here</a>). Variety of data preprocessing (data prep) use cases fall into this category including data mungeing, data quality, labeling, unsupervised algorithms such as clustering, latent topic analysis, anomaly detection, dimensionality reduction, etc. </span></span></li>
</ul>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">One important difference between data recipes and other BYOR kinds (transformer, model, and scorer) is relation to model scoring pipelines. While the latter integrate into Python scoring pipeline and sometimes into MOJO so they get deployed with models the former manipulate data prior to modeling workflow takes place and do not take part in scoring. </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">For recipe specification see <a href="https://github.com/h2oai/driverlessai-recipes/blob/master/data/data_template.py">here</a> and for various examples refer to <a href="http://h2oai/driverlessai-recipes/data" target="_blank">h2oai/driverlessai-recipes/data</a> repository.</span></span></span></span><br />
<span style="font-size: small;"><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span></span>
<br />
<h3>
<span style="font-size: small;">
</span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Python Client </span></span></h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">All Driverless AI features and actions found inside web interface are also available via Python Client API. See <a href="http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/python_install_client.html">docs</a> for instructions on how to install Python package with more examples <a href="http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/python_client.html#the-python-client" target="_blank">here</a>. For Driverless AI users who are proficient in Python scripting repeatable and reusable tasks with Python Client is next logical step in adopting Driverless AI automated workflow. Examples of such tasks are re-fitting on latest data, deploying scoring pipelines, executing business-driven workflows that combine data prep and Driverless AI modeling, computing business reports and KPIs using models, implementing Reject Inference method for credit approval, and other use cases.</span></span><br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: small;"> </span></span></span>
<br />
<h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">R Client</span></span></h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Driverless AI R Client parallels functionality of Python Client and emphasizes consistency with R language conventions that appeals to data scientists practicing R. With access to unparalleled visualization libraries in R users can extend model analysis beyond already powerful tools and features found in Driverless AI user interface and <a href="http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/experiment-summary.html?highlight=autodoc#experiment-autoreport">Autoreport</a>. Let's conclude with the example of using <span style="font-family: "arial" , "helvetica" , sans-serif;"><b>ggplot2</b></span> </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">package </span></span>based on <a href="https://smile.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448/" target="_blank">The Grammar of Graphics</a> by Leland Wilkinson (Chief Scientist at H2O.ai) and create</span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"> Response Distribution Chart (RDC) to analyze binary classification models. RDC shows the distribution of responses (probabilities) generated by the model to assess quality of the model on a basis how well it distinguishes two classes (see <a href="https://www.kdd.org/kdd2019/accepted-papers/view/150-successful-machine-learning-models-6-lessons-learned-at-booking.com" target="_blank">150 Successful Machine Learning Models: 6 Lessons Learned at Booking.com, section 6</a>).</span></span></span></span><br />
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">The process below shows full sequence of using R Client: how to connect, import, split data, run experiment that creates a model, score data, and finally plot RDC.</span></span><br />
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">To start Driverless AI R client package needs to be installed by downloading it from the server:</span></span><br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgYfGvroUGMz0z3CAX0ivBJN21tBWSWtSFLM218rH4JzpH70k5-cShc5OKY8T40bz40iW92chh2S_ofCe-kFGFOoblyHsj7WKvWpbhNmTbGyXAWuugxf3slWcXkuhx7YW7tZcbSPWzRooiv/s1600/dai-r-package-download.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="606" data-original-width="1600" height="242" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgYfGvroUGMz0z3CAX0ivBJN21tBWSWtSFLM218rH4JzpH70k5-cShc5OKY8T40bz40iW92chh2S_ofCe-kFGFOoblyHsj7WKvWpbhNmTbGyXAWuugxf3slWcXkuhx7YW7tZcbSPWzRooiv/s640/dai-r-package-download.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 2. Downloading Driverless AI Client R package</td><td class="tr-caption" style="text-align: center;"><br /></td></tr>
</tbody></table>
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">After download completes RStudio lets you find and install package from its menu <span style="font-family: "arial" , "helvetica" , sans-serif;">Tools -> Install Packages...</span> </span></span><br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi2ukMhF44pEBtMPL0pnF9uk7ZAROmb9ZuqRLWuOljKMoozXKpbz5f7hR_xJeW0ww9P__BdxXD7kYSfangi416c64OO9MzKFv85TQFtQC-lttFshQmnJEf7haqsxtYwDmJDL1d_xCV7a9_8/s1600/dai-r-install-rstudio.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="704" data-original-width="1024" height="275" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi2ukMhF44pEBtMPL0pnF9uk7ZAROmb9ZuqRLWuOljKMoozXKpbz5f7hR_xJeW0ww9P__BdxXD7kYSfangi416c64OO9MzKFv85TQFtQC-lttFshQmnJEf7haqsxtYwDmJDL1d_xCV7a9_8/s400/dai-r-install-rstudio.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 3. Installing dai package in RStudio</td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: center;">
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"></span></span></div>
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">With <b><span style="font-family: "courier new" , "courier" , monospace;">dai</span></b> package installed every script begins by connecting to running Driverless AI instance (change its name, user id, and password):</span></span><br />
<br />
<div>
<code data-gist-file="01-connect-to-DAI.R" data-gist-hide-footer="true" data-gist-id="33a34a895d5e8c8305db20c8221b1ca6"></code>
</div>
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">For our example we will use infamous titanic dataset that I slightly enhanced and saved on both my local machine and in S3 bucket <a href="https://s3.console.aws.amazon.com/s3/buckets/h2o-public-test-data/smalldata/titanic/?region=us-east-1&tab=overview" target="_blank">here</a>. The following commands upload data file from local machine or from S3 into Driverless AI (pick one):</span></span><br />
<br />
<div>
<code data-gist-file="02-upload-titanic-data.R" data-gist-hide-footer="true" data-gist-id="33a34a895d5e8c8305db20c8221b1ca6"></code>
</div>
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">While H2O pipeline automates machine learning workflow including creating and using validation splits it is best practice to provide separate test set so that Driverless AI can produce out of sample score estimate for its final model. Splitting data on appropriate target, fold, or time column is built-in functionality:</span></span><br />
<br />
<div>
<code data-gist-file="03-split-titanic-data.R" data-gist-hide-footer="true" data-gist-id="33a34a895d5e8c8305db20c8221b1ca6"></code>
</div>
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Now we can start automated machine learning workflow to predict survival chances for Titanic passengers that results in complete and fully featured classification model:</span></span><br />
<br />
<div>
<code data-gist-file="04-create-titanic-model-555.R" data-gist-hide-footer="true" data-gist-id="33a34a895d5e8c8305db20c8221b1ca6"></code>
</div>
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">If you login into Driverless AI you can observe just created model via browser UI:</span></span><br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiDLcejdoedDQdF8xpNW3HODTjSeUkHSG-8mM_9EE-YQzIs337G_yPuepTHi_2Lsgxg_Tcw15KTzmfWUq9L8Bsh2iKbr_sR3ktSuZCYbooO_MxcOg0E4LcXQhQ2Dni_qT3AxP9fOb7327qI/s1600/Screen+Shot+2019-12-09+at+19.52.40.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="917" data-original-width="1600" height="366" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiDLcejdoedDQdF8xpNW3HODTjSeUkHSG-8mM_9EE-YQzIs337G_yPuepTHi_2Lsgxg_Tcw15KTzmfWUq9L8Bsh2iKbr_sR3ktSuZCYbooO_MxcOg0E4LcXQhQ2Dni_qT3AxP9fOb7327qI/s640/Screen+Shot+2019-12-09+at+19.52.40.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 4. Driverless AI in action</td></tr>
</tbody></table>
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Having Driverless AI classifier there are many ways to obtain predictions. One way is to download file with computed test predictions to client and then read it into R:</span></span><br />
<br />
<div>
<code data-gist-file="05a-predict-titanic-test.R" data-gist-hide-footer="true" data-gist-id="33a34a895d5e8c8305db20c8221b1ca6"></code>
</div>
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Because we want to use features from the model in visualizations there is a way to score dataset and attach hand picked features in results (scoring all Titanic data in this case):</span></span><br />
<br />
<div>
<code data-gist-file="05b-predict-titanic-test.R" data-gist-hide-footer="true" data-gist-id="33a34a895d5e8c8305db20c8221b1ca6"></code>
</div>
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">At this point full power of R graphics is available to produce additional visualizations on the model with predictions saved to R data frame. As promised, we show how to implement the method of Response Distribution Analysis:</span></span><br />
<blockquote class="tr_bq">
<span style="font-size: small;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: sans-serif; left: 529.925px; top: 622.131px; transform: scalex(0.926467);">The method is based on the</span><span style="font-family: sans-serif; left: 529.925px; top: 640.396px; transform: scalex(0.946253);"> Response Distribution Chart (RDC), which is simply a histogram</span><span style="font-family: sans-serif; left: 529.925px; top: 658.661px; transform: scalex(0.912684);"> of the output of the model. The simple observation that the RDC of</span><span style="font-family: sans-serif; left: 529.925px; top: 676.926px; transform: scalex(0.919393);"> an ideal model should have one peak at 0 and one peak at 1 (with</span><span style="font-family: sans-serif; left: 529.925px; top: 695.191px; transform: scalex(0.965356);"> heights given by the class proportion). Source: https://www.kdd.org/kdd2019/accepted-papers/view/150-successful-machine-learning-models-6-lessons-learned-at-booking.com</span></span></span> <span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"></span></span></blockquote>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">First, we plot RDC on all data:</span></span><br />
<br />
<div>
<code data-gist-file="06a-rdc-titanic-all.R" data-gist-hide-footer="true" data-gist-id="33a34a895d5e8c8305db20c8221b1ca6"></code>
</div>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj0JR47zwKwzyV2l6k23tto7KVAzacMzzkQeYDH_4AqXrxKrz2YTdnys_YUZ2yRqxr7BNcP005a9-Jxnr2Bzj-b5Ofn39rWzshKrPsOyX8PoLSUXY_1zkojm4MLOIxnASg3zikVXHFVKGsU/s1600/dai-custom-rdc-all.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="433" data-original-width="500" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj0JR47zwKwzyV2l6k23tto7KVAzacMzzkQeYDH_4AqXrxKrz2YTdnys_YUZ2yRqxr7BNcP005a9-Jxnr2Bzj-b5Ofn39rWzshKrPsOyX8PoLSUXY_1zkojm4MLOIxnASg3zikVXHFVKGsU/s1600/dai-custom-rdc-all.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 5. Cumulative RDC on titanic model</td></tr>
</tbody></table>
<br />
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Few more examples of RDC follow - first with separate distributions on survived and not survived passengers:</span></span><br />
<br />
<div>
<code data-gist-file="06b-rdc-titanic-all.R" data-gist-hide-footer="true" data-gist-id="33a34a895d5e8c8305db20c8221b1ca6"></code>
</div>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgN2LAWJz2I7RnvlMSNvBdGYwbsCuKz4YK72n2DfgDtL-BA3ZZv4i_-mhJXaSFyLqh1T_1yhpCRn2DAyDzavsZ6VKZXx2mxx3DWcch2gvckX7_n5irvrGPGABtaojPkH7raFdjOllP_1Nsy/s1600/dai-custom-rdc-by-survived.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="433" data-original-width="500" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgN2LAWJz2I7RnvlMSNvBdGYwbsCuKz4YK72n2DfgDtL-BA3ZZv4i_-mhJXaSFyLqh1T_1yhpCRn2DAyDzavsZ6VKZXx2mxx3DWcch2gvckX7_n5irvrGPGABtaojPkH7raFdjOllP_1Nsy/s1600/dai-custom-rdc-by-survived.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 6. RDC by actual outcome</td></tr>
</tbody></table>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">Next plot compares RDC'es for male and female passengers:</span></span><br />
<br />
<div>
<code data-gist-file="07-rdc-titanic-by-sex.R" data-gist-hide-footer="true" data-gist-id="33a34a895d5e8c8305db20c8221b1ca6"></code>
</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjASAdfCoYRSASIR8Pca7UPGSnsqOqErKXP9NM7WqChPrJi8In69xFhe4twfL0f_oKIlOGxt0dfYA-j5d4mPB727UjCdr9insXiPP4mVFRYKExhr74LETFPeWvaeeE7Cl66m06EKh3uRbjJ/s1600/dai-custom-rdc-by-survived-facet.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="433" data-original-width="500" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjASAdfCoYRSASIR8Pca7UPGSnsqOqErKXP9NM7WqChPrJi8In69xFhe4twfL0f_oKIlOGxt0dfYA-j5d4mPB727UjCdr9insXiPP4mVFRYKExhr74LETFPeWvaeeE7Cl66m06EKh3uRbjJ/s1600/dai-custom-rdc-by-survived-facet.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 7. RDC by passenger sex</td></tr>
</tbody></table>
<span id="goog_736099103"></span><span id="goog_736099104"></span><br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Finally, RDC's by port of embarkation:</span></span><br />
<br />
<div>
<code data-gist-file="08-rdc-titanic-by-embarked.R" data-gist-hide-footer="true" data-gist-id="33a34a895d5e8c8305db20c8221b1ca6"></code>
</div>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><img border="0" data-original-height="433" data-original-width="500" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhqEWjs7TEFY-4L0juoVYUbuVczRdLsyrI2MqCEjyW57afldNGeQd8gxrxldBalC06UEHeKAKcLGFK9jz8xbBwFvh5nJFDmFWjl4QyLRPChDpAqmOaw5qkdZKXdtgaQOhZI8VUcnAEfPkdN/s1600/dai-custom-rdc-by-embarked.png" style="margin-left: auto; margin-right: auto;" /></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 8. RDC by port of embarkation</td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhqEWjs7TEFY-4L0juoVYUbuVczRdLsyrI2MqCEjyW57afldNGeQd8gxrxldBalC06UEHeKAKcLGFK9jz8xbBwFvh5nJFDmFWjl4QyLRPChDpAqmOaw5qkdZKXdtgaQOhZI8VUcnAEfPkdN/s1600/dai-custom-rdc-by-embarked.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"> </a> </div>
<div class="separator" style="clear: both; text-align: left;">
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Once again, </span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">H2O engineers continually innovate and introduce new techniques so chances are RDC may become another feature inside Driverless AI model diagnostics module. But this example would still llustrate how to enhance models with practically any type of analysis using R Client and visualizations.</span></span></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<h3 class="separator" style="clear: both; text-align: left;">
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"> Resources and References</span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"> </span></span></h3>
<ol>
<li><a href="https://www.h2o.ai/products/h2o-driverless-ai/"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">H2O.ai Driverless AI home page</span></span></a></li>
<li><a href="http://docs.h2o.ai/"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Driverless AI docs online</span></span></a></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Github project <a href="https://github.com/h2oai/driverlessai-recipes">Recipes for H2O Driverless AI</a> </span></span></li>
<li><a href="https://www.h2o.ai/blog/how-to-write-a-transformer-recipe-for-driverlessai/"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">How to Write a Transformer Recipe for Driverless AI</span></span></a></li>
<li><a href="http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/scorers.html"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Driverless AI Scorers</span></span></a></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><a href="https://smile.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448/">The Grammar of Graphics by Leland Wilkinson</a> </span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><a href="https://www.kdd.org/kdd2019/accepted-papers/view/150-successful-machine-learning-models-6-lessons-learned-at-booking.com" target="_blank">150 Successful Machine Learning Models: 6 Lessons Learned at Booking.com</a></span></span></span></span></span></span></li>
<li><a href="https://gist.github.com/grigory93/33a34a895d5e8c8305db20c8221b1ca6"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">GitHub Gist with source code for RDC visualization with R Client</span></span></span></span></span></span></a> </li>
</ol>
Gregory Kanevskyhttp://www.blogger.com/profile/09179130896383881927noreply@blogger.com0tag:blogger.com,1999:blog-7530218802939252476.post-33874728166604839622018-12-25T23:23:00.001-06:002020-02-25T15:17:28.488-06:00Finally, You Can Plot H2O Decision Trees in R<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Creating and plotting decision trees (</span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">like one below)</span></span> for the models created in H2O will be main objective of this post:</span></span><br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiCaVfFHbYCPu5xY-AfPQdwthCdM1KFVnpsjqtHgsGqOu0bWhyphenhyphen76UZP_JF5gOJqyYT5IyuxknzJh3Exy-6vuavN25bFW_4D1kzj8DIleNUnQkpVujcsHMaZMqNzI97MagkLkJmRxbUtqNGs/s1600/Titanic-Decision-tree-h2o.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="414" data-original-width="501" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiCaVfFHbYCPu5xY-AfPQdwthCdM1KFVnpsjqtHgsGqOu0bWhyphenhyphen76UZP_JF5gOJqyYT5IyuxknzJh3Exy-6vuavN25bFW_4D1kzj8DIleNUnQkpVujcsHMaZMqNzI97MagkLkJmRxbUtqNGs/s1600/Titanic-Decision-tree-h2o.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-size: x-small;"><span style="font-family: "georgia" , "times new roman" , serif;">Figure 1. Decision Tree Visualization </span></span></td></tr>
</tbody></table>
<br />
<h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Decision Trees with H<sub>2</sub>O</span></span></span></span></span></span></span></span></h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">With release <a href="http://docs.h2o.ai/h2o/latest-stable/h2o-docs/downloading.html">3.22.0.1</a> H<sub>2</sub>O-3 (a.k.a. open source </span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">H<sub>2</sub>O or simply </span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">H<sub>2</sub>O</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span>) added to its family of tree-based algorithms (which already included <a href="http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/drf.html">DRF</a>, <a href="http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/gbm.html">GBM</a>, and <a href="http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/xgboost.html">XGBoost</a>) support for one more: <a href="https://www.h2o.ai/blog/anomaly-detection-with-isolation-forests-using-h2o/">Isolation Forest (random forest for unsupervised anomaly detection)</a>. There were no simple way to visualize </span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">H<sub>2</sub>O</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span> trees except following clunky (albeit reliable) method of <a href="https://dzone.com/articles/visualizing-h2o-gbm-and-random-forest-mojo-models">creating a MOJO object and running combination of Java and dot commands</a>. </span></span><br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><br /></span></span>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">That changed in</span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"> 3.22.0.1 too with introduction of unified Tree API to work with any of the tree-based algorithms above. Data scientists are now able to utilize powerful visualization tools in R (or Python) without resorting to </span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">producing intermediate </span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">artifacts</span></span></span></span> like MOJO and running external utilities. Please read <a href="https://dzone.com/articles/inspecting-decision-trees-in-h2o">this article by Pavel Pscheidl</a> who did superb job of explaining </span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">H<sub>2</sub>O</span></span></span></span> Tree API and S4 classes in R before coming back to take it a step further to visualize trees.</span></span></span></span><br />
<br />
<h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">The Workflow: from Data to Decision Tree </span></span></span></span></h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Whether you are still here or came back after reading Pavel's excellent post let's set goal straight: create single decision tree model in </span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">H<sub>2</sub>O</span></span></span></span> and visualize its tree graph. With </span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">H<sub>2</sub>O</span></span></span></span> there is always a choice between using Python or R - </span></span></span></span>the choice for R here will become clear when discussing its graphical and analytical capabilities later.</span></span></span></span><br />
<br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><a href="https://en.wikipedia.org/wiki/Predictive_analytics#Classification_and_regression_trees_(CART)">CART models</a> operate on labeled data (classification and regression) and offer arguably unmatched model interpretability by means of analyzing a tree graph. In data science there is never single way to solve given problem so let's define end-to-end logical workflow from "raw" data to visualized decision tree</span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">:</span></span></span></span><br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgQXrxe-fy6DKW0j-h962fZICR-j6d_vgyg3xi7SZrQgB90GdUcLH_lDg3JSRJVbkf9yDzPml1i2BpsiFQbR5O8u_rlmsQP2071C7AwOitQMrxFmLRRVbhOWNGCbZdHvVLvQYydURJ_Brc-/s1600/h2o-tree-visual-flow-abstract.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="168" data-original-width="816" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgQXrxe-fy6DKW0j-h962fZICR-j6d_vgyg3xi7SZrQgB90GdUcLH_lDg3JSRJVbkf9yDzPml1i2BpsiFQbR5O8u_rlmsQP2071C7AwOitQMrxFmLRRVbhOWNGCbZdHvVLvQYydURJ_Brc-/s1600/h2o-tree-visual-flow-abstract.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-size: x-small;"><span style="font-family: "georgia" , "times new roman" , serif;">Figure 2. Workflow of tasks in this post</span></span></td></tr>
</tbody></table>
<br />
<div style="text-align: left;">
</div>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">One may argue that the choice of executing steps inside </span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">H<sub>2</sub>O</span></span></span></span> or R</span></span></span></span></span></span></span></span> could be different but let's follow outlined plan for this post. Next diagram adds implementation details:</span></span></span></span><br />
<ul>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">R package <span style="font-family: "verdana" , sans-serif;">data.table</span> for data munging</span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">H<sub>2</sub>O</span></span></span></span> grid for hyper-parameter search</span></span></span></span></span></span></span></span></span></span></span></span> </span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">H<sub>2</sub>O</span></span></span></span> GBM for modeling single decision tree algorithm</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">H<sub>2</sub>O</span></span></span></span> Tree API for tree model representation</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">R package <span style="font-family: "verdana" , sans-serif;">data.tree</span> for visualization </span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span> </span></span></span></span></li>
</ul>
<div style="text-align: left;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhhsztScKoBECXiJ6qAlOoquaZhCxVYpRB-W_ibG603T6z987xkDt3MuY9POMrpqtfU6HA8dnfFQaeu3Lhkry5Wad1opsRVmgoTq3ZTj-B8HewGvlazY0kgy9I8eX8cMauDpnHsKPDxlp8x/s1600/h2o-tree-visual-flow.png" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="166" data-original-width="814" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhhsztScKoBECXiJ6qAlOoquaZhCxVYpRB-W_ibG603T6z987xkDt3MuY9POMrpqtfU6HA8dnfFQaeu3Lhkry5Wad1opsRVmgoTq3ZTj-B8HewGvlazY0kgy9I8eX8cMauDpnHsKPDxlp8x/s1600/h2o-tree-visual-flow.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-size: x-small;"><span style="font-family: "georgia" , "times new roman" , serif;">Figure 3. Workflow of tasks in this post with implementation details</span></span></td></tr>
</tbody></table>
<br />
<br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><br /></span></span></span></span>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Discussion of this workflow continues for the rest of this post.</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span><br />
<br />
<h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Titanic Dataset</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">The famous <a href="https://www.kaggle.com/c/titanic">Titanic dataset</a> contains information about the fate of passengers of the <a href="https://en.wikipedia.org/wiki/RMS_Titanic">RMS Titanic</a> that sank after colliding with an iceberg. It regularly serves as toy data for blog exercises like this.</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span><br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><br /></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">H<sub>2</sub>O</span></span></span></span> public S3 bucket holds the Titanic dataset readly available and using package </span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "verdana" , sans-serif;">data.table</span></span></span></span></span> makes it fast one-liner </span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">to load into R</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span>:</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span><br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><br /></span></span></span></span>
<br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"></span></span></span></span>
<br />
<div>
<code data-gist-file="load-titanic-data-from-S3.R" data-gist-hide-footer="true" data-gist-id="149fb361fdfe933cd6317601e8a7107c"></code>
</div>
<br />
<h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Data Engineering</span></span></span></span></h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Passenger features from the Titanic dataset are discussed at length online, e.g. see <a href="https://towardsdatascience.com/predicting-the-survival-of-titanic-passengers-30870ccc7e8">Predicting the Survival of Titanic Passengers</a> and <a href="https://www.kaggle.com/thilakshasilva/predicting-titanic-survival-using-five-algorithms">Predicting Titanic Survival using Five Algorithms</a>. To summarize the following features were selected and engineered for decision tree model:</span></span></span></span><br />
<ul>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><i>survived</i> indicates if passenger survived the wreck</span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><i>boat</i> and <i>body</i> leak survival outcome and were dropped completely before modeling</span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><i>name</i> and <i>cabin</i> are too noisy as they are and only used to derive new features</span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><i>title</i> is parsed from <i>name</i></span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><i>cabin_type</i> is parsed from <i>cabin</i></span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><i>family_size</i> and <i>family_type</i> are derived from combination of count features <i>sibsp</i> (siblings+spouse) and <i>parch</i> (parents+children)</span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><i>ticket</i> and <i>home.dest</i> are dropped to preserve simplicity of the model</span></span></span></span></span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">missing values in <i>age</i> and <i>fare</i> are imputed using target encoding (mean) over grouping by <i>survived</i>, <i>sex</i>, and <i>embarked</i> columns. </span></span></span></span></span></span></span></span></li>
</ul>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Data load and data munging steps above are implemented in R using <i>data.table</i>:</span></span></span></span></span></span></span></span><br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><br /></span></span></span></span></span></span></span></span>
<br />
<div>
<code data-gist-file="decision-tree-visual-with-H2O.R" data-gist-hide-footer="true" data-gist-id="149fb361fdfe933cd6317601e8a7107c"></code>
</div>
<br />
<h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Starting with H<sub>2</sub>O</span></span></span></span></span></span></span></span></h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Creating models with </span></span></span></span></span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">H<sub>2</sub>O</span></span></span></span> requires running a server process (remote or local) and a client (package <span style="font-family: "verdana" , sans-serif;">h2o</span> in R available from CRAN) where the latter connects and sends commands to the former. The Tree API was introduced with release </span></span></span></span></span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">3.22.0.1</span></span></span></span></span></span></span></span> (10/26/2018) but due to CRAN policies <span style="font-family: "verdana" , sans-serif;">h2o</span> package usually lags several versions behind (on the time of this writing CRAN hosted version 3.20.0.8). There are two ways to work around this:</span></span></span></span></span></span></span></span><br />
<ol>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Install and run package available from CRAN and use <span style="font-family: "verdana" , sans-serif;">strict_version_check=FALSE</span> inside <span style="font-family: "verdana" , sans-serif;">h2o.connect()</span> to communicate with newer version running on server</span></span></span></span></span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Or install the latest version of <span style="font-family: "verdana" , sans-serif;">h2o</span> available from </span></span></span></span></span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">H<sub>2</sub>O</span></span></span></span></span></span></span></span></span></span></span></span> repository either to connect to remote server or to both connect and run server locally.</span></span></span></span></span></span></span></span></li>
</ol>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Tree API is available only with 2d option because it requires access to new classes and functions in </span></span></span></span></span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "verdana" , sans-serif;">h2o</span></span></span></span></span></span></span></span></span> package (remember, I asked you read Pavel's blog). Below code from the official </span></span></span></span></span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">H<sub>2</sub>O</span></span></span></span></span></span></span></span></span></span></span></span> download page shows how to download and install the latest version of the</span></span></span></span></span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"> package: </span></span></span></span></span></span></span></span><br />
<br />
<div>
<code data-gist-file="install-latest-h2o-package.R" data-gist-hide-footer="true" data-gist-id="149fb361fdfe933cd6317601e8a7107c"></code>
</div>
<br />
<h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Building Decision Tree with H<sub>2</sub>O</span></span></span></span></span></span></span></span></h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">While </span></span></span></span></span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">H2O offers </span></span></span></span></span></span></span></span>no dedicated single decision tree algorithm <a href="https://0xdata.atlassian.net/browse/PUBDEV-4324">there two approaches using superseding models</a>:</span></span></span></span></span></span></span></span><br />
<ul>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><a href="http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/drf.html">Distributed Random Forest (DRF)</a> function </span></span></span></span></span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">h2o.randomForest()</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span> with arguments</span></span></span></span></span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><br />ntrees = 1</span></span></span></span></span><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><br />mtries = number of features <span style="font-family: "georgia" , "times new roman" , serif;">(would be determined dynamically at runtime)</span></span></span></span></span></span></span><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><br />sample_rate = 1<br />min_rows = 1</span></span></span></span></span></span></span></span></span></span></span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><a href="http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/gbm.html">Gradient Boosting Machine (GBM)</a> function </span></span></span></span></span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">h2o.gbm()</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span> with arguments</span></span></span></span></span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><br />ntrees = 1</span></span></span></span></span></span><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><br />min_rows = 1</span></span></span></span></span></span></span></span></span></span></span></span><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><br />sample_rate = 1<br />col_sample_rate = 1</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span> </span></span></span></span></span></span></span></span></li>
</ul>
<br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Choosing GBM option requires one less line of code (no need to calculate number of features to set </span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">mtries</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span>) so it was used for this post. Otherwise both ways result in the same decision tree with the steps below fully reproducible using </span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">h2o.randomForest()</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span> instead of </span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">h2o.gbm()</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span>.</span></span><br />
<br />
<h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Decision Tree Depth</span></span></h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;">When building single decision tree models maximum tree depth stands as t</span></span><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;">he most important parameter to pick</span></span></span>. Shallow trees tend to underfit by failing to capture
important relationships in data producing similar trees despite varying
training data (error due to high bias). On the other hand trees grown
too deep overfit by reacting to noise and slight changes in data (error
due to high variance). </span><span style="font-size: large;">Tuning </span><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">H<sub>2</sub>O </span></span></span></span></span>model's parameter </span></span><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">max_depth</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"> that limits decision tree depth aims at balancing the effects of bias and variance</span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">. In R using </span><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">H<sub>2</sub>O</span></span></span></span></span></span> to split data and to tune the model, then visualizing results with </span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "verdana" , sans-serif;">ggplot</span></span></span></span></span></span></span></span></span> to look for right value unfolds like this</span><span style="font-size: large;">:</span><span style="font-size: large;"> </span></span><br />
<ol>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">split Titanic data into training and validation sets</span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">define grid search object with parameter </span></span><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">max_depth</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span> </span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">launch grid search on GBM models and grid object to obtain AUC values (model performance) </span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">plot grid model AUC'es vs. </span></span><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="background-color: white; color: black; display: inline; float: none; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">max_depth</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"> values to <span style="font-size: large;">determine "inflection point" where AUC growth stops or saturates (see plot below)</span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;">register tree depth value at <a href="https://en.wikipedia.org/wiki/Inflection_point">inflection point</a> to use in the final model</span></span></span></li>
</ol>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Code below implements these steps:</span></span>
<br />
<div>
<code data-gist-file="grid-search-and-1tree-model.R" data-gist-hide-footer="true" data-gist-id="149fb361fdfe933cd6317601e8a7107c"></code>
</div>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><br /></span></span>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">and produces chart that points to inflection point for maximum tree depth at 5:</span></span><br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEifhxqLOYMLOYm-O0CxujzPRSIQvYrCELKJQsQbOJUCPnPsa5KRe7OAXIHkmlap3fkUS4i2fLpFJA03XbvpB3y6N4smfuaI_m1Ph1YAzYhmjTMIDqyo3t_Qq-EIBlNG_XxytuHd7PCWmIB1/s1600/decision-tree-grid-max-depth.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="555" data-original-width="500" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEifhxqLOYMLOYm-O0CxujzPRSIQvYrCELKJQsQbOJUCPnPsa5KRe7OAXIHkmlap3fkUS4i2fLpFJA03XbvpB3y6N4smfuaI_m1Ph1YAzYhmjTMIDqyo3t_Qq-EIBlNG_XxytuHd7PCWmIB1/s1600/decision-tree-grid-max-depth.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-family: "georgia" , "times new roman" , serif;">Figure 4. Visualization of AUC vs. maximum tree depth hyper-parameter trend </span><br />
<span style="font-family: "georgia" , "times new roman" , serif;">extracted from the H2O grid object after running grid search in H2O. </span><br />
<span style="font-family: "georgia" , "times new roman" , serif;">Marked inflection point indicates when increasing maximum tree depth </span><br />
<span style="font-family: "georgia" , "times new roman" , serif;">no longer improves model performance on validation set</span></td></tr>
</tbody></table>
<h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"> </span></span></h3>
<h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Creating Decision Tree</span></span></h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">As evident from the Figure 4 optimal decision tree depth is 5. The code below constructs single decision tree model in </span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">H<sub>2</sub>O</span></span></span></span></span></span></span> and then retrieves tree representation from a GBM model with Tree API function </span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="background-color: white; color: #222635; display: inline; float: none; font-family: "cambria" , serif; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><code style="-moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color: transparent; border-radius: 4px; box-sizing: border-box; color: black; font-family: monospace; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-align: start; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">h2o.getModelTree()</code></span></span></span></span></span>, which creates an instance of S4 class </span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><code style="-moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color: transparent; border-radius: 4px; box-sizing: border-box; color: black; font-family: monospace; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-align: start; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">H2OTree</code></span></span></span> and assigns to variable </span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><code style="-moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color: transparent; border-radius: 4px; box-sizing: border-box; color: black; font-family: monospace; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-align: start; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">titanicH2oTree</code></span></span></span></span></span>:</span></span><br />
<br />
<div>
<code data-gist-file="creaate-decision-tree-and-get-tree.R" data-gist-hide-footer="true" data-gist-id="149fb361fdfe933cd6317601e8a7107c"></code>
</div>
<br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">At this point all action moved back inside R with its unparalleled access to analytical and visualization tools. So before navigating and plotting a decision tree - final goal for this post - let's have brief intro to networks in R.</span></span><br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: small;"><span style="font-size: medium;"><br /></span></span>
<br />
<h3>
<span style="font-size: small;">
</span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Overview of Network Analysis in R</span></span></h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">R offers arguably the richest functionality when it comes to analyzing and visualizing network (graph, tree) objects. Before taking on the task of conquering it spend time visiting a couple of comprehensive articles describing vast landscape of tools and approaches available: <a href="http://kateto.net/network-visualization">Static and dynamic network visualization with R</a> by Katya Ognyanova and <a href="https://www.jessesadler.com/post/network-analysis-with-r/">Introduction to Network Analysis with R</a> by Jesse Sadler.</span></span><br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><br /></span></span>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">To summarize there are two commonly used packages to manage and analyze networks in R: </span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><code style="-moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color: transparent; border-radius: 4px; box-sizing: border-box; color: black; font-family: monospace; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-align: start; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">network</code></span></span></span></span></span> (part of statnet family) and </span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><code style="-moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color: transparent; border-radius: 4px; box-sizing: border-box; color: black; font-family: monospace; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-align: start; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">igraph</code></span></span></span></span></span> (family in itself). Each package implements namesake classes to represent network structures so there is significant overlap between the two and they mask each other's functions. Preferred approach is picking only one of two: <a href="https://www.jessesadler.com/post/network-analysis-with-r/">it appears</a> that </span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><code style="-moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color: transparent; border-radius: 4px; box-sizing: border-box; color: black; font-family: monospace; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-align: start; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">igraph</code></span></span></span></span></span></span></span> is more common for general-purpose applications while </span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><code style="-moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color: transparent; border-radius: 4px; box-sizing: border-box; color: black; font-family: monospace; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-align: start; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">network</code></span></span></span></span></span></span></span> is preferred for social network and statistical analysis (my subjective assessment). And while researching these packages </span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">do not forget about package </span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><code style="-moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color: transparent; border-radius: 4px; box-sizing: border-box; color: black; font-family: monospace; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-align: start; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">intergraph</code></span></span></span></span></span></span></span></span></span></span></span></span></span> that seamlessly transforms objects between</span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"> </span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><code style="-moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color: transparent; border-radius: 4px; box-sizing: border-box; color: black; font-family: monospace; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-align: start; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">network</code></span></span></span></span></span></span></span> and </span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><code style="-moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color: transparent; border-radius: 4px; box-sizing: border-box; color: black; font-family: monospace; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-align: start; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">igraph</code></span></span></span></span></span></span></span></span></span> </span></span> classes. (And this analysis stopped short of expanding into universe of R packages hosted on <a href="http://bioconductor.org/">Bioconductor</a>).</span></span><br />
<br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">When it comes to visualizing networks choices quickly proliferate. Both </span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><code style="-moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color: transparent; border-radius: 4px; box-sizing: border-box; color: black; font-family: monospace; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-align: start; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">network</code></span></span></span></span></span></span></span> and </span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><code style="-moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color: transparent; border-radius: 4px; box-sizing: border-box; color: black; font-family: monospace; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-align: start; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">igraph</code></span></span></span></span></span></span></span></span></span> offer graphical functions that use R base plotting system but it doesn't stop here. Following packages specialize in advanced visualizations for at least one or both of the classes:</span></span></span></span></span></span><br />
<ul>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><code style="-moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color: transparent; border-radius: 4px; box-sizing: border-box; color: black; font-family: monospace; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-align: start; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">ggraph</code></span></span></span></span></span></span></span></span></span></span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><code style="-moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color: transparent; border-radius: 4px; box-sizing: border-box; color: black; font-family: monospace; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-align: start; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">ggnet2</code></span></span></span></span></span></span></span></span></span></span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><code style="-moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color: transparent; border-radius: 4px; box-sizing: border-box; color: black; font-family: monospace; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-align: start; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">ggnetwork</code></span></span></span></span></span></span></span></span></span></span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><code style="-moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color: transparent; border-radius: 4px; box-sizing: border-box; color: black; font-family: monospace; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-align: start; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">visNetwork</code></span></span></span></span></span></span></span></span></span></span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><code style="-moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color: transparent; border-radius: 4px; box-sizing: border-box; color: black; font-family: monospace; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-align: start; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">DiagrammeR </code></span></span></span></span></span></span></span></span></span></span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><code style="-moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color: transparent; border-radius: 4px; box-sizing: border-box; color: black; font-family: monospace; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-align: start; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">networkD3 </code></span></span></span></span></span></span></span></span></span></span></span></span></span></li>
</ul>
<br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><code style="-moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color: transparent; border-radius: 4px; box-sizing: border-box; color: black; font-family: monospace; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-align: start; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-family: "georgia" , "times new roman" , serif;">Finally, there is package</span></code></span></span></span></span></span></span></span></span></span></span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><code style="-moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color: transparent; border-radius: 4px; box-sizing: border-box; color: black; font-family: monospace; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-align: start; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><code style="-moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color: transparent; border-radius: 4px; box-sizing: border-box; color: black; font-family: monospace; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-align: start; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">data.tree</code></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span> designed specifically to create and analyze trees in R. It fits the bill of representing and visualizing decision trees perfectly, so it became a tool of choice for this post. Still, visualizing </span></code></span></span></span></span></span></span></span></span></span></span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><code style="-moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color: transparent; border-radius: 4px; box-sizing: border-box; color: black; font-family: monospace; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-align: start; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">H2O</span></span> model trees could be fully reproduced with any of network and visualization packages mentioned above. </span> </code></span></span></span></span></span></span></span></span></span></span></span></span></span><br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"></span></span></span></span></span></span></span></span></span></span></span></span></span><br />
<h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;">Visualizing </span></span></span></span></span>H<sub>2</sub>O<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"> Trees</span></span></span></span></span></span></span></span></span></span></span></span></span></h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;">In the last step a decision tree for the model created by GBM</span></span></span></span></span></span></span></span></span></span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"> moved from </span></span></span></span></span></span></span></span></span></span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">H<sub>2</sub>O</span></span></span></span> cluster</span></span></span></span></span></span></span></span></span></span></span></span></span> memory to </span></span></span></span></span></span></span></span></span></span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><code style="background-color: transparent; border-radius: 4px; box-sizing: border-box; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">H2OTree</code><span style="background-color: white; display: inline; float: none; font-family: "cambria" , serif; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span class="Apple-converted-space"> </span></span></span>object in R by means of Tree API. Still, specific to </span></span></span></span></span></span></span></span></span></span></span></span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">H<sub>2</sub>O </span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;">the </span></span></span></span></span></span></span></span></span></span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><code style="background-color: transparent; border-radius: 4px; box-sizing: border-box; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">H2OTree</code><span style="background-color: white; display: inline; float: none; font-family: "cambria" , serif; font-style: normal; font-weight: normal; letter-spacing: normal; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span class="Apple-converted-space"> </span></span></span>object now contains necessary details about decision tree, but not in the format understood by R packages such as</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><code style="-moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color: transparent; border-radius: 4px; box-sizing: border-box; color: black; font-family: monospace; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-align: start; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><code style="-moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color: transparent; border-radius: 4px; box-sizing: border-box; color: black; font-family: monospace; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-align: start; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">data.tree.</code></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></code></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span><br />
<br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">To fill this gap function</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "verdana" , sans-serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><code style="background-color: transparent; border-radius: 4px; box-sizing: border-box; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"> createDataTree(H2OTree)</code></span></span></span></span></span></span></span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"> created that traverses a tree and translates it from </span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "verdana" , sans-serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><code style="background-color: transparent; border-radius: 4px; box-sizing: border-box; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "verdana" , sans-serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><code style="background-color: transparent; border-radius: 4px; box-sizing: border-box; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">H2OTree</code></span></span></span></span></span></span></span></span></code></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span>into</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"> </span></span><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "verdana" , sans-serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "verdana" , sans-serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><code style="background-color: transparent; border-radius: 4px; box-sizing: border-box; font-family: monospace; font-style: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">data.tree</code></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"> accumulating information about decision tree splits and predictions into node and edge attributes of a tree:</span></span><br />
<br />
<div>
<code data-gist-file="map-h2otree-to-datatree.R" data-gist-hide-footer="true" data-gist-id="149fb361fdfe933cd6317601e8a7107c"></code>
</div>
<br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Finally everything lined up and ready for the final step of plotting decision tree:</span></span><br />
<ul>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">single decision tree model created in H2O</span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">its structure made available in R</span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">and translated to specialized </span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><code style="-moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color: transparent; border-radius: 4px; box-sizing: border-box; color: black; font-family: monospace; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-align: start; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><code style="-moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color: transparent; border-radius: 4px; box-sizing: border-box; color: black; font-family: monospace; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-align: start; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">data.tree</code></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></code></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span> for network analysis.</span></span></li>
</ul>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Styling and plotting </span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><code style="-moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color: transparent; border-radius: 4px; box-sizing: border-box; color: black; font-family: monospace; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-align: start; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">data.tree</code></span></span></span></span></span></span></span></span></span></span></span></span></span> objects is built around rich functionality of the </span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><code style="-moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color: transparent; border-radius: 4px; box-sizing: border-box; color: black; font-family: monospace; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-align: start; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">DiagrammerR</code></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span> package. For anything that goes beyond simple plotting read documentation <a href="http://rich-iannone.github.io/DiagrammeR/docs.html">here</a> but also remember that for plotting </span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><code style="-moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color: transparent; border-radius: 4px; box-sizing: border-box; color: black; font-family: monospace; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-align: start; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">data.tree</code></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span> takes advantage of:</span></span><br />
<ul>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">hierarchical nature of tree structures</span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><a href="https://graphviz.gitlab.io/_pages/doc/info/attrs.html">GraphViz attributes</a> to style graph, node and edge properties</span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">and dynamic callback functions (in this example </span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><code style="-moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color: transparent; border-radius: 4px; box-sizing: border-box; color: black; font-family: monospace; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; padding: 2px 4px; text-align: start; text-decoration: none; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">GetEdgeLabel(node), GetNodeShape(node), GetFontName(node)</code></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span>) to customize tree's feel and look</span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"> </span></span></li>
</ul>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">The following code will produce this moderately customized decision tree for our </span></span><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">H<sub>2</sub>O</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span> model: </span></span><br />
<br />
<div>
<code data-gist-file="plot-h2o-decision-tree.R" data-gist-hide-footer="true" data-gist-id="149fb361fdfe933cd6317601e8a7107c"></code>
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjGcbfWsEtnZdJuosRc3JyOc5hPf5DkSyLIRP0EMN-OmPs9TDygDY4DQBi3drPKB20_q9f5PaFqwijUDZdVgx8_ms8__LOcQdcgpENaDm9ZYsv5QFiK6ObX0n_TYUsp7ETZ95BCeKa5DKJa/s1600/decision-tree-titanic-final.png" imageanchor="1" style="clear: left; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1000" data-original-width="1600" height="399" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjGcbfWsEtnZdJuosRc3JyOc5hPf5DkSyLIRP0EMN-OmPs9TDygDY4DQBi3drPKB20_q9f5PaFqwijUDZdVgx8_ms8__LOcQdcgpENaDm9ZYsv5QFiK6ObX0n_TYUsp7ETZ95BCeKa5DKJa/s640/decision-tree-titanic-final.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-size: x-small;"><span style="font-family: "georgia" , "times new roman" , serif;">Figure 5. H2O Decision Tree for Titanic Model Visualized in R using data.tree package </span></span></td></tr>
</tbody></table>
<br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;"> </span></span><br />
<br />
<h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">References</span></span></h3>
<ul>
<li><a href="https://www.h2o.ai/blog/anomaly-detection-with-isolation-forests-using-h2o/"><span style="font-family: "georgia" , "times new roman" , serif; font-size: small;"><span style="font-size: medium;">Anomaly Detection with Isolation Forests using H2O</span></span></a></li>
<li><span style="font-size: small;"><span style="font-family: "georgia" , "times new roman" , serif;"><a href="https://github.com/h2oai/h2o-3/blob/master/Changes.md#xia-32201---10262018">Changes in H2O Xia (3.22.0.1) - 10/26/2018</a></span> </span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: small;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: small;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><a href="https://dzone.com/articles/visualizing-h2o-gbm-and-random-forest-mojo-models">Visualizing H2O GBM and Random Forest MOJO Model Trees</a></span></span></span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: small;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: small;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><a href="https://dzone.com/articles/inspecting-decision-trees-in-h2o">Inspecting Decision Trees in H2O</a> </span></span></span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: small;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: small;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><a href="https://en.wikipedia.org/wiki/Predictive_analytics#Classification_and_regression_trees_(CART)">Classification and Regression Trees</a> </span></span></span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: small;"><span style="font-size: small;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><a href="https://towardsdatascience.com/predicting-the-survival-of-titanic-passengers-30870ccc7e8">Predicting the Survival of Titanic Passengers</a> </span></span></span></span></span></span></span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: small;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: small;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><a href="https://www.kaggle.com/thilakshasilva/predicting-titanic-survival-using-five-algorithms">Predicting Titanic Survival using Five Algorithms</a></span></span></span></span> </span></span></span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: small;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><a href="http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/drf.html">Distributed Random Forest (DRF)</a></span></span></span></span></span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: small;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><a href="http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/gbm.html">Gradient Boosting Machine (GBM)</a></span></span></span></span></span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: small;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><a href="https://en.wikipedia.org/wiki/Inflection_point">Inflection Point</a></span></span></span></span></span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: small;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><a href="http://kateto.net/network-visualization">Static and dynamic network visualization with R</a> by Katya Ognyanova</span></span></span></span></span></span></span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: small;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><a href="https://www.jessesadler.com/post/network-analysis-with-r/">Introduction to Network Analysis with R</a> by Jesse Sadler</span></span> </span></span></span></span></span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: small;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><a href="https://cran.r-project.org/web/views/gR.html">CRAN Graphical Models in R Task View</a></span></span></span></span></span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: small;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><a href="http://bioconductor.org/packages/release/BiocViews.html#___GraphAndNetwork">Bioconductor GraphAndNetwork packages</a> </span></span></span></span></span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: small;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><a href="https://cran.r-project.org/web/packages/data.tree/vignettes/data.tree.html">Introduction to data.tree</a> </span></span></span></span></span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: small;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><a href="https://github.com/rich-iannone/DiagrammeR">DiagrammeR package on github</a></span></span></span></span></span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: small;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><a href="https://graphviz.gitlab.io/_pages/doc/info/attrs.html">Node, Edge, and Graph attributes for Graphviz tools</a></span></span></span></span></span></span></span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: small;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: medium;"><a href="https://gist.github.com/grigory93/149fb361fdfe933cd6317601e8a7107c">Public GitHub gist with source code</a> </span></span></span></span></span></span></span></span></li>
</ul>
Gregory Kanevskyhttp://www.blogger.com/profile/09179130896383881927noreply@blogger.com0tag:blogger.com,1999:blog-7530218802939252476.post-78275305782656868862018-04-01T22:19:00.000-05:002018-04-02T18:02:01.730-05:00Surviving Shelter: Analysis of Time Spent and Outcome in Dallas Animal Shelters<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">In previous <a href="https://novyden.blogspot.com/2017/08/dallas-animal-services-shelter-intake.html">post</a> we discovered Dallas Animal Services data sources (</span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">available on </span><span style="font-size: large;"><a href="https://www.dallasopendata.com/City-Services/FY-2017-Dallas-Animal-Shelter-Data/sjyj-ydcj" style="font-family: "georgia", "times new roman", serif;">Dallas Open Data</a></span>) and analyzed how animals get admitted to and discharged from the city shelters</span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">. </span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">We loaded </span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">actual shelter records</span> and looked at the types of admittance, different outcomes and their relationships. </span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">In this post we continue this analysis by focusing on the <i>time</i> animals spend and factors that favor or hinder <i>survival</i> of dogs in the shelters. </span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="color: #222222;">For consistency and representation on</span>ly types of admission <b>Confiscated</b>, <b>Owner Surrender</b>, and <b>Stray</b> and outcomes <b>Adoption</b>, </span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><b><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><b>Died, </b></span>Euthanized</b>, <b>Returned to Owner</b>, and <b>Transfer</b> were included. </span><span style="font-size: large;"><b style="font-family: georgia, "times new roman", serif;"><span style="font-size: medium;">Dead on Arrival</span></b><span style="font-family: "georgia" , "times new roman" , serif; font-size: medium;"> was excluded from survival analysis because it preempties outcome before stay in shelter begins.</span></span><br />
<br />
<h3>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Time Spent in Shelters</span></h3>
<div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Compare the distributions of time spent in shelter for cats and dogs to note both similarities and differences:</span><br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgmbJ9TL4c1QL515s37tLnc-HP7KjGr16ENfR2OK1VH4GqxFJayerHJZTyDakWcCRAbd5JfpWccCPc5K-NyxKncVbCoUckE-4nbcHBkhx5r0t5nJrAXfebzm63glrS_ld27ZkReFbaUblTG/s1600/dallas-animal-shelters-cats-dogs-days-in-shelter-hist.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="445" data-original-width="700" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgmbJ9TL4c1QL515s37tLnc-HP7KjGr16ENfR2OK1VH4GqxFJayerHJZTyDakWcCRAbd5JfpWccCPc5K-NyxKncVbCoUckE-4nbcHBkhx5r0t5nJrAXfebzm63glrS_ld27ZkReFbaUblTG/s1600/dallas-animal-shelters-cats-dogs-days-in-shelter-hist.png" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Distributions are bimodal with relatively fat tails but they differ in how major modes compare to minor ones. As Wikipedia rightly notices "</span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="background-color: white; color: #222222;">a <a href="https://en.wikipedia.org/wiki/Multimodal_distribution">bimodal distribution</a> most commonly arises as a mixture of two different </span><a class="mw-redirect" href="https://en.wikipedia.org/wiki/Unimodal" style="background: none rgb(255, 255, 255); color: #0b0080; text-decoration-line: none;" title="Unimodal">unimodal</a><span style="background-color: white;"><span style="color: #222222;"> distributions" and dissecting data by admission and outcome types opens the door to further discovery:</span></span></span><br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZehnRoGEM08y_LwgIdSd1KoCRcnCz4vsQI2CVOl0aiYY2TiS2IRXngTsM5lOUvGLko0qHz6YT1bcw-SD0dLdJXngL14Ltcl7hQdeR5IcEzkey8DJUJHHX2eXf21ovdpFKRlPmSRQdgiFZ/s1600/dallas-animal-shelters-cats-dogs-days-in-shelter-density.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="993" data-original-width="1000" height="635" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZehnRoGEM08y_LwgIdSd1KoCRcnCz4vsQI2CVOl0aiYY2TiS2IRXngTsM5lOUvGLko0qHz6YT1bcw-SD0dLdJXngL14Ltcl7hQdeR5IcEzkey8DJUJHHX2eXf21ovdpFKRlPmSRQdgiFZ/s640/dallas-animal-shelters-cats-dogs-days-in-shelter-density.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">If the former histogram used facets for separate plots for cats and dogs, the latter plot switched to dodged bars to pack more information into less space. Some interesting observations:</span><br />
<br />
<ul>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><b>Confiscated</b> admissions have distinctively different profile and peaks presumingly attributed to legal obligations to owners;</span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><b>Confiscated</b> has distinct bimodal distributions when outcomes are either <b>Returned to Owner</b> or <b>Transfer</b>;</span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><b>Adoption</b> times are similar for both cats and dogs;</span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Most distributions have clear unimodal profiles specific to the types of admission and outcome that vary between dogs and cats in density;</span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><b>Adoption</b> and to less degree <b>Owner Surrender</b> distributions are almost indistinguishable between cats and dogs.</span></li>
</ul>
<div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Rendering the same data using density curve estimates lets us validate the differences and similarities observed:</span></div>
<div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjL8xBtrI3gR-PTrN3IHU4iAUHXL75QW0FR5vjBOH6dzeu9VYBk9xNQHoz9jk0F3vfV6foYkfSWsy9ey6BWGwHQlB7X28eDkBAZCeyfbCqA5G9t6gBkREMtcE-vJ3nkDh5Heamo2DhGB4bQ/s1600/dallas-animal-shelters-cats-dogs-days-in-shelter-density-estimates2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="993" data-original-width="1000" height="634" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjL8xBtrI3gR-PTrN3IHU4iAUHXL75QW0FR5vjBOH6dzeu9VYBk9xNQHoz9jk0F3vfV6foYkfSWsy9ey6BWGwHQlB7X28eDkBAZCeyfbCqA5G9t6gBkREMtcE-vJ3nkDh5Heamo2DhGB4bQ/s640/dallas-animal-shelters-cats-dogs-days-in-shelter-density-estimates2.png" width="640" /></a></div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span></div>
<div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">The densities demonstrate striking similarity in <b>Adoption</b> and most differences in <b>Euthanized</b> outcome times. </span></div>
<div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span></div>
<h3>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Sankeys With Average Times</span></h3>
<div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">We already used Sankey diagrams to project flow from admission to discharge by total number of occurrences in each transition. This time we decided on novel approach to Sankeys when thickness reflects average time spent in shelter. First diagram is for cats:</span></div>
<div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span></div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span>
<iframe allowfullscreen="" frameborder="1" height="600px" marginheight="0px" marginwidth="0px" name="myiFrame" scrolling="no" src="https://rpubs.com/grigory/IntakeToOutcomeCatsAvgTimeSankey" style="border: 0px #ffffff none;" width="800px"></iframe>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br />And then for dogs:</span><br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span>
<iframe allowfullscreen="" frameborder="1" height="600px" marginheight="0px" marginwidth="0px" name="myiFrame" scrolling="no" src="https://rpubs.com/grigory/IntakeToOutcomeDogsAvgTimeSankey" style="border: 0px #ffffff none;" width="800px"></iframe>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span>
<br />
<div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span></div>
<div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">The thinner the line the shorter average stay between admission and outcome it connects. And the larger vertical panel (</span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">admission or outcome)</span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"> the longer it indicates an animal spends in shelter after admission or before discharge </span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">(on average and unweighted)</span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">. </span><br />
<br />
<h3>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Expected Chance of Not Surviving in Shelter</span></h3>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">For the purpose of this analysis any outcome other than </span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><b>Died</b> or <b>Euthanized</b></span> means animal survived to leave shelter alive (most with outcomes <b>Adoption</b>, <b>Foster</b>, <b>Returned to Owner</b> or <b>Transfer</b>). Remember that we also excluded </span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">dogs with intake type <b>Dead on Arrival</b> </span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">(see introduction)</span>.</span><br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">We begin with rather simple calculations - an estimates of chance of dying in shelter given animal satisfies certain condition. Plot below contains conditional probabilities for dogs (unless cats specified) <b>not</b> surviving in shelter given certain factor at the time of admission (intake categories):</span><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi56bRfDomQpITgrIUAiKpW5uk-RrPm1gjcHgVJZT1zMasbLDwXYHa3aVTKzvw0y_CyTXWGbqlaYwlZ7-7pkEPVQwmnn-9OcfbBmPE0lhyneHQ_nfSLpfV4qyJdZIuxHqNq3pcrPMwR5DeS/s1600/dallas-animal-shelters-by-intake-categories-chances.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="842" data-original-width="900" height="598" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi56bRfDomQpITgrIUAiKpW5uk-RrPm1gjcHgVJZT1zMasbLDwXYHa3aVTKzvw0y_CyTXWGbqlaYwlZ7-7pkEPVQwmnn-9OcfbBmPE0lhyneHQ_nfSLpfV4qyJdZIuxHqNq3pcrPMwR5DeS/s640/dallas-animal-shelters-by-intake-categories-chances.png" width="640" /></a></div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Two health conditions stand out with the highest rates: <i>untreatable</i> and <i>unmanageable</i>, while another health condition <i>contagious</i> is present in 3 out of top 4 factors.</span></span><br />
<br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">There is one more factor <i>breed</i> which has over 200 values just for dogs. Below we display chances of dying for the dog breeds with at least 100 recorded admissions:</span><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFLBW18IpFzLMGpK2Ov68JSnlpGXX5QCVk_yrYhizGUoSkO2anM48BjhxFnBfo0kMHDdtPZnWSKKCzSB8IK-_cZZHKgvOIsZyp9K2wNZsRN1etmDddArTSnfm79dJ6nIaXTjFiPjyi284l/s1600/dallas-animal-shelters-dogs-by-breed-survival-chances.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="842" data-original-width="900" height="598" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFLBW18IpFzLMGpK2Ov68JSnlpGXX5QCVk_yrYhizGUoSkO2anM48BjhxFnBfo0kMHDdtPZnWSKKCzSB8IK-_cZZHKgvOIsZyp9K2wNZsRN1etmDddArTSnfm79dJ6nIaXTjFiPjyi284l/s640/dallas-animal-shelters-dogs-by-breed-survival-chances.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Note that probability scale is different between last two plots. Surprisingly, breed <b>Chow Chow</b> took the top spot with Pit Bull Terrier breeds <b>Staffordshire</b>, <b>Pit Bull</b>, <b>Am Pit Bull Ter</b>rier, and <b>American Stafford</b>shire close next. </span><br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span></div>
</div>
<h3>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Survival Analysis</span></h3>
<div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">While applying classic survival analysis to animal shelter data presents certain challenges we apply the approach by ignoring few details. But any suggestions or comments how to improve are welcome. The survival function <i>S(t)</i> gives the probability that the subject (pet admitted to shelter) survives longer than time </span><span style="font-size: large;"><i style="font-family: "georgia", "times new roman", serif;">t</i><span style="font-family: "georgia" , "times new roman" , serif; font-size: medium;">. </span></span><br />
<br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">In this case pets survived when discharged with any outcome other than <b>Died</b> or <b>Euthanized</b>. The time <i>t</i> is always in days since the day of admission and all animal records included in this analysis are for animals that were discharged (effectively eliminating both left and right censoring cases). </span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Survival analysis accounts for censored data - those subjects with last known status alive and no later information available. In our case all animal records contain outcome</span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"> and thus all discharged alive are censored at discharge date.</span></div>
<span style="font-size: large;">
</span>
<br />
<h4>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Kaplan-Meier Estimator</span></h4>
<div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3059453/">Kaplan-Meier (KM) estimate</a> </span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">is a non-parametric maximum likelihood estimate of the survival function, <i>S(t)</i>. It</span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"> measures the fraction of animals living for a certain number of days <i>t</i> after admission </span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">and produces a declining step function with drops </span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">(KM curve) </span>that approximates the real survival function from data</span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">. </span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Given single categorical factor we can observe and compare KM curves (</span></span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">univariate analysis)</span></span></span> among multiple factor values. KM curves </span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">estimate and visualize survival chances in time </span>just as survival functions: given time <i>t</i> what is probability that subject survives at least to that time or longer.</span></div>
<div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span></div>
<h4>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Cats vs. Dogs KM Curves</span></h4>
<div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">First we compare survival curves between cats and dogs:</span><br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgyeNnBzw-XsnDGnAn7Q0KPyu0z5pyO7CYPkH8gMDMScHxPHR1cG8MVVQnVB73ztqb-g4gdlWAqmM4V6e7s_gMydMxuUexd2hdehAgY3k3C1WQeDVrWKrPmgm_DE9PedUU7r2j9hDv-W9n1/s1600/dallas-animal-shelters-km-curves-by-dogscats.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="794" data-original-width="800" height="635" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgyeNnBzw-XsnDGnAn7Q0KPyu0z5pyO7CYPkH8gMDMScHxPHR1cG8MVVQnVB73ztqb-g4gdlWAqmM4V6e7s_gMydMxuUexd2hdehAgY3k3C1WQeDVrWKrPmgm_DE9PedUU7r2j9hDv-W9n1/s640/dallas-animal-shelters-km-curves-by-dogscats.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br /></div>
<div>
<div class="separator" style="clear: both; text-align: center;">
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"></span></div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">
</span></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">The survival curve plot (top) is augmented with the bar chart of totals by categories and survival outcome (bottom) to give better understanding of underlying data. Survival chances for cats are never better than those and overall cats fare much worse than dogs - see bar chart above. Zooming in into the most critical first days after admission reveals more differences:</span></div>
<div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhk3Ulqa6s4eJj_WK8k48qJq7rLlG6MtPH5q5kgo-aB-onTnW1Qv6uHWuSwPMJ9AE0w0EO1sc7AL0CTal-rUuAZEKTmdbaIbikXXs4te9W9f5rYO0donW5pZYkMKaCw2NNyLzPAopjt0qZM/s1600/dallas-animal-shelters-km-curves-by-dogscats-2weeks.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="794" data-original-width="800" height="635" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhk3Ulqa6s4eJj_WK8k48qJq7rLlG6MtPH5q5kgo-aB-onTnW1Qv6uHWuSwPMJ9AE0w0EO1sc7AL0CTal-rUuAZEKTmdbaIbikXXs4te9W9f5rYO0donW5pZYkMKaCw2NNyLzPAopjt0qZM/s640/dallas-animal-shelters-km-curves-by-dogscats-2weeks.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Day of admission is the worst for both but cats fare twice as bad with 25% lost right away. Days 4 and 5 are critical for dogs as their survival plummets on these days. After that survival rates stabilize and trend in similar pattern.</span><br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span>
<br />
<h4>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">KM Curves by Dog Intake Types</span></h4>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">To make further analysis more plausible we include only dog records from this point on. We also exclude pets</span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"> admitted as </span><span style="font-size: large;"><b style="font-family: "georgia", "times new roman", serif;">Dead on Arrival</b><span style="font-family: "georgia" , "times new roman" , serif; font-size: medium;"> or </span><b style="font-family: "georgia", "times new roman", serif;">Euthanasia Requested</b><span style="font-family: "georgia" , "times new roman" , serif; font-size: medium;"> since their outcomes are obvious and immediate.</span></span><br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiBzJwSY_EX87aCjED-SOz453QXKiiOZRPgA-jirqrjgANGLZIWU4x0epdKMnk3CcSf3ebZO-ktBLfATLqGvp28Z1Yb6JBYYhkhklfW17fou1XnGXOap6FG88ErQx8nlDI7Tk0IABGQtHzi/s1600/dallas-animal-shelters-km-curves-dogs-by-intake-types.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="794" data-original-width="800" height="635" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiBzJwSY_EX87aCjED-SOz453QXKiiOZRPgA-jirqrjgANGLZIWU4x0epdKMnk3CcSf3ebZO-ktBLfATLqGvp28Z1Yb6JBYYhkhklfW17fou1XnGXOap6FG88ErQx8nlDI7Tk0IABGQtHzi/s640/dallas-animal-shelters-km-curves-dogs-by-intake-types.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span></div>
<div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Confiscated dogs survival chances are the best in first 10 days or so but then they quickly deteriorate crossing and diving below 2 other types after 2 weeks. The worst chances as expected belong to dogs surrendered by owner. And after 2 weeks all 3 curves cross to become less distinguishable.</span><br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"> </span></div>
<div>
<h4>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">KM Curves by Dog Origins</span></h4>
<div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Dallas Animal Services also maintain origin field assigning it at admission with 3 most prevalent values being <b>Field</b>, <b>Over the Counter</b>, and <b>Sweep</b>. These are how survival curves differ depending on dog origin:</span></div>
<div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj7Z5GKZH05fnhL1imk11NBYCdr6ahuDqXo_tRpF6_ZTA-4jlQyvtlJQj28kqi0hk67vZY3-PtNNDTaSgHlkR4wBXX2Lut4eXl2jzvIHv_gXzDLqbD8fdyz_2qDVZfBOMQCCIX2Wwuj6OBh/s1600/dallas-animal-shelters-km-curves-dogs-by-origin.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="794" data-original-width="800" height="635" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj7Z5GKZH05fnhL1imk11NBYCdr6ahuDqXo_tRpF6_ZTA-4jlQyvtlJQj28kqi0hk67vZY3-PtNNDTaSgHlkR4wBXX2Lut4eXl2jzvIHv_gXzDLqbD8fdyz_2qDVZfBOMQCCIX2Wwuj6OBh/s640/dallas-animal-shelters-km-curves-dogs-by-origin.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span></div>
<div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Again, significant shifts in survival chances happen after 5 days and then after 2-3 weeks when the fortunes of different origins turn around: after 5 days <b>Over the Counter</b> from the worst becomes 2d worst (or best) and then after 3 weeks the best. Both <b>Field</b> and <b>Sweep</b> drop after 5 days. In absolute numbers (shown in the bar plots) <b>Field </b>dogs survive the worst.</span></div>
<h4>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Health Conditions at Admission</span></h4>
</div>
<div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Unhealthy animals have little chance to survive shelters as evident from the following:</span><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiDVCvmW74pYE63RkrcjvNmam3aMmLfVAoILQRG_kp3qFsWQjmTand9Tnn5RoT4PEHeYcNzmncZtO2lriY_p2M960JAZF8-5FCRfnQ6y4NRcOxXx1JzoDyC1TJgn29yKLR55exX9ytr4NyX/s1600/dallas-animal-shelters-km-curves-dogs-by-unhealthy.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="794" data-original-width="800" height="634" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiDVCvmW74pYE63RkrcjvNmam3aMmLfVAoILQRG_kp3qFsWQjmTand9Tnn5RoT4PEHeYcNzmncZtO2lriY_p2M960JAZF8-5FCRfnQ6y4NRcOxXx1JzoDyC1TJgn29yKLR55exX9ytr4NyX/s640/dallas-animal-shelters-km-curves-dogs-by-unhealthy.png" width="640" /></a></div>
<br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">No surprise that unhealthy animals survival is significntly below healthy ones. Also, dominant majority of dogs accepted are in unhealthy condition, which is both not surprising and unfortunate. </span></span><br />
<br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">There is more information about unhealthy dogs available from shelter records: treatable vs. untreatable and contagious vs. non-contagious. Unfortunately, these values reside inside single field so the survival curves include combinations of the health factors:</span></span><br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3h2jVZnoFBQi8syBbnrT9rn6Mz0tmZNV-rFeyDufRpXdH_jDRzeChPs0dj6TP_pOpPfWbcbMSUmMTmmYgOw7Jza3Eky2uui-dnB5wApF9aGeqgh_2EL7RzSLgaOz1ZdDFvFxxAPO16aXG/s1600/dallas-animal-shelters-km-curves-dogs-by-health-factors.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="794" data-original-width="800" height="634" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3h2jVZnoFBQi8syBbnrT9rn6Mz0tmZNV-rFeyDufRpXdH_jDRzeChPs0dj6TP_pOpPfWbcbMSUmMTmmYgOw7Jza3Eky2uui-dnB5wApF9aGeqgh_2EL7RzSLgaOz1ZdDFvFxxAPO16aXG/s640/dallas-animal-shelters-km-curves-dogs-by-health-factors.png" width="640" /></a></div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span></span>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">It clearly shows how each health factor reduces survival chances: from <b>Healthy</b> to <b>Treatable Rehabilitable</b> to <b>Treatable Manageable</b> to <b>Unhealthy Untreatable</b> to finally </span></span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><b><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><b>Unhealthy Untreatable Contagious</b></span></span></b>. </span></span><br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span></span>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">If we extract and analyze each health factor (ignoring the rest) then these relationships become more apparent:</span></span><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjtzJhEBMOa16EBHHxyAD13KBN35qp9RJcmclWUqXVH_vNtrkVDGrZIKtwlSF4ue4ssINakgK8vqms6llwQ27I3hHaWhFMu_qO54EnzWZrsMaTpP1bxLDJuJueMWXTGGyus2Au3dwiDOgEw/s1600/Rplotdallas-animal-shelters-km-curves-dogs-by-health-contagious.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="794" data-original-width="800" height="633" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjtzJhEBMOa16EBHHxyAD13KBN35qp9RJcmclWUqXVH_vNtrkVDGrZIKtwlSF4ue4ssINakgK8vqms6llwQ27I3hHaWhFMu_qO54EnzWZrsMaTpP1bxLDJuJueMWXTGGyus2Au3dwiDOgEw/s640/Rplotdallas-animal-shelters-km-curves-dogs-by-health-contagious.png" width="640" /></a></div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiI90CH31MNbNkPLwvCHiaDPCV6CKAoBy-0uysGalaj4GWjUPFAGMup6h_FEWJ_7lsvjKpZxc9cJSXNIv00rN3yDUrpbhCBI-1BJ5kYZ4hH1lmRIFh8epJ5XPNlhbiHQr9zazqgm_yOnEya/s1600/dallas-animal-shelters-km-curves-dogs-by-health-treatable.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="794" data-original-width="800" height="633" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiI90CH31MNbNkPLwvCHiaDPCV6CKAoBy-0uysGalaj4GWjUPFAGMup6h_FEWJ_7lsvjKpZxc9cJSXNIv00rN3yDUrpbhCBI-1BJ5kYZ4hH1lmRIFh8epJ5XPNlhbiHQr9zazqgm_yOnEya/s640/dallas-animal-shelters-km-curves-dogs-by-health-treatable.png" width="640" /></a></div>
<br />
<br />
<h4>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Survival of Dogs with Chips </span></span></h4>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span id="DeltaPlaceHolderMain"><a href="http://dallascityhall.com/departments/dallas-animal-services/Pages/microchipping-FAQ.aspx">As of June 17, 2017, all dogs and cats four months and older in the city of Dallas must be microchipped.</a> This relatively new regulation will likely change both the share of chipped dogs in Dallas and survival curves as observed below from 2015 through October 2017:</span></span></span><br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span id="DeltaPlaceHolderMain"><br /></span></span></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhyw2UymjrMWgglVTJ-foOxV_HKXcHqfF7HGMKUEJ-df9rJMfS4kOD64VECtxv4qoYYm6m9W1gWorpspZ-eSiItYEhUkmGoBRnCZXtQ45EfM76jeLkodm-MPG4fvi6bts86IlAO2rUiFTwm/s1600/dallas-animal-shelters-km-curves-dogs-by-chipped.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="794" data-original-width="800" height="634" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhyw2UymjrMWgglVTJ-foOxV_HKXcHqfF7HGMKUEJ-df9rJMfS4kOD64VECtxv4qoYYm6m9W1gWorpspZ-eSiItYEhUkmGoBRnCZXtQ45EfM76jeLkodm-MPG4fvi6bts86IlAO2rUiFTwm/s640/dallas-animal-shelters-km-curves-dogs-by-chipped.png" width="640" /></a></div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span id="DeltaPlaceHolderMain"><br /></span></span></span>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: medium;"><span id="DeltaPlaceHolderMain">Still having a dog microchipped will almost certainly keep survival chances higher.</span></span></span><br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span id="DeltaPlaceHolderMain"> </span></span></span>
<br />
<h4>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span id="DeltaPlaceHolderMain">Dog Breeds</span></span></span></h4>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span id="DeltaPlaceHolderMain">Dallas shelters admitted dogs of over 200 different breeds from 2015 through 2017. Among them 56 breeds appeared 100 times or more (over 95% of all admissions): </span></span></span><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span id="DeltaPlaceHolderMain"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhqTRnQeCSfPMlAz_iBI0jYWZnbwjVdaeiDspwwO7c_gw8KciGKK9zUXy4R5NBjwqsfJ8Y2phBa6ObGE9zgnUqiylt-dDtDdGlPr9MOho50KniT2A40lwSU6k3Zo7duyelqlz8ATqtbWfI-/s1600/dallas-animal-shelters-dogs-by-breed.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="847" data-original-width="900" height="602" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhqTRnQeCSfPMlAz_iBI0jYWZnbwjVdaeiDspwwO7c_gw8KciGKK9zUXy4R5NBjwqsfJ8Y2phBa6ObGE9zgnUqiylt-dDtDdGlPr9MOho50KniT2A40lwSU6k3Zo7duyelqlz8ATqtbWfI-/s640/dallas-animal-shelters-dogs-by-breed.png" width="640" /></a></span></span></span></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span id="DeltaPlaceHolderMain">Top 4 breeds - <b>Pit Bull</b>, <b>Labrador Retriever</b>, <b>Chihuahua</b>, and <b>German Shepherd</b> - account for almost 60% of all admissions with next breed - <b>Cairn Terrier</b> - dropping to just under 3%. The survival curves for these 5 breeds contain almost 2/3 of all dogs admitted to Dallas shelters:</span></span></span><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiI-cxkg3W71-Py-6-E8aJZXoWWnGDGSje0aSsic7VWsz5rlS8G-agQkdZ1XSJzAEv9rKoNk1F6T4cV9dVvnSDXEtI6oOycTHXu_tbqe8f_TcLnqe_sIhPCSxjcWqFESuNf9ltG_FOyWZ6U/s1600/dallas-animal-shelters-km-curves-dogs-by-top-5-breeds.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="753" data-original-width="800" height="602" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiI-cxkg3W71-Py-6-E8aJZXoWWnGDGSje0aSsic7VWsz5rlS8G-agQkdZ1XSJzAEv9rKoNk1F6T4cV9dVvnSDXEtI6oOycTHXu_tbqe8f_TcLnqe_sIhPCSxjcWqFESuNf9ltG_FOyWZ6U/s640/dallas-animal-shelters-km-curves-dogs-by-top-5-breeds.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span id="DeltaPlaceHolderMain"><b>Pit Bull</b></span></span></span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span id="DeltaPlaceHolderMain"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span id="DeltaPlaceHolderMain">'s suffer the worst survival rate of the 5</span></span></span> most admitted breeds. It drops to below 50% survival rate after just over a week at shelter. <b>Labrador</b> and <b>German Shepherd</b> get 50% some time into 3 week period. Smaller breeds last much better as evident from <b>Chihuahua</b> and <b>Cairn Terrier</b> curves.</span></span></span><br />
<br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span id="DeltaPlaceHolderMain">It turns out there are more breeds closely related to Pit Bull: <b>American Staff</b>, <b>Am Pit Bull Ter</b>, and <b>Staffordshire</b>:</span></span></span>
<br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span id="DeltaPlaceHolderMain"><br /></span></span></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhgrFBcCen9J4GdGpBicJqcSRShQsDbfUd6jGQkDsX8AAAnp9pM4cjl361pBGdjtnn22rSsws6uMf6wFecVflxR012kHSiVXvrGkHxVF0ZSFG2SnN9fDyndcUPis5ub6LDK3A7etbe_7ugU/s1600/Dallas-animal-shelters-km-curves-dogs-by-pit-bull-group.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="748" data-original-width="800" height="598" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhgrFBcCen9J4GdGpBicJqcSRShQsDbfUd6jGQkDsX8AAAnp9pM4cjl361pBGdjtnn22rSsws6uMf6wFecVflxR012kHSiVXvrGkHxVF0ZSFG2SnN9fDyndcUPis5ub6LDK3A7etbe_7ugU/s640/Dallas-animal-shelters-km-curves-dogs-by-pit-bull-group.png" width="640" /></a></div>
<br />
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Similar pattern for three of four breeds from the group sharply differ from the 4th - <b>American Staff</b>ordshire for reason(s) beyond this analysis.</span></span><br />
<span style="font-size: large;"> </span><span style="font-family: "georgia" , "times new roman" , serif;"><br /></span><br />
<h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Next</span></span></h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">In the next and final post on Dallas animal shelters we will apply Cox proportional hazard semi-parameterical statistical analysis </span><span style="font-size: large;">to assess simultaneously the effect of several factors on survival time and outcome.</span></span><br />
<br />
<h3>
<span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-size: large;">Resources</span></span></h3>
<span style="font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">The R notebook (source code) with data pipeline and visualizations can be found </span><span style="font-size: large;"><a href="https://github.com/grigory93/r-notebooks/commit/8ce1b2ba6f3c710ddd29a712af797648c976d2a6" style="font-family: "georgia", "times new roman", serif;">here</a></span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"> with knitted version on </span><span style="font-size: large;"><a href="http://rpubs.com/grigory/DASSurvivalAnalysis" style="font-family: "georgia", "times new roman", serif;">RPubs</a></span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">.</span></span> </span></div>
Gregory Kanevskyhttp://www.blogger.com/profile/09179130896383881927noreply@blogger.com6tag:blogger.com,1999:blog-7530218802939252476.post-81648425028316192342017-09-04T01:56:00.000-05:002017-09-04T15:05:32.190-05:00How Pets Get Admitted and Later Leave Dallas Animal Shelters<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Thanks to <a href="https://www.dallasopendata.com/City-Services/FY-2017-Dallas-Animal-Shelter-Data/sjyj-ydcj">Dallas OpenData</a> anyone has access to the city animal shelter records. If you lost or found a pet it could be that <a href="http://www.writersdigest.com/editor-blogs/questions-and-quandaries/grammar/how-to-handle-animal-pronouns-he-she-or-it">he or she</a> spent some time in a shelter - I personally took lost dogs there</span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">. It's unfortunate but every year tens of thousands of animals find their way to shelters with significant fraction never finding way out. </span><br />
<h3>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span></h3>
<h3>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">What and How Many Animals are Admitted?</span></h3>
<div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span></div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">City of Dallas animal shelter dataset contains 5 different animal types with solid lead belonging to dogs and cats (hardly any surprise to anyone):</span><br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNf10SqjT8jxDE4gp8FDz9xLIxD-D3oCOHCkshpUAZeAGmNO7X05qGhVn56hp-nKkY9Q1mYgdw5UYgt9CQCsgZHKzJRgRdRRmHIffmfqLJMvsDgeQqOM4CKMr5AghldCR529oyr4rzyCgk/s1600/dallas-animal-shelters-animal-totals.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="906" data-original-width="900" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNf10SqjT8jxDE4gp8FDz9xLIxD-D3oCOHCkshpUAZeAGmNO7X05qGhVn56hp-nKkY9Q1mYgdw5UYgt9CQCsgZHKzJRgRdRRmHIffmfqLJMvsDgeQqOM4CKMr5AghldCR529oyr4rzyCgk/s640/dallas-animal-shelters-animal-totals.png" width="632" /></a></div>
<br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">For consistency and plausibility of analysis we will focus on cats and dogs only</span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">. </span><br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span>
<br />
<h3>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">How Animals get Admitted</span></h3>
<div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span></div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Each shelter record has animal's intake type (reason animal was admitted) and outcome (cause for animal disharge). </span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Top 2 reasons why cats and dogs turn up at shelters are</span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"> </span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><b>Stray</b> (lost or abandoned) </span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">and </span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><b>Owner Surrender</b> (willingly brought in by owner) while</span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"> </span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><b>Confiscated</b> (abused, no owner, etc.) is #3 for dogs but not cats</span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">: </span><br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiQhu89p6lY6CvHl27rklFnlmgZgp3gTyaGA_VX-vyjbuVA-I7Hy4gxRXOGhI9pbhD7Cxp822UAPvRI3196AKWIfSqpZismwLV0GM5URyhqS2kd96_lEMtSo5zitpm5aCe_X6STUfTtHW7D/s1600/dallas-animal-shelters-intake-totals.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="906" data-original-width="900" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiQhu89p6lY6CvHl27rklFnlmgZgp3gTyaGA_VX-vyjbuVA-I7Hy4gxRXOGhI9pbhD7Cxp822UAPvRI3196AKWIfSqpZismwLV0GM5URyhqS2kd96_lEMtSo5zitpm5aCe_X6STUfTtHW7D/s640/dallas-animal-shelters-intake-totals.png" width="632" /></a></div>
<br />
<h3>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">How Animals Leave Shelter</span></h3>
<div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span></div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Animals leave shelters (either alive or dead) for 4 main reasons (outcomes): </span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><b>Adoption</b> (good),</span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"> </span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><b>Euthanized</b> (bad), <b>Returned to Owner</b> (good), and <b>Transfer</b> (neutral):</span></span><br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi6xE55xrweSYXfNZssXNYyo0Bz2cBJP_U6wqQDN6QYMGXUZGI05BK3rGFMnO3SUH57kml7YlA4I6YjN-LRNgLcuSmDqdnYqGv5ilPJYiSFq-yqSXi-strdeyBrkxFoxPis2yUo4j4oKI74/s1600/dallas-animal-shelters-outcome-totals.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="906" data-original-width="900" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi6xE55xrweSYXfNZssXNYyo0Bz2cBJP_U6wqQDN6QYMGXUZGI05BK3rGFMnO3SUH57kml7YlA4I6YjN-LRNgLcuSmDqdnYqGv5ilPJYiSFq-yqSXi-strdeyBrkxFoxPis2yUo4j4oKI74/s640/dallas-animal-shelters-outcome-totals.png" width="632" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Unfortunately, for both cats and dogs the top reason to leave shelter is being euthanized. But that's where similarity between them ends: </span><br />
<span style="font-family: "georgia" , "times new roman" , serif;"><br /></span>
<br />
<ul>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">cats don't get returned to owner anywhere near as often as dogs;</span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">dogs' adoption and euthanized rates are almost the same while cats get adopted far less. </span></li>
</ul>
<br />
<h3>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">From Admissions to Outcomes with Sankey</span></h3>
<div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span></div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">So what is the relationship between intake types and outcomes? Which and to what extent intake types drive outcomes? The good news there is some causality effect because each stay begins with intake type and ends with outcome. </span><br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-family: "georgia" , "times new roman" , serif;">We begin analyzing this relationship with higher level (in that case) but visually appealing visualization called sankey diagram (or just sankey)</span><span style="font-family: "georgia" , "times new roman" , serif;">. It is </span></span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="background-color: white; color: #222222;">a specific type of flow diagram,</span><span style="background-color: white; color: #222222;"> in which the width of the arrows is shown proportionally to the flow quantity. In our case e</span></span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">ach dog shelter stay contributes to the pipe size flowing from left (an intake type) to right (an outcome). With this we basically visualize conditional probabilities of dog leaving shelter with certain outcome given its admission intake type (first image illustrates </span><span style="background-color: white; color: #222222; font-family: "georgia" , "times new roman" , serif; font-size: large;">transitions for cats and second does the same for dogs):</span><br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span>
<iframe allowfullscreen="" frameborder="1" height="600px" marginheight="0px" marginwidth="0px" name="myiFrame" scrolling="no" src="https://rpubs.com/grigory/IntakeToOutcomeCatsSankey" style="border: 0px #ffffff none;" width="800px"></iframe>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span>
<iframe allowfullscreen="" frameborder="1" height="600px" marginheight="0px" marginwidth="0px" name="myiFrame" scrolling="no" src="https://rpubs.com/grigory/IntakeToOutcomeSankey" style="border: 0px #ffffff none;" width="800px"></iframe>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span><br />
<br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">While <b>Owner Surrender</b> intake type flows similarly for both, <b>Stray</b> animals don't: cat outcomes are dominated by <b>Euthanized</b> but dogs are dominated by <b>Adoption</b> with <b>Transfer</b> and <b>Returned to Owner</b> outcomes together matching <b>Euthanized</b>.</span><br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span>
<br />
<h3>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Correlations Between Admissions and Outcomes</span></h3>
<div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span></div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Next, we go beyond overall totals used in the sankey and compute correlations. To correlate intake types and outcomes we construct time series by computing monthly totals for each intake type and outcome obtaining monthly trends. Then we </span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">correlate between monthly trends (separate for cats and dogs) animals brought in and removed from Dallas animal shelters for each pair of top intake types (<b>Confiscated</b>, <b>Owner Surrender</b>, and <b>Stray</b>) and outcomes (<b>Adoption</b>, <b>Euthanized</b>, <b>Returned to Owner</b>, and <b>Transfer</b>) - 12 coefficients in total:</span><br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhtPonVbnvwXkg5_doFq4qDQxY67Os7nYPUVJdTUog23vQAl8wzmLeTD7Eq4689nZzU-a9Q1qVJvtDA7u57NUq_MuTDYHYVFRTVBLQS1qFDUfGltnNYZ81w7a2yW2NipWISZxjgx9Wvr2Dl/s1600/dallas-animal-shelters-multiple-cats-monthly-trends-corrs.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="500" data-original-width="800" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhtPonVbnvwXkg5_doFq4qDQxY67Os7nYPUVJdTUog23vQAl8wzmLeTD7Eq4689nZzU-a9Q1qVJvtDA7u57NUq_MuTDYHYVFRTVBLQS1qFDUfGltnNYZ81w7a2yW2NipWISZxjgx9Wvr2Dl/s640/dallas-animal-shelters-multiple-cats-monthly-trends-corrs.png" width="640" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjvbtsnmCLMzT36wxE-YMjqBeBj9Nm4WeDx3XbTU71vOz34GmytVJWYpo-3WrbozaSP2nZ9JDDXf39pK1klECLUF_QBLWUFq0Tqxe4VOWYuxHoUUbH0KOO50ZMynuYwVl5yv0rso8EBdG4g/s1600/dallas-animal-shelters-multiple-dogs-monthly-trends-corrs.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="500" data-original-width="800" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjvbtsnmCLMzT36wxE-YMjqBeBj9Nm4WeDx3XbTU71vOz34GmytVJWYpo-3WrbozaSP2nZ9JDDXf39pK1klECLUF_QBLWUFq0Tqxe4VOWYuxHoUUbH0KOO50ZMynuYwVl5yv0rso8EBdG4g/s640/dallas-animal-shelters-multiple-dogs-monthly-trends-corrs.png" width="640" /></a></div>
<br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">In this case strong correlation implies (at least to some extent) causation effect due to presence of temporal relationship, consistency, and plausibility criteria (see <a href="https://stats.stackexchange.com/a/2536/8464">here</a> and <a href="http://epiville.ccnmtl.columbia.edu/assets/pdfs/Hill_1965.pdf">here</a>). </span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Few observations to note:</span><br />
<ul>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">The highest correlation for cats (0.91) and second highest for dogs (0.77) are observed between intake <b>Surrendered by Owner</b> and outcome<b>Euthanized</b> which is almost as obvious as unfortunate.</span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Correlation between <b>Stray</b> and <b>Returned to Owner</b> for dogs is the highest at 0.86. This is great news because it means the more dogs get lost the more of them are found. The higher this correlation the healthier the city for 2 reasons: a) lost animals return home and b) larger share of stray dogs are lost ones and not abandoned (given that the city keeps collecting them).</span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: medium;"><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Unfortunately trend in <b>Stray</b> cats correlates highly with <b>Euthanized</b>. So while <b>Stray</b> <b>dog</b> trend drives adoptions and returns, <b>Stray</b> <b>cat</b> trend affects euthanizations the most (we've seen that in sankey as well).</span></span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">No trends are affected by variations in <b>Confiscated</b> dogs, but this is likely due to smaller share of such admissions.</span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Variation in <b>Stray</b> dogs admitted affect every outcome (but <b>Euthanized</b>). Indeed <b>Stray</b> intake type is the largest and is almost twice as big as the 2d largest dog type <b>Owner Surrender</b>. </span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Low correlation for dogs between <b>Stray</b> and <b>Euthanized</b> needs additional analysis because it's counter-intuitive.</span></li>
</ul>
<div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span></div>
<div>
<h3>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Monthly Trends</span></h3>
<div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span></div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">But can we do better than correlations of these trends </span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">which technically are sophisticated but </span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">still</span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"> </span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">aggregates</span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">? </span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Next visual places time series </span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">instead of correlation coefficients </span><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">inside the same matrix grid allowing to see and compare actual monthly trends:</span></div>
<div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhSZi_BgJqlWtTqOdrE5OmktBcdGqiWBz7wlsJVex55NUhD5-oklSwDjNThyphenhyphenLp6ZIXj98eg0WBKa2eKAP6hp2qeCohhWOWBozV3A3Ki6TCWksUQoySkXY5zLAIwSmVgjyDTrkN_PbNBh1QO/s1600/dallas-animal-shelters-multiple-cats-monthly-trends.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="900" data-original-width="1600" height="360" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhSZi_BgJqlWtTqOdrE5OmktBcdGqiWBz7wlsJVex55NUhD5-oklSwDjNThyphenhyphenLp6ZIXj98eg0WBKa2eKAP6hp2qeCohhWOWBozV3A3Ki6TCWksUQoySkXY5zLAIwSmVgjyDTrkN_PbNBh1QO/s640/dallas-animal-shelters-multiple-cats-monthly-trends.png" width="640" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjq6ve3w-lOhNxCLpTSWMfzbTN-lqt3k0CKyaXzHZH1CfpbMxGYbMp8IlGnBL9E-L2onc10kNnhDgD4moz5cS3W2G5ahAaR8baMkSg9wAW4bEhEV8PLhn5xFBJggs9vLphJkKo-qznEAYG4/s1600/dallas-animal-shelters-multiple-dogs-monthly-trends.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="900" data-original-width="1600" height="360" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjq6ve3w-lOhNxCLpTSWMfzbTN-lqt3k0CKyaXzHZH1CfpbMxGYbMp8IlGnBL9E-L2onc10kNnhDgD4moz5cS3W2G5ahAaR8baMkSg9wAW4bEhEV8PLhn5xFBJggs9vLphJkKo-qznEAYG4/s640/dallas-animal-shelters-multiple-dogs-monthly-trends.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span></div>
<div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Note that each plot is a 3 x 4 matrix - the same dimension as correlations matrices before. But instead of correlation coefficient each cell contains a pair of monthly trends (in fact, each correlation was computed for these exact pairs of trends, hence, a reference to its aggregation origin). Each row corresponds to an intake type (the same blue line in each) and each column to an outcome (the same red line in each). Being able to see trends over time let's record a few observations (following the matrix order top down):</span></div>
<ul>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><b>Confiscated</b> intake trends flat for both cats and dogs with only significant spike for dogs in January 2016. This spike is so unusual, relatively big, and contained within single month or two that it begs additional investigation into probable external event or procedural change that may have caused it.</span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Number of <b>Confiscated</b> animals is relatively low to noticeably affect outcomes. Still, if we can reduce effect of other intake types some relationships are possible.</span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><b>Owner Surrender</b> trend correlation with <b>Euthanized</b> outcome is so obvious that this type of visualization is sufficient to find it. Yes, it is unfortunate but people bring their old or unhealthy pets for a reason. </span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">The same applies to <b>Stray</b> and <b>Owner Surrender</b> for cats only. </span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><b>Owner Surrender</b> has significant seasonal component spiking in summer possibly due to hot weather or holiday season or both. For cats only seasonal component is also strong in <b>Stray</b> trend.</span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><b>Euthanized</b> trends together with <b>Owner Surrender</b> which causes it to a large degree.</span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><b>Stray</b> dogs trend slowly upwards in Dallas and it's alarming.</span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><b>Adoption</b> also trends upwards but not steep enough to compensate for inflow of dogs into shelters. Targeted campaign to encourage more adoptions of pets in the city is due.</span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><b>Transfer</b> outcome trending upward also compensates for the growth in stray dogs. It's not clear if it's positive or negative though as there is no means to track what happens to dogs after transfer (or is it?).</span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><b>Stray</b> trend for dogs dipped in January 2016 exactly when confiscated trend spiked - it could be a coincidence or related - for sure something to consider when investigating further.</span></li>
<li><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">For dogs <b>Euthanized</b> trend correlates strongly with <b>Stray</b> intake until the summer of 2016 when they start to diverge in opposite directions - again some policy or procedural change apparently caused it. Indeed, if we observe other outcomes we notice that <b>Returned to Owner</b> trend began its uptick at around the same time (indeed, after I observed this I found out about <a href="http://www.latimes.com/nation/la-na-stray-dogs-20160915-snap-story.html">this</a> and <a href="https://www.dallasnews.com/news/dallas-city-hall/2016/09/19/regime-change-dallas-animal-services-police-commanders-set-tackle-loose-dog-crisis">this</a> - significant changes in Dallas Animal Services leadership and policies around summer and fall of 2016).</span></li>
</ul>
<div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">I will be back with more analysis (survival analysis). R code for data processing, analysis, and visualizations from this post can be found <a href="http://rpubs.com/grigory/296999">here</a>.</span></div>
Gregory Kanevskyhttp://www.blogger.com/profile/09179130896383881927noreply@blogger.com0Dallas, TX, USA32.7766642 -96.79698789999997632.3473512 -97.445181399999981 33.2059772 -96.148794399999971tag:blogger.com,1999:blog-7530218802939252476.post-53397029459969474242017-07-05T14:01:00.000-05:002017-07-05T17:00:21.474-05:00The Role of Small Data and Vacation Recap Example<span style="font-family: "times" , "times new roman" , serif; font-size: x-large;">Wikipedia defines <a href="https://en.wikipedia.org/wiki/Small_data">small data</a> <i>'small' enough for human comprehension</i> but then it goes further by qualifying <i>data in a volume and format that makes it accessible, informative and actionable</i>. I am not certain the latter is <b>always</b> true: smaller footprint doesn't automatically qualify data as informative and actionable without more work. In my book small data usually scales to kilobytes and has just a handful of dimensions. But its main feature remains <i>human comprehension</i> which really means there is simple story behind it. </span><br />
<span style="font-family: "times" , "times new roman" , serif; font-size: x-large;"><br /></span>
<span style="font-family: "times" , "times new roman" , serif; font-size: x-large;">In the grand scheme of big data things small data story is the last mile of data science analysis. It still requires interpretation (or representation) in the form of visualization or application.</span><br />
<span style="font-family: "times" , "times new roman" , serif; font-size: x-large;"><br /></span>
<br />
<div>
<span style="font-family: "times" , "times new roman" , serif; font-size: x-large;">Case in point could be Google spreadsheet I kept while on vacation in Italy with daily recordings of miles and steps walked. Later I added main attractions for each day. The result was my personal small data covering about 2 weeks of touring Italy with bases in Rome and later in Sicily (this sentence was the story):</span><span style="font-size: large;"><br /></span>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhwn-pZoGib9bkQTL6DKERnR978YzIXYLFZPANA2X6Bla45ZOfhHzr88sGBjll7PcNtUYWf4s0ngbXXIpXaR0ybJHty7tkvkhox3sf5Pm0zht2NV6RHLJgVA9F5kXNrMoBZ9GIK2qTibWHk/s1600/vacation-rome-siracusa-data-screen.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="511" data-original-width="863" height="377" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhwn-pZoGib9bkQTL6DKERnR978YzIXYLFZPANA2X6Bla45ZOfhHzr88sGBjll7PcNtUYWf4s0ngbXXIpXaR0ybJHty7tkvkhox3sf5Pm0zht2NV6RHLJgVA9F5kXNrMoBZ9GIK2qTibWHk/s640/vacation-rome-siracusa-data-screen.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Google sheet of activities while on vacation in Italy</td></tr>
</tbody></table>
<span style="font-family: "times" , "times new roman" , serif;"><br /></span>
<span style="font-family: "times" , "times new roman" , serif; font-size: x-large;">As-is this spreadsheet is destined to Google archives contributing to ever growing collection of docs I created and happily forgot about. So I created this visualization that represents both most of data and the story:</span><br />
<span style="font-family: "times" , "times new roman" , serif; font-size: large;"><br /></span>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiPVXUmDKMqv3m6VT8HqAU9SG3bRk-OuY6vHfCR39Umpr8s3p1bA6zIhFrFlreB9_9qGSmp7F4P7LnPkT-nSWxy-K8YokK4hlsJlt6Z3dP8z4twE0oVNUq0m6fNCEUkd4Gbs4PgWFojBf3g/s1600/Italy-vacation-Jun-2017.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="827" data-original-width="1200" height="441" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiPVXUmDKMqv3m6VT8HqAU9SG3bRk-OuY6vHfCR39Umpr8s3p1bA6zIhFrFlreB9_9qGSmp7F4P7LnPkT-nSWxy-K8YokK4hlsJlt6Z3dP8z4twE0oVNUq0m6fNCEUkd4Gbs4PgWFojBf3g/s640/Italy-vacation-Jun-2017.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Small data visualization</td></tr>
</tbody></table>
<span style="font-family: "times" , "times new roman" , serif; font-size: large;"><br /></span>
<span style="font-family: "times" , "times new roman" , serif; font-size: x-large;">Before explaining how this visualization was created with R I ought to acknowledge that Google spreadsheets offer <a href="https://support.google.com/docs/answer/63728?co=GENIE.Platform%3DDesktop&hl=en">adding a chart or graph to a document</a>. But its functionality appears rather limited without resorting to JavaScript API.</span><br />
<span style="font-family: "times" , "times new roman" , serif; font-size: x-large;"><br /></span>
<span style="font-family: Times, Times New Roman, serif; font-size: x-large;">Using R googlesheets package to source Google docs makes them integral part of data sources available from within R code:</span><span style="font-family: "times" , "times new roman" , serif; font-size: x-large;"><br /></span><script src="https://gist.github.com/grigory93/4c0d970a1ca47f22f251796edce5e635.js?file=vacation-recap-Italy-googledocs.R"></script><span style="font-family: "times" , "times new roman" , serif; font-size: medium;"><span style="font-size: x-large;">For details on how code above authenticates with Google servers and processes documents see very detailed <a href="https://github.com/jennybc/googlesheets#vignettes">vignettes</a>. </span></span><br />
<span style="font-family: "times" , "times new roman" , serif; font-size: medium;"><br /><span style="font-size: x-large;">Now we can get back to small data and its simple story. Which means single visualization may include most if not all of it. In case of small data the goal is designing such chart without sacrificing clarity. </span><br /><br /><span style="font-size: x-large;"> Core attributes days (</span><b style="font-size: xx-large;">Date</b><span style="font-size: x-large;">) and miles walked (</span><b style="font-size: xx-large;">Distance</b><span style="font-size: x-large;">; I chose miles over </span><b style="font-size: xx-large;">Steps</b><span style="font-size: x-large;"> for simplicity) suggest a line chart with timeline along </span><i style="font-size: xx-large;">x</i><span style="font-size: x-large;">-axis and distance for </span><i style="font-size: xx-large;">y</i><span style="font-size: x-large;">-axis. But there are 2 more factors to incorporate: </span><b style="font-size: xx-large;">Place</b><span style="font-size: x-large;"> indicating where the base city was each day and </span><b style="font-size: xx-large;">Label</b><span style="font-size: x-large;"> for major attractions. </span></span><span style="font-family: "times" , "times new roman" , serif; font-size: x-large;"><br /></span>
<span style="font-family: "times" , "times new roman" , serif; font-size: x-large;"><br /></span>
<span style="font-family: "times" , "times new roman" , serif; font-size: x-large;">Base city receives color identification with deep red for <a href="https://en.wikipedia.org/wiki/Rome">Rome</a> and olive green for <a href="https://en.wikipedia.org/wiki/Syracuse,_Sicily">Syracuse</a>. Major attractions text was attached to each point with smart justifications to fit inside the chart:</span><br />
<span style="font-family: "times" , "times new roman" , serif; font-size: x-large;"><br /></span>
<script src="https://gist.github.com/grigory93/4c0d970a1ca47f22f251796edce5e635.js?file=vacation-recap-Italy-visual.R"></script>
<span style="font-family: "times" , "times new roman" , serif; font-size: x-large;"><br /></span>
<span style="font-family: "times" , "times new roman" , serif; font-size: x-large;">Had I kept more detailed log I would have ended up with more dimensions to use. For example, miles driven by car or train, time spent at leisure versus touring, number of cities and places visited, historical marker attributes and so on. But that moves us further away from small data domain as footprint and dimensions grow and story becomes less comprehensible. One of indicators of this is that it becomes harder to collect data manually. Instead, there are apps that would do it for me, for example, <a href="http://www.northcube.com/lifecycle/">Life Cycle</a> or Apple Health.</span><br />
<span style="font-family: "times" , "times new roman" , serif; font-size: x-large;"><br /></span>
<span style="font-family: "times" , "times new roman" , serif; font-size: x-large;">Ultimately any big data problem is reduced to one or more small data ones by aggregating, regressions, clustering or some other data science method. The path to big data insights is a journey from big to small data in search of simple story. So learning how to deal with small data is where it all both ends and begins.</span></div>
Gregory Kanevskyhttp://www.blogger.com/profile/09179130896383881927noreply@blogger.com0tag:blogger.com,1999:blog-7530218802939252476.post-29666613991052998172017-06-23T11:11:00.000-05:002020-02-12T20:49:00.024-06:00Logarithmic Scale Explained with U.S. Trade Balance<div style="-webkit-text-stroke-width: 0px; background: transparent; border: 0px; color: rgba(0, 0, 0, 0.7); font-family: "Source Serif Pro", serif; font-size: 21px; font-stretch: inherit; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-variant-numeric: inherit; font-weight: normal; letter-spacing: normal; line-height: 32px; margin: 3.2rem 0px; orphans: 2; outline: 0px; padding: 0px; text-align: start; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; vertical-align: baseline; white-space: normal; widows: 2; word-spacing: 0px;">
Skewed data prevail in real life. Unless you observe trivial or near constant processes data is skewed one way or another due to outliers, long tails, errors or something else. Such effects create problems in visualizations when a few data elements are much larger than the rest.</div>
<div style="-webkit-text-stroke-width: 0px; background: transparent; border: 0px; color: rgba(0, 0, 0, 0.7); font-family: "Source Serif Pro", serif; font-size: 21px; font-stretch: inherit; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-variant-numeric: inherit; font-weight: normal; letter-spacing: normal; line-height: 32px; margin: 3.2rem 0px; orphans: 2; outline: 0px; padding: 0px; text-align: start; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; vertical-align: baseline; white-space: normal; widows: 2; word-spacing: 0px;">
Consider<span class="Apple-converted-space"> </span><a href="http://goo.gl/2nDR0p" rel="nofollow noopener" style="background: transparent; border: 0px; color: #8c68cb; cursor: pointer; font-family: inherit; font-size: 21px; font-stretch: inherit; font-style: inherit; font-variant: inherit; font-weight: normal; line-height: inherit; margin: 0px; outline: none; padding: 0px; text-decoration: none; vertical-align: baseline; word-wrap: break-word;" target="_blank">U.S. 2016 merchandise trade partner balances</a><span class="Apple-converted-space"> </span>data set where each point is a country with 2 features: U.S. imports and exports against it:<br />
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div style="-webkit-text-stroke-width: 0px; background: transparent; border: 0px; color: rgba(0, 0, 0, 0.7); font-family: "Source Serif Pro", serif; font-size: 21px; font-stretch: inherit; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-variant-numeric: inherit; font-weight: normal; letter-spacing: normal; line-height: 32px; margin: 3.2rem 0px; orphans: 2; outline: 0px; padding: 0px; text-align: start; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; vertical-align: baseline; white-space: normal; widows: 2; word-spacing: 0px;">
<span style="color: black;"><span style="display: inline; float: none; font-family: "source serif pro" , serif; font-size: 21px; font-style: normal; font-weight: normal; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">Suppose we decided to visualize top 30 U.S trading partners using<span class="Apple-converted-space"> </span></span><a href="http://en.wikipedia.org/wiki/Bubble_chart" rel="nofollow noopener" style="background: transparent none repeat scroll 0% 0%; border: 0px none; cursor: pointer; font-family: "source serif pro",serif; font-size: 21px; font-stretch: inherit; font-style: normal; font-weight: normal; letter-spacing: normal; line-height: inherit; margin: 0px; outline: medium none; overflow-wrap: break-word; padding: 0px; text-decoration: none; text-indent: 0px; text-transform: none; vertical-align: baseline; white-space: normal; word-spacing: 0px;" target="_blank">bubble chart</a><span style="display: inline; float: none; font-family: "source serif pro" , serif; font-size: 21px; font-style: normal; font-weight: normal; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">, which simply is a 2D<span class="Apple-converted-space"> </span></span><a href="http://en.wikipedia.org/wiki/Scatter_plot" rel="nofollow noopener" style="background: transparent none repeat scroll 0% 0%; border: 0px none; cursor: pointer; font-family: "source serif pro",serif; font-size: 21px; font-stretch: inherit; font-style: normal; font-weight: normal; letter-spacing: normal; line-height: inherit; margin: 0px; outline: medium none; overflow-wrap: break-word; padding: 0px; text-decoration: none; text-indent: 0px; text-transform: none; vertical-align: baseline; white-space: normal; word-spacing: 0px;" target="_blank">scatter plot</a><span style="display: inline; float: none; font-family: "source serif pro" , serif; font-size: 21px; font-style: normal; font-weight: normal; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span class="Apple-converted-space"> </span>with the third dimension expressed through point size. Then U.S. trade partners become disks with imports and exports for<span class="Apple-converted-space"> </span></span><i style="background: transparent none repeat scroll 0% 0%; border: 0px none; font-family: "georgia","source serif pro",serif; font-size: 0.975em; font-stretch: inherit; font-style: italic; font-weight: normal; letter-spacing: normal; line-height: inherit; margin: 0px; outline: 0px none; padding: 0px; text-indent: 0px; text-transform: none; vertical-align: baseline; white-space: normal; word-spacing: 0px;">xy</i><span style="display: inline; float: none; font-family: "source serif pro" , serif; font-size: 21px; font-style: normal; font-weight: normal; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span class="Apple-converted-space"> </span>coordinates and trade balance (</span><i style="background: transparent none repeat scroll 0% 0%; border: 0px none; font-family: "georgia","source serif pro",serif; font-size: 0.975em; font-stretch: inherit; font-style: italic; font-weight: normal; letter-spacing: normal; line-height: inherit; margin: 0px; outline: 0px none; padding: 0px; text-indent: 0px; text-transform: none; vertical-align: baseline; white-space: normal; word-spacing: 0px;">abs(export - import)</i></span><span style="color: rgba(0 , 0 , 0 , 0.7); display: inline; float: none; font-family: "source serif pro" , serif; font-size: 21px; font-style: normal; font-weight: normal; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="color: black;">) for size:</span> </span></div>
<div style="-webkit-text-stroke-width: 0px; background: transparent; border: 0px; color: rgba(0, 0, 0, 0.7); font-family: "Source Serif Pro", serif; font-size: 21px; font-stretch: inherit; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-variant-numeric: inherit; font-weight: normal; letter-spacing: normal; line-height: 32px; margin: 3.2rem 0px; orphans: 2; outline: 0px; padding: 0px; text-align: start; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; vertical-align: baseline; white-space: normal; widows: 2; word-spacing: 0px;">
</div>
<div style="-webkit-text-stroke-width: 0px; background: transparent; border: 0px; color: rgba(0, 0, 0, 0.7); font-family: "Source Serif Pro", serif; font-size: 21px; font-stretch: inherit; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-variant-numeric: inherit; font-weight: normal; letter-spacing: normal; line-height: 32px; margin: 3.2rem 0px; orphans: 2; outline: 0px; padding: 0px; text-align: start; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; vertical-align: baseline; white-space: normal; widows: 2; word-spacing: 0px;">
</div>
<div style="-webkit-text-stroke-width: 0px; background: transparent; border: 0px; color: rgba(0, 0, 0, 0.7); font-family: "Source Serif Pro", serif; font-size: 21px; font-stretch: inherit; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-variant-numeric: inherit; font-weight: normal; letter-spacing: normal; line-height: 32px; margin: 3.2rem 0px; orphans: 2; outline: 0px; padding: 0px; text-align: start; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; vertical-align: baseline; white-space: normal; widows: 2; word-spacing: 0px;">
</div>
<div style="-webkit-text-stroke-width: 0px; background: transparent; border: 0px; color: rgba(0, 0, 0, 0.7); font-family: "Source Serif Pro", serif; font-size: 21px; font-stretch: inherit; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-variant-numeric: inherit; font-weight: normal; letter-spacing: normal; line-height: 32px; margin: 3.2rem 0px; orphans: 2; outline: 0px; padding: 0px; text-align: start; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; vertical-align: baseline; white-space: normal; widows: 2; word-spacing: 0px;">
</div>
<div style="-webkit-text-stroke-width: 0px; background: transparent; border: 0px; color: rgba(0, 0, 0, 0.7); font-family: "Source Serif Pro", serif; font-size: 21px; font-stretch: inherit; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-variant-numeric: inherit; font-weight: normal; letter-spacing: normal; line-height: 32px; margin: 3.2rem 0px; orphans: 2; outline: 0px; padding: 0px; text-align: start; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; vertical-align: baseline; white-space: normal; widows: 2; word-spacing: 0px;">
</div>
<div style="-webkit-text-stroke-width: 0px; background: transparent; border: 0px; color: rgba(0, 0, 0, 0.7); font-family: "Source Serif Pro", serif; font-size: 21px; font-stretch: inherit; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-variant-numeric: inherit; font-weight: normal; letter-spacing: normal; line-height: 32px; margin: 3.2rem 0px; orphans: 2; outline: 0px; padding: 0px; text-align: start; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; vertical-align: baseline; white-space: normal; widows: 2; word-spacing: 0px;">
</div>
<div style="-webkit-text-stroke-width: 0px; background: transparent; border: 0px; color: rgba(0, 0, 0, 0.7); font-family: "Source Serif Pro", serif; font-size: 21px; font-stretch: inherit; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-variant-numeric: inherit; font-weight: normal; letter-spacing: normal; line-height: 32px; margin: 3.2rem 0px; orphans: 2; outline: 0px; padding: 0px; text-align: start; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; vertical-align: baseline; white-space: normal; widows: 2; word-spacing: 0px;">
</div>
<div style="-webkit-text-stroke-width: 0px; background: transparent; border: 0px; color: rgba(0, 0, 0, 0.7); font-family: "Source Serif Pro", serif; font-size: 21px; font-stretch: inherit; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-variant-numeric: inherit; font-weight: normal; letter-spacing: normal; line-height: 32px; margin: 3.2rem 0px; orphans: 2; outline: 0px; padding: 0px; text-align: start; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; vertical-align: baseline; white-space: normal; widows: 2; word-spacing: 0px;">
</div>
<div style="-webkit-text-stroke-width: 0px; background: transparent; border: 0px; color: rgba(0, 0, 0, 0.7); font-family: "Source Serif Pro", serif; font-size: 21px; font-stretch: inherit; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-variant-numeric: inherit; font-weight: normal; letter-spacing: normal; line-height: 32px; margin: 3.2rem 0px; orphans: 2; outline: 0px; padding: 0px; text-align: start; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; vertical-align: baseline; white-space: normal; widows: 2; word-spacing: 0px;">
</div>
<div style="-webkit-text-stroke-width: 0px; background: transparent; border: 0px; color: rgba(0, 0, 0, 0.7); font-family: "Source Serif Pro", serif; font-size: 21px; font-stretch: inherit; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-variant-numeric: inherit; font-weight: normal; letter-spacing: normal; line-height: 32px; margin: 3.2rem 0px; orphans: 2; outline: 0px; padding: 0px; text-align: start; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; vertical-align: baseline; white-space: normal; widows: 2; word-spacing: 0px;">
</div>
<div style="-webkit-text-stroke-width: 0px; background: transparent; border: 0px; color: rgba(0, 0, 0, 0.7); font-family: "Source Serif Pro", serif; font-size: 21px; font-stretch: inherit; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-variant-numeric: inherit; font-weight: normal; letter-spacing: normal; line-height: 32px; margin: 3.2rem 0px; orphans: 2; outline: 0px; padding: 0px; text-align: start; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; vertical-align: baseline; white-space: normal; widows: 2; word-spacing: 0px;">
</div>
<div style="-webkit-text-stroke-width: 0px; background: transparent; border: 0px; color: rgba(0, 0, 0, 0.7); font-family: "Source Serif Pro", serif; font-size: 21px; font-stretch: inherit; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-variant-numeric: inherit; font-weight: normal; letter-spacing: normal; line-height: 32px; margin: 3.2rem 0px; orphans: 2; outline: 0px; padding: 0px; text-align: start; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; vertical-align: baseline; white-space: normal; widows: 2; word-spacing: 0px;">
</div>
<div style="-webkit-text-stroke-width: 0px; background: transparent; border: 0px; color: rgba(0, 0, 0, 0.7); font-family: "Source Serif Pro", serif; font-size: 21px; font-stretch: inherit; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-variant-numeric: inherit; font-weight: normal; letter-spacing: normal; line-height: 32px; margin: 3.2rem 0px; orphans: 2; outline: 0px; padding: 0px; text-align: start; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; vertical-align: baseline; white-space: normal; widows: 2; word-spacing: 0px;">
</div>
<div style="-webkit-text-stroke-width: 0px; background: transparent; border: 0px; color: rgba(0, 0, 0, 0.7); font-family: "Source Serif Pro", serif; font-size: 21px; font-stretch: inherit; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-variant-numeric: inherit; font-weight: normal; letter-spacing: normal; line-height: 32px; margin: 3.2rem 0px; orphans: 2; outline: 0px; padding: 0px; text-align: start; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; vertical-align: baseline; white-space: normal; widows: 2; word-spacing: 0px;">
</div>
<div style="-webkit-text-stroke-width: 0px; background: transparent; border: 0px; color: rgba(0, 0, 0, 0.7); font-family: "Source Serif Pro", serif; font-size: 21px; font-stretch: inherit; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-variant-numeric: inherit; font-weight: normal; letter-spacing: normal; line-height: 32px; margin: 3.2rem 0px; orphans: 2; outline: 0px; padding: 0px; text-align: start; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; vertical-align: baseline; white-space: normal; widows: 2; word-spacing: 0px;">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh81rzRWx_GC6sWUtJHfDb3H8V1WdjxdOEqUQIOg-1JRLXp18G9fXFfQzWf45uXia5yA-D3xIV1G4-0Yh_gqIBrgTsgeL3UgVu3QSxYU-DBjcCDym6_gcnhG_6BCh7YrUxgs6Cu6UTkYsjY/s1600/US-Merchandise-Trade-Balance-top30-linear.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1200" data-original-width="1200" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh81rzRWx_GC6sWUtJHfDb3H8V1WdjxdOEqUQIOg-1JRLXp18G9fXFfQzWf45uXia5yA-D3xIV1G4-0Yh_gqIBrgTsgeL3UgVu3QSxYU-DBjcCDym6_gcnhG_6BCh7YrUxgs6Cu6UTkYsjY/s640/US-Merchandise-Trade-Balance-top30-linear.png" width="640" /></a></div>
</div>
<div style="-webkit-text-stroke-width: 0px; background: transparent; border: 0px; color: rgba(0, 0, 0, 0.7); font-family: "Source Serif Pro", serif; font-size: 21px; font-stretch: inherit; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-variant-numeric: inherit; font-weight: normal; letter-spacing: normal; line-height: 32px; margin: 3.2rem 0px; orphans: 2; outline: 0px; padding: 0px; text-align: start; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; vertical-align: baseline; white-space: normal; widows: 2; word-spacing: 0px;">
<span style="color: black;"><span style="display: inline; float: none; font-family: "source serif pro" , serif; font-size: 21px; font-style: normal; font-weight: normal; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">China, Canada, and Mexico run far larger balances compared to the other 27 countries which causes most data points to collapse into crowded lower left corner. One way to "solve" this problem is to eliminate 3 mentioned outliers from the picture:</span></span><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiKPM7rMZ8myf6BbYUOK33BlmaEB2Ui0nVD2r4JjyQ63tC7Miq1Fkino9QeYBywwdBRX8xMAnX3RJSXxifakqg4_p0ENSqHMHNGS8xsWE02jQ2T1cdDibYZyhMwjBEn6oEkznWkaERWFVIy/s1600/US-Merchandise-Trade-Balance-top27-linear.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1200" data-original-width="1200" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiKPM7rMZ8myf6BbYUOK33BlmaEB2Ui0nVD2r4JjyQ63tC7Miq1Fkino9QeYBywwdBRX8xMAnX3RJSXxifakqg4_p0ENSqHMHNGS8xsWE02jQ2T1cdDibYZyhMwjBEn6oEkznWkaERWFVIy/s640/US-Merchandise-Trade-Balance-top27-linear.png" width="640" /></a></div>
</div>
<div style="-webkit-text-stroke-width: 0px; background: transparent; border: 0px; color: rgba(0, 0, 0, 0.7); font-family: "Source Serif Pro", serif; font-size: 21px; font-stretch: inherit; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-variant-numeric: inherit; font-weight: normal; letter-spacing: normal; line-height: 32px; margin: 3.2rem 0px; orphans: 2; outline: 0px; padding: 0px; text-align: start; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; vertical-align: baseline; white-space: normal; widows: 2; word-spacing: 0px;">
While this plot does look better it no longer serves its original purpose of displaying<span class="Apple-converted-space"> </span><b style="background: transparent; border: 0px; font-family: inherit; font-size: 21px; font-stretch: inherit; font-style: inherit; font-variant: inherit; font-weight: bold; line-height: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;">all</b><span class="Apple-converted-space"> </span>top trading partners. And undesirable effect of outliers though reduced still presents itself with new ones: Japan, Germany, and U.K. So let us bring all countries back into the mix by trying logarithmic scale.</div>
<div style="-webkit-text-stroke-width: 0px; background: transparent; border: 0px; color: rgba(0, 0, 0, 0.7); font-family: "Source Serif Pro", serif; font-size: 21px; font-stretch: inherit; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-variant-numeric: inherit; font-weight: normal; letter-spacing: normal; line-height: 32px; margin: 3.2rem 0px; orphans: 2; outline: 0px; padding: 0px; text-align: start; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; vertical-align: baseline; white-space: normal; widows: 2; word-spacing: 0px;">
Quick refresher from algebra.<span class="Apple-converted-space"> </span><a href="https://en.wikipedia.org/wiki/Logarithm" rel="nofollow noopener" style="background: transparent; border: 0px; color: #8c68cb; cursor: pointer; font-family: inherit; font-size: 21px; font-stretch: inherit; font-style: inherit; font-variant: inherit; font-weight: normal; line-height: inherit; margin: 0px; outline: none; padding: 0px; text-decoration: none; vertical-align: baseline; word-wrap: break-word;" target="_blank">Log function</a><span class="Apple-converted-space"> </span>(in this example log base 10 but the same applies to natural log or log base 2) is commonly used to transform positive real numbers. All because of its property of mapping multiplicative relationships into additive ones. Indeed, given numbers<span class="Apple-converted-space"> </span><i style="background: transparent; border: 0px; font-family: Georgia, "Source Serif Pro", serif; font-size: 0.975em; font-stretch: inherit; font-style: italic; font-variant: inherit; font-weight: inherit; line-height: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;">A</i>,<span class="Apple-converted-space"> </span><i style="background: transparent; border: 0px; font-family: Georgia, "Source Serif Pro", serif; font-size: 0.975em; font-stretch: inherit; font-style: italic; font-variant: inherit; font-weight: inherit; line-height: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;">B</i>, and<span class="Apple-converted-space"> </span><i style="background: transparent; border: 0px; font-family: Georgia, "Source Serif Pro", serif; font-size: 0.975em; font-stretch: inherit; font-style: italic; font-variant: inherit; font-weight: inherit; line-height: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;">C<span class="Apple-converted-space"> </span></i>such that<br />
<br />
<div style="text-align: left;">
`A*B=C and A,B,C > 0`</div>
<br />
applying <i style="background: transparent; border: 0px; font-family: Georgia, "Source Serif Pro", serif; font-size: 0.975em; font-stretch: inherit; font-style: italic; font-variant: inherit; font-weight: inherit; line-height: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;">log<span class="Apple-converted-space"> </span></i> results in additive relationship:<br />
<br />
<div style="text-align: left;">
`log(A) + log(B) = log(C)`</div>
<div style="text-align: left;">
<br />
<span style="color: #444444;"><span style="display: inline; float: none; font-family: "source serif pro" , serif; font-size: 21px; font-style: normal; font-weight: normal; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">For example, let<span class="Apple-converted-space"> </span></span></span><i style="-webkit-text-stroke-width: 0px; background: transparent; border: 0px; color: rgba(0, 0, 0, 0.7); font-family: Georgia, "Source Serif Pro", serif; font-size: 0.975em; font-stretch: inherit; font-style: italic; font-variant-caps: normal; font-variant-ligatures: normal; font-variant-numeric: inherit; font-weight: normal; letter-spacing: normal; line-height: inherit; margin: 0px; orphans: 2; outline: 0px; padding: 0px; text-align: start; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; vertical-align: baseline; white-space: normal; widows: 2; word-spacing: 0px;">A=100</i><span style="color: rgba(0 , 0 , 0 , 0.7); display: inline; float: none; font-family: "source serif pro" , serif; font-size: 21px; font-style: normal; font-weight: normal; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">,<span class="Apple-converted-space"> </span></span><i style="-webkit-text-stroke-width: 0px; background: transparent; border: 0px; color: rgba(0, 0, 0, 0.7); font-family: Georgia, "Source Serif Pro", serif; font-size: 0.975em; font-stretch: inherit; font-style: italic; font-variant-caps: normal; font-variant-ligatures: normal; font-variant-numeric: inherit; font-weight: normal; letter-spacing: normal; line-height: inherit; margin: 0px; orphans: 2; outline: 0px; padding: 0px; text-align: start; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; vertical-align: baseline; white-space: normal; widows: 2; word-spacing: 0px;">B=1000</i><span style="color: rgba(0 , 0 , 0 , 0.7); display: inline; float: none; font-family: "source serif pro" , serif; font-size: 21px; font-style: normal; font-weight: normal; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">, <span style="color: black;">and</span><span class="Apple-converted-space"> </span></span><i style="-webkit-text-stroke-width: 0px; background: transparent; border: 0px; color: rgba(0, 0, 0, 0.7); font-family: Georgia, "Source Serif Pro", serif; font-size: 0.975em; font-stretch: inherit; font-style: italic; font-variant-caps: normal; font-variant-ligatures: normal; font-variant-numeric: inherit; font-weight: normal; letter-spacing: normal; line-height: inherit; margin: 0px; orphans: 2; outline: 0px; padding: 0px; text-align: start; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; vertical-align: baseline; white-space: normal; widows: 2; word-spacing: 0px;">C=100000</i><span style="color: rgba(0 , 0 , 0 , 0.7); display: inline; float: none; font-family: "source serif pro" , serif; font-size: 21px; font-style: normal; font-weight: normal; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span class="Apple-converted-space"> </span><span style="color: black;">then</span></span><br />
<span style="color: rgba(0 , 0 , 0 , 0.7); display: inline; float: none; font-family: "source serif pro" , serif; font-size: 21px; font-style: normal; font-weight: normal; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="color: black;"> </span></span>
<br />
<div style="text-align: left;">
`100 * 1000 = 100000`<br />
<br /></div>
so that after transformation it becomes
<br />
<div style="text-align: left;">
<br />
`log(100) + log(1000) = log(100000)`
or
`2 + 3 = 5`<br />
<br />
<span style="color: rgba(0 , 0 , 0 , 0.7); display: inline; float: none; font-family: "source serif pro" , serif; font-size: 21px; font-style: normal; font-weight: normal; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">Observe this on 1D plane:</span><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi-r0SE8_1WSw7UPW_R6ky7FayT8BaBRK_rm1hM_Nrgjkm5iw0Ff1jlQ9zVQOahUnLp_Ak6fhujhYCL9-ksr3gXiGJxtU8REmVH3_kQnBGwRilCEGm-9p6uIP0qAZgdIcc1a7ecF1DOsOQ1/s1600/before-log-transform.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="200" data-original-width="700" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi-r0SE8_1WSw7UPW_R6ky7FayT8BaBRK_rm1hM_Nrgjkm5iw0Ff1jlQ9zVQOahUnLp_Ak6fhujhYCL9-ksr3gXiGJxtU8REmVH3_kQnBGwRilCEGm-9p6uIP0qAZgdIcc1a7ecF1DOsOQ1/s1600/before-log-transform.png" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjYqmqxhpWUfRaN_03Ch_kHfoXo9yV-9rtpDXQ7cIxgh3u0erqTwnugQzV_b6YdSsxoyk9un3fV1HKOjcu64S6Pa-hw_ZfzkUnSpu7FvNGUoserRFeUDPZbPEK2pBxRsIdco-ZrriEwSTfL/s1600/after-log-transform.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="165" data-original-width="700" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjYqmqxhpWUfRaN_03Ch_kHfoXo9yV-9rtpDXQ7cIxgh3u0erqTwnugQzV_b6YdSsxoyk9un3fV1HKOjcu64S6Pa-hw_ZfzkUnSpu7FvNGUoserRFeUDPZbPEK2pBxRsIdco-ZrriEwSTfL/s1600/after-log-transform.png" /> </a></div>
<div style="text-align: left;">
<span style="color: rgba(0 , 0 , 0 , 0.7); display: inline; float: none; font-family: "source serif pro" , serif; font-size: 21px; font-style: normal; font-weight: normal; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"></span><br />
<div class="separator" style="clear: both; text-align: center;">
<span style="color: rgba(0 , 0 , 0 , 0.7); display: inline; float: none; font-family: "source serif pro" , serif; font-size: 21px; font-style: normal; font-weight: normal; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><br /></span></div>
<span style="color: rgba(0 , 0 , 0 , 0.7); display: inline; float: none; font-family: "source serif pro" , serif; font-size: 21px; font-style: normal; font-weight: normal; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">
</span></div>
<span style="color: rgba(0 , 0 , 0 , 0.7); display: inline; float: none; font-family: "source serif pro" , serif; font-size: 21px; font-style: normal; font-weight: normal; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">Logarithmic scale is simply a log transformation applied to all feature's values before plotting them. In our example we used it on both trading partners' features - imports and exports which gives bubble chart new look:</span></div>
<div style="text-align: left;">
<br />
<div class="separator" style="clear: both; text-align: center;">
<span style="color: rgba(0 , 0 , 0 , 0.7); display: inline; float: none; font-family: "source serif pro" , serif; font-size: 21px; font-style: normal; font-weight: normal; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjby9-Ehu_pIcUOkCWj1L6yJsfqXercIF1gCu0SrHOEIphzCAPR4cNK8qYuxo2mACjhwPg2eD8nQkZFYCm7f8Yne2_FQLyR-dLtGVR69g_N6XdVKBdSeBk-ajSJVT46IUqVoZTHNHXD_mKW/s1600/US-Merchandise-Trade-Balance-top30-log.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1200" data-original-width="1200" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjby9-Ehu_pIcUOkCWj1L6yJsfqXercIF1gCu0SrHOEIphzCAPR4cNK8qYuxo2mACjhwPg2eD8nQkZFYCm7f8Yne2_FQLyR-dLtGVR69g_N6XdVKBdSeBk-ajSJVT46IUqVoZTHNHXD_mKW/s640/US-Merchandise-Trade-Balance-top30-log.png" width="640" /></a></span></div>
<span style="color: rgba(0 , 0 , 0 , 0.7); display: inline; float: none; font-family: "source serif pro" , serif; font-size: 21px; font-style: normal; font-weight: normal; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">
</span><span style="color: rgba(0 , 0 , 0 , 0.7); display: inline; float: none; font-family: "source serif pro" , serif; font-size: 21px; font-style: normal; font-weight: normal; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"> </span></div>
<div style="text-align: left;">
<span style="color: rgba(0 , 0 , 0 , 0.7); display: inline; float: none; font-family: "source serif pro" , serif; font-size: 21px; font-style: normal; font-weight: normal; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">The same data displayed on logarithmic scale appear almost uniform but not to forget the farther away points from 0 the more orders of magnitude they are apart on actual scale (observe this by scrolling back to the original plot). The main advantage of using log scale in this plot is ability of observing relationships between all top 30 countries without loosing the whole picture and avoiding collapsing smaller points together.</span></div>
<div style="text-align: left;">
</div>
<div style="text-align: left;">
<span style="color: rgba(0 , 0 , 0 , 0.7); display: inline; float: none; font-family: "source serif pro" , serif; font-size: 21px; font-style: normal; font-weight: normal; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="color: rgba(0 , 0 , 0 , 0.7); display: inline; float: none; font-family: "source serif pro" , serif; font-size: 21px; font-style: normal; font-weight: normal; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">For more detailed discussion of logarithmic scale refer to<span class="Apple-converted-space"> </span></span><a href="https://www.forbes.com/sites/naomirobbins/2012/01/19/when-should-i-use-logarithmic-scales-in-my-charts-and-graphs/#60dca2055e67" rel="nofollow noopener" style="-webkit-text-stroke-width: 0px; background: transparent; border: 0px; color: #8c68cb; cursor: pointer; font-family: "Source Serif Pro", serif; font-size: 21px; font-stretch: inherit; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-variant-numeric: inherit; font-weight: normal; letter-spacing: normal; line-height: inherit; margin: 0px; orphans: 2; outline: none; padding: 0px; text-align: start; text-decoration: none; text-indent: 0px; text-transform: none; vertical-align: baseline; white-space: normal; widows: 2; word-spacing: 0px; word-wrap: break-word;" target="_blank">When Should I Use Logarithmic Scales in My Charts and Graphs?</a><span style="color: rgba(0 , 0 , 0 , 0.7); display: inline; float: none; font-family: "source serif pro" , serif; font-size: 21px; font-style: normal; font-weight: normal; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span class="Apple-converted-space"> </span>Oh, and how about that trade deficit with China?</span></span></div>
<div style="text-align: left;">
</div>
<div style="text-align: left;">
<span style="color: rgba(0 , 0 , 0 , 0.7); display: inline; float: none; font-family: "source serif pro" , serif; font-size: 21px; font-style: normal; font-weight: normal; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="color: rgba(0 , 0 , 0 , 0.7); display: inline; float: none; font-family: "source serif pro" , serif; font-size: 21px; font-style: normal; font-weight: normal; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"><span style="font-size: small;"><span style="font-family: "georgia" , "source serif pro" , serif;"><i>This is a re-post from the original <a href="https://www.linkedin.com/pulse/logarithmic-scale-action-example-gregory-kanevsky">blog</a> on LinkedIn.</i></span></span></span> </span><br />
<span style="color: rgba(0 , 0 , 0 , 0.7); display: inline; float: none; font-family: "source serif pro" , serif; font-size: 21px; font-style: normal; font-weight: normal; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;"></span></div>
</div>
</div>
Gregory Kanevskyhttp://www.blogger.com/profile/09179130896383881927noreply@blogger.com0tag:blogger.com,1999:blog-7530218802939252476.post-69722868081028509562017-05-26T00:46:00.001-05:002021-08-27T10:36:10.885-05:00MapReduce in Two Modern Paintings<span style="font-family: "times new roman" , "times" , serif; font-size: 22px;">Two years ago we had a rare family outing at the <a class="jive-link-external-small" href="http://www.dma.org" rel="nofollow" target="_blank">Dallas Museum of Art</a>
(my son is a teenager and he's into sport after all). DMA hosted an excellent
exhibition of modern art and allowed taking pictures. Two hours and
dozen of pictures later my weekend was over but thanks to Google Photos
I stumbled upon those pictures again. Suddenly, I realized that
two paintings captured make up an illustration of one of the most
important framework in big data - MapReduce. </span><br />
<div style="min-height: 8pt; padding: 0px;">
<br /></div><p>
<span style="font-family: "times new roman" , "times" , serif; font-size: 22px;">There
are multiple papers, tutorials and web pages about it and to
truly understand and use framework like this one should study at least a few thoroughly. There are also illustrations of MapReduce architecture and principles <a href="https://www.google.com/search?q=mapreduce+principle&source=lnms&tbm=isch&sa=X&ved=2ahUKEwjqpr7Pv9HyAhXkc98KHc2IAEgQ_AUoAXoECAEQAw&biw=1680&bih=940">out there</a> too.</span><br />
<br />
<span style="font-family: "times new roman" , "times" , serif; font-size: 22px;">But the power of art can express more with less and with just two paintings below I will try to illustrate this for MapReduce. </span><span style="font-family: "times new roman" , "times" , serif; font-size: 22px;"> </span></p><p><span style="font-family: "times new roman" , "times" , serif; font-size: 22px;">First, we have the work by Erró </span><span style="font-family: "times new roman" , "times" , serif; font-size: 22px;"><i>Foodscape</i>, 1964:</span><br />
<br />
</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiGtWpwnvAspjMCLClyBSq745zoCOgc9EPJP-huA-JasskrPe_TeJOWRrPksAhXOgjgQHmocZznRMP3UwPYGCmZ3eF33sap5G5Gkv4vXZqizZN8EiT9J8ayWpwr5cwH58mDuM4UWyTyFqhd/s1600/IMG_5659.JPG" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1200" data-original-width="1600" height="480" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiGtWpwnvAspjMCLClyBSq745zoCOgc9EPJP-huA-JasskrPe_TeJOWRrPksAhXOgjgQHmocZznRMP3UwPYGCmZ3eF33sap5G5Gkv4vXZqizZN8EiT9J8ayWpwr5cwH58mDuM4UWyTyFqhd/s640/IMG_5659.JPG" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><div class="artwork-metadata__artist">
<a class="entity-link" href="https://www.artsy.net/artist/erro">Erró</a></div>
<span class="artwork-metadata__title"><i>Foodscape</i>, 1964</span><br />
<div class="artwork-metadata__medium">
Oil on canvas </div>
</td></tr>
</tbody></table>
<span style="font-family: "times new roman" , "times" , serif; font-size: 22px;">It
illustrates variety, richness, potential of insight (if consumed
properly), and of course, scale. The painting is boundless
with no ends to the table surface in all 4 directions. Observe many types of food and drinks, packaging, presentations, varying in colors, texture and origin </span><span style="font-family: "times new roman", times, serif; font-size: 22px;">(better quality image found </span><a class="jive-link-external-small" href="https://www.perrotin.com/artists/Erro/239/foodscape/35749" rel="nofollow" style="font-family: "times new roman", times, serif; font-size: 22px;" target="_blank">here</a><span style="font-family: "times new roman", times, serif; font-size: 22px;">)</span><span style="font-family: "times new roman", times, serif; font-size: 22px;">. Thus the painting represents big data so much better than
any flowchart or diagram.</span><br />
<div style="min-height: 8pt; padding: 0px;">
<br /></div>
<span style="font-family: "times new roman" , "times" , serif; font-size: 22px;">The 2d and final painting is by Wayne Thiebaud <span><i>Salads, Sandwiches, and Desserts</i>, 1962</span>:</span><br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhsbDpBDLZtTt5iXHzNCH0cw3WqHX_XdmUL9VR3w_Fzy5vKajVOdn7APNQuSDy9loCNkKMBmH7DCpekjmzv2iTOAiR1B4ZbOzZEURz_ZXJGbqbVPBDmrbGR-KNaBoxzEbk8WhymioJJ1hpM/s1600/IMG_5658.JPG" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1200" data-original-width="1600" height="480" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhsbDpBDLZtTt5iXHzNCH0cw3WqHX_XdmUL9VR3w_Fzy5vKajVOdn7APNQuSDy9loCNkKMBmH7DCpekjmzv2iTOAiR1B4ZbOzZEURz_ZXJGbqbVPBDmrbGR-KNaBoxzEbk8WhymioJJ1hpM/s640/IMG_5658.JPG" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><div class="artwork-metadata__artist">
<a class="entity-link" href="https://www.artsy.net/artist/wayne-thiebaud">Wayne Thiebaud</a></div>
<span class="artwork-metadata__title"><i>Salads, Sandwiches, and Desserts</i>, 1962</span><br />
<div class="artwork-metadata__medium">
Oil on canva</div>
</td></tr>
</tbody></table>
<span style="font-family: "times new roman" , "times" , serif; font-size: 22px;">Should we think of how MapReduce works this seemingly infinite table fittingly resembling
conveyor belt looks like a result of split-apply-combine on food items from <i>Foodscape</i> universe. Indeed, each vertical group is a combination of the same type of finished and plated food combined into variably sized groups and ready to serve (better quality image found <a class="jive-link-external-small" href="https://arthive.com/artists/11194~Wayne_Thiebaud/works/478149~Salads_sandwiches_and_desserts" rel="nofollow" target="_blank">here</a>). One can imagine an invisible hand of MapReduce process grouping and arranging items as they flow over conveyor belt.</span><br />
<div style="min-height: 8pt; padding: 0px;">
<br /></div>
<span style="font-family: "times new roman" , "times" , serif; font-size: 22px;">As with any art there is much about MapReduce that was left out of the picture. That's why we still have <a href="https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf">papers</a>, <a href="https://www.amazon.com/MapReduce-Design-Patterns-Effective-Algorithms/dp/1449327176/">books</a>, and <a href="https://en.wikipedia.org/wiki/MapReduce">Wikipedia</a>. And again, I'd like to remind of importance of taking your kids to a museum.</span>Gregory Kanevskyhttp://www.blogger.com/profile/09179130896383881927noreply@blogger.com0tag:blogger.com,1999:blog-7530218802939252476.post-38719395616282906602016-12-20T14:04:00.000-06:002016-12-21T10:19:47.559-06:00Correlation Primer with Aster and R<span style="color: #444444; font-family: "source serif pro" , serif; font-size: 21px;">Calculating correlations is often starting point before more advanced analytical steps take place. Big data (long data) always presents computational challenges of both scale and distributed nature. In turn they may get aggravated by the presence of large number of features (wide data). </span><span style="color: #444444; font-family: "source serif pro" , serif; font-size: 21px;">But challenges do not stop here as complex relationships induce analysis of correlations across subsets and groups.</span><br />
<span style="color: #444444; font-family: "source serif pro" , serif; font-size: 21px;"><br /></span>
<span style="color: #444444; font-family: "source serif pro" , serif; font-size: 21px;">Such mix of long and wide becomes more common in the age of internet-of-things, sensor and machine data with non-human data sources dominating analytical use cases. </span><br />
<span style="color: #444444; font-family: inherit;">
<span style="font-family: "source serif pro" , serif; font-size: 21px;">Thus, when computing correlations on big data the following capabilities matter:</span></span><br />
<ul>
<li><span style="color: #444444; font-family: "source serif pro" , serif; font-size: 21px;"><span style="font-family: inherit; font-size: 21px;">scale on large distributed data sets (long data)</span></span></li>
<li><span style="color: #444444; font-family: "source serif pro" , serif; font-size: 21px;"><span style="font-family: inherit; font-size: 21px;">scale on wide distributed data sets (wide data / large number of features)</span></span></li>
<li><span style="color: #444444; font-family: "source serif pro" , serif; font-size: 21px;"><span style="font-family: inherit; font-size: 21px;">flexibility on wide data sets (ability to permutate features such as Cartesian combinations, one-to-many, etc.)</span></span></li>
<li><span style="color: #444444; font-family: "source serif pro" , serif; font-size: 21px;"><span style="font-family: inherit; font-size: 21px;">correlations on subsets and groups.</span></span></li>
</ul>
<div>
<span style="color: #444444; font-family: "source serif pro" , serif; font-size: 21px;"><span style="font-family: inherit; font-size: 21px;">Correlations in R comes standard with <b>stats</b> function <i>cor</i> but it doesn't meet most of the capabilities above. As always <a href="http://www.teradata.com/products-and-services/analytics-from-aster-overview/">Teradata Aster big data analytical platform</a> offers both scalability and functionality far exceeding capabilities above. And thanks to Aster R (<b>TeradataAsterR</b>) package it is available without leaving R environment.</span></span></div>
<span style="font-family: inherit;"><br /></span>
<span style="color: #444444; font-family: "source" serif "pro" , serif;"><span style="font-family: "source serif pro" , serif; font-size: 21px;">With Aster and R integration there are multiple ways of correlating on datasets. Before sending you to the link for detailed discussion I summarized approaches discussed there by the capabilities:</span></span><br />
<br />
<style type="text/css">.nobrtable br { display: none }</style>
<br />
<div class="nobrtable">
<table border="2" bordercolor="#0033FF" cellpadding="3" cellspacing="0" style="background-color: #99ffff; border-collapse: collapse; width: 100%;">
<tbody>
<tr>
<th>Method / Solution features</th>
<th>Variable (columns) Permutations</th>
<th>Calculating for Groups</th>
<th>SQL-MR</th>
<th>In-database R</th>
</tr>
<tr>
<td><b>Aster R</b> <i>ta.cor</i></td>
<td><div style="text-align: center;">
<b>N</b></div>
</td>
<td><div style="text-align: center;">
<b>N</b></div>
</td>
<td><div style="text-align: center;">
<b>Y</b></div>
</td>
<td><div style="text-align: center;">
<b>N</b></div>
</td>
</tr>
<tr>
<td><b>Aster R</b> in-database <i>ta.tapply</i></td>
<td><div style="text-align: center;">
<b>N</b></div>
</td>
<td><div style="text-align: center;">
<b>Y</b></div>
</td>
<td><div style="text-align: center;">
<b>N</b></div>
</td>
<td><div style="text-align: center;">
<b>Y</b></div>
</td>
</tr>
<tr>
<td><b>toaster</b> <i>computeCorrelations</i></td>
<td><div style="text-align: center;">
<b>Y</b></div>
</td>
<td><div style="text-align: center;">
<b>Y</b></div>
</td>
<td><div style="text-align: center;">
<b>Y</b></div>
</td>
<td><div style="text-align: center;">
<b>N</b></div>
</td>
</tr>
</tbody>
</table>
</div>
<br />
<div>
<span style="color: rgba(0 , 0 , 0 , 0.701961); font-family: "source serif pro" , serif; font-size: 21px;">Please visit my latest RPubs </span><a href="http://goo.gl/XEkAKj" rel="nofollow noopener" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; border: 0px; color: #8c68cb; cursor: pointer; font-family: "Source Serif Pro", serif; font-size: 21px; font-stretch: inherit; font-variant-numeric: inherit; line-height: inherit; margin: 0px; outline: none; padding: 0px; text-decoration: none; vertical-align: baseline; word-wrap: break-word;" target="_blank">post</a><span style="color: rgba(0 , 0 , 0 , 0.701961); font-family: "source serif pro" , serif; font-size: 21px;"> for detailed discussion and comparison of these methods.</span><br />
<div class="separator" style="clear: both; text-align: center;">
</div>
</div>
Gregory Kanevskyhttp://www.blogger.com/profile/09179130896383881927noreply@blogger.com0Dallas, TX, USA32.7766642 -96.79698789999997632.3496592 -97.442434899999981 33.2036692 -96.151540899999972tag:blogger.com,1999:blog-7530218802939252476.post-22402766310017348752016-05-31T15:08:00.002-05:002016-05-31T16:18:41.851-05:00Running similar but independent jobs in parallel on Aster with R<span style="font-size: large;">No surprise that Teradata Aster runs each SQL, SQL-MR, and
SQL-GR command in parallel on many clusters with distributed data. But
when faced with the task of running many similar but independent jobs one has to do extra work to parallelize them in Aster. When running a SQL
script the next command has to wait for the previous to finish. This makes
sense when commands contribute to the pipeline with results of each job passed down to next one. But what if the jobs are independent and produce their own results each. For example, <a class="jive-link-external-small" href="https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29" rel="nofollow" target="_blank">cross-validation</a> of linear regression or other models is divided into independent jobs each working with its respective partition (of total <em>K</em> in case of <a class="jive-link-external-small" href="https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29#k-fold_cross-validation" rel="nofollow" target="_blank">K-fold cross-validation</a>). These jobs could run in parallel in Aster with little help from R. This post will illustrate how to run <strong><em>K</em></strong> linear regression models <strong>in parallel</strong> in Aster as part of the <em>K</em>-fold cross-validation procedure.</span><br />
<span style="font-size: large;"><br /></span>
<br />
<h3 id="jive_content_id_The_Problem">
<span style="font-size: large;">
The Problem</span></h3>
<div>
<span style="font-size: large;">Cross-validation is important technique in machine learning that receives its own chapters in the textbooks (e.g. see Chapter 7 <a class="jive-link-external-small" href="http://statweb.stanford.edu/~tibs/ElemStatLearn/" rel="nofollow" target="_blank">here</a>). In our examples we implement a <em>K</em>-fold
cross-validation method to demonstrate how to run parallel jobs in
Aster with R. The implementation of K-fold cross-validation that will be
given is neither exhaustive nor exemplary as it introduces certain bias
(based on month of the year) into the models. But this approach could
definitely lead to a general solution for cross-validation and other
problems involving execution of many similar but independent tasks on
Aster platform.</span><br />
<div style="min-height: 8pt; padding: 0px;">
<span style="font-size: large;"><br /></span></div>
<span style="font-size: large;">Further more, the examples will be concerned only with the step in <em>K</em>-fold cross-validation that creates <em><strong>K</strong></em> models on overlapping but different partitions of the training dataset. We will show how to construct <strong><em>K</em></strong> independent linear regression models in parallel on Aster, each for one of the <strong><em>K</em></strong> partitions of the table (not the same as table partitioning in Aster).</span><br />
<span style="font-size: large;"><br /></span>
<br />
<h3 id="jive_content_id_Data_and_R_Packages">
<span style="font-size: large;">
Data and R Packages</span></h3>
<span style="font-size: large;">We will use Dallas Open Data data set available from <a class="jive-link-external-small" href="https://github.com/teradata-aster-field/toaster/wiki/Demo-and-Examples#dallas-open-data-dataset" rel="nofollow" target="_blank">here</a> (including Aster load scripts).</span><br />
<span style="font-size: large;">To simplify examples we will also use R package <a class="jive-link-external-small" href="https://cran.r-project.org/web/packages/toaster/index.html" rel="nofollow" target="_blank"><strong>toaster</strong></a> for Aster and several other packages - all available from CRAN:</span><br />
<span style="font-size: large;"><br /></span>
<br />
<div>
<script src="https://gist.github.com/grigory93/060b1788f498275f86d311b0e0dac26f.js?file=install-packages.R"></script>
</div>
<br />
<h3 id="jive_content_id_Data_set_Model_and_K_Folds">
<span style="font-size: large;">
Data set, Model and K Folds</span></h3>
<div>
<span style="font-size: large;">Dallas Open Data has information on building permits across city of
Dallas for the period between January 2012 through May 2014 stored in
the table <span style="font-family: "courier new" , "courier" , monospace;">dallasbuildingpermits</span>. We can quickly analyze this table from R with toaster and see its numerical columns:</span><br />
<span style="font-size: large;"><br /></span></div>
<div>
<script src="https://gist.github.com/grigory93/060b1788f498275f86d311b0e0dac26f.js?file=data-set-model-and-K-folds.R"></script>
</div>
<div>
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">which results in:</span></div>
<blockquote class="tr_bq">
<span style="font-size: large;">[1] "area" "value" "lon" "lat"</span></blockquote>
<span style="font-size: large;">These 4 fields will make up our simple linear model to determine the
value of construction using its area and location. And now the same in R terms:</span><br />
<span style="font-size: large;"><br /></span>
<script src="https://gist.github.com/grigory93/060b1788f498275f86d311b0e0dac26f.js?file=linreg-formula-in.R"></script>
<br />
<span style="font-size: large;">This problem is not beyond R memory limits but our goal is to execute linear regression in Aster. We enlist <strong>toaster</strong>'s <em>computeLm</em> function that returns R <em>lm</em> object:</span><br />
<span style="font-size: large;"><br /></span>
<script src="https://gist.github.com/grigory93/060b1788f498275f86d311b0e0dac26f.js?file=linreg-with-toaster.R"></script><span style="font-size: large;">
Lastly, we need to define the folds (partitions) on the table to build
linear regression model on each of them. Usually, this step performs
equal and random division into partitions. Doing this with R and Aster
is actually not extremely difficult but will take us beyond the scope of
the main topic. For this reason alone we propose <strong>quick and dirty</strong> method of dividing building permits into 12 partitions (<strong><em>K=12</em></strong>) using issue date's month value (in SQL):
</span><br />
<span style="font-size: large;"><br /></span>
<script src="https://gist.github.com/grigory93/060b1788f498275f86d311b0e0dac26f.js?file=k-folds-create.sql"></script><span style="font-size: large;">
Again, do not replicate this method in real cross-validation task but use it as a template or a prototype only.</span><br />
<span style="font-size: large;">To make each fold's compliment (used to train 12 models later) we simply
exclude each month's data, e.g. selecting the compliment to the fold 6
in its entirety (in SQL):</span><br />
<span style="font-size: large;"><br /></span>
<script src="https://gist.github.com/grigory93/060b1788f498275f86d311b0e0dac26f.js?file=k-folds-select.sql"></script>
<br />
<h3 id="jive_content_id_Computing_CrossValidation_Models_in_Aster_with_R">
<span style="font-size: large;">Computing Cross-Validation Models in Aster with R</span></h3>
<div>
<span style="font-size: large;">Before we get to parallel execution with R we show how to script in R
Aster cross-validation of linear regression. To begin we use standard R <em><strong>for</strong></em> loop and <em>computeLm</em> with the argument <em><strong>where</strong></em> that limits data to the required fold just like in SQL example above:</span></div>
<span style="font-size: large;"><br /></span>
<script src="https://gist.github.com/grigory93/060b1788f498275f86d311b0e0dac26f.js?file=execute-cv-seq.R"></script>
<br />
<span style="font-size: large;">This results in the list <em><strong>fit.folds</strong></em> that contains 12 linear regression models for each fold respectively.</span><br />
<span style="font-size: large;">Next, we replace the <em><strong>for</strong> </em>loop with the specialized <em>foreach</em>
function designed for parallel execution in R. There is no parallel
execution yet but all necessary structure for transition to parallel
processing:</span><br />
<script src="https://gist.github.com/grigory93/060b1788f498275f86d311b0e0dac26f.js?file=execute-cv-seq-par.R"></script>
<br />
<span style="font-size: large;"><em>foreach</em> performs the same iterations from 1 to 12 as <em><strong>for</strong> </em>loop and combines results into list by default.</span><br />
<span style="font-size: large;"><br /></span>
<br />
<h3 id="jive_content_id_Parallel_Computing_in_Aster_with_R">
<span style="font-size: large;">
Parallel Computing in Aster with R</span></h3>
</div>
<div>
<span style="font-size: large;">Finally, we are ready to enable parallel execution in R. For more details on using package <a class="jive-link-external-small" href="https://cran.r-project.org/web/packages/doParallel/index.html" rel="nofollow" target="_blank"><strong>doParallel</strong> </a>see <a class="jive-link-external-small" href="https://cran.r-project.org/web/packages/doParallel/vignettes/gettingstartedParallel.pdf" rel="nofollow" target="_blank">here</a>, but the following suffices to enable a parallel backend in R on Windows:</span></div>
<div>
<span style="font-size: large;"><br /></span></div>
<script src="https://gist.github.com/grigory93/060b1788f498275f86d311b0e0dac26f.js?file=execute-cv-par-setup.R"></script>
<br />
<div>
<span style="font-size: large;"><br /></span></div>
<div>
<span style="font-size: large;">After that <em>foreach</em> with operator <em>%dopar%</em> automatically recognizes parallel backend <strong><em>cl</em> </strong>and runs its iterations in parallel:</span></div>
<div>
<span style="font-size: large;"><br /></span></div>
<script src="https://gist.github.com/grigory93/060b1788f498275f86d311b0e0dac26f.js?file=execute-cv-par-run.R"></script>
<br />
<div>
<span style="font-size: large;"><br /></span></div>
<div>
<span style="font-size: large;">Comparing with <em>foreach</em> <em>%do%</em> earlier notice extra handling for ODBC connection inside <em>foreach %dopar%</em>.
This is necessary due to inability of sharing the same database
connection between parallel processes (or threads, depending on the
backend implementation). Effectively, with each loop we reconnect to
Aster with a brand new connection by reusing original connection's
configuration in function <em>odbcReConnect</em>.</span><br />
<span style="font-size: large;"><br /></span>
<br />
<h3>
<span style="font-size: large;">
Elaspsed Time</span></h3>
</div>
<div>
<span style="font-size: large;">Lastly, let's see if the whole thing was worth the effort. Chart below
contains elapsed times (in seconds) for all 3 types of loops: <em><strong>for</strong></em> loop in R, <em>foreach %do%</em> (sequential), and <em>foreach %dopar%</em> (parallel):</span><br />
<span style="font-size: large;"><br /></span></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgDKIhnA3BkruLvakX1Q5W15hXMN7I8AtMFYFYUylW_zOoeMOqSJFEjJmkhvFFyevfgwTTgoq8TOSniH_MHQ7eAFB-sM96Hjs-aJrV7uhTgtmgTSON9aCaTLwo6UFoq7XxPK-vTOK6uKdUk/s1600/parallel-exec-in-Aster-with-R.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><span style="font-size: large;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgDKIhnA3BkruLvakX1Q5W15hXMN7I8AtMFYFYUylW_zOoeMOqSJFEjJmkhvFFyevfgwTTgoq8TOSniH_MHQ7eAFB-sM96Hjs-aJrV7uhTgtmgTSON9aCaTLwo6UFoq7XxPK-vTOK6uKdUk/s1600/parallel-exec-in-Aster-with-R.png" /></span></a></div>
<div>
<br /></div>
Gregory Kanevskyhttp://www.blogger.com/profile/09179130896383881927noreply@blogger.com0tag:blogger.com,1999:blog-7530218802939252476.post-22427492462033833672016-04-24T15:49:00.001-05:002016-04-29T10:25:13.296-05:00Map of the Windows Fonts Registered with R<span style="font-size: large;">If you already found package <b>extrafont</b> then you probably found how to load and use Windows fonts in R visualizations. But just in case, everything to get started with <b>extrafont</b> is found <a href="https://github.com/wch/extrafont" target="_blank">here</a> and summarized for using fonts in Windows for on-screen or bitmap output below:</span><br />
<br />
<div>
<script src="https://gist.github.com/grigory93/50f613f3fc8aea94a7eba4953f8a3ad7.js?file=windows-fonts-r.R"></script>
</div>
<br />
<span style="font-size: large;">One thing to add is a summary of all Windows fonts </span><span style="font-size: large;">registered</span><span style="font-size: large;"> </span><span style="font-size: large;">in R. This will come handy when designing new visualizations and deciding on which font or combination of fonts and their faces to use. The code below produces a table where rows are fonts and columns are faces with font name printed using both the font and the face (if available) in each table cell:</span><br />
<span style="font-size: large;"><br /></span>
<div>
<script src="https://gist.github.com/grigory93/50f613f3fc8aea94a7eba4953f8a3ad7.js?file=font-table-map-summary.R"></script>
</div>
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">The resulting table is this handy visual:</span><br />
<span style="font-size: large;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgkERQj0fbfDAKD3uUUtmE2qzkqPkSpHsqRhHQBgRXu0TVKODBzMSwiH78-uV7k_nLzG81BkWTwKFd23T8jfLBIl2GLVhe30l7CQ4p6NLSR23UbMGFFmz32J1_anNZvDWHGk5oe-jreYRXK/s1600/font_ggplot.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgkERQj0fbfDAKD3uUUtmE2qzkqPkSpHsqRhHQBgRXu0TVKODBzMSwiH78-uV7k_nLzG81BkWTwKFd23T8jfLBIl2GLVhe30l7CQ4p6NLSR23UbMGFFmz32J1_anNZvDWHGk5oe-jreYRXK/s1600/font_ggplot.png" /></a></div>
<br />
<span style="font-size: large;">You can download this image or produce your own with the code above.</span>Gregory Kanevskyhttp://www.blogger.com/profile/09179130896383881927noreply@blogger.com0tag:blogger.com,1999:blog-7530218802939252476.post-1543104401257703882016-04-16T02:05:00.000-05:002020-02-29T15:58:19.351-06:00Creating and Tweaking Bubble Chart with ggplot2<span style="font-size: large;">This article will take us step-by-step over incremental changes to produce a bubble chart using <b>ggplot2</b> that looks like this:</span><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgwboLEL1iq4MfTrj8Wq2rzB_n3fwCeQo0My5xUGRK8JhQA9c79tHGPhw3LxsVt1oSysSxH_ySJ-ofmU9jd04sWHWFD7YU8uMFKTeXmJAO6smnd2B1ciZpt1x7D692RZpYlACcnc4Mkrf4O/s1600/toaster-vs-TeradataAsterR.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgwboLEL1iq4MfTrj8Wq2rzB_n3fwCeQo0My5xUGRK8JhQA9c79tHGPhw3LxsVt1oSysSxH_ySJ-ofmU9jd04sWHWFD7YU8uMFKTeXmJAO6smnd2B1ciZpt1x7D692RZpYlACcnc4Mkrf4O/s1600/toaster-vs-TeradataAsterR.png" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<h2>
<span style="font-size: large;"><span style="font-family: Georgia, "Times New Roman", serif;">Data and Setup </span></span></h2>
<span style="font-size: large;">We'll encounter the plot above once again at the very end after explaining each step with code changes and observing intermediate plots. Without getting into details what it means (curios reader can find out <a href="https://github.com/teradata-aster-field/toaster/wiki#r-packages-for-aster" target="_blank">here</a>) the dataset behind is defined as:</span><br />
<br />
<div>
<script src="https://gist.github.com/grigory93/f370c5eb997fc74b7b7ec83e73d4dffa.js?file=ggplot2-data.R"></script>
</div>
<br />
<span style="font-size: large;">It contains 2 data points and 4 attributes: three numerical <i>Aster_experience,</i> <i>R_experience</i>, and <i>coverage</i>, and one categorical <i>product</i>. Remember that <b>the data won't change a bit</b> while the plot progression unfolds.</span><br />
<span style="font-size: large;"><br /></span>
<h2>
<span style="font-family: Georgia, "Times New Roman", serif;"><span style="font-size: large;">As-Is Scatterplot</span></span><span style="font-size: large;"></span></h2>
<span style="font-size: large;">The starting plot is simple scatterplot using coordinates <i>x</i> and <i>y</i> as <i>Aster_experience,</i> <i>R_experience</i> (<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">line 3</span>), point size as <i>coverage</i>, and point color as <i>product</i> (<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">line 4</span>) (this type of scatterplot has a special name - <a href="https://en.wikipedia.org/wiki/Bubble_chart" target="_blank">bubble chart</a>):</span><br />
<br />
<div>
<code data-gist-file="ggplot2-initial-plot.R" data-gist-highlight-line="3,4" data-gist-id="f370c5eb997fc74b7b7ec83e73d4dffa"></code>
</div>
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjr5avsiYFyjmUrdkMRCyiSrtjxBlKj99XkxiRitSekGAc1CAdOtUHxSQS2reAKJnx_9WqbGXBBKJGqIDCPWEkyzdAIw6U3r6ykFC_B9m9Rg06L9ndv_58XTr4uNQebR0xogoJuy1W0963T/s1600/toaster-vs-TeradataAsterR-1.0.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjr5avsiYFyjmUrdkMRCyiSrtjxBlKj99XkxiRitSekGAc1CAdOtUHxSQS2reAKJnx_9WqbGXBBKJGqIDCPWEkyzdAIw6U3r6ykFC_B9m9Rg06L9ndv_58XTr4uNQebR0xogoJuy1W0963T/s1600/toaster-vs-TeradataAsterR-1.0.png" /></a></div>
<br />
<h2>
<span style="font-size: large;"><span style="font-family: Georgia, "Times New Roman", serif;">Fixing Point Sizes </span></span></h2>
<span style="font-size: large;">Immediate fix would be making the smaller point big enough to see it with the help of <i>scale_size</i> function and its <i>range</i> argument (<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">line 3</span>) (strange enough but sibling function <i>scale_size_area</i> doesn't have such argument) that specifies the minimum and maximum size of the plotting symbol after transformation<sup><a href="https://www.blogger.com/blogger.g?blogID=7530218802939252476#1" name="top1">1</a> </sup>:</span><br />
<br />
<div>
<code data-gist-file="ggplot2-scale-size.R" data-gist-highlight-line="3" data-gist-id="f370c5eb997fc74b7b7ec83e73d4dffa"></code>
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZYnd6g_DmE0LgGgo_k28n5cHZJTjBPqsnOZ-otde936Qpeblky1kjuNIlW3fn2cDX6QBWO_s-WCT02gOCbWoYMTdBVrK119GbbVWTYgg0BAmj8TrtqHygtgXPUydW7yL-JeTMRP-nXh2Z/s1600/toaster-vs-TeradataAsterR-2.0.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZYnd6g_DmE0LgGgo_k28n5cHZJTjBPqsnOZ-otde936Qpeblky1kjuNIlW3fn2cDX6QBWO_s-WCT02gOCbWoYMTdBVrK119GbbVWTYgg0BAmj8TrtqHygtgXPUydW7yL-JeTMRP-nXh2Z/s1600/toaster-vs-TeradataAsterR-2.0.png" /></a></div>
<br />
<h2>
<span style="font-size: large;"><span style="font-family: Georgia, "Times New Roman", serif;">Magic Quadrant: adding lines and customizing axises</span> </span></h2>
<span style="font-size: large;">Next refinement aims at the <a href="http://www.gartner.com/technology/research/methodologies/research_mq.jsp" target="_blank">magic quadrant</a> concept which fits this data well. In this case it's "R Experience" vs. "Aster Experience" and whether there is more or less of each. Achieving this effect involves fake axes using <i>geom_hline </i>and <i>geom_vline</i> (<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">line 3</span>), and customizing actual axes using <i>scale</i> (<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">line 5-6</span>) and <i>theme</i> functions (<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">line 8-12</span>):</span><br />
<br />
<div>
<code data-gist-file="ggplot2-magic-quadrant.R" data-gist-highlight-line="3,5,6,8-12" data-gist-id="f370c5eb997fc74b7b7ec83e73d4dffa"></code>
</div>
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7nc8pQl17adAPDzneslwyWk4yoROD-xYcrk3iJh5lYb1BCJmjamgmIQiKuHlLPyA0HzHT_AQ1RCVzBUg9Q_IECOh2WfIrvV7Sbzc60pzliZtfa4goo-aSKAp_F1oAChgKrxs_I0GTJyN6/s1600/toaster-vs-TeradataAsterR-3.0.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7nc8pQl17adAPDzneslwyWk4yoROD-xYcrk3iJh5lYb1BCJmjamgmIQiKuHlLPyA0HzHT_AQ1RCVzBUg9Q_IECOh2WfIrvV7Sbzc60pzliZtfa4goo-aSKAp_F1oAChgKrxs_I0GTJyN6/s1600/toaster-vs-TeradataAsterR-3.0.png" /></a></div>
<br />
<h2>
<span style="font-size: large;"><span style="font-family: Georgia, "Times New Roman", serif;">Adding Text and Color to Points </span></span></h2>
<span style="font-size: large;">Typical for bubble charts its points get both colored and labeled, which also makes color bar legend obsolete. We use <i>geom_text </i>to label points (<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">line 5</span>) and <i>scale_color_manual </i>to assign new colors and remove color bar legend (<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">line 11</span>):</span><br />
<br />
<div>
<code data-gist-file="ggplot2-text-labels.R" data-gist-highlight-line="5,11" data-gist-id="f370c5eb997fc74b7b7ec83e73d4dffa"></code>
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgYVwcS0byQO7_G3X-voM8Zgt9vl-x5F5rtmPM4Do3EVgdfdjvDx4RGRDwbtxxjtFahad7WVBj1fbzBAXPX6wQf3Jf5f5P887z60y95MADeSjuPpOflXL7snaQP3qDvhNFiS7dIjHmwKmkq/s1600/toaster-vs-TeradataAsterR-4.0.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgYVwcS0byQO7_G3X-voM8Zgt9vl-x5F5rtmPM4Do3EVgdfdjvDx4RGRDwbtxxjtFahad7WVBj1fbzBAXPX6wQf3Jf5f5P887z60y95MADeSjuPpOflXL7snaQP3qDvhNFiS7dIjHmwKmkq/s1600/toaster-vs-TeradataAsterR-4.0.png" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<h2>
<span style="font-size: large;"><span style="font-family: Georgia, "Times New Roman", serif;">Customizing Legend</span> </span></h2>
<span style="font-size: large;">The next step happened to tackle the most advanced problem while working on the plot. The guide legend for size above looks rather awkward. Ideally, it matches the two points we have in both color and size. It turned out (and rightly so) that the function <i>scale_size </i>is responsible for its appearance (<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">line 8</span>). In particular, number of legend positions overrides argument <i>breaks</i>, and controling appearance including colors of the legend performed with <i>guide_legend</i> and <i>override.aes</i>:</span><br />
<br />
<div>
<code data-gist-file="ggplot2-size-legend.R" data-gist-highlight-line="8" data-gist-id="f370c5eb997fc74b7b7ec83e73d4dffa"></code>
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi0pV9fKDxXgJeUFVrDpn-USh8V2-6Gfc6796NFPSmv7F9ywN9m_SM8iW9irGPP86-LAyCtdHHC-zB8iyFWaXZs1hXsTv67CyR-RF1pDCi06B95dptXcLXdSNoXYN6eCd2OCZVE5FQmiJSL/s1600/toaster-vs-TeradataAsterR-5.0.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi0pV9fKDxXgJeUFVrDpn-USh8V2-6Gfc6796NFPSmv7F9ywN9m_SM8iW9irGPP86-LAyCtdHHC-zB8iyFWaXZs1hXsTv67CyR-RF1pDCi06B95dptXcLXdSNoXYN6eCd2OCZVE5FQmiJSL/s1600/toaster-vs-TeradataAsterR-5.0.png" /></a></div>
<br />
<h2>
<span style="font-size: large;"><span style="font-family: Georgia, "Times New Roman", serif;">Finishing Touch with Custom Theme</span> </span></h2>
<span style="font-size: large;">We finish cleaning the plot using package <b>ggthemes</b> and its <i>theme_tufte</i> function (<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">line 10</span>):</span><br />
<br />
<div>
<code data-gist-file="ggplot2-final.R" data-gist-highlight-line="10" data-gist-id="f370c5eb997fc74b7b7ec83e73d4dffa"></code>
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEixxrBjqlfBWuQ95HwyUS3m6ldbnyGt8nwELdFmMMu6uzhpKd3Xw7eIOMxG7sFU6cSJap3HrlyuztrO3rhKXBLzwv60Gd11xQa9FF-b5vpZM-i7rKl8tL-T1HVXkf6ZDa3SH88nEOpMSDOO/s1600/toaster-vs-TeradataAsterR.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEixxrBjqlfBWuQ95HwyUS3m6ldbnyGt8nwELdFmMMu6uzhpKd3Xw7eIOMxG7sFU6cSJap3HrlyuztrO3rhKXBLzwv60Gd11xQa9FF-b5vpZM-i7rKl8tL-T1HVXkf6ZDa3SH88nEOpMSDOO/s1600/toaster-vs-TeradataAsterR.png" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">As promised, we finished exactly where we started.</span><br />
<hr width="80%" />
<span class="Apple-style-span" style="font-size: x-small;"><br />
<a href="https://www.blogger.com/null" name="1"><b>1 </b></a><a href="http://docs.ggplot2.org/current/scale_size.html">Scale size (area or radius).</a><a href="https://www.blogger.com/blogger.g?blogID=7530218802939252476#top1"><sup>↩</sup></a><br />
</span>Gregory Kanevskyhttp://www.blogger.com/profile/09179130896383881927noreply@blogger.com0tag:blogger.com,1999:blog-7530218802939252476.post-81356758520290616602016-01-30T02:42:00.000-06:002016-04-16T23:26:29.375-05:00R Graph Objects: igraph vs. network<span style="font-size: large;">While working on new graph functions for my package <a href="https://github.com/teradata-aster-field/toaster/wiki" target="_blank">toaster</a> I had to pick from the R packages that represent graphs. The choice was between <i>network </i>and <i>graph</i> objects from the <a href="https://cran.r-project.org/web/packages/network/index.html" target="_blank"><b>network</b></a> and <a href="http://igraph.org/r/" target="_blank"><b>igraph</b></a> correspondingly - the two most prominent packages for creating and manipulating graphs and networks in R.</span><br />
<br />
<h2>
Interchangeability of <span style="font-weight: normal;">network</span> and <span style="font-weight: normal;">graph</span> objects</h2>
<div>
<br /></div>
<span style="font-size: large;">One can always use them interchangeably with little effort using package <b><a href="http://intergraph.r-forge.r-project.org/" target="_blank">intergraph</a>.</b> Its sole purpose is providing "coercion routines for network data objects". Simply use its <i>asNetwork</i> and <i>asIgraph </i>functions to convert from one network representation to another:</span><br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;"><a href="http://inside-r.org/r-doc/base/library"><span style="color: #003399; font-weight: bold;">library</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/packages/cran/igraph">igraph</a><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/base/library"><span style="color: #003399; font-weight: bold;">library</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/packages/cran/network">network</a><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/base/library"><span style="color: #003399; font-weight: bold;">library</span></a><span style="color: #009900;">(</span>intergraph<span style="color: #009900;">)</span>
<span style="color: #666666; font-style: italic;"># igraph </span>
pkg.igraph = graph_from_edgelist<span style="color: #009900;">(</span>edges.mat<span style="color: #339933;">,</span> directed = <span style="color: black; font-weight: bold;">TRUE</span><span style="color: #009900;">)</span>
pkg.network.from.igraph = asNetwork<span style="color: #009900;">(</span>pkg.igraph<span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/base/all.equal"><span style="color: #003399; font-weight: bold;">all.equal</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/length"><span style="color: #003399; font-weight: bold;">length</span></a><span style="color: #009900;">(</span>get.edgelist<span style="color: #009900;">(</span>pkg.igraph<span style="color: #009900;">)</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span> <a href="http://inside-r.org/r-doc/base/length"><span style="color: #003399; font-weight: bold;">length</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/as.matrix"><span style="color: #003399; font-weight: bold;">as.matrix</span></a><span style="color: #009900;">(</span>pkg.network.from.igraph<span style="color: #339933;">,</span> <span style="color: blue;">"edgelist"</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
<span style="color: #666666; font-style: italic;"># network</span>
pkg.network = <a href="http://inside-r.org/packages/cran/network">network</a><span style="color: #009900;">(</span>edges.mat<span style="color: #009900;">)</span>
pkg.igraph.from.network = asIgraph<span style="color: #009900;">(</span>pkg.network<span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/base/all.equal"><span style="color: #003399; font-weight: bold;">all.equal</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/length"><span style="color: #003399; font-weight: bold;">length</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/as.matrix"><span style="color: #003399; font-weight: bold;">as.matrix</span></a><span style="color: #009900;">(</span>pkg.network<span style="color: #339933;">,</span> <span style="color: blue;">"edgelist"</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span> <a href="http://inside-r.org/r-doc/base/length"><span style="color: #003399; font-weight: bold;">length</span></a><span style="color: #009900;">(</span>get.edgelist<span style="color: #009900;">(</span>pkg.igraph.from.network<span style="color: #009900;">)</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span></pre>
</div>
</div>
<a href="http://www.inside-r.org/pretty-r" title="Created by Pretty R at inside-R.org">Created by Pretty R at inside-R.org</a><br />
<br />
<span style="font-size: large;">For more on using <b>intergraph </b>functions see <a href="http://intergraph.r-forge.r-project.org/howto.html" target="_blank">tutorial</a>.</span><br />
<br />
<h2>
Package dependencies with <span style="font-weight: normal;">miniCRAN</span></h2>
<div>
<span style="font-weight: normal;"><br /></span></div>
<div>
<span style="font-size: large;">To assess relative importance of packages <b>network </b>and <b>igraph </b>we will use package <b>miniCRAN</b>. Its access to CRAN packages' metadata including dependencies via "Depends", "Imports", "Suggests" provides necessary information about package relationships. Built-in <i>makeDepGraph</i> function recursively retrieves these dependencies and builds corresponding graph:</span></div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;"><a href="http://inside-r.org/r-doc/base/library"><span style="color: #003399; font-weight: bold;">library</span></a><span style="color: #009900;">(</span>miniCRAN<span style="color: #009900;">)</span>
cranInfo = pkgAvail<span style="color: #009900;">(</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/graphics/plot"><span style="color: #003399; font-weight: bold;">plot</span></a><span style="color: #009900;">(</span>makeDepGraph<span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: blue;">"network"</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span> availPkgs = cranInfo<span style="color: #009900;">)</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/graphics/plot"><span style="color: #003399; font-weight: bold;">plot</span></a><span style="color: #009900;">(</span>makeDepGraph<span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: blue;">"igraph"</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span> availPkgs = cranInfo<span style="color: #009900;">)</span><span style="color: #009900;">)</span></pre>
</div>
</div>
<a href="http://www.inside-r.org/pretty-r" title="Created by Pretty R at inside-R.org">Created by Pretty R at inside-R.org</a>
<br />
<br />
<table border="1">
<tbody>
<tr>
<td> <br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjB3jHpLc7Mc9zjWdK8PmulCumm_06G_xSxqEHWDK-3lCxJ8zA-EvNR5NYaOE6400JAY5YpLO6kgggVFW5v-JhnMNrLRjcH1bhc4Fuy7iXOMxG2CMOmdDNJ91tc9cqzVd2GMzCpB3fMqT1F/s1600/network-depgraph.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="272" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjB3jHpLc7Mc9zjWdK8PmulCumm_06G_xSxqEHWDK-3lCxJ8zA-EvNR5NYaOE6400JAY5YpLO6kgggVFW5v-JhnMNrLRjcH1bhc4Fuy7iXOMxG2CMOmdDNJ91tc9cqzVd2GMzCpB3fMqT1F/s320/network-depgraph.png" width="320" /></a></div>
</td>
<td><div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZuuptghenpgvG03rrBy4JkeFrhyOe1D9oC_AkX_1fietKRM282Z4ODm7bJe2vHWGKmEmEZbWLoM7ggC-KDy3cYo18Vw7FK_JVjf_XedjOH8MBksdDto8W20R1nDq8HjVg2o8baXm5pGL0/s1600/igraph-depgraph.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="272" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZuuptghenpgvG03rrBy4JkeFrhyOe1D9oC_AkX_1fietKRM282Z4ODm7bJe2vHWGKmEmEZbWLoM7ggC-KDy3cYo18Vw7FK_JVjf_XedjOH8MBksdDto8W20R1nDq8HjVg2o8baXm5pGL0/s320/igraph-depgraph.png" width="320" /></a></div>
<br /></td>
</tr>
</tbody></table>
<br />
<span style="font-size: large;">Unfortunately, these dependency graphs show how <b>network </b>and <b>igraph </b>depend on other CRAN packages while the goal is to evaluate relationships the other way around: how much other CRAN packages depend on the two.</span><br />
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">This will require some assembly as we construct a network of packages manually with edges being directed relationships (one of "Depends", "Imports", or "Suggests") as defined in <span style="font-family: "courier new" , "courier" , monospace;">DESCRIPTION</span><span style="font-family: inherit;"> for all packages</span>. The following code builds this <i>igraph</i> object (we chose <b>igraph</b> for its functions utilized later):</span><br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">cranInfoDF = <a href="http://inside-r.org/r-doc/base/as.data.frame"><span style="color: #003399; font-weight: bold;">as.data.frame</span></a><span style="color: #009900;">(</span>cranInfo<span style="color: #339933;">,</span> stringsAsFactors = <span style="color: black; font-weight: bold;">FALSE</span><span style="color: #009900;">)</span>
edges = ddply<span style="color: #009900;">(</span>cranInfoDF<span style="color: #339933;">,</span> .<span style="color: #009900;">(</span>Package<span style="color: #009900;">)</span><span style="color: #339933;">,</span> <a href="http://inside-r.org/r-doc/base/function"><span style="color: #003399; font-weight: bold;">function</span></a><span style="color: #009900;">(</span>x<span style="color: #009900;">)</span> <span style="color: #009900;">{</span>
<span style="color: #666666; font-style: italic;"># split all implied (depends, imports, and suggests) packages and then concat into single array</span>
l = <a href="http://inside-r.org/r-doc/base/unlist"><span style="color: #003399; font-weight: bold;">unlist</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/sapply"><span style="color: #003399; font-weight: bold;">sapply</span></a><span style="color: #009900;">(</span>x<span style="color: #009900;">[</span><a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: blue;">'Depends'</span><span style="color: #339933;">,</span><span style="color: blue;">'Imports'</span><span style="color: #339933;">,</span><span style="color: blue;">'Suggests'</span><span style="color: #009900;">)</span><span style="color: #009900;">]</span><span style="color: #339933;">,</span> <a href="http://inside-r.org/r-doc/base/strsplit"><span style="color: #003399; font-weight: bold;">strsplit</span></a><span style="color: #339933;">,</span> <a href="http://inside-r.org/r-doc/base/split"><span style="color: #003399; font-weight: bold;">split</span></a>=<span style="color: blue;">"(,|, |,<span style="color: #000099; font-weight: bold;">\n</span>|<span style="color: #000099; font-weight: bold;">\n</span>,| ,| , )"</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
<span style="color: #666666; font-style: italic;"># remove version info and empty fields that became NA</span>
l = <a href="http://inside-r.org/r-doc/base/gsub"><span style="color: #003399; font-weight: bold;">gsub</span></a><span style="color: #009900;">(</span><span style="color: blue;">"^([^ <span style="color: #000099; font-weight: bold;">\n</span>(]+).*$"</span><span style="color: #339933;">,</span> <span style="color: blue;">"<span style="color: #000099; font-weight: bold;">\\</span>1"</span><span style="color: #339933;">,</span> l<span style="color: #009900;">[</span>!<a href="http://inside-r.org/r-doc/base/is.na"><span style="color: #003399; font-weight: bold;">is.na</span></a><span style="color: #009900;">(</span>l<span style="color: #009900;">)</span><span style="color: #009900;">]</span><span style="color: #009900;">)</span>
<span style="color: #666666; font-style: italic;"># take care of empty arrays</span>
<span style="color: black; font-weight: bold;">if</span> <span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/is.null"><span style="color: #003399; font-weight: bold;">is.null</span></a><span style="color: #009900;">(</span>l<span style="color: #009900;">)</span> || <a href="http://inside-r.org/r-doc/base/length"><span style="color: #003399; font-weight: bold;">length</span></a><span style="color: #009900;">(</span>l<span style="color: #009900;">)</span> == <span style="color: #cc66cc;">0</span><span style="color: #009900;">)</span>
<span style="color: black; font-weight: bold;">NULL</span>
<span style="color: black; font-weight: bold;">else</span>
<a href="http://inside-r.org/r-doc/base/data.frame"><span style="color: #003399; font-weight: bold;">data.frame</span></a><span style="color: #009900;">(</span>Package = x<span style="color: #009900;">[</span><span style="color: blue;">'Package'</span><span style="color: #009900;">]</span><span style="color: #339933;">,</span> Implies = l<span style="color: #339933;">,</span> stringsAsFactors = <span style="color: black; font-weight: bold;">FALSE</span><span style="color: #009900;">)</span>
<span style="color: #009900;">}</span> <span style="color: #009900;">)</span>
edges.mat = <a href="http://inside-r.org/r-doc/base/as.matrix"><span style="color: #003399; font-weight: bold;">as.matrix</span></a><span style="color: #009900;">(</span>edges<span style="color: #339933;">,</span> <a href="http://inside-r.org/r-doc/base/ncol"><span style="color: #003399; font-weight: bold;">ncol</span></a>=<span style="color: #cc66cc;">2</span><span style="color: #339933;">,</span> <a href="http://inside-r.org/r-doc/base/dimnames"><span style="color: #003399; font-weight: bold;">dimnames</span></a>=<a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: blue;">'from'</span><span style="color: #339933;">,</span><span style="color: blue;">'to'</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
pkg.graph = graph_from_edgelist<span style="color: #009900;">(</span>edges.mat<span style="color: #339933;">,</span> directed = <span style="color: black; font-weight: bold;">TRUE</span><span style="color: #009900;">)</span></pre>
</div>
</div>
<a href="http://www.inside-r.org/pretty-r" title="Created by Pretty R at inside-R.org">Created by Pretty R at inside-R.org</a><br />
<br />
<span style="font-size: large;">The resulting network <span style="font-family: "courier new" , "courier" , monospace;">pkg.graph</span> contains all CRAN packages and their relationships. Let's extract and compare the neighborhoods for the two packages we are interested in:</span><br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;"><span style="color: #666666; font-style: italic;"># build subgraphs for each package</span>
subgraphs = make_ego_graph<span style="color: #009900;">(</span>pkg.graph<span style="color: #339933;">,</span> <a href="http://inside-r.org/r-doc/base/order"><span style="color: #003399; font-weight: bold;">order</span></a>=<span style="color: #cc66cc;">1</span><span style="color: #339933;">,</span> nodes=<a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: blue;">"igraph"</span><span style="color: #339933;">,</span><span style="color: blue;">"network"</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span> <a href="http://inside-r.org/r-doc/base/mode"><span style="color: #003399; font-weight: bold;">mode</span></a> = <span style="color: blue;">"in"</span><span style="color: #009900;">)</span>
g.igraph = subgraphs<span style="color: #009900;">[</span><span style="color: #009900;">[</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">]</span><span style="color: #009900;">]</span>
g.network = subgraphs<span style="color: #009900;">[</span><span style="color: #009900;">[</span><span style="color: #cc66cc;">2</span><span style="color: #009900;">]</span><span style="color: #009900;">]</span>
<span style="color: #666666; font-style: italic;"># plotting subgraphs</span>
V<span style="color: #009900;">(</span>g.igraph<span style="color: #009900;">)</span>$color = <a href="http://inside-r.org/r-doc/base/ifelse"><span style="color: #003399; font-weight: bold;">ifelse</span></a><span style="color: #009900;">(</span>V<span style="color: #009900;">(</span>g.igraph<span style="color: #009900;">)</span>$name == <span style="color: blue;">"igraph"</span><span style="color: #339933;">,</span> <span style="color: blue;">"orange"</span><span style="color: #339933;">,</span> <span style="color: blue;">"lightblue"</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/graphics/plot"><span style="color: #003399; font-weight: bold;">plot</span></a><span style="color: #009900;">(</span>g.igraph<span style="color: #339933;">,</span> main=<span style="color: blue;">"Packages pointing to igraph"</span><span style="color: #009900;">)</span>
V<span style="color: #009900;">(</span>g.network<span style="color: #009900;">)</span>$color = <a href="http://inside-r.org/r-doc/base/ifelse"><span style="color: #003399; font-weight: bold;">ifelse</span></a><span style="color: #009900;">(</span>V<span style="color: #009900;">(</span>g.network<span style="color: #009900;">)</span>$name == <span style="color: blue;">"network"</span><span style="color: #339933;">,</span> <span style="color: blue;">"orange"</span><span style="color: #339933;">,</span> <span style="color: blue;">"lightblue"</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/graphics/plot"><span style="color: #003399; font-weight: bold;">plot</span></a><span style="color: #009900;">(</span>g.network<span style="color: #339933;">,</span> main=<span style="color: blue;">"Packages pointing to network"</span><span style="color: #009900;">)</span></pre>
</div>
</div>
<a href="http://www.inside-r.org/pretty-r" title="Created by Pretty R at inside-R.org">Created by Pretty R at inside-R.org</a><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh7l-pXWyijG9OIj9Uuakag-tKvUeeMs2mUrWK5ixT7FUvVTAddRLShEEzyAUF1MmGytjOu8T99gtczlJ_gdPIfyAetuVC8NEh37agun84iVatR-xpogFJtighjglaydHIjJ2qRBsJrpZwJ/s1600/igraph-neighborhood-in.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="392" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh7l-pXWyijG9OIj9Uuakag-tKvUeeMs2mUrWK5ixT7FUvVTAddRLShEEzyAUF1MmGytjOu8T99gtczlJ_gdPIfyAetuVC8NEh37agun84iVatR-xpogFJtighjglaydHIjJ2qRBsJrpZwJ/s400/igraph-neighborhood-in.png" width="400" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiU3Pm-iHIpxHUcxNtbLgi1tNutHHPjcULKDolgVeokzIcOrOQuIM5lz1UZ0oG89cL_aytsg2j9A6muXpEzi4BKuLnXRt5Jh4FsDFpszFwzbvK9-Od5JSLpYnp8HsEY_9-ok1WFTQ9ev0bk/s1600/network-neighborhood-in.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="392" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiU3Pm-iHIpxHUcxNtbLgi1tNutHHPjcULKDolgVeokzIcOrOQuIM5lz1UZ0oG89cL_aytsg2j9A6muXpEzi4BKuLnXRt5Jh4FsDFpszFwzbvK9-Od5JSLpYnp8HsEY_9-ok1WFTQ9ev0bk/s400/network-neighborhood-in.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<span style="font-size: large;">The <b>igraph </b>neighborhood is much denser populated subgraph than the <b>network </b>neighborhood and hence its importance and acceptance must be higher.</span><br />
<br />
<h2>
Package Centrality Scores</h2>
<div>
<br /></div>
<div>
<span style="font-size: large;">Package igraph can produce various centrality measures on the nodes of a graph. In particular, pagerank centrality and eigenvector centrality scores are principal indicators of the importance of a node in given graph. We finish this exercise with validation using centrality scores for our initial conclusion that <b>igraph </b>package is more accepted and utilized across CRAN ecosystem than <b>network</b> package:</span></div>
<div>
<br /></div>
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;"><span style="color: #666666; font-style: italic;"># PageRank</span>
pkg.pagerank = page.rank<span style="color: #009900;">(</span>pkg.graph<span style="color: #339933;">,</span> directed = <span style="color: black; font-weight: bold;">TRUE</span><span style="color: #009900;">)</span>
<span style="color: #666666; font-style: italic;"># Eigenvector Centrality</span>
pkg.ev = evcent<span style="color: #009900;">(</span>pkg.graph<span style="color: #339933;">,</span> directed = <span style="color: black; font-weight: bold;">TRUE</span><span style="color: #009900;">)</span>
toplot = <a href="http://inside-r.org/r-doc/base/rbind"><span style="color: #003399; font-weight: bold;">rbind</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/data.frame"><span style="color: #003399; font-weight: bold;">data.frame</span></a><span style="color: #009900;">(</span>centrality=<span style="color: blue;">"pagerank"</span><span style="color: #339933;">,</span> type = <a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: blue;">'igraph'</span><span style="color: #339933;">,</span><span style="color: blue;">'network'</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span>
value = pkg.pagerank$vector<span style="color: #009900;">[</span><a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: blue;">'igraph'</span><span style="color: #339933;">,</span><span style="color: blue;">'network'</span><span style="color: #009900;">)</span><span style="color: #009900;">]</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span>
<a href="http://inside-r.org/r-doc/base/data.frame"><span style="color: #003399; font-weight: bold;">data.frame</span></a><span style="color: #009900;">(</span>centrality=<span style="color: blue;">"eigenvector"</span><span style="color: #339933;">,</span> type = <a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: blue;">'igraph'</span><span style="color: #339933;">,</span><span style="color: blue;">'network'</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span>
value = pkg.ev$vector<span style="color: #009900;">[</span><a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: blue;">'igraph'</span><span style="color: #339933;">,</span><span style="color: blue;">'network'</span><span style="color: #009900;">)</span><span style="color: #009900;">]</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/base/library"><span style="color: #003399; font-weight: bold;">library</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/packages/cran/ggplot2">ggplot2</a><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/base/library"><span style="color: #003399; font-weight: bold;">library</span></a><span style="color: #009900;">(</span>ggthemes<span style="color: #009900;">)</span>
<a href="http://inside-r.org/packages/cran/ggplot">ggplot</a><span style="color: #009900;">(</span>toplot<span style="color: #009900;">)</span> +
geom_bar<span style="color: #009900;">(</span>aes<span style="color: #009900;">(</span>type<span style="color: #339933;">,</span> value<span style="color: #339933;">,</span> fill=type<span style="color: #009900;">)</span><span style="color: #339933;">,</span> stat=<span style="color: blue;">"identity"</span><span style="color: #009900;">)</span> +
facet_wrap<span style="color: #009900;">(</span>~centrality<span style="color: #339933;">,</span> <a href="http://inside-r.org/r-doc/base/ncol"><span style="color: #003399; font-weight: bold;">ncol</span></a> = <span style="color: #cc66cc;">2</span><span style="color: #009900;">)</span></pre>
</div>
</div>
<a href="http://www.inside-r.org/pretty-r" title="Created by Pretty R at inside-R.org">Created by Pretty R at inside-R.org</a><br />
<br />
<div>
<br /></div>
<div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEis8NsxQIRqsBGQZ9yWNqUS53zQOwV5C6mG4CN4_nqtIEZ9Sav2koO8isgFSiX0VxuAhzVJuClAu2aypBBa82C7aauZo5BXeJgamzVJ7X8J2Y5r5_X9RluugXUGx-M8BkL390J0-ZpMrJoO/s1600/network-igraph-centrality.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="341" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEis8NsxQIRqsBGQZ9yWNqUS53zQOwV5C6mG4CN4_nqtIEZ9Sav2koO8isgFSiX0VxuAhzVJuClAu2aypBBa82C7aauZo5BXeJgamzVJ7X8J2Y5r5_X9RluugXUGx-M8BkL390J0-ZpMrJoO/s400/network-igraph-centrality.png" width="400" /></a></div>
<br />
<h2>
Conclusion</h2>
</div>
<div>
<br /></div>
<div>
<span style="font-size: large;">Both packages <b>igraph </b>and <b>network </b>are widely used across CRAN ecosystem. Due to its versatility and rich set of functions <b>igraph</b> leads in acceptance and importance. But as far as graph objects concern it is still a matter of the requirements to prefer one's or another's objects in R.</span></div>
Gregory Kanevskyhttp://www.blogger.com/profile/09179130896383881927noreply@blogger.com0tag:blogger.com,1999:blog-7530218802939252476.post-92069197562808619262015-09-22T16:07:00.002-05:002015-09-22T16:30:58.471-05:00VW Big Data PlayVolkswagen made headlines lately for cheating U.S. EPA regulators. But let's pay some respect to their engineers.<br />
<br />
Apparently, there is no button or switch that tells car it's being tested - indeed - that would be obvious flaw in the emission test protocol. So VW engineers designed and deployed sophisticated algorithm that detects car is undergoing emission testing and turns emission control on just in time to pass it with flying colors. Then, after the test is over, it recognizes normal driving conditions and switches car software back to run diesel engine in its normal mode (that creates smog at up to 40 times the legal limit).<br />
<br />
Having such feature running flawlessly in real time conditions on hundred thousand cars all over the world deserves special recognition. In fact, it was pure accident that this "cheating device" was found (<a href="http://goo.gl/LBb43m">here</a> is Bloomberg's story how). At least, let's congratulate VW data scientists and software engineers - but not their execs - with quite an accomplishment.Gregory Kanevskyhttp://www.blogger.com/profile/09179130896383881927noreply@blogger.com0tag:blogger.com,1999:blog-7530218802939252476.post-51728842572724216152013-09-13T00:19:00.000-05:002017-09-16T17:59:09.953-05:00How to expand color palette with ggplot and RColorBrewer<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Histograms and bar charts are almost always a part of data analysis presentation. If it is made with R <a href="http://docs.ggplot2.org/current/geom_histogram.html" target="_blank">ggplot package</a> functions <i>geom_histogram()</i> or <i>geom_bar()</i> then bar chart may look like this:</span><br />
<br />
<code data-gist-file="ggplot_mtcars_barplot.R" data-gist-hide-footer="true" data-gist-id="ba4dca9636b4a6228ce5a8d5c0167968"></code>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhSYWFK5nuFFf3MyHr_DWPvNH2urEZczHi-ILpg2O6cgdnj1s6jaLCKAJz8L0lsj4uTEo0ARWt34cW4LI4ErV7jq-3HAQ-icOxuPIEdlCNJweCMXlkiiSb1GH56tdmUgGOUSLb3XjN_cK6X/s1600/ggplot-hist-mtcars-simple.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhSYWFK5nuFFf3MyHr_DWPvNH2urEZczHi-ILpg2O6cgdnj1s6jaLCKAJz8L0lsj4uTEo0ARWt34cW4LI4ErV7jq-3HAQ-icOxuPIEdlCNJweCMXlkiiSb1GH56tdmUgGOUSLb3XjN_cK6X/s1600/ggplot-hist-mtcars-simple.png" /></a></div>
<br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">The elegance of <i>ggplot </i>functions realizes in simple yet compact expression of visualization formula while hiding many options assumed by default. Hiding doesn't mean lacking as most options are just a step away. For example, for color color selection use one of the methods from the scale family of functions such as <i>scale_fill_brewer()</i>:</span><br />
<br />
<code data-gist-file="ggplot_mtcars_fill_brewer.R" data-gist-hide-footer="true" data-gist-id="ba4dca9636b4a6228ce5a8d5c0167968"></code>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh736_Q7_KJFMZTaj4DiPZ7-A9DgDkHGWIIec-FKFFH_UCG4w9cCuT0C5gOc-D_nHwPqBCjm67fCpvBfYfqNexgJOBH87BSC0Y8cIJwfqUCToeuRDvzZbHkpuVZ-xPv9RfxpHTvr873iorj/s1600/ggplot-hist-mtcars-scale-brewer.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh736_Q7_KJFMZTaj4DiPZ7-A9DgDkHGWIIec-FKFFH_UCG4w9cCuT0C5gOc-D_nHwPqBCjm67fCpvBfYfqNexgJOBH87BSC0Y8cIJwfqUCToeuRDvzZbHkpuVZ-xPv9RfxpHTvr873iorj/s1600/ggplot-hist-mtcars-scale-brewer.png" /></a></div>
<br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">And argument <i>palette</i> controls choice of colors in <i>scale_fill_brewer()</i>:</span><br />
<br />
<code data-gist-file="ggplot_mtcars_fill_brewer_palette.R" data-gist-hide-footer="true" data-gist-id="ba4dca9636b4a6228ce5a8d5c0167968"></code>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEivUSzzbATqokomyIP3yQjjUlqKt-kSEZkhNf0g8aj4H7pFXKwCOsSRjUSfFHYVQU1NOoRO9zVcK2kb3UFFobVHoa4YUjgsY0p9kNFQVgAN3M6glfDkdAPDXjOMdTPsbauPag4OMT2ngMgS/s1600/ggplot-hist-mtcars-simple-palette-Set1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEivUSzzbATqokomyIP3yQjjUlqKt-kSEZkhNf0g8aj4H7pFXKwCOsSRjUSfFHYVQU1NOoRO9zVcK2kb3UFFobVHoa4YUjgsY0p9kNFQVgAN3M6glfDkdAPDXjOMdTPsbauPag4OMT2ngMgS/s1600/ggplot-hist-mtcars-simple-palette-Set1.png" /></a></div>
<br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Palettes used live in the package <i>RColorBrewer </i>- to see all available choices simply run <i>display.brewer.all()</i>: </span><br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span>
<code data-gist-file="rcolorbrewer_palettes.R" data-gist-hide-footer="true" data-gist-id="ba4dca9636b4a6228ce5a8d5c0167968"></code>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8FxDlR1WqVbFwm0t_HlW4kCqG1PKFea-KvIbWT0Ps70efXrcAO34bTDADHQhuUEeYR2NFY7xIyPu2l4vDJL1MiolqsFCRO_UGq5IvOYLjkv0vuP5gNh2kTV7CQxvN6q3xxMsTAup2S_at/s1600/RColorBrewer-palettes.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8FxDlR1WqVbFwm0t_HlW4kCqG1PKFea-KvIbWT0Ps70efXrcAO34bTDADHQhuUEeYR2NFY7xIyPu2l4vDJL1MiolqsFCRO_UGq5IvOYLjkv0vuP5gNh2kTV7CQxvN6q3xxMsTAup2S_at/s1600/RColorBrewer-palettes.png" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgo7X0JGBX-8Qepc98BIB5OXhsLutraA_zS3VKF4b-xcUJn7uSnBMxiWHDrFHkkOU4IXbwlefuGZM_D2XRnYsJIqbYEjOdyUph6B7i7ySoCBdCB1yZACZBO3KvBis6Khu5oXv7zkhlvNfCt/s1600/RColorBrewer-palette-Set1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="695" data-original-width="700" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgo7X0JGBX-8Qepc98BIB5OXhsLutraA_zS3VKF4b-xcUJn7uSnBMxiWHDrFHkkOU4IXbwlefuGZM_D2XRnYsJIqbYEjOdyUph6B7i7ySoCBdCB1yZACZBO3KvBis6Khu5oXv7zkhlvNfCt/s1600/RColorBrewer-palette-Set1.png" /></a></div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-size: medium;"><br /></span></span>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">There are 3 types of palettes - sequential, diverging, and qualitative - each palette containing from 8 to 12 colors (see data frame <i>brewer.pal.info </i>or help <i>?RColorBrewer</i> for more detail).</span><br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Curious reader may notice that if a bar chart contains 13 or more bars we get in trouble with colors like in the next plot:</span><br />
<br />
<code data-gist-file="ggplot_mtcars_hp.R" data-gist-hide-footer="true" data-gist-id="ba4dca9636b4a6228ce5a8d5c0167968"></code>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhbUpHgIgJx_a1KCVtNE51A8deyEaDJbmX6D20Oxy7tG4Bho-bzGlZPXLnJKrBvZo_RoHO3m5Zbg6FGfVqk2GVYFqzjTcTDIi5cNpAlt8cVbAgTdoQIJBe2eMuhg2K1v2kPu5A_hvFqxXPG/s1600/ggplot-mtcars-hist-hp-trouble.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhbUpHgIgJx_a1KCVtNE51A8deyEaDJbmX6D20Oxy7tG4Bho-bzGlZPXLnJKrBvZo_RoHO3m5Zbg6FGfVqk2GVYFqzjTcTDIi5cNpAlt8cVbAgTdoQIJBe2eMuhg2K1v2kPu5A_hvFqxXPG/s1600/ggplot-mtcars-hist-hp-trouble.png" /></a></div>
<br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Indeed <i>length(unique(mtcars$hp))</i> finds 22 unique values for the attribute horse power, while the palette <i>Set2 </i>has 8 colors to choose from. Lack of colors in the palette triggers <i>ggplot </i>to issue warning like this (and invalidates plot as seen above):</span><br />
<blockquote class="tr_bq">
<span style="color: #990000;">1: In brewer.pal(n, pal) :<br /> n too large, allowed maximum for palette Set2 is 8<br />Returning the palette you asked for with that many colors</span></blockquote>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><i>RColorBrewer </i>gives us a way to produce larger palettes by interpolating existing ones with constructor function <i>colorRampPalette().</i> It generates a function that does actual job of build palettes with arbitrary number of colors by interpolating existing palette. Thus expanding the palette <i>Set1 </i>of 9 colors to 22 (the number of unique horse power values in <i>mtcars</i>):</span><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhCLZoj6TdI-awVkJbmwLDt6uE_shZ3BnVYHANqM-N8vGPETKlJvPLpGlf5apxqrHg9oL4UkF5Qn43JFBBcTFkYnppZyrwIrKpDnq-Wn1FSvu21RIJEB9WGkFc_3iaP8TK_wcKNHYSpulNN/s1600/ggplot-mtcars-mypalette-nolegend.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhCLZoj6TdI-awVkJbmwLDt6uE_shZ3BnVYHANqM-N8vGPETKlJvPLpGlf5apxqrHg9oL4UkF5Qn43JFBBcTFkYnppZyrwIrKpDnq-Wn1FSvu21RIJEB9WGkFc_3iaP8TK_wcKNHYSpulNN/s1600/ggplot-mtcars-mypalette-nolegend.png" /></a></div>
<br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">While we addressed color palette deficiency other interesting things happened: even though all bars are back and are distinctly colored we lost the color legend. I intentionally added <i>theme(legend.position=...) </i>to showcase this fact: despite explicit position request in <i>theme()</i> the legend is no more part of the plot.</span><br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-size: medium;"><br /></span>
<span style="font-size: medium;">The difference: <i>fill </i>parameter was moved outside of histogram <i>aes() </i>function which effectively removed color information from <i>ggplot()</i> aesthetics mapping. Hence, there is nothing to apply legend to.</span></span><br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="font-size: medium;"><br /></span>
<span style="font-size: medium;">To fix move <i>fill </i>back into <i>aes()</i> and use <i>scale_fill_manual() </i>to define custom palette:</span></span><br />
<br />
<code data-gist-file="ggplot_mtcars_hp_aes.R" data-gist-hide-footer="true" data-gist-id="ba4dca9636b4a6228ce5a8d5c0167968"></code>
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhF2IIi8sChjcJ8f1FOr3EG2uuQPDOZN2B_p-IJha9JnDeGhljsSaJxDLQxLukyZRWB1PuIWcsOhOv8NjIOu2AAebVbTStspyKx0M9yx7RzeOqilySQCelvm0J0QFPBDBGr9Nui8_ByG_Mi/s1600/ggplot-mtcars-manual-fill-scale.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhF2IIi8sChjcJ8f1FOr3EG2uuQPDOZN2B_p-IJha9JnDeGhljsSaJxDLQxLukyZRWB1PuIWcsOhOv8NjIOu2AAebVbTStspyKx0M9yx7RzeOqilySQCelvm0J0QFPBDBGr9Nui8_ByG_Mi/s1600/ggplot-mtcars-manual-fill-scale.png" /></a></div>
<br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Another likely problem with large number of bars in plots like above is placing and layout of the legend. Adjust legend position and layout using <i>theme() </i>and <i>guide_legend() </i>functions as follows:</span><br />
<br />
<code data-gist-file="ggplot_mtcars_legend_position.R" data-gist-hide-footer="true" data-gist-id="ba4dca9636b4a6228ce5a8d5c0167968"></code>
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg-hjWFxrW0Ej4v-WGF2uVAMesH7XfJ7h9GhlLUV0Nzzazhhwm0bkt2TNbjE92YJZMCfR0pHnO-qtjE9Srhb8Flw6ojfgnzZifJwmO8DxymOGNhql-CzEweu1Ryn5OX7A3xLPmBc-x3BK-P/s1600/ggplot-mtcars-hist-final-legend.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg-hjWFxrW0Ej4v-WGF2uVAMesH7XfJ7h9GhlLUV0Nzzazhhwm0bkt2TNbjE92YJZMCfR0pHnO-qtjE9Srhb8Flw6ojfgnzZifJwmO8DxymOGNhql-CzEweu1Ryn5OX7A3xLPmBc-x3BK-P/s1600/ggplot-mtcars-hist-final-legend.png" /></a></div>
<br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Finally, the same example using in place palette constructor with different choice of library palette:</span><br />
<br />
<code data-gist-file="ggplot_mtcars_legend_example.R" data-gist-hide-footer="true" data-gist-id="ba4dca9636b4a6228ce5a8d5c0167968"></code>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8Of5uhKwaaxAm8TNXxOcR2_2U5MqPoGb_Kt7fIfTahHrtS8bzgJCIPpu-XNJZ4ZG_4y_VZhEAMEEdsM7KlbykfGkfVku4Ph45N_xysDrJUTBK4geHPQY29zEzgrv74vfyMRKuIJzVlJXx/s1600/ggplot-mtcars-hist-final-inplace.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8Of5uhKwaaxAm8TNXxOcR2_2U5MqPoGb_Kt7fIfTahHrtS8bzgJCIPpu-XNJZ4ZG_4y_VZhEAMEEdsM7KlbykfGkfVku4Ph45N_xysDrJUTBK4geHPQY29zEzgrv74vfyMRKuIJzVlJXx/s1600/ggplot-mtcars-hist-final-inplace.png" /></a></div>
<br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">There are quite a few more <a href="http://docs.ggplot2.org/0.9.3.1/scale_brewer.html">scale functions</a> to choose from depending on aesthetics type (<i>colour</i>, <i>fill</i>), color types (gradient, hue, etc.), data values (discrete or continuous).</span><br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><b>UPDATE (09.16.17)</b></span><br />
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;">Not to undermine usefulness of <i>RColorBrewer</i> but there are more choices available in R. One example is <a href="https://cran.r-project.org/web/packages/ggthemes/vignettes/ggthemes.html">package <i>ggthemes</i></a> that besides offering complete themes and scales for <i>ggplot2</i> contains themed color palettes: </span><br />
<br />
<code data-gist-file="ggplot_mtcars_solarized.R" data-gist-hide-footer="true" data-gist-id="ba4dca9636b4a6228ce5a8d5c0167968"></code>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhWfEvSedioHxbFc3T3CgWhh6r3NpC_S6aqkvVft1NLF0T9ndrl8bChAmLi2Zomo9dtZDzdGqH4Yty-0p4ryZx3_iRvwkvjmICa4swO52TX86k_YCf1h-9L2D2xjh91F6twMHYH34tmCZyf/s1600/ggthemes-solarized-palette.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="625" data-original-width="630" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhWfEvSedioHxbFc3T3CgWhh6r3NpC_S6aqkvVft1NLF0T9ndrl8bChAmLi2Zomo9dtZDzdGqH4Yty-0p4ryZx3_iRvwkvjmICa4swO52TX86k_YCf1h-9L2D2xjh91F6twMHYH34tmCZyf/s1600/ggthemes-solarized-palette.png" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj9GtlMsoORbSMBHwbFy0AGi8g0KZRT8FOuPd6zU7Pp5DyhQtFuNyaTK0DxkLG63EDHI4h9H568MNGizYYoVFnAZ7olIkn2Rx8jWAkWHmmxB5Crd5KA8Rxf3m23DonWGnIj0GcKNr-DkRUG/s1600/ggplot-mtcars-manual-fill-solirized.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="624" data-original-width="630" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj9GtlMsoORbSMBHwbFy0AGi8g0KZRT8FOuPd6zU7Pp5DyhQtFuNyaTK0DxkLG63EDHI4h9H568MNGizYYoVFnAZ7olIkn2Rx8jWAkWHmmxB5Crd5KA8Rxf3m23DonWGnIj0GcKNr-DkRUG/s1600/ggplot-mtcars-manual-fill-solirized.png" /></a></div>
<span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><br /></span>Gregory Kanevskyhttp://www.blogger.com/profile/09179130896383881927noreply@blogger.com12tag:blogger.com,1999:blog-7530218802939252476.post-28178436296922462862013-07-31T23:55:00.001-05:002013-10-18T15:39:21.506-05:00Quick R tip: ggplot in functions needs some extra careWhen building visualizations with <b>ggplot2 </b>in R I decided to create specialized functions that encapsulate plotting logic for some of my creations. In this case instead of commonly used <b><i>aes </i></b>function I had to use its alternative - <b><i>aes_string</i></b> - for aesthetic mapping from a string.<br />
<br />
And now goes this handy tip:<br />
while original aesthetic mapping function <b><i>aes </i></b>accepts<b><i> x</i></b> and <b><i>y</i></b> parameters by position:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">p = <a href="http://inside-r.org/packages/cran/ggplot">ggplot</a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a><span style="color: #339933;">,</span> aes<span style="color: #009900;">(</span>x<span style="color: #339933;">,</span> y<span style="color: #009900;">)</span><span style="color: #009900;">)</span> + ...</pre>
</div>
</div>
<br />
<b><i>aes_string </i></b>even though silently accepts them won't work like this:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">my_plot_fun = <a href="http://inside-r.org/r-doc/base/function"><span style="color: #003399; font-weight: bold;">function</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a><span style="color: #339933;">,</span> xname<span style="color: #339933;">,</span> yname<span style="color: #009900;">)</span> <span style="color: #009900;">{</span>
p = <a href="http://inside-r.org/packages/cran/ggplot">ggplot</a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a><span style="color: #339933;">,</span> aes_string<span style="color: #009900;">(</span>xname<span style="color: #339933;">,</span> yname<span style="color: #009900;">)</span><span style="color: #009900;">)</span> + ...
<span style="color: #009900;">}</span></pre>
</div>
</div>
<br />
It will run to compile plot object without problems but when plot <b><i>p</i></b> (returned from the function <b><i>my_plot_fun</i></b>) executed this rather cryptic error appears:<br />
<br />
<b><span style="font-family: Courier New, Courier, monospace;">
Error in as.environment(where) : 'where' is missing</span></b>
<br />
<b><span style="font-family: Courier New, Courier, monospace;"><br /></span></b>What it means is that <b>ggplot </b>never got aesthetics defined right. This is due to <b><i>aes_string</i></b> function lacking the same position parameters as in its <b><i>aes </i></b>counterpart above. Instead, define both <b><i>x</i></b> and <b><i>y</i></b> parameters (and others if necessary) by name:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">p = <a href="http://inside-r.org/packages/cran/ggplot">ggplot</a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a><span style="color: #339933;">,</span> aes_string<span style="color: #009900;">(</span>x=xname<span style="color: #339933;">,</span> y=yname<span style="color: #009900;">)</span><span style="color: #009900;">)</span> + ...</pre>
</div>
</div>
<br />
<a href="http://www.inside-r.org/pretty-r" title="Created by Pretty R at inside-R.org">Created by Pretty R at inside-R.org</a><br />
<b><span style="font-family: Courier New, Courier, monospace;"><br /></span></b>
<br />
<h4>
UPDATE:</h4>
<br />
One more vote for using <i><b>aes_string</b></i> in place of <b><i>aes </i></b>comes from CRAN submission policy, i.e.:<br />
<blockquote class="tr_bq">
<span style="background-color: white;">In principle, packages must pass </span><code style="background-color: white;">R CMD check</code><span style="background-color: white;"> without warnings or significant notes to be admitted to the main </span><acronym style="background-color: white;">CRAN</acronym><span style="background-color: white;"> package area. If there are warnings or notes you cannot eliminate (for example because you believe them to be spurious) send an explanatory note as part of your covering email, or as a comment on the submission form.</span> </blockquote>
<blockquote class="tr_bq">
(source: <a href="http://cran.r-project.org/web/packages/policies.html">CRAN Repository Policy</a>)</blockquote>
What happens is that <span style="background-color: white;"> </span><code style="background-color: white;">R CMD check</code><span style="background-color: white;"> </span> reports notes like this for every <b><i>aes </i></b>call:<br />
<br />
<pre class="default prettyprint prettyprinted" style="background-color: #eeeeee; border: 0px; font-family: Consolas, Menlo, Monaco, 'Lucida Console', 'Liberation Mono', 'DejaVu Sans Mono', 'Bitstream Vera Sans Mono', 'Courier New', monospace, serif; font-size: 14px; line-height: 18px; margin-bottom: 10px; max-height: 600px; overflow: auto; padding: 5px; vertical-align: baseline; width: auto; word-wrap: normal;"><code style="border: 0px; font-family: Consolas, Menlo, Monaco, 'Lucida Console', 'Liberation Mono', 'DejaVu Sans Mono', 'Bitstream Vera Sans Mono', 'Courier New', monospace, serif; margin: 0px; padding: 0px; vertical-align: baseline;"><span class="kwd" style="background-color: transparent; border: 0px; color: darkblue; margin: 0px; padding: 0px; vertical-align: baseline;">no</span><span class="pln" style="background-color: transparent; border: 0px; margin: 0px; padding: 0px; vertical-align: baseline;"> visible binding </span><span class="kwd" style="background-color: transparent; border: 0px; color: darkblue; margin: 0px; padding: 0px; vertical-align: baseline;">for</span><span class="pln" style="background-color: transparent; border: 0px; margin: 0px; padding: 0px; vertical-align: baseline;"> </span><span class="kwd" style="background-color: transparent; border: 0px; color: darkblue; margin: 0px; padding: 0px; vertical-align: baseline;">global</span><span class="pln" style="background-color: transparent; border: 0px; margin: 0px; padding: 0px; vertical-align: baseline;"> variable </span><span class="pun" style="background-color: transparent; border: 0px; margin: 0px; padding: 0px; vertical-align: baseline;">[</span><span class="pln" style="background-color: transparent; border: 0px; margin: 0px; padding: 0px; vertical-align: baseline;">variable name</span><span class="pun" style="background-color: transparent; border: 0px; margin: 0px; padding: 0px; vertical-align: baseline;">]</span></code></pre>
It turns out that the most sensible solution is <a href="http://stackoverflow.com/q/9439256/59470">using <b><i>aes_string</i></b> instead</a>.<br />
<!-----><!----->Gregory Kanevskyhttp://www.blogger.com/profile/09179130896383881927noreply@blogger.com0tag:blogger.com,1999:blog-7530218802939252476.post-5550378224455634642012-07-26T23:40:00.000-05:002012-07-26T23:40:31.019-05:00My New Calculator(s)We all need a calculator from time to time. I used to reach to <b>Start</b> button, type <i>Calc</i> in the <b>Run </b>(or <b>Search</b>) box to get to <b>Calculator </b>app (<i>Windows</i>). Until recently that is. Now I simply start Octave and do my calculations there. Sometimes, I already have Python prompt and then I do my calculations there.<br />
<br />
For example compute a variance for the sample of 10 coin flips: 4 Tails (0) and 6 Heads (1) (estimated mean <i>p=0.6)</i>:
<br />
<blockquote class="tr_bq" style="font-family: "Courier New",Courier,monospace; font-weight: bold;">
octave-3.2.4.exe>(4*(0-0.6)**2 + 6*(1-0.6)**2)/(10-1)<br />
ans = 0.15360</blockquote>
This calculator really works for me. Sometimes I have a Python window and it works just as well:
<br />
<blockquote class="tr_bq" style="font-family: "Courier New",Courier,monospace; font-weight: bold;">
>>> (4*(0-0.6)**2 + 6*(1-0.6)**2)/(10-1)<br />
0.2666666666666667</blockquote>
Having said that Octave beats Python in easiness when calculating with vectors (or time series or sequences or anything that can be represented as a vector). Let's suppose we test if certain coin is not loaded (is fair) by flipping it 14 times. We would like to be 95% certain that coin is fair (i.e. p=0.5 which is equivalent to two-tailed test). Suppose that 14 flips resulted in only 3 heads. First, we build a critical interval - the number of tails that would result in rejecting coin fairness given number of heads:
<br />
<blockquote class="tr_bq" style="font-family: "Courier New",Courier,monospace; font-weight: bold;">
critical_interval = binopdf(0:14,14,0.5) > 0.025 | binocdf(0:14,14,0.5)- binopdf(0:14,14,0.5) > 0.975
</blockquote>
<i>critical_interval</i> is a Boolean vector where i-th element corresponds to <i>(i-1)</i> number of tails: if it's true then with 95% certainty it is a fair coin. This expression is a logical OR of 2 expressions: first for left tail and second for right tail. Octave seamlessly handles any vectors just as if it were a number: I can change this to 1000 flips with minimum keystrokes.<br />
<br />
Thus, given number of tails 3 we get<br />
<blockquote class="tr_bq" style="font-family: "Courier New",Courier,monospace; font-weight: bold;">
octave-3.2.4.exe> critical_interval(3+1)<br />
ans = 0 </blockquote>
We use <i>(3+1)</i> as 1st element corresponds to 0 tails. Hence, we accept the fact that our coin is fair with 95% certainty because 3 heads do not belong to the critical interval. Similarly we have to reject this coin as loaded if number of tails happens to be 12: <br />
<br />
<blockquote class="tr_bq" style="font-family: "Courier New",Courier,monospace; font-weight: bold;">
octave-3.2.4.exe> critical_interval(12+1)<br />
ans = 1 </blockquote>
<br />
I leave Python example as an exercise to a reader but I am certain that result won't be close to Octave in neither transparency nor conciseness. The Octave solution does leave me with the right to keep claiming this use similar to a calculator - not a programming.Gregory Kanevskyhttp://www.blogger.com/profile/09179130896383881927noreply@blogger.com0tag:blogger.com,1999:blog-7530218802939252476.post-57298965586001522982012-06-23T16:11:00.000-05:002012-06-23T16:20:27.983-05:00Intro to Octave for Coursera StudentsOctave as a programming language has a lot to offer. To give you a taste this post attempts to showcase some of the cooler features of the language. But it also serves a purpose of introduction to Octave (or Matlab) for those who are taking or considering taking <a href="https://www.coursera.org/course/ml" target="_blank">Coursera Machine Learning class by Professor Andrew Ng</a> (great great idea). Not incidentally most of the examples were inspired by the homework assignments for the course.<br />
<br />
<div id="disclaimer" style="border: 1px solid #f00; text-align: center; width: 600px;">
<br />
<i><b>Disclaimer:</b> this post contains no concrete references, examples, excerpts or solutions for any of the Coursera courses, exercises, or homework assignments.</i>
<br />
<br /></div>
<br />
<h3>
Matrix Basics</h3>
Suppose we have a 3 by 5 matrix <b><i>A</i></b> like this:
<br />
<blockquote class="tr_bq" style="font-family: "Courier New",Courier,monospace; font-weight: bold;">
octave:> A = [1 2 3 4 5; 2 3 4 5 6; 3 4 5 6 7]
<br />
<br />
A =<br />
<br />
1 2 3 4 5<br />
2 3 4 5 6<br />
3 4 5 6 7</blockquote>
<br />
Then to extract a single element from <b><i>A</i></b> is just like in most other languages:
<br />
<blockquote class="tr_bq" style="font-family: "Courier New",Courier,monospace; font-weight: bold;">
octave:> A(2,3)<br />
<br />
ans = 4 </blockquote>
Just remember that following <a href="http://en.wikipedia.org/wiki/Matrix_%28mathematics%29" target="_blank">mathematical conventions</a> Octave indexes start with 1.<br />
<br />
Almost everything in Octave is array (vector or matrix or similar) and index is no exception. Let's take a range for example. Range is a row vector with evenly spaced elements, e.g.:<br />
<br />
<blockquote class="tr_bq" style="font-family: "Courier New",Courier,monospace; font-weight: bold;">
octave:> 1:5<br />
ans =<br />
<br />
1 2 3 4 5<br />
<br />
octave:> A(1:3,1:5)<br />
ans =<br />
<br />
1 2 3 4 5<br />
2 3 4 5 6<br />
3 4 5 6 7</blockquote>
Operation <i><b>A(1:3, 1:5)</b></i> returns whole A again because it selects all of its rows and columns. Ranges can exist by themselves as vectors but there is special type which is available only in the context of matrix index:<br />
<br />
<blockquote class="tr_bq" style="font-family: "Courier New",Courier,monospace; font-weight: bold;">
octave:> A(2:end,3:end)<br />
ans =<br />
<br />
4 5 6<br />
5 6 7</blockquote>
Ranges <i><b>2:end</b></i> and <i><b>3:end</b></i> are defined only within concrete matrix context as keyword <i><b>end</b></i> indicates last row or column position within matrix. You can even select elements for last 2 rows and columns like this:<br />
<blockquote class="tr_bq" style="font-family: "Courier New",Courier,monospace; font-weight: bold;">
octave:> A(end-1:end,end-1:end)<br />
ans =<br />
<br />
5 6<br />
6 7</blockquote>
<h3>
Logical Operations on Matrices </h3>
Logical arrays in Octave contain all logical elements and are usually results of relational operators with vectors and matrices like this:<br />
<br />
<blockquote class="tr_bq" style="font-family: "Courier New",Courier,monospace; font-weight: bold;">
octave:> A != 3<br />
ans =<br />
<br />
1 1 0 1 1<br />
1 0 1 1 1<br />
0 1 1 1 1</blockquote>
One cool application of this is inverting identity matrix:<br />
<blockquote class="tr_bq" style="font-family: "Courier New",Courier,monospace; font-weight: bold;">
octave:> I = eye(3)<br />
I =<br />
<br />
Diagonal Matrix<br />
<br />
1 0 0<br />
0 1 0<br />
0 0 1<br />
<br />
octave:> I == 0<br />
ans =<br />
<br />
0 1 1<br />
1 0 1<br />
1 1 0</blockquote>
<h3>
Euclidean Distance</h3>
Given two vectors (same size) find Euclidean distance between them. <br />
<br />
<blockquote class="tr_bq" style="font-family: "Courier New",Courier,monospace; font-weight: bold;">
a = [0 0 0];<br />
b = [1 2 2];<br />
distance = sqrt(sumsq(a-b));</blockquote>
If you find yourself writing in Octave more complex solutions for similar problems with vectors then stop and review your vectorization approach.<br />
<br />
<h3>
Vectorizing indexes</h3>
Suppose you have a collection of <i><b>m </b></i>vectors in <i><b>n-</b></i>dimensional space stored as <i><b>m </b></i>x <i><b>n </b></i>matrix <i><b>X</b></i>. Suppose that the value of last (n-th) coordinate of these vectors is always 0 or 1. Then we want to produce 2 subsets of X - one subset of vectors with last coordinate equal to 0 and the other subset where vectors have last coordinate equal to 1:<br />
<blockquote class="tr_bq" style="font-family: "Courier New",Courier,monospace; font-weight: bold;">
n = size(X, 2);<br />
X0 = X( find( X(:, n) == 0 ), :);<br />
X1 = X( find( X(:, n) == 1 ), :);</blockquote>
<h3>
Randomness</h3>
Random numbers appear in many problems. Basic approach is a matrix filled with random numbers from uniform distribution on the interval (0, 1):<br />
<blockquote class="tr_bq" style="font-family: "Courier New",Courier,monospace; font-weight: bold;">
octave:> rand(3,4)<br />
ans =<br />
<br />
0.347937 0.317482 0.630678 0.245148<br />
0.917634 0.649125 0.634592 0.837635<br />
0.994745 0.092818 0.154936 0.966380</blockquote>
But Octave offers a few shortcuts. For one, such common distributions as normal, exponential, Poisson, and gamma each receive their own function <i><b>randn</b></i>, <i><b>rande</b></i>, <i><b>randp</b></i>, and <i><b>randg</b></i>.<br />
<br />
Function <i><b>randperm </b></i>produces a row vector of randomly permuted integers from 1 to <b><i>n</i></b>:<br />
<blockquote class="tr_bq" style="font-family: "Courier New",Courier,monospace; font-weight: bold;">
octave:> randperm(5)<br />
ans =<br />
<br />
3 1 4 5 2</blockquote>
If you are looking for an arbitrary vector with values between 0 and <i><b>N</b></i> of size <i><b>n (n<=N)</b></i> then <i><b>randperm </b></i>gets id done (in this case <i><b>N = 100</b></i> and <i><b>n = 10</b></i>):<br />
<blockquote class="tr_bq" style="font-family: "Courier New",Courier,monospace; font-weight: bold;">
octave:> x = randperm(100)(1:10)<br />
x =<br />
<br />
88 1 89 25 76 19 78 38 99 34</blockquote>
<br />
This combined with vector indexing accomplishes rather elaborate task in short one-liner: suppose we have <i><b>m n</b></i>-dimensional vectors stored as <i><b>m </b></i>x<i><b> n</b></i> matrix <i><b>X</b></i> and we need to pick <i><b>k</b></i> vectors (<i><b>k < m</b></i>) randomly. The following gets this done:<br />
<br />
<blockquote class="tr_bq" style="font-family: "Courier New",Courier,monospace; font-weight: bold;">
X(randperm(size(X, 1))(1:k), :);</blockquote>
<br />
Newer releases of Octave (I use 3.2.4) added functions <i><b>randi </b></i>and <i><b>randperm(n, m)</b></i> that offer even nice features.<br />
<br />
<h3>
Binary Singleton Expansion Function (bsxfun)</h3>
This function reminds me of Python <i><b><a href="http://docs.python.org/library/functions.html#map" target="_blank">map</a> </b></i>function, but having it in Octave is necessity (unlike in Python). When vectorizing in Octave we have a few options: 1) both parameters are of the same dimensions for element-wise application; 2) parameters are of compatible sizes for matrix operations like multiplication; 3) one parameter is a matrix and the other is a scalar.<br />
<br />
And then we use <i><b>bsxfun </b></i>for vectorizing everything else, for example applying a vector to each row:<br />
<br />
<blockquote class="tr_bq" style="font-family: "Courier New",Courier,monospace; font-weight: bold;">
X = rand(5, 3);<br />
mu = mean(X);<br />
sigma = std(X); <br />
X_norm = bsxfun(@minus, X, mu);<br />
X_norm = bsxfun(@divide, X_norm, sigma);</blockquote>
<br />
The operations above resulted in normalizing all 5 vectors (rows) from <i><b>X</b></i>: <i><b>X_norm </b></i>contains vectors with 0 means and standard deviations 1 for all 3 dimensions. In just 2 lines <i><b>bsxfun</b></i> applied <i><b>mu </b></i>(means) and <i><b>sigma </b></i>(standard deviations) to each row of <i><b>X </b></i>and <i><b>X_norm</b></i>.<br />
<br />
<h3>
Timing operations</h3>
Use functions <i><b>tic </b></i>and <i><b>toc </b></i> to measure execution time in Octave to tune performance when necessary:<br />
<blockquote class="tr_bq" style="font-family: "Courier New",Courier,monospace; font-weight: bold;">
octave:> tic; A * A'; toc<br />
Elapsed time is 2.58e-008 seconds.</blockquote>
<br />Gregory Kanevskyhttp://www.blogger.com/profile/09179130896383881927noreply@blogger.com2tag:blogger.com,1999:blog-7530218802939252476.post-55736771704536235062012-06-04T14:57:00.000-05:002012-07-26T15:39:40.707-05:00Gentle Intro to Octave or MatlabI began using Octave for homework assignments from the online <a href="https://class.coursera.org/ml/class/index" target="_blank">Machine Learning class</a>. Having worked with languages like Python, Groovy and JavaScript I never expected a system designed for numerical computations to include such a complete and unique programming language. But it does and I can't resist sharing some examples.<br />
<br />
There are two important things one should know about Octave (or Matlab as Octave is usually portable to Matlab):<br />
<ul>
<li>Octave is a <b>high level</b> language just like Python or Groovy</li>
<li>Using Octave without <b>matrices or vectors</b> is like using Java without objects </li>
</ul>
Just these by themselves are worth a whole book on Octave but instead I go on with few cool examples (leaving the book for later :-). <br />
<br />
<h4>
Matrices and Vectors</h4>
Creating vector or matrix in Octave is simple:<br />
<blockquote class="tr_bq" style="font-family: "Courier New",Courier,monospace; font-weight: bold;">
octave:> A = [1 2 3; 4 5 6; 7 8 9]<br />
A =<br />
<br />
1 2 3<br />
4 5 6<br />
7 8 9 </blockquote>
defines 3x3 matrix of integers.<br />
<br />
Use special functions to define special matrices, e.g. identity: <br />
<blockquote class="tr_bq" style="font-family: "Courier New",Courier,monospace; font-weight: bold;">
octave:> I = eye(3)<br />
I =<br />
<br />
Diagonal Matrix<br />
<br />
1 0 0<br />
0 1 0<br />
0 0 1</blockquote>
<br />
all zeros:<br />
<br />
<blockquote class="tr_bq" style="font-family: "Courier New",Courier,monospace; font-weight: bold;">
octave:> allZeros = zeros(2,4) <br />
allZeros =<br />
<br />
0 0 0 0<br />
0 0 0 0 </blockquote>
vector (number of columns is 1) of all ones: <br />
<br />
<blockquote class="tr_bq" style="font-family: "Courier New",Courier,monospace; font-weight: bold;">
octave:> allOnes = ones(3,1) <br />
allOnes =<br />
<br />
1<br />
1<br />
1 </blockquote>
or matrix with random values:<br />
<br />
<blockquote class="tr_bq" style="font-family: "Courier New",Courier,monospace; font-weight: bold;">
octave:> X = rand(3, 5)<br />
X =<br />
<br />
0.400801 0.091597 0.951333 0.063074 0.018309<br />
0.690633 0.194094 0.417911 0.658953 0.624323<br />
0.848887 0.696741 0.213559 0.363656 0.632738</blockquote>
And finally getting a vector of values from 1 to N (row vector):<br />
<blockquote class="tr_bq" style="font-family: "Courier New",Courier,monospace; font-weight: bold;">
octave:> 1:N<br />
ans =<br />
<br />
1 2 3 4 5 6 7 8 9 10</blockquote>
and column (vector above transposed):<br />
<blockquote class="tr_bq" style="font-family: "Courier New",Courier,monospace; font-weight: bold;">
octave:> (1:N)'<br />
ans =<br />
<br />
1<br />
2<br />
3<br />
4<br />
5<br />
6<br />
7<br />
8<br />
9<br />
10</blockquote>
<br />
To stir things up a bit I make the following claim:<br />
<blockquote class="tr_bq" style="border:3px solid gray;margin:1ex;padding:1ex;">
<b><i>In Octave for any given problem there is higher than 50% chance that using matrices alone solves the problem with less code and more efficiently than when using loop and condition statements. </i></b></blockquote>
Being a high level language Octave has control statements <i><b>if, switch, </b></i>loops <i><b>for</b></i> and <i><b>while </b></i>but using them in Octave is often your second choice. The reason are many matrix operators and functions Octave offers may accomplish a task without ever invoking a single control statement in a fraction of time.<br />
<br />
Suppose you have a matrix <i><b>X</b></i> and you need to insert a column of <i><b>1s</b></i> in front. Then I just concatenate a vector of <i><b>1s</b></i> of proper size and <i><b>X</b></i>:<br />
<br />
<blockquote class="tr_bq" style="font-family: "Courier New",Courier,monospace; font-weight: bold;">
X = [ ones(size(X, 1), 1), X ];</blockquote>
<br />
<u>What just happened:</u> function <i>size </i>returned 1st dimension of array <i><b>X</b></i> (number of rows); then function <i>ones </i>generated a vector (2d dimension is 1) of <i><b>1s</b></i>, and finally we concatenated column and <i>X</i>. But this is only beginning.<br />
<br />
<h4>
Matrix magic</h4>
This example illustrates why and how things may work out better without control statements in Octave. Suppose I have a row vector (we may call it also an array but ultimately it is a single row matrix) of numbers from 1 to 10:<br />
<blockquote class="tr_bq" style="font-family: "Courier New",Courier,monospace; font-weight: bold;">
octave:> y = randperm(10)<br />
y =<br />
<br />
4 3 1 8 6 10 9 2 7 5 </blockquote>
<blockquote class="tr_bq" style="font-family: "Courier New",Courier,monospace; font-weight: bold;">
<br />
octave:> y = repmat(y, 1, 10);</blockquote>
<br />
Function <i><b>randperm </b></i>returned a row vector containing random permutation of numbers 1 through 10. Then I used function <i><b>repmat </b></i>to make vector <i><b>y</b></i> 10 times its original length by repeating it 10 times (which results in 1 by 100 matrix <i><b>y</b></i>).<br />
<br />
Now, to the problem we will solve. Let's say each number from 1 to 10 corresponds to 10-dimensional vector of 0s and 1 with all of its elements set to 0 except the position of the number. Thus, <i><b>1</b></i> corresponds to vector <i>(1 0 0 0 0 0 0 0 0 0)</i>, <b><i>2</i></b> corresponds to vector <i>(0 1 0 0 0 0 0 0 0 0)</i> and so on until <b><i>10</i></b> that corresponds to <i>(0 0 0 0 0 0 0 0 0 1)</i>. Then given some vector <i><b>y</b></i> like above (where each element is a number from 1 to 10) we want to produce series of vectors that correspond to elements in <i><b>y</b></i>.<br />
<br />
If you never worked with Octave before then solution may amaze you, if you did then this might be your normal routine:<br />
<br />
<blockquote class="tr_bq" style="font-family: "Courier New",Courier,monospace; font-weight: bold;">
A = eye(10)<br />
<br />
A =<br />
<br />
Diagonal Matrix<br />
<br />
1 0 0 0 0 0 0 0 0 0<br />
0 1 0 0 0 0 0 0 0 0<br />
0 0 1 0 0 0 0 0 0 0<br />
0 0 0 1 0 0 0 0 0 0<br />
0 0 0 0 1 0 0 0 0 0<br />
0 0 0 0 0 1 0 0 0 0<br />
0 0 0 0 0 0 1 0 0 0<br />
0 0 0 0 0 0 0 1 0 0<br />
0 0 0 0 0 0 0 0 1 0<br />
0 0 0 0 0 0 0 0 0 1<br />
<br />
result = A(:, y);</blockquote>
<br />
<u>What just happened:</u> first we created an identity matrix <i><b>A</b></i> of size 10 - note that it consists of exactly 10 vectors we are mapping to and each is in right column position. Now, we just plug our original vector y into column index of matrix A. This will extract elements from A: all for rows and precisely right columns. Thus <i><b>A(:, 2)</b></i> gives us matrix which is 2nd column of <i><b>A</b></i>, <i><b>A(:, 2:4)</b></i> gives us matrix with columns of <i><b>A</b></i> from 2 to 4, same is accomplished with <i><b>A(:, [2:4])</b></i>, and <i><b>A(:, [1 4 9])</b></i> selects 1st, 4th and 9th columns of <i><b>A</b></i>. Finally, we can plug an arbitrary vector in column index - in our case vector <i><b>y</b></i> just what we need and <i><b>result </b></i>becomes 10 by 100 matrix where each column corresponds to element of <i><b>y</b></i>.<br />
<br />
As unconventional as it may sound I keep thinking of this solution in terms of ranges when indexing arrays. Indexing array could be via an integer, range, or array of numbers. The latter is just a vector and that is all to it.Gregory Kanevskyhttp://www.blogger.com/profile/09179130896383881927noreply@blogger.com4