Wednesday, July 5, 2017

The Role of Small Data and Vacation Recap Example

Wikipedia defines small data 'small' enough for human comprehension but then it goes further by qualifying data in a volume and format that makes it accessible, informative and actionable. I am not certain the latter is always true: smaller footprint doesn't automatically qualify data as informative and actionable without more work. In my book small data usually scales to kilobytes and has just a handful of dimensions. But its main feature remains human comprehension which really means there is simple story behind it. 

In the grand scheme of big data things small data story is the last mile of data science analysis. It still requires interpretation (or representation) in the form of visualization or application.

Case in point could be Google spreadsheet I kept while on vacation in Italy with daily recordings of miles and steps walked. Later I added main attractions for each day. The result was my personal small data covering about 2 weeks of touring Italy with bases in Rome and later in Sicily (this sentence was the story):

Google sheet of activities while on vacation in Italy

As-is this spreadsheet is destined to Google archives contributing to ever growing collection of docs I created and happily forgot about. So I created this visualization that represents both most of data and the story:

Small data visualization

Before explaining how this visualization was created with R I ought to acknowledge that Google spreadsheets offer adding a chart or graph to a document. But its functionality appears rather limited without resorting to JavaScript API.

Using R googlesheets package to source Google docs makes them integral part of data sources available from within R code:
For details on how code above authenticates with Google servers and processes documents see very detailed vignettes

Now we can get back to small data and its simple story. Which means single visualization may include most if not all of it. In case of small data the goal is designing such chart without sacrificing clarity.

Core attributes days (Date) and miles walked (Distance; I chose miles over Steps for simplicity) suggest a line chart with timeline along x-axis and distance for y-axis. But there are 2 more factors to incorporate: Place indicating where the base city was each day and Label for major attractions.

Base city receives color identification with deep red for Rome and olive green for Syracuse. Major attractions text was attached to each point with smart justifications to fit inside the chart:

Had I kept more detailed log I would have ended up with more dimensions to use. For example, miles driven by car or train, time spent at leisure versus touring, number of cities and places visited, historical marker attributes and so on. But that moves us further away from small data domain as footprint and dimensions grow and story becomes less comprehensible. One of indicators of this is that it becomes harder to collect data manually. Instead, there are apps that would do it for me, for example, Life Cycle or Apple Health.

Ultimately any big data problem is reduced to one or more small data ones by aggregating, regressions, clustering or some other data science method. The path to big data insights is a journey from big to small data in search of simple story. So learning how to deal with small data is where it all both ends and begins.

Friday, June 23, 2017

Logarithmic Scale Explained with U.S. Trade Balance

Skewed data prevail in real life. Unless you observe trivial or near constant processes data is skewed one way or another due to outliers, long tails, errors or something else. Such effects create problems in visualizations when a few data elements are much larger than the rest.
Consider U.S. 2016 merchandise trade partner balances data set where each point is a country with 2 features: U.S. imports and exports against it:

Suppose we decided to visualize top 30 U.S trading partners using bubble chart, which simply is a 2D scatter plot with the third dimension expressed through point size. Then U.S. trade partners become disks with imports and exports for xy coordinates and trade balance (abs(export - import)) for size:
China, Canada, and Mexico run far larger balances compared to the other 27 countries which causes most data points to collapse into crowded lower left corner. One way to "solve" this problem is to eliminate 3 mentioned outliers from the picture:

 While this plot does look better it no longer serves its original purpose of displaying all top trading partners. And undesirable effect of outliers though reduced still presents itself with new ones: Japan, Germany, and U.K. So let us bring all countries back into the mix by trying logarithmic scale.
Quick refresher from algebra. Log function (in this example log base 10 but the same applies to natural log or log base 2) is commonly used to transform positive real numbers. All because of its property of mapping multiplicative relationships into additive ones. Indeed, given numbers A, B, and C such that

`A*B=C and A,B,C > 0`

applying log  results in additive relationship:

`log(A) + log(B) = log(C)`

For example, let A=100, B=1000, and C=100000 then
`100 * 1000 = 100000`

so that after transformation it becomes

`log(100) + log(1000) = log(100000)`  or   `2 + 3 = 5`

Observe this on 1D plane:

Logarithmic scale is simply a log transformation applied to all feature's values before plotting them. In our example we used it on both trading partners' features - imports and exports which gives bubble chart new look:
The same data displayed on logarithmic scale appear almost uniform but not to forget the farther away points from 0 the more orders of magnitude they are apart on actual scale (observe this by scrolling back to the original plot). The main advantage of using log scale in this plot is ability of observing relationships between all top 30 countries without loosing the whole picture and avoiding collapsing smaller points together.
For more detailed discussion of logarithmic scale refer to When Should I Use Logarithmic Scales in My Charts and Graphs? Oh, and how about that trade deficit with China?
This is a re-post from the original blog on LinkedIn.

Friday, May 26, 2017

MapReduce in Two Modern Paintings

Two years ago we had a rare family outing to the Dallas Museum of Art (my son is teenager and he's into sport after all). It had an excellent exhibition of modern art and DMA allowed taking pictures. Two hours and dozen of pictures later my weekend was over but thanks to Google Photos I just stumbled upon those pictures again. Suddenly, I realized that two paintings I captured make up an illustration of one of the most important concepts in big data.

There are multiple papers, tutorials and web pages about MapReduce and to truly understand and use it one should study at least a few thoroughly. And there are many illustrations of MapReduce structure and architecture out there.

But the power of art can express more with less with just two paintings. First, we have work by ErrĂ³ Foodscape, 1964:

It illustrates variety, richness, potential of insight (if consumed properly), and of course, scale. The painting is boundless with no ends to the table surface in all 4 directions. Also observe many types of food and drinks, packaging, presentations, varying in colors, texture and origin (better quality image here). All these represent big data so much better than any kind of flowchart diagram.

The 2d and final painting is by Wayne Thiebaud Salads, Sandwiches, and Desserts, 1962:

Should we think of how MapReduce works this seemingly infinite table (also fittingly resembling conveyor line) looks like result of split-apply-combine executed on Foodscape items. Indeed, each vertical group is combination of the same type of finished and plated food combined into variably sized groups and ready to serve (better quality image here).

As with any art there is much about MapReduce that was left out of the picture. That's why we still have papers, books, and Wikipedia.  And again, I'd like to remind of importance of taking your kids to a museum.

Tuesday, December 20, 2016

Correlation Primer with Aster and R

Calculating correlations is often starting point before more advanced analytical steps take place. Big data (long data) always presents computational challenges of both scale and distributed nature. In turn they may get aggravated by the presence of large number of features (wide data). But challenges do not stop here as complex relationships induce analysis of correlations across subsets and groups.

Such mix of long and wide becomes more common in the age of internet-of-things, sensor and machine data with non-human data sources dominating analytical use cases.  
Thus, when computing correlations on big data the following capabilities matter:
  • scale on large distributed data sets (long data)
  • scale on wide distributed data sets (wide data / large number of features)
  • flexibility on wide data sets (ability to permutate features such as Cartesian combinations, one-to-many, etc.)
  • correlations on subsets and groups.
Correlations in R comes standard with stats function cor but it doesn't meet most of the capabilities above. As always Teradata Aster big data analytical platform offers both scalability and functionality far exceeding capabilities above. And thanks to Aster R (TeradataAsterR) package it is available without leaving R environment.

With Aster and R integration there are multiple ways of correlating on datasets. Before sending you to the link for detailed discussion I summarized approaches discussed there by the capabilities:

Method / Solution features Variable (columns) Permutations Calculating for Groups SQL-MR In-database R
Aster R ta.cor
Aster R in-database ta.tapply
toaster computeCorrelations

Please visit my latest RPubs post for detailed discussion and comparison of these methods.

Tuesday, May 31, 2016

Running similar but independent jobs in parallel on Aster with R

No surprise that Teradata Aster runs each SQL, SQL-MR, and SQL-GR command in parallel on many clusters with distributed data. But when faced with the task of running many similar but independent jobs one has to do extra work to parallelize them in Aster. When running a SQL script the next command has to wait for the previous to finish. This makes sense when commands contribute to the pipeline with results of each job passed down to next one. But what if the jobs are independent and produce their own results each. For example, cross-validation of linear regression or other models is divided into independent jobs each working with its respective partition (of total K in case of K-fold cross-validation). These jobs could run in parallel in Aster with little help from R. This post will illustrate how to run K linear regression models in parallel in Aster as part of the K-fold cross-validation procedure.

The Problem

Cross-validation is important technique in machine learning that receives its own chapters in the textbooks (e.g. see Chapter 7 here). In our examples we implement a K-fold cross-validation method to demonstrate how to run parallel jobs in Aster with R. The implementation of K-fold cross-validation that will be given is neither exhaustive nor exemplary as it introduces certain bias (based on month of the year) into the models. But this approach could definitely lead to a general solution for cross-validation and other problems involving execution of many similar but independent tasks on Aster platform.

Further more, the examples will be concerned only with the step in K-fold cross-validation that creates K models on overlapping but different partitions of the training dataset. We will show how to construct K independent linear regression models in parallel on Aster, each for one of the K partitions of the table (not the same as table partitioning in Aster).

Data and R Packages

We will use Dallas Open Data data set available from here (including Aster load scripts).
To simplify examples we will also use R package toaster for Aster and several other packages - all available from CRAN:

Data set, Model and K Folds

Dallas Open Data has information on building permits across city of Dallas for the period between January 2012 through May 2014 stored in the table dallasbuildingpermits. We can quickly analyze this table from R with toaster and see its numerical columns:

which results in:
[1] "area" "value" "lon" "lat"
These 4 fields will make up our simple linear model to determine the value of construction using its area and location. And now the same in R terms:

This problem is not beyond R memory limits but our goal is to execute linear regression in Aster. We enlist toaster's computeLm function that returns R lm object:

Lastly, we need to define the folds (partitions) on the table to build linear regression model on each of them. Usually, this step performs equal and random division into partitions. Doing this with R and Aster is actually not extremely difficult but will take us beyond the scope of the main topic. For this reason alone we propose quick and dirty method of dividing building permits into 12 partitions (K=12) using issue date's month value (in SQL):

Again, do not replicate this method in real cross-validation task but use it as a template or a prototype only.
To make each fold's compliment (used to train 12 models later) we simply exclude each month's data, e.g. selecting the compliment to the fold 6 in its entirety (in SQL):

Computing Cross-Validation Models in Aster with R

Before we get to parallel execution with R we show how to script in R Aster cross-validation of linear regression. To begin we use standard R for loop and computeLm with the argument where that limits data to the required fold just like in SQL example above:

This results in the list fit.folds that contains 12 linear regression models for each fold respectively.
Next, we replace the for loop with the specialized foreach function designed for parallel execution in R. There is no parallel execution yet but all necessary structure for transition to parallel processing:

foreach performs the same iterations from 1 to 12 as for loop and combines results into list by default.

Parallel Computing in Aster with R

Finally, we are ready to enable parallel execution in R. For more details on using package doParallel see here, but the following suffices to enable a parallel backend in R on Windows:

After that foreach with operator %dopar% automatically recognizes parallel backend cl and runs its iterations in parallel:

Comparing with foreach %do% earlier notice extra handling for ODBC connection inside foreach %dopar%. This is necessary due to inability of sharing the same database connection between parallel processes (or threads, depending on the backend implementation). Effectively, with each loop we reconnect to Aster with a brand new connection by reusing original connection's configuration in function odbcReConnect.

Elaspsed Time

Lastly, let's see if the whole thing was worth the effort. Chart below contains elapsed times (in seconds) for all 3 types of loops: for loop in R, foreach %do% (sequential), and foreach %dopar% (parallel):

Sunday, April 24, 2016

Map of the Windows Fonts Registered with R

If you already found package extrafont then you probably found how to load and use Windows fonts in R visualizations. But just in case, everything to get started with extrafont is found here and summarized for using fonts in Windows for on-screen or bitmap output below:

One thing to add is a summary of all Windows fonts registered in R. This will come handy when designing new visualizations and deciding on which font or combination of fonts and their faces to use. The code below produces a table where rows are fonts and columns are faces with font name printed using both the font and the face (if available) in each table cell:

The resulting table is this handy visual:

You can download this image or produce your own with the code above.

Saturday, April 16, 2016

Creating and Tweaking Bubble Chart with ggplot2

This article will take us step-by-step over incremental changes to produce a bubble chart using ggplot2 that looks like this:

We'll encounter the plot above once again at the very end after explaining each step with code changes and observing intermediate plots. Without getting into details what it means (curios reader can find out here) the dataset behind is defined as:

It contains 2 data points and 4 attributes: three numerical Aster_experience, R_experience, and  coverage, and one categorical product. Remember that the data won't change a bit while the plot progression unfolds.

The starting plot is simple scatterplot using coordinates x and y as Aster_experience, R_experience (line 3), point size as coverage, and point color as product (line 4) (this type of scatterplot has a special name - bubble chart):

Immediate fix would be making the smaller point big enough to see it with the help of scale_size function and its range argument (line 3) (strange enough but sibling function scale_size_area doesn't have such argument) that specifies the minimum and maximum size of the plotting symbol after transformation1 :

Next refinement aims at the magic quadrant concept which fits this data well. In this case it's "R Experience" vs. "Aster Experience" and whether there is more or less of each. Achieving this effect involves fake axes using geom_hline and geom_vline (line 3), and customizing actual axes using scale (line 5-6) and theme functions (line 8-12):

Typical for bubble charts its points get both colored and labeled, which also makes color bar legend obsolete. We use geom_text to label points (line 5) and scale_color_manual to assign new colors and remove color bar legend (line 11):

The next step happened to tackle the most advanced problem while working on the plot. The guide legend for size above looks rather awkward. Ideally, it matches the two points we have in both color and size. It turned out (and rightly so) that the function scale_size is responsible for its appearance (line 8). In particular, number of legend positions overrides argument breaks, and controling appearance including colors of the legend performed with guide_legend and override.aes:

We finish cleaning the plot using package ggthemes and its theme_tufte function (line 10):

As promised, we finished exactly where we started.

1 Scale size (area or radius).