Calculating correlations is often starting point before more advanced analytical steps take place. Big data (long data) always presents computational challenges of both scale and distributed nature. In turn they may get aggravated by the presence of large number of features (wide data). But challenges do not stop here as complex relationships induce analysis of correlations across subsets and groups. Such mix of long and wide becomes more common in the age of internet-of-things, sensor and machine data with non-human data sources dominating analytical use cases. Thus, when computing correlations on big data the following capabilities matter:
scale on large distributed data sets (long data)
scale on wide distributed data sets (wide data / large number of features)
flexibility on wide data sets (ability to permutate features such as Cartesian combinations, one-to-many, etc.)
correlations on subsets and groups.
Correlations in R comes standard with stats function cor but it doesn't meet most of the capabilities above. As always Teradata Aster big data analytical platform offers both scalability and functionality far exceeding capabilities above. And thanks to Aster R (TeradataAsterR) package it is available without leaving R environment.
With Aster and R integration there are multiple ways of correlating on datasets. Before sending you to the link for detailed discussion I summarized approaches discussed there by the capabilities:
Method / Solution features
Variable (columns) Permutations
Calculating for Groups
SQL-MR
In-database R
Aster Rta.cor
N
N
Y
N
Aster R in-database ta.tapply
N
Y
N
Y
toastercomputeCorrelations
Y
Y
Y
N
Please visit my latest RPubs post for detailed discussion and comparison of these methods.
No surprise that Teradata Aster runs each SQL, SQL-MR, and
SQL-GR command in parallel on many clusters with distributed data. But
when faced with the task of running many similar but independent jobs one has to do extra work to parallelize them in Aster. When running a SQL
script the next command has to wait for the previous to finish. This makes
sense when commands contribute to the pipeline with results of each job passed down to next one. But what if the jobs are independent and produce their own results each. For example, cross-validation of linear regression or other models is divided into independent jobs each working with its respective partition (of total K in case of K-fold cross-validation). These jobs could run in parallel in Aster with little help from R. This post will illustrate how to run K linear regression models in parallel in Aster as part of the K-fold cross-validation procedure.
The Problem
Cross-validation is important technique in machine learning that receives its own chapters in the textbooks (e.g. see Chapter 7 here). In our examples we implement a K-fold
cross-validation method to demonstrate how to run parallel jobs in
Aster with R. The implementation of K-fold cross-validation that will be
given is neither exhaustive nor exemplary as it introduces certain bias
(based on month of the year) into the models. But this approach could
definitely lead to a general solution for cross-validation and other
problems involving execution of many similar but independent tasks on
Aster platform.
Further more, the examples will be concerned only with the step in K-fold cross-validation that creates K models on overlapping but different partitions of the training dataset. We will show how to construct K independent linear regression models in parallel on Aster, each for one of the K partitions of the table (not the same as table partitioning in Aster).
Data and R Packages
We will use Dallas Open Data data set available from here (including Aster load scripts). To simplify examples we will also use R package toaster for Aster and several other packages - all available from CRAN:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Dallas Open Data has information on building permits across city of
Dallas for the period between January 2012 through May 2014 stored in
the table dallasbuildingpermits. We can quickly analyze this table from R with toaster and see its numerical columns:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
These 4 fields will make up our simple linear model to determine the
value of construction using its area and location. And now the same in R terms:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This problem is not beyond R memory limits but our goal is to execute linear regression in Aster. We enlist toaster's computeLm function that returns R lm object:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Lastly, we need to define the folds (partitions) on the table to build
linear regression model on each of them. Usually, this step performs
equal and random division into partitions. Doing this with R and Aster
is actually not extremely difficult but will take us beyond the scope of
the main topic. For this reason alone we propose quick and dirty method of dividing building permits into 12 partitions (K=12) using issue date's month value (in SQL):
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Again, do not replicate this method in real cross-validation task but use it as a template or a prototype only. To make each fold's compliment (used to train 12 models later) we simply
exclude each month's data, e.g. selecting the compliment to the fold 6
in its entirety (in SQL):
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Before we get to parallel execution with R we show how to script in R
Aster cross-validation of linear regression. To begin we use standard R for loop and computeLm with the argument where that limits data to the required fold just like in SQL example above:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This results in the list fit.folds that contains 12 linear regression models for each fold respectively. Next, we replace the forloop with the specialized foreach
function designed for parallel execution in R. There is no parallel
execution yet but all necessary structure for transition to parallel
processing:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
foreach performs the same iterations from 1 to 12 as forloop and combines results into list by default.
Parallel Computing in Aster with R
Finally, we are ready to enable parallel execution in R. For more details on using package doParallelsee here, but the following suffices to enable a parallel backend in R on Windows:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
After that foreach with operator %dopar% automatically recognizes parallel backend cland runs its iterations in parallel:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Comparing with foreach%do% earlier notice extra handling for ODBC connection inside foreach %dopar%.
This is necessary due to inability of sharing the same database
connection between parallel processes (or threads, depending on the
backend implementation). Effectively, with each loop we reconnect to
Aster with a brand new connection by reusing original connection's
configuration in function odbcReConnect.
Elaspsed Time
Lastly, let's see if the whole thing was worth the effort. Chart below
contains elapsed times (in seconds) for all 3 types of loops: for loop in R, foreach %do% (sequential), and foreach %dopar% (parallel):
If you already found package extrafont then you probably found how to load and use Windows fonts in R visualizations. But just in case, everything to get started with extrafont is found here and summarized for using fonts in Windows for on-screen or bitmap output below:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
One thing to add is a summary of all Windows fonts registeredin R. This will come handy when designing new visualizations and deciding on which font or combination of fonts and their faces to use. The code below produces a table where rows are fonts and columns are faces with font name printed using both the font and the face (if available) in each table cell:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This article will take us step-by-step over incremental changes to produce a bubble chart using ggplot2 that looks like this:
Data and Setup
We'll encounter the plot above once again at the very end after explaining each step with code changes and observing intermediate plots. Without getting into details what it means (curios reader can find out here) the dataset behind is defined as:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
It contains 2 data points and 4 attributes: three numerical Aster_experience,R_experience, and coverage, and one categorical product. Remember that the data won't change a bit while the plot progression unfolds.
As-Is Scatterplot
The starting plot is simple scatterplot using coordinates x and y as Aster_experience,R_experience (line 3), point size as coverage, and point color as product (line 4) (this type of scatterplot has a special name - bubble chart):
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Immediate fix would be making the smaller point big enough to see it with the help of scale_size function and its range argument (line 3) (strange enough but sibling function scale_size_area doesn't have such argument) that specifies the minimum and maximum size of the plotting symbol after transformation1:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Magic Quadrant: adding lines and customizing axises
Next refinement aims at the magic quadrant concept which fits this data well. In this case it's "R Experience" vs. "Aster Experience" and whether there is more or less of each. Achieving this effect involves fake axes using geom_hline and geom_vline (line 3), and customizing actual axes using scale (line 5-6) and theme functions (line 8-12):
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Typical for bubble charts its points get both colored and labeled, which also makes color bar legend obsolete. We use geom_text to label points (line 5) and scale_color_manual to assign new colors and remove color bar legend (line 11):
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The next step happened to tackle the most advanced problem while working on the plot. The guide legend for size above looks rather awkward. Ideally, it matches the two points we have in both color and size. It turned out (and rightly so) that the function scale_size is responsible for its appearance (line 8). In particular, number of legend positions overrides argument breaks, and controling appearance including colors of the legend performed with guide_legend and override.aes:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We finish cleaning the plot using package ggthemes and its theme_tufte function (line 10):
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
While working on new graph functions for my package toaster I had to pick from the R packages that represent graphs. The choice was between network and graph objects from the network and igraph correspondingly - the two most prominent packages for creating and manipulating graphs and networks in R.
Interchangeability of network and graph objects
One can always use them interchangeably with little effort using package intergraph. Its sole purpose is providing "coercion routines for network data objects". Simply use its asNetwork and asIgraph functions to convert from one network representation to another:
For more on using intergraph functions see tutorial.
Package dependencies with miniCRAN
To assess relative importance of packages network and igraph we will use package miniCRAN. Its access to CRAN packages' metadata including dependencies via "Depends", "Imports", "Suggests" provides necessary information about package relationships. Built-in makeDepGraph function recursively retrieves these dependencies and builds corresponding graph:
Unfortunately, these dependency graphs show how network and igraph depend on other CRAN packages while the goal is to evaluate relationships the other way around: how much other CRAN packages depend on the two. This will require some assembly as we construct a network of packages manually with edges being directed relationships (one of "Depends", "Imports", or "Suggests") as defined in DESCRIPTION for all packages. The following code builds this igraph object (we chose igraph for its functions utilized later):
cranInfoDF = as.data.frame(cranInfo, stringsAsFactors = FALSE)
edges = ddply(cranInfoDF, .(Package),function(x){# split all implied (depends, imports, and suggests) packages and then concat into single array
l = unlist(sapply(x[c('Depends','Imports','Suggests')],strsplit,split="(,|, |,\n|\n,| ,| , )"))# remove version info and empty fields that became NA
l = gsub("^([^ \n(]+).*$","\\1", l[!is.na(l)])# take care of empty arraysif(is.null(l) || length(l) == 0)NULLelsedata.frame(Package = x['Package'], Implies = l, stringsAsFactors = FALSE)})
edges.mat = as.matrix(edges,ncol=2,dimnames=c('from','to'))
pkg.graph = graph_from_edgelist(edges.mat, directed = TRUE)
The resulting network pkg.graph contains all CRAN packages and their relationships. Let's extract and compare the neighborhoods for the two packages we are interested in:
# build subgraphs for each package
subgraphs = make_ego_graph(pkg.graph,order=1, nodes=c("igraph","network"),mode = "in")
g.igraph = subgraphs[[1]]
g.network = subgraphs[[2]]# plotting subgraphs
V(g.igraph)$color = ifelse(V(g.igraph)$name == "igraph","orange","lightblue")plot(g.igraph, main="Packages pointing to igraph")
V(g.network)$color = ifelse(V(g.network)$name == "network","orange","lightblue")plot(g.network, main="Packages pointing to network")
The igraph neighborhood is much denser populated subgraph than the network neighborhood and hence its importance and acceptance must be higher.
Package Centrality Scores
Package igraph can produce various centrality measures on the nodes of a graph. In particular, pagerank centrality and eigenvector centrality scores are principal indicators of the importance of a node in given graph. We finish this exercise with validation using centrality scores for our initial conclusion that igraph package is more accepted and utilized across CRAN ecosystem than network package:
# PageRank
pkg.pagerank = page.rank(pkg.graph, directed = TRUE)# Eigenvector Centrality
pkg.ev = evcent(pkg.graph, directed = TRUE)
toplot = rbind(data.frame(centrality="pagerank", type = c('igraph','network'),
value = pkg.pagerank$vector[c('igraph','network')]),data.frame(centrality="eigenvector", type = c('igraph','network'),
value = pkg.ev$vector[c('igraph','network')]))library(ggplot2)library(ggthemes)ggplot(toplot) +
geom_bar(aes(type, value, fill=type), stat="identity") +
facet_wrap(~centrality,ncol = 2)
Both packages igraph and network are widely used across CRAN ecosystem. Due to its versatility and rich set of functions igraph leads in acceptance and importance. But as far as graph objects concern it is still a matter of the requirements to prefer one's or another's objects in R.