The Problem
Cross-validation is important technique in machine learning that receives its own chapters in the textbooks (e.g. see Chapter 7 here). In our examples we implement a K-fold
cross-validation method to demonstrate how to run parallel jobs in
Aster with R. The implementation of K-fold cross-validation that will be
given is neither exhaustive nor exemplary as it introduces certain bias
(based on month of the year) into the models. But this approach could
definitely lead to a general solution for cross-validation and other
problems involving execution of many similar but independent tasks on
Aster platform.
Further more, the examples will be concerned only with the step in K-fold cross-validation that creates K models on overlapping but different partitions of the training dataset. We will show how to construct K independent linear regression models in parallel on Aster, each for one of the K partitions of the table (not the same as table partitioning in Aster).
To simplify examples we will also use R package toaster for Aster and several other packages - all available from CRAN:
which results in:
This problem is not beyond R memory limits but our goal is to execute linear regression in Aster. We enlist toaster's computeLm function that returns R lm object:
Lastly, we need to define the folds (partitions) on the table to build linear regression model on each of them. Usually, this step performs equal and random division into partitions. Doing this with R and Aster is actually not extremely difficult but will take us beyond the scope of the main topic. For this reason alone we propose quick and dirty method of dividing building permits into 12 partitions (K=12) using issue date's month value (in SQL):
Again, do not replicate this method in real cross-validation task but use it as a template or a prototype only.
To make each fold's compliment (used to train 12 models later) we simply exclude each month's data, e.g. selecting the compliment to the fold 6 in its entirety (in SQL):
This results in the list fit.folds that contains 12 linear regression models for each fold respectively.
Next, we replace the for loop with the specialized foreach function designed for parallel execution in R. There is no parallel execution yet but all necessary structure for transition to parallel processing:
foreach performs the same iterations from 1 to 12 as for loop and combines results into list by default.
Data and R Packages
We will use Dallas Open Data data set available from here (including Aster load scripts).To simplify examples we will also use R package toaster for Aster and several other packages - all available from CRAN:
Data set, Model and K Folds
Dallas Open Data has information on building permits across city of
Dallas for the period between January 2012 through May 2014 stored in
the table dallasbuildingpermits. We can quickly analyze this table from R with toaster and see its numerical columns:
which results in:
[1] "area" "value" "lon" "lat"These 4 fields will make up our simple linear model to determine the value of construction using its area and location. And now the same in R terms:
This problem is not beyond R memory limits but our goal is to execute linear regression in Aster. We enlist toaster's computeLm function that returns R lm object:
Lastly, we need to define the folds (partitions) on the table to build linear regression model on each of them. Usually, this step performs equal and random division into partitions. Doing this with R and Aster is actually not extremely difficult but will take us beyond the scope of the main topic. For this reason alone we propose quick and dirty method of dividing building permits into 12 partitions (K=12) using issue date's month value (in SQL):
Again, do not replicate this method in real cross-validation task but use it as a template or a prototype only.
To make each fold's compliment (used to train 12 models later) we simply exclude each month's data, e.g. selecting the compliment to the fold 6 in its entirety (in SQL):
Computing Cross-Validation Models in Aster with R
Before we get to parallel execution with R we show how to script in R
Aster cross-validation of linear regression. To begin we use standard R for loop and computeLm with the argument where that limits data to the required fold just like in SQL example above:
This results in the list fit.folds that contains 12 linear regression models for each fold respectively.
Next, we replace the for loop with the specialized foreach function designed for parallel execution in R. There is no parallel execution yet but all necessary structure for transition to parallel processing:
foreach performs the same iterations from 1 to 12 as for loop and combines results into list by default.
Parallel Computing in Aster with R
Finally, we are ready to enable parallel execution in R. For more details on using package doParallel see here, but the following suffices to enable a parallel backend in R on Windows:
After that foreach with operator %dopar% automatically recognizes parallel backend cl and runs its iterations in parallel:
Comparing with foreach %do% earlier notice extra handling for ODBC connection inside foreach %dopar%.
This is necessary due to inability of sharing the same database
connection between parallel processes (or threads, depending on the
backend implementation). Effectively, with each loop we reconnect to
Aster with a brand new connection by reusing original connection's
configuration in function odbcReConnect.
Elaspsed Time
Lastly, let's see if the whole thing was worth the effort. Chart below
contains elapsed times (in seconds) for all 3 types of loops: for loop in R, foreach %do% (sequential), and foreach %dopar% (parallel):