Because best way to understand is explaining.

Tuesday, September 22, 2015

VW Big Data Play

Volkswagen made headlines lately for cheating U.S. EPA regulators. But let's pay some respect to their engineers.

Apparently, there is no button or switch that tells car it's being tested - indeed - that would be obvious flaw in the emission test protocol. So VW engineers designed and deployed sophisticated algorithm that detects car is undergoing emission testing and turns emission control on just in time to pass it with flying colors. Then, after the test is over, it recognizes normal driving conditions and switches car software back to run diesel engine in its normal mode (that creates smog at up to 40 times the legal limit).

Having such feature running flawlessly in real time conditions on hundred thousand cars all over the world deserves special recognition. In fact, it was pure accident that this "cheating device" was found (here is Bloomberg's story how). At least, let's congratulate VW data scientists and software engineers - but not their execs - with quite an accomplishment.

Friday, September 13, 2013

How to expand color palette with ggplot and RColorBrewer

Histograms are almost always a part of data analysis presentation. If it is made with R ggplot package then it may look like this:

ggplot(mtcars) +
  geom_histogram(aes(factor(cyl), fill=factor(cyl)))

The elegance of ggplot functions is in simple yet compact expression of visualization formula while hiding many options assumed by default. Hiding doesn't mean lacking as most options are just a step away. For example, color selection can change with one of scale functions such as scale_fill_brewer:

ggplot(mtcars) +
  geom_histogram(aes(factor(cyl), fill=factor(cyl))) +

In turn, scale_fill_brewer palette can be changed too:

ggplot(mtcars) +
  geom_histogram(aes(factor(cyl), fill=factor(cyl))) +

Palettes used live in the package RColorBrewer - to see all available choices simply attach the package with library(RColorBrewer) and run display.brewer.all() to show this:
There are 3 types of palettes, sequential, diverging, and qualitative; each containing from 8 to 12 colors (see data frame and help ?RColorBrewer for details).

Curious reader may notice that if a histogram contains 13 or more bars (bins in case of continuous data) we may get in trouble with colors:

ggplot(mtcars) + 
  geom_histogram(aes(factor(hp), fill=factor(hp))) +

Indeed length(unique(mtcars$hp)) finds 22 unique values for horse power in mtcars, while specified palette Set2 has 8 colors to choose from. Lack of colors in the palette triggers ggplot warnings like this (and invalidates plot as seen above):
1: In brewer.pal(n, pal) :
  n too large, allowed maximum for palette Set2 is 8
Returning the palette you asked for with that many colors
RColorBrewer gives us a way to produce larger palettes by interpolating existing ones with constructor function colorRampPalette. It generates functions that do actual job: they build palettes with arbitrary number of colors by interpolating existing palette. To interpolate palette Set1 to 22 colors (number of colors is stored in colourCount variable for examples to follow):

colourCount = length(unique(mtcars$hp))
getPalette = colorRampPalette(brewer.pal(9, "Set1"))
ggplot(mtcars) + 
  geom_histogram(aes(factor(hp)), fill=getPalette(colourCount)) + 

While we addressed color deficit other interesting things happened: even though all bars are back and are distinctly colored we lost color legend. I intentionally added theme(legend.position=...) to showcase this fact: despite explicit position request the legend is no more part of the plot.

The difference: fill parameter was moved outside of histogram aes function - this effectively removed fill information from aesthetics data set for ggplot. Hence, there is nothing to apply legend to.

To fix it place fill back into aes and use scale_fill_manual to define custom palette:

ggplot(mtcars) + 
  geom_histogram(aes(factor(hp), fill=factor(hp))) + 
  scale_fill_manual(values = getPalette(colourCount))

Another likely problem with large number of bars in histogram plots is placing of the legend. Adjust legend position and layout using theme and guides functions as follows :

ggplot(mtcars) + 
  geom_histogram(aes(factor(hp), fill=factor(hp))) + 
  scale_fill_manual(values = getPalette(colourCount)) +
  theme(legend.position="bottom") +

Finally, the same example using  in place palette constructor with different choice of library palette:

ggplot(mtcars) + 
  geom_histogram(aes(factor(hp), fill=factor(hp))) + 
  scale_fill_manual(values = colorRampPalette(brewer.pal(12, "Accent"))(colourCount)) +
  theme(legend.position="bottom") +
Created by Pretty R at

There are many more scale functions to choose from depending on aesthetics type (colour, fill), color types (gradient, hue, etc.), data values (discrete or continuous).

Wednesday, July 31, 2013

Quick R tip: ggplot in functions needs some extra care

When building visualizations with ggplot2 in R I decided to create specialized functions that encapsulate plotting logic for some of my creations. In this case instead of commonly used aes function I had to use its alternative -  aes_string - for aesthetic mapping from a string.

And now goes this handy tip:
while original aesthetic mapping function aes accepts x and y parameters by position:

p = ggplot(data, aes(x, y)) + ...

aes_string even though silently accepts them won't work like this:

my_plot_fun = function(data, xname, yname) {
  p = ggplot(data, aes_string(xname, yname)) + ...

It will run to compile plot object without problems but when plot p (returned from the function my_plot_fun) executed this rather cryptic error appears:

Error in as.environment(where) : 'where' is missing

What it means is that ggplot never got aesthetics defined right. This is due to aes_string function lacking the same position parameters as in its aes counterpart above. Instead, define both x and y parameters (and others if necessary) by name:

p = ggplot(data, aes_string(x=xname, y=yname)) + ...

Created by Pretty R at


One more vote for using aes_string in place of aes comes from CRAN submission policy, i.e.:
In principle, packages must pass R CMD check without warnings or significant notes to be admitted to the main CRAN package area. If there are warnings or notes you cannot eliminate (for example because you believe them to be spurious) send an explanatory note as part of your covering email, or as a comment on the submission form. 
 (source: CRAN Repository Policy)
 What happens is that  R CMD check  reports notes like this for every aes call:

no visible binding for global variable [variable name]
It turns out that the most sensible solution is using aes_string instead.

Thursday, July 26, 2012

My New Calculator(s)

We all need a calculator from time to time. I used to reach to Start button, type Calc in the Run (or Search) box to get to Calculator app (Windows). Until recently that is. Now I simply start Octave and do my calculations there. Sometimes, I already have Python prompt and then I do my calculations there.

For example compute a variance for the sample of 10 coin flips: 4 Tails (0) and 6 Heads (1) (estimated mean p=0.6):
 octave-3.2.4.exe>(4*(0-0.6)**2 + 6*(1-0.6)**2)/(10-1)
 ans =  0.15360
This calculator really works for me. Sometimes I have a Python window and it works just as well:
>>> (4*(0-0.6)**2 + 6*(1-0.6)**2)/(10-1)
Having said that Octave beats Python in easiness when calculating with vectors (or time series or sequences or anything that can be represented as a vector). Let's suppose we test if certain coin is not loaded (is fair) by flipping it 14 times. We would like to be 95% certain that coin is fair (i.e. p=0.5 which is equivalent to two-tailed test). Suppose that 14 flips resulted in only 3 heads. First, we build a critical interval - the number of tails that would result in rejecting coin fairness given number of heads:
critical_interval = binopdf(0:14,14,0.5) > 0.025 | binocdf(0:14,14,0.5)- binopdf(0:14,14,0.5) > 0.975
critical_interval is a Boolean vector where i-th element corresponds to (i-1) number of tails: if it's true then with 95% certainty it is a fair coin. This expression is a logical OR of 2 expressions: first for left tail and second for right tail. Octave seamlessly handles any vectors just as if it were a number: I can change this to 1000 flips with minimum keystrokes.

Thus, given number of tails 3 we get
octave-3.2.4.exe> critical_interval(3+1)
ans = 0
We use (3+1) as 1st element corresponds to 0 tails. Hence, we accept the fact that our coin is fair with 95% certainty because 3 heads do not belong to the critical interval. Similarly we have to reject this coin as loaded if number of tails happens to be 12:

octave-3.2.4.exe> critical_interval(12+1)
ans = 1

I leave Python example as an exercise to a reader but I am certain that result won't be close to Octave  in neither transparency nor conciseness. The Octave solution does leave me with the right to keep claiming this use similar to a calculator - not a programming.

Saturday, June 23, 2012

Intro to Octave for Coursera Students

Octave as a programming language has a lot to offer. To give you a taste this post attempts to showcase some of the cooler features of the language. But it also serves a purpose of introduction to Octave (or Matlab) for those who are taking or considering taking Coursera Machine Learning class by Professor Andrew Ng (great great idea). Not incidentally most of the examples were inspired by the homework assignments for the course.

Disclaimer: this post contains no concrete references, examples, excerpts or solutions for any of the Coursera courses, exercises, or homework assignments.

Matrix Basics

Suppose we have a 3 by 5 matrix A like this:
octave:> A = [1 2 3 4 5; 2 3 4 5 6; 3 4 5 6 7]

A =

   1 2 3 4 5
   2 3 4 5 6
   3 4 5 6 7

Then to extract a single element from A is just like in most other languages:
octave:> A(2,3)

ans = 4 
Just remember that following mathematical conventions Octave indexes start with 1.

Almost everything in Octave is array (vector or matrix or similar) and index is no exception. Let's take a range for example. Range is a row vector with evenly spaced elements, e.g.:

octave:> 1:5
ans =

   1   2   3   4   5

octave:> A(1:3,1:5)
ans =

   1   2   3   4   5
   2   3   4   5   6
   3   4   5   6   7
Operation A(1:3, 1:5) returns whole A again because it selects all of its rows and columns. Ranges can exist by themselves as vectors but there is special type which is available only in the context of matrix index:

octave:> A(2:end,3:end)
ans =

   4   5   6
   5   6   7
Ranges 2:end and 3:end are defined only within concrete matrix context as keyword end indicates last row or column position within matrix. You can even select elements for last 2 rows and columns like this:
octave:> A(end-1:end,end-1:end)
ans =

   5   6
   6   7

Logical Operations on Matrices

Logical arrays in Octave contain all logical elements and are usually results of relational operators with vectors and matrices like this:

octave:> A != 3
ans =

   1   1   0   1   1
   1   0   1   1   1
   0   1   1   1   1
One cool application of this is inverting identity matrix:
octave:> I = eye(3)
I =

Diagonal Matrix

   1   0   0
   0   1   0
   0   0   1

octave:> I == 0
ans =

   0   1   1
   1   0   1
   1   1   0

Euclidean Distance

Given two vectors (same size) find Euclidean distance between them.

a = [0 0 0];
b = [1 2 2];
distance = sqrt(sumsq(a-b));
If you find yourself writing in Octave more complex solutions for similar problems  with vectors  then stop and review your vectorization approach.

Vectorizing indexes

Suppose you have a collection of m vectors in n-dimensional space stored as m x n matrix X.  Suppose that the value of last (n-th) coordinate of these vectors is always 0 or 1. Then we want to produce 2 subsets of X - one subset of vectors with last coordinate equal to 0 and the other subset where vectors have last coordinate equal to 1:
n = size(X, 2);
X0 = X( find( X(:, n) == 0 ), :);
X1 = X( find( X(:, n) == 1 ), :);


Random numbers appear in many problems. Basic approach is a matrix filled with random numbers from uniform distribution on the interval (0, 1):
octave:> rand(3,4)
ans =

   0.347937   0.317482   0.630678   0.245148
   0.917634   0.649125   0.634592   0.837635
   0.994745   0.092818   0.154936   0.966380
But Octave offers a few shortcuts. For one, such common distributions as normal, exponential, Poisson, and gamma each receive their own function randn, rande, randp, and randg.

Function randperm produces a row vector of randomly permuted integers from 1 to n:
octave:> randperm(5)
ans =

   3   1   4   5   2
If you are looking for an arbitrary vector with values between 0 and N of size n (n<=N) then randperm gets id done (in this case N = 100 and n = 10):
octave:> x = randperm(100)(1:10)
x =

   88    1   89   25   76   19   78   38   99   34

This combined with vector indexing accomplishes rather elaborate task in short one-liner: suppose we have m n-dimensional vectors stored as m x n matrix X and we need to pick k vectors (k < m) randomly. The following gets this done:

X(randperm(size(X, 1))(1:k), :);

Newer releases of Octave (I use 3.2.4) added functions randi and randperm(n, m) that offer even nice features.

Binary Singleton Expansion Function (bsxfun)

This function reminds me of Python map function, but having it in Octave is necessity (unlike in Python). When vectorizing in Octave we have a few options: 1) both parameters are of the same dimensions for element-wise application; 2) parameters are of compatible sizes for matrix operations like multiplication; 3) one parameter is a matrix and the other is a scalar.

And then we use bsxfun for vectorizing everything else, for example applying a vector to each row:

X = rand(5, 3);
mu = mean(X);
sigma = std(X);
X_norm = bsxfun(@minus, X, mu);
X_norm = bsxfun(@divide, X_norm, sigma);

The operations above resulted in normalizing all 5 vectors (rows) from XX_norm contains vectors with 0 means and standard deviations 1 for all 3 dimensions. In just 2 lines bsxfun applied mu (means) and sigma (standard deviations) to each row of X and X_norm.

Timing operations

Use functions tic and toc to measure execution time in Octave to tune performance when necessary:
octave:> tic; A * A'; toc
Elapsed time is 2.58e-008 seconds.

Monday, June 4, 2012

Gentle Intro to Octave or Matlab

I began using Octave for homework assignments from the online Machine Learning class. Having worked with languages like Python, Groovy and JavaScript I never expected a system designed for numerical computations to include such a complete and unique programming language. But it does and I can't resist sharing some examples.

There are two important things one should know about Octave (or Matlab as Octave is usually portable to Matlab):
  • Octave is a high level language just like Python or Groovy
  • Using Octave without matrices or vectors is like using Java without objects
Just these by themselves are worth a whole book on Octave but instead I go on with few cool  examples (leaving the book for later :-).

Matrices and Vectors

Creating vector or matrix in Octave is simple:
octave:> A = [1 2 3; 4 5 6; 7 8 9]
A =

   1   2   3
   4   5   6
   7   8   9
defines 3x3 matrix of integers.

Use special functions to define special matrices, e.g. identity:
octave:> I = eye(3)
I =

Diagonal Matrix

   1   0   0
   0   1   0
   0   0   1

all zeros:

octave:> allZeros = zeros(2,4)
allZeros =

   0   0   0   0
   0   0   0   0
vector (number of columns is 1) of all ones:

octave:> allOnes = ones(3,1)
allOnes =

or matrix with random values:

octave:> X = rand(3, 5)
X =

   0.400801   0.091597   0.951333   0.063074   0.018309
   0.690633   0.194094   0.417911   0.658953   0.624323
   0.848887   0.696741   0.213559   0.363656   0.632738
And finally getting a vector of values from 1 to N (row vector):
octave:> 1:N
ans =

    1    2    3    4    5    6    7    8    9   10
 and column (vector above transposed):
octave:> (1:N)'
ans =


To stir things up a bit I make the following claim:
In Octave for any given problem there is higher than 50% chance that using matrices alone solves the problem with less code and more efficiently than when using loop and condition statements. 
Being a high level language Octave has control statements if, switch, loops for and while but using them in Octave is often your second choice. The reason are many matrix operators and functions Octave offers may accomplish a task without ever invoking a single control statement in a fraction of time.

Suppose you have a matrix X and you need to insert a column of 1s in front. Then I just concatenate a vector of 1s of proper size and X:

 X = [ ones(size(X, 1), 1),  X ];

What just happened: function size returned 1st dimension of array X (number of rows); then function ones generated a vector (2d dimension is 1) of 1s, and finally we concatenated column and X. But this is only beginning.

Matrix magic

This example illustrates why and how things may work out better without control statements in Octave. Suppose I have a row vector (we may call it also an array but ultimately it is a single row matrix) of numbers from 1 to 10:
octave:> y = randperm(10)
y =

    4    3    1    8    6   10    9    2    7    5

octave:> y = repmat(y, 1, 10);

Function randperm returned a row vector containing random permutation of numbers 1 through 10. Then I used function repmat to make vector y 10 times its original length by repeating it 10 times (which results in 1 by 100 matrix y).

Now, to the problem we will solve. Let's say each number from 1 to 10 corresponds to 10-dimensional vector of 0s and 1 with all of its elements set to 0 except the position of the number. Thus, 1 corresponds to vector (1 0 0 0 0 0 0 0 0 0), 2 corresponds to vector (0 1 0 0 0 0 0 0 0 0) and so on until 10 that corresponds to (0 0 0 0 0 0 0 0 0 1). Then given some vector y like above (where each element is a number from 1 to 10) we want to produce series of vectors that correspond to elements in y.

If you never worked with Octave before then solution may amaze you, if you did then this might be your normal routine:

A = eye(10)

A =

Diagonal Matrix

   1   0   0   0   0   0   0   0   0   0
   0   1   0   0   0   0   0   0   0   0
   0   0   1   0   0   0   0   0   0   0
   0   0   0   1   0   0   0   0   0   0
   0   0   0   0   1   0   0   0   0   0
   0   0   0   0   0   1   0   0   0   0
   0   0   0   0   0   0   1   0   0   0
   0   0   0   0   0   0   0   1   0   0
   0   0   0   0   0   0   0   0   1   0
   0   0   0   0   0   0   0   0   0   1

result = A(:, y);

What just happened: first we created an identity matrix A of size 10 - note that it consists of exactly 10 vectors we are mapping to and each is in right column position. Now, we just plug our original vector y into column index of matrix A. This will extract elements from A: all for rows and precisely right columns. Thus A(:, 2) gives us matrix which is 2nd column of A, A(:, 2:4) gives us matrix with columns of A from 2 to 4, same is accomplished with A(:, [2:4]), and A(:, [1 4 9]) selects 1st, 4th and 9th columns of A. Finally, we can plug an arbitrary vector in column index - in our case vector y just what we need and result becomes 10 by 100 matrix where each column corresponds to element of y.

As unconventional as it may sound I keep thinking of this solution in terms of ranges when indexing arrays. Indexing array could be via an integer, range, or array of numbers. The latter is just a vector and that is all to it.

Sunday, October 23, 2011

AutoComplete Using Google App Engine and YUI in two parts (part 2)

Part 2: Autocomplete with YUI3 and GAE RPC Handlers

This is second and final part of 2-part series on implementing AutoComplete with GAE. Recall that in part 1 we built a foundation for keyword autocomplete lookup service for our document search application. Both the service itself and its HTML/JavaScript client will materialize below.

Let's for a moment switch to JavaScript side where I intend to use YUI3 AutoComplete. It supports variety of sources to query for available auto-complete choices including XHR (XMLHttpRequest) style and JSONP style URL sources. While working within bounds of the same application XHR URL will suffice the simplicity of both YUI widget and GAE RPC service support will let us do both with almost no extra work (JSONP service allows access from third-party web site pages which should not be taken lightly for security concerns).

The choice of YUI3 widget versus other libraries such as jQuery with its Autocomplete plugin is not important as one can swap plugins with few lines of JavaScript. YUI3 Library offers rich set of built-in options, wide variety of other compatible widgets, utilities, infrastructure and its API resembles jQuery now (I believe Yahoo stole - not copied - according to Picasso).

Great article by Paul Peavyhouse contains building blocks for the RPC handlers in GAE. We begin with RPCHandler class:

class RPCHandler(webapp.RequestHandler):
    """ Allows the functions defined in the RPCMethods class to be RPCed."""

    def __init__(self):
        self.methods = RPCMethods()

    def get(self):
        func = None

        action = self.request.get('action')
        if action:
            if action[0] == '_':
                self.error(403) # access denied
                func = getattr(self.methods, action, None)

        if not func:
            self.error(404) # file not found

        args = ()
        while True:
            key = 'arg%d' % len(args)
            val = self.request.get(key)
            if val:
                args += (val,)
        # Checking if result is cached
        cache_key = action + ';' + ';'.join(args)
        result = memcache.get(cache_key) #@UndefinedVariable
        # Query if it's not
        if result is None:
            result = func(*args)
            memcache.add(cache_key, result, 900) #@UndefinedVariable
        return_data = self.prepare_result(result)
        self.response.headers['Content-Type'] = "application/json"

Actually RPCHandler takes care of roughly 90% of the job:
  • it retrieves action from request and matches it to appropriate RPC method from RPCMethods class via reflection (lines: 4-6 and 9-21)
  • it extracts service parameters from the request (parameter names matching argN) to pass to RPC method (lines: 23-30)
  • it forms a key to cache this call and checks if it's already available from memcache (lines: 32-43)
  • it calls RPC method and saves results in cache (lines: 36-39)
  • it formats results and sends them back to a client (lines: 41-43)

RPCHandler is an abstract class - concrete handlers extend using template method pattern : single abstract method prepare_result lets us have both XHR and JSONP style handlers:

class JSONPHandler(RPCHandler):
    def prepare_result(self, result):
        callback_name = self.request.get('callback')
        json_data = simplejson.dumps(result)
        return_data = callback_name + '(' + json_data + ');'
        return return_data
class XHRHandler(RPCHandler):
    def prepare_result(self, result):
        json_data = simplejson.dumps(result)
        return json_data

While XHRHandler formats data in JSON, JSONPHandler adds callback function to reply as expected by JSONP client (on top of generated JSON). Django provided simplejson encoder implementation imported from django.utils is part of App Engine environment.

With RPC plumbing done class RPCMethods does actual work: its method for keyword autocomplete action is ac_keywords (later you can offer more services by adding methods in RPCMethods):

class RPCMethods:
    def ac_keywords(self, *args):
        prefix = args[0]
        limit = int(args[1])
        query = Query(Keyword, keys_only=True)        
        query.filter('words >=', prefix)
        query.filter('words <=', unicode(prefix) + u"\ufffd")
        keyword_keys = query.fetch(limit, 0)
        result = map(lambda x:, keyword_keys) 
        return result
The method ac_keywords executes a search that matches all keywords starting with prefix and returns normalized version of corresponding keyword using retrieved key. In first part we called this approach embedded RIE exactly for this reason: retrieving key as data using search over string list property.
Now that everything is ready on GAE side (well, almost: last piece of code I left for the very end), we can implement take care of the html inside the browser. I start with defining a form containing input field to enter keywords:
<form method="post" action="#" id="search_form">
<input id="search_field" class="search" type="text" name="search_field" 
               value="Enter keywords..." 
        <input name="search" type="image" src="images/search.png" 
               alt="Search" title="Search" />
New empty input box will contain phrase Enter keywords... that disappears as soon as user focuses on the filed:

With auto-complete enabled it will look like this:

Adding YUI3 AutoComplete plugin is just few lines of JavaScript that also include extra customization to control highlighting, filtering, and delimiter for matching words (far from all options available to tune this plugin for one's needs):


I used queryDelimiter to activate autocomplete for each word user enters: feel free to play and change these and other attributes plugin offers. The line with source that is commented out defines source URL for JSONP style service, while the line with source left active is for XHR URL.

Finally, last piece of server-side Python code that enables both URLs using webapp framework for Python (in

application = webapp.WSGIApplication(
            [( '/',                 MainHandler),
             ( '/rpc.jsonp',        JSONPHandler),
             ( '/rpc.xhr',          XHRHandler)

I wanted to finish emphasizing how efficient this solution is. YUI3 AutoComplete plugin caches responses from JSONP and XHR URL sources automatically based on the query value for the duration of the pageview. Python RPC services implemented in GAE use memcache automatically and transparently to cache results for each action type and query value. Finally, querying for matching keywords in the datastore uses key only queries which are least expensive. Given that autocomplete feature on a busy site must be quite popular all these will contribute to both performance and savings on GAE.