Because best way to understand is explaining.

Sunday, April 24, 2016

Map of the Windows Fonts Registered with R

If you already found package extrafont then you probably found how to load and use Windows fonts in R visualizations. But just in case, everything to get started with extrafont is found here and summarized for using fonts in Windows for on-screen or bitmap output below:


One thing to add is the summary of all registered Windows fonts in R. Such summary comes handy when designing new visualizations and deciding on which font or combination of fonts and their faces to use.

This results in the large and very handy visual:



You can download this image or produce your own with the code above.

Saturday, April 16, 2016

Creating and Tweaking Bubble Chart with ggplot2

This article will take us step-by-step over incremental changes to produce a bubble chart using ggplot2 that looks like this:


We'll encounter the plot above once again at the very end after explaining each step with code changes and observing intermediate plots. Without getting into details what it means (curios reader can find out here) the dataset behind is defined as:


It contains 2 data points and 4 attributes: three numerical Aster_experience, R_experience, and  coverage, and one categorical product. Remember that the data won't change a bit while the plot progression unfolds.

The starting plot is simple scatterplot using coordinates x and y as Aster_experience, R_experience (line 3), point size as coverage, and point color as product (line 4) (this type of scatterplot has a special name - bubble chart):




Immediate fix would be making the smaller point big enough to see it with the help of scale_size function and its range argument (line 3) (strange enough but sibling function scale_size_area doesn't have such argument) that specifies the minimum and maximum size of the plotting symbol after transformation1 :



Next refinement aims at the magic quadrant concept which fits this data well. In this case it's "R Experience" vs. "Aster Experience" and whether there is more or less of each. Achieving this effect involves fake axes using geom_hline and geom_vline (line 3), and customizing actual axes using scale (line 5-6) and theme functions (line 8-12):




Typical for bubble charts its points get both colored and labeled, which also makes color bar legend obsolete. We use geom_text to label points (line 5) and scale_color_manual to assign new colors and remove color bar legend (line 11):



The next step happened to tackle the most advanced problem while working on the plot. The guide legend for size above looks rather awkward. Ideally, it matches the two points we have in both color and size. It turned out (and rightly so) that the function scale_size is responsible for its appearance (line 8). In particular, number of legend positions overrides argument breaks, and controling appearance including colors of the legend performed with guide_legend and override.aes:



We finish cleaning the plot using package ggthemes and its theme_tufte function (line 10):




As promised, we finished exactly where we started.


1 Scale size (area or radius).

Saturday, January 30, 2016

R Graph Objects: igraph vs. network

While working on new graph functions for my package toaster I had to pick from the R packages that represent graphs. The choice was between network and graph objects from the network and igraph correspondingly - the two most prominent packages for creating and manipulating graphs and networks in R.

Interchangeability of network and graph objects


One can always use them interchangeably with little effort using package intergraph. Its sole purpose is providing "coercion routines for network data objects". Simply use its asNetwork and asIgraph functions to convert from one network representation to another:

library(igraph)
library(network)
library(intergraph)
 
# igraph 
pkg.igraph = graph_from_edgelist(edges.mat, directed = TRUE)
pkg.network.from.igraph = asNetwork(pkg.igraph)
all.equal(length(get.edgelist(pkg.igraph)), length(as.matrix(pkg.network.from.igraph, "edgelist")))
 
# network
pkg.network = network(edges.mat)
pkg.igraph.from.network = asIgraph(pkg.network)
all.equal(length(as.matrix(pkg.network, "edgelist")), length(get.edgelist(pkg.igraph.from.network)))
Created by Pretty R at inside-R.org

For more on using intergraph functions see tutorial.

Package dependencies with miniCRAN


To assess relative importance of packages network and igraph we will use package miniCRAN. Its access to CRAN packages' metadata including dependencies via "Depends", "Imports", "Suggests" provides necessary information about package relationships. Built-in makeDepGraph function recursively retrieves these dependencies and builds corresponding graph:

library(miniCRAN)
 
cranInfo = pkgAvail()
 
plot(makeDepGraph(c("network"), availPkgs = cranInfo))
plot(makeDepGraph(c("igraph"), availPkgs = cranInfo))
Created by Pretty R at inside-R.org

               
                   


Unfortunately, these dependency graphs show how network and igraph depend on other CRAN packages while the goal is to evaluate relationships the other way around: how much other CRAN packages depend on the two.

This will require some assembly as we construct a network of packages manually with edges being directed relationships (one of "Depends", "Imports", or "Suggests") as defined in DESCRIPTION for all packages. The following code builds this igraph object (we chose igraph for its functions utilized later):

cranInfoDF = as.data.frame(cranInfo, stringsAsFactors = FALSE)
 
edges = ddply(cranInfoDF, .(Package), function(x) {
  # split all implied (depends, imports, and suggests) packages and then concat into single array
  l = unlist(sapply(x[c('Depends','Imports','Suggests')], strsplit, split="(,|, |,\n|\n,| ,| , )"))
 
  # remove version info and empty fields that became NA
  l = gsub("^([^ \n(]+).*$", "\\1", l[!is.na(l)])
 
  # take care of empty arrays
  if (is.null(l) || length(l) == 0) 
    NULL
  else
    data.frame(Package = x['Package'], Implies = l, stringsAsFactors = FALSE)
} )
 
edges.mat = as.matrix(edges, ncol=2, dimnames=c('from','to'))
pkg.graph = graph_from_edgelist(edges.mat, directed = TRUE)
Created by Pretty R at inside-R.org

The resulting network pkg.graph contains all CRAN packages and their relationships. Let's extract and compare the neighborhoods for the two packages we are interested in:

# build subgraphs for each package
subgraphs = make_ego_graph(pkg.graph, order=1, nodes=c("igraph","network"), mode = "in")
g.igraph = subgraphs[[1]]
g.network = subgraphs[[2]]
 
# plotting subgraphs
V(g.igraph)$color = ifelse(V(g.igraph)$name == "igraph", "orange", "lightblue")
plot(g.igraph, main="Packages pointing to igraph")
V(g.network)$color = ifelse(V(g.network)$name == "network", "orange", "lightblue")
plot(g.network, main="Packages pointing to network")
Created by Pretty R at inside-R.org



The igraph neighborhood is much denser populated subgraph than the network neighborhood and hence its importance and acceptance must be higher.

Package Centrality Scores


Package igraph can produce various centrality measures on the nodes of a graph. In particular, pagerank centrality and eigenvector centrality scores are principal indicators of the importance of a node in given graph. We finish this exercise with validation using centrality scores for our initial conclusion that igraph package is more accepted and utilized across CRAN ecosystem than network package:

# PageRank
pkg.pagerank = page.rank(pkg.graph, directed = TRUE)
 
# Eigenvector Centrality
pkg.ev = evcent(pkg.graph, directed = TRUE)
 
toplot = rbind(data.frame(centrality="pagerank", type = c('igraph','network'), 
                          value = pkg.pagerank$vector[c('igraph','network')]),
               data.frame(centrality="eigenvector", type = c('igraph','network'), 
                          value = pkg.ev$vector[c('igraph','network')]))
 
library(ggplot2)
library(ggthemes)
ggplot(toplot) +
  geom_bar(aes(type, value, fill=type), stat="identity") +
  facet_wrap(~centrality, ncol = 2)
Created by Pretty R at inside-R.org



Conclusion


Both packages igraph and network are widely used across CRAN ecosystem. Due to its versatility and rich set of functions igraph leads in acceptance and importance. But as far as graph objects concern it is still a matter of the requirements to prefer one's or another's objects in R.

Tuesday, September 22, 2015

VW Big Data Play

Volkswagen made headlines lately for cheating U.S. EPA regulators. But let's pay some respect to their engineers.

Apparently, there is no button or switch that tells car it's being tested - indeed - that would be obvious flaw in the emission test protocol. So VW engineers designed and deployed sophisticated algorithm that detects car is undergoing emission testing and turns emission control on just in time to pass it with flying colors. Then, after the test is over, it recognizes normal driving conditions and switches car software back to run diesel engine in its normal mode (that creates smog at up to 40 times the legal limit).

Having such feature running flawlessly in real time conditions on hundred thousand cars all over the world deserves special recognition. In fact, it was pure accident that this "cheating device" was found (here is Bloomberg's story how). At least, let's congratulate VW data scientists and software engineers - but not their execs - with quite an accomplishment.

Friday, September 13, 2013

How to expand color palette with ggplot and RColorBrewer

Histograms are almost always a part of data analysis presentation. If it is made with R ggplot package then it may look like this:

data(mtcars)
 
ggplot(mtcars) +
  geom_histogram(aes(factor(cyl), fill=factor(cyl)))


The elegance of ggplot functions is in simple yet compact expression of visualization formula while hiding many options assumed by default. Hiding doesn't mean lacking as most options are just a step away. For example, color selection can change with one of scale functions such as scale_fill_brewer:

ggplot(mtcars) +
  geom_histogram(aes(factor(cyl), fill=factor(cyl))) +
  scale_fill_brewer()


In turn, scale_fill_brewer palette can be changed too:

ggplot(mtcars) +
  geom_histogram(aes(factor(cyl), fill=factor(cyl))) +
  scale_fill_brewer(palette="Set1")



Palettes used live in the package RColorBrewer - to see all available choices simply attach the package with library(RColorBrewer) and run display.brewer.all() to show this:
There are 3 types of palettes, sequential, diverging, and qualitative; each containing from 8 to 12 colors (see data frame brewer.pal.info and help ?RColorBrewer for details).

Curious reader may notice that if a histogram contains 13 or more bars (bins in case of continuous data) we may get in trouble with colors:

ggplot(mtcars) + 
  geom_histogram(aes(factor(hp), fill=factor(hp))) +
  scale_fill_brewer(palette="Set2")


Indeed length(unique(mtcars$hp)) finds 22 unique values for horse power in mtcars, while specified palette Set2 has 8 colors to choose from. Lack of colors in the palette triggers ggplot warnings like this (and invalidates plot as seen above):
1: In brewer.pal(n, pal) :
  n too large, allowed maximum for palette Set2 is 8
Returning the palette you asked for with that many colors
RColorBrewer gives us a way to produce larger palettes by interpolating existing ones with constructor function colorRampPalette. It generates functions that do actual job: they build palettes with arbitrary number of colors by interpolating existing palette. To interpolate palette Set1 to 22 colors (number of colors is stored in colourCount variable for examples to follow):

colourCount = length(unique(mtcars$hp))
getPalette = colorRampPalette(brewer.pal(9, "Set1"))
 
ggplot(mtcars) + 
  geom_histogram(aes(factor(hp)), fill=getPalette(colourCount)) + 
  theme(legend.position="right")


While we addressed color deficit other interesting things happened: even though all bars are back and are distinctly colored we lost color legend. I intentionally added theme(legend.position=...) to showcase this fact: despite explicit position request the legend is no more part of the plot.

The difference: fill parameter was moved outside of histogram aes function - this effectively removed fill information from aesthetics data set for ggplot. Hence, there is nothing to apply legend to.

To fix it place fill back into aes and use scale_fill_manual to define custom palette:

ggplot(mtcars) + 
  geom_histogram(aes(factor(hp), fill=factor(hp))) + 
  scale_fill_manual(values = getPalette(colourCount))


Another likely problem with large number of bars in histogram plots is placing of the legend. Adjust legend position and layout using theme and guides functions as follows :

ggplot(mtcars) + 
  geom_histogram(aes(factor(hp), fill=factor(hp))) + 
  scale_fill_manual(values = getPalette(colourCount)) +
  theme(legend.position="bottom") +
  guides(fill=guide_legend(nrow=2))


Finally, the same example using  in place palette constructor with different choice of library palette:

ggplot(mtcars) + 
  geom_histogram(aes(factor(hp), fill=factor(hp))) + 
  scale_fill_manual(values = colorRampPalette(brewer.pal(12, "Accent"))(colourCount)) +
  theme(legend.position="bottom") +
  guides(fill=guide_legend(nrow=2))
Created by Pretty R at inside-R.org


There are many more scale functions to choose from depending on aesthetics type (colour, fill), color types (gradient, hue, etc.), data values (discrete or continuous).

Wednesday, July 31, 2013

Quick R tip: ggplot in functions needs some extra care

When building visualizations with ggplot2 in R I decided to create specialized functions that encapsulate plotting logic for some of my creations. In this case instead of commonly used aes function I had to use its alternative -  aes_string - for aesthetic mapping from a string.

And now goes this handy tip:
while original aesthetic mapping function aes accepts x and y parameters by position:

p = ggplot(data, aes(x, y)) + ...

aes_string even though silently accepts them won't work like this:

my_plot_fun = function(data, xname, yname) {
  p = ggplot(data, aes_string(xname, yname)) + ...
}

It will run to compile plot object without problems but when plot p (returned from the function my_plot_fun) executed this rather cryptic error appears:

Error in as.environment(where) : 'where' is missing

What it means is that ggplot never got aesthetics defined right. This is due to aes_string function lacking the same position parameters as in its aes counterpart above. Instead, define both x and y parameters (and others if necessary) by name:

p = ggplot(data, aes_string(x=xname, y=yname)) + ...

Created by Pretty R at inside-R.org


UPDATE:


One more vote for using aes_string in place of aes comes from CRAN submission policy, i.e.:
In principle, packages must pass R CMD check without warnings or significant notes to be admitted to the main CRAN package area. If there are warnings or notes you cannot eliminate (for example because you believe them to be spurious) send an explanatory note as part of your covering email, or as a comment on the submission form. 
 (source: CRAN Repository Policy)
 What happens is that  R CMD check  reports notes like this for every aes call:

no visible binding for global variable [variable name]
It turns out that the most sensible solution is using aes_string instead.

Thursday, July 26, 2012

My New Calculator(s)

We all need a calculator from time to time. I used to reach to Start button, type Calc in the Run (or Search) box to get to Calculator app (Windows). Until recently that is. Now I simply start Octave and do my calculations there. Sometimes, I already have Python prompt and then I do my calculations there.

For example compute a variance for the sample of 10 coin flips: 4 Tails (0) and 6 Heads (1) (estimated mean p=0.6):
 octave-3.2.4.exe>(4*(0-0.6)**2 + 6*(1-0.6)**2)/(10-1)
 ans =  0.15360
This calculator really works for me. Sometimes I have a Python window and it works just as well:
>>> (4*(0-0.6)**2 + 6*(1-0.6)**2)/(10-1)
0.2666666666666667
Having said that Octave beats Python in easiness when calculating with vectors (or time series or sequences or anything that can be represented as a vector). Let's suppose we test if certain coin is not loaded (is fair) by flipping it 14 times. We would like to be 95% certain that coin is fair (i.e. p=0.5 which is equivalent to two-tailed test). Suppose that 14 flips resulted in only 3 heads. First, we build a critical interval - the number of tails that would result in rejecting coin fairness given number of heads:
critical_interval = binopdf(0:14,14,0.5) > 0.025 | binocdf(0:14,14,0.5)- binopdf(0:14,14,0.5) > 0.975
critical_interval is a Boolean vector where i-th element corresponds to (i-1) number of tails: if it's true then with 95% certainty it is a fair coin. This expression is a logical OR of 2 expressions: first for left tail and second for right tail. Octave seamlessly handles any vectors just as if it were a number: I can change this to 1000 flips with minimum keystrokes.

Thus, given number of tails 3 we get
octave-3.2.4.exe> critical_interval(3+1)
ans = 0
We use (3+1) as 1st element corresponds to 0 tails. Hence, we accept the fact that our coin is fair with 95% certainty because 3 heads do not belong to the critical interval. Similarly we have to reject this coin as loaded if number of tails happens to be 12:

octave-3.2.4.exe> critical_interval(12+1)
ans = 1

I leave Python example as an exercise to a reader but I am certain that result won't be close to Octave  in neither transparency nor conciseness. The Octave solution does leave me with the right to keep claiming this use similar to a calculator - not a programming.