Wednesday, July 5, 2017

The Role of Small Data and Vacation Recap Example

Wikipedia defines small data 'small' enough for human comprehension but then it goes further by qualifying data in a volume and format that makes it accessible, informative and actionable. I am not certain the latter is always true: smaller footprint doesn't automatically qualify data as informative and actionable without more work. In my book small data usually scales to kilobytes and has just a handful of dimensions. But its main feature remains human comprehension which really means there is simple story behind it. 

In the grand scheme of big data things small data story is the last mile of data science analysis. It still requires interpretation (or representation) in the form of visualization or application.


Case in point could be Google spreadsheet I kept while on vacation in Italy with daily recordings of miles and steps walked. Later I added main attractions for each day. The result was my personal small data covering about 2 weeks of touring Italy with bases in Rome and later in Sicily (this sentence was the story):

Google sheet of activities while on vacation in Italy

As-is this spreadsheet is destined to Google archives contributing to ever growing collection of docs I created and happily forgot about. So I created this visualization that represents both most of data and the story:


Small data visualization

Before explaining how this visualization was created with R I ought to acknowledge that Google spreadsheets offer adding a chart or graph to a document. But its functionality appears rather limited without resorting to JavaScript API.

Using R googlesheets package to source Google docs makes them integral part of data sources available from within R code:
For details on how code above authenticates with Google servers and processes documents see very detailed vignettes

Now we can get back to small data and its simple story. Which means single visualization may include most if not all of it. In case of small data the goal is designing such chart without sacrificing clarity.

Core attributes days (Date) and miles walked (Distance; I chose miles over Steps for simplicity) suggest a line chart with timeline along x-axis and distance for y-axis. But there are 2 more factors to incorporate: Place indicating where the base city was each day and Label for major attractions.


Base city receives color identification with deep red for Rome and olive green for Syracuse. Major attractions text was attached to each point with smart justifications to fit inside the chart:


Had I kept more detailed log I would have ended up with more dimensions to use. For example, miles driven by car or train, time spent at leisure versus touring, number of cities and places visited, historical marker attributes and so on. But that moves us further away from small data domain as footprint and dimensions grow and story becomes less comprehensible. One of indicators of this is that it becomes harder to collect data manually. Instead, there are apps that would do it for me, for example, Life Cycle or Apple Health.

Ultimately any big data problem is reduced to one or more small data ones by aggregating, regressions, clustering or some other data science method. The path to big data insights is a journey from big to small data in search of simple story. So learning how to deal with small data is where it all both ends and begins.