BrightPoint Consulting, Inc.

Thomas Gonzalez

What can others learn from the successes and failures of the way you’ve unlocked the value of big data?: 


Federal Budget   -   Political Influence   -   US Trade Deficit

Unlocking the value in big data requires finding the signal within the noise.   It is part science, part art, and part intuition; looking for and identifying patterns in the data that tell a compelling story.

The first step is identifying the story you want to tell.   Sometimes you have a clear thesis that you want to visually prove with the data, other times you have a hypothesis that requires exploration to understand further.  In all cases your goal is to create meaning from the data that isn’t immediately apparent.    In data visualization most of this meaning is derived through visual relationships and patterns.   Using size, shape, color, position, contrast, and host of other stylistic elements you can create a myriad of visual relationships that impart meaning to the end user.

One of the challenges in working with big data, is that there is usually more data than can be discretely visualized at once.   In desktop, tablet, and mobile, we are constrained by how many pixels we have on a screen and thus the number of data points that can be displayed at once.   Techniques such as drill down, drill across, and dynamic user interaction allow us to visually present data in smaller chunks, but a summarization process of the data is still required.   This summarization can be seen quite clearly in the US Federal Budget Deficit where we use the visual metaphor of a tree with expanding leaf nodes to show data in ever smaller chunks.

But what happens when the data we have is too large (tens of thousands of data points and higher) to present in an interactive visualization all at once?  We end up having to make hard decisions and being more selective about what data we will use to tell our story.   When you look at the Political Influence visualization you will see that I made a choice to only display the top 20 PAC’s.  Earlier on in my research of the data I had decided on a the general visual metaphors I wanted to use (the PAC’s surrounding the senate/congress like a petri dish (PACs) containing cells (candidate) in the middle symbolizing the intermixed and contained relationships between government and industry.)   Based on this general metaphor I knew that displaying all of the PAC’s in a radius was not going to be possible – as you are limited to perhaps a couple dozen that you can display clearly due to the limitations of only having 360 degrees of display.   So I made a decision to only display the top 20, thinking this would be enough to tell the story.

The reason why this was a hard decision, is because of the technical work it implied.   The data for this visualization came from half a dozen sources on the FEC website representing millions of rows of contribution data - which I then had to import into a SQL database.  I had to cross reference PAC’s, contributions, with political appointees.   For both the house and the senate and I then needed to figure out who the top 20 contributors were (filtering by dozens of different contribution types and donor types) by cross joining the contributions table to the PAC table.   From this I then had to make two separate temporary SQL tables that only held the contributions from the respective house and senate top 20 contributors.   From these tables I could then run aggregate functions to determine how much was contributed to each candidate, by whom, and when.    The result was the distillation of several millions of rows of data into several hundred that could then be visualized.

Along the way I continued to validate my expectations of the amount of the data to how I generally wanted to visualize it in its discrete parts (number of PACS, number of congress people, number of contributions.)   Once I had the distilled data to be visualized I was then able to work on the visualization itself.

From a creative and visual perspective this is the fun part.  You know you have solid data to work with, and now you get to start creating and see what patterns the data creates.  I knew the general visual metaphor I wanted (angular arcs representing the PAC’s and circles representing the congress.) But there were still lots of details to work out, like the circle packing algorithm to sort out democrats and republicans, how to scale the size of each congress member based off total donations, how to show multiple donations to a particular congress member (nested semi-transparent circles.)  

I always strive to make my visualizations aesthetically pleasing, as it helps to draw the user in and want to understand the data more in depth.   In this visualization one of the challenges was how to show the contributions from the PAC’s to the congress.   Simply using straight lines looked ugly and confusing, so I spent several days refining the quadratic curve algorithms to get curving lines with a zero degree arc at the end terminating at the congress member, and the relative arc showing the size of PAC contribution.  Further refinements were than applied with specific stroke and fill transparencies, adding animation showing the contributions being applied over time.  The end result is a rather dense visualization of the top 20 PAC contributions for year.   It looks a bit messy and organic, which was part of what I was going for.   It was my hope that people would see how intertwined special interests groups are with congress and each other based on their contributions across a wide spectrum of candidates from both parties.

But in order to help make sense of the data and engage the user further, I needed to add user interaction.   In this case I wanted the user to easily see who all the candidates a specific PAC contributed to, for a given congress member, where all their contributions came from, and for a given contribution where it came from and to whom it went.   In order to accomplish this I created mouse over events for each candidate to show their total contributions and from where they came.  For each PAC I created a mouse over on their name label that showed their total contributions and where they went, and for each contribution itself I created mouse over events for both the radius arc and the Bezier path itself.  In all of these cases the mouse over events highlight the relative data visually to help the user see the relationships and patterns with the data.  The user than can see these relationships and have the visualization animate in an aesthetically pleasing and engaging manner.

What were your expectations of the value hidden in your data, and how did they influence the design of your solution?: 

Each of the three visualizations presented here were a bit of a gamble even though I had some general sense of what I would see.  I was still surprised by some of the outcomes.   For instance, for the Federal Budget, I knew at the federal level Defense spending was far greater than Education spending, but looking at the raw numbers does not have the same impact as the visual.   It was a pleasant surprise (although in hindsight not unexpected) to see at the local level Education spending far outstrips Defense – as Education is mostly funded through county property taxes, while our military is through federal tax dollars.  

Another big surprise was seeing in the US Trade Deficit visualization that the amount of trade we do with Canada and Mexico, and that in many years our largest importer is Canada, not China.

In all three visualizations I had some guesses at to what I would see and I generally follow a process to ferret out my intuition about the data before investing too much time with it.   First I look at the size of the data set, to help determine if it is even possible to visualize it, and/or what kind of summary techniques I will need to visualize (data pre-processing, client side pre-processing, user drill down/across, unique interaction metaphors, etc.)    Then I will generally form a hypothesis about the data and the story I want to tell. I will usually query the data directly (either via Excel or SQL) to look for key metrics that support or refute the hypothesis.    In the case of the Federal Budget it was easy to see how much is spent on Defense versus other categories, and that was compelling enough for me to build the visual, which wasn’t a huge effort.   On the other hand, the Political Influence was a substantial effort and I didn’t have any key metrics to validate – but what I did look for was the number of contributions coming and going to both democrats and republicans to help validate the hypothesis that special interests groups contributions ran across parties and would support the visual of the interconnected nature of our political system.

Based on what I expected to see in terms of looking at the data I had some good guesses on the visual approaches I would take, but sometimes you run into a few issues.  For instance, for the Federal Budget, the tree metaphor broke down when I got to some of the end leafs that had hundreds of nodes to display.   In this case, the vertical layout algorithm tries to squeeze all of the nodes into a space too small to display in appealing manner.  So instead of working out a different visual at that level I simply opted to use an Alert dialog telling the user there were too many nodes to display.   While the former solution (a new visual to display the high density of nodes) would be a more optimal solution, I felt that the visualization told enough of a story as it was and didn’t merit the further design and engineering work that a new metaphor would require (often with clients working within a budget these types of decisions get made all the time.)

Conversely, how did the design of your solution affect your understanding of the potential value of your data?: 

I take a rather iterative approach to design that doesn’t really lend itself to big surprises that create a major change is design direction.   This is more to mitigate the risk of expending a lot of effort in one direction to realize the path isn’t going to bear any fruit.    That’s not to say that in the concepting stage of a visual that I don’t throw away ideas and start fresh – but this is more with visual prototypes that I use to flush out a new metaphor than with something being driven by real data.

Generally I iterate over the data itself and its structures validating that the metrics support the hypothesis and I can make the necessary summaries and relationships within the data itself and output or process the data to a point that it can be visualized.   This is before I start doing any visualization work. 

After I have done some preliminary data validation and I have come up with some ideas on how I want to visualize the data – especially when it is a completely new visual metaphor that hasn’t been used before such as in the US Trade Deficit or the Political Influence I go through an iterative design process.

This process starts with hand sketches to get on paper what I am conceptualizing in my minds eye.   I can quickly see if a visual idea has merit and is worth exploring further or if I need to explore different options.  This is a very quick process and easy to see dead ends versus potential solutions.   If I get stuck and can’t come up with a good solution I sit on it for a few days and re-approach with pen and paper in hand.   Once I have a sketch of the direction I am going in, I may refine the sketch a little, but pretty quickly I get into prototyping the visual.

I use prototyping with dummy data to vet out the details of a visual – especially when I am experimenting with new shapes (such as the Bezier curved paths used to represent contributions in Political Influence, or the arc segments for the US Trade Deficit.)   Without having to map to actual data I can focus on the actual geometries, colors, shading, and interactions of the visualizations, as well as the engineering aspects of the visualizations.  I usually do this prototyping directly in code with frameworks like Axiis, Processing, or D3.js.   Very quickly I can see if the sketches translate to the screen, and then I can work on visual refinements and stylistic treatments.  Once I have the prototypes to a point I am satisfied with, I then take the prototype code and refactor it and clean it up to more usuable code that I can leverage across more than just one project and use this as the basis for the visualization.

With the abstract visualizations working in code I then begin the process of integrating the data with the visuals.  It is at this stage where I start working on refinements to the project, working on scales, stroke weights, colors, gradients and other visual cues that help tell a nuanced story with the data.

Describe the aspects of the design of your solution that do the most to expose meaning in data that would otherwise be harder to discern.: 


Federal Budget:

The area of the circles in the visualization represent aggregate spending for a specific federal budget line item.  In this case, scale becomes an interesting measure.   If I were to map the actual spending relative to the area of each circle in a linear fashion we would have some huge outliers (like defense) that would render other nodes at a sub-pixel level.  Essentially users would see a few HUGE circles and some little dots next to them – not very helpful at seeing finer nuances to the data.   Also since each level of the tree represents an aggregate of the levels below it the upper levels would overshadow the lower levels.   So I took a two-fold approach to solving this problem and forewent dataàpixel accuracy to more effectively communicate the story.  I don’t believe my choices skew the data – but help to communicate the overall story more clearly.

First to solve the problem of the aggregate levels I made a decision that all nodes for a given level would be visualized relative to each other.  So I calculate the min and max values for each level and display the data relative to those min and max values.  

Second, because of some outlier data (huge spending or nominal spending) I used an inverse log scale to represent the radius (and thus area) of the circle and mapped this scale back to the relative value of the node in comparison to the min and max of that level.   This allows the large spending to appear not too large in visual area, and the small spending not too small.  I also used min and maxes for total radius to cap values.  I had to play around with these caps and scales with the actual data until I felt I had something that was a happy medium in telling the story and accuracy.  The end result was to allow users to easily see relative differences in spending without large values taking up the whole screen or small values to disappear.

To more easily guide the user to places of spending I decided to match the stroke weight of each curved link going from a higher node to a lower node to match the radius of the circle of that node.  In other words you will see thicker lines where more spending is occurring and thinner lines where less spending is occurring.   This provides the user an additional visual cue to track spending.   I also adjusted the transparency of these links to about 30% so they would be visual cues while not being distracting with too much visual mass and contrast.   This also allows the user to more easily track the source of multiple spending lines as you can see the layers of spending due to the 30% transparency.

To show relative spending across geographic levels (Federal/State/Local) I incorporated some animation.   Animation can often be overused in a gratuitous manner in visualizations, offering nothing more than eye-candy, versus a communicative cue.   One area that animation DOES work well in visualization work is showing changing values of data.   In this case, animation is employed to show the changing radius of circles and stroke weights of links when the user toggle’s between Federal/State/Local spending.   The animation quickly draws the user’s eye to areas of growth or shrinkage in spending levels across different categories.   It also is used as a tool further engage the user to interact with the data.

The final piece to the visualization, which I usually incorporate in all visualizations is a data tip on mouse hover.   Data tips provide a great way to give more detailed information to a user without overwhelming them with detail.  In this case the data tip shows categorical spending across Federal, State, and Local levels so the user can see the numbers behind the visualization and compare all spending levels at once.   To further aid the user, the particular locality that is being visualized is also highlighted in the data tip with the darker shaded box.   This then allows the user to more easily correlate the numbers with what is being shown.

US Trade Deficit:

One of my main goals in visualizing the US Trade Deficit was to help people understand what is meant by a trade deficit.   At a basic level the trade deficit shows that the US pays for (imports) more goods than it sells (exports.) I wanted the overall visualization to show the smaller amount of money coming into the country and larger amount of money going out.  This was the overall design goal when coming up with the visual and thus the motivation for the shapes that I decided upon.

My initial sketches had a circle in the middle with arc/rays of money flowing into the circle and arcs/rays flowing out of the circle – with the circle representing the US and exports (money flowing in to the US) at the top and imports (money flowing out of the us) at the bottom.   I struggled a bit with how to show imports and exports – as the flow of goods is the opposite of the flow of money (we pay to bring goods in and get paid to send goods out.)   I ended up siding on the side of the money flowing in and out.

When I started prototyping I played around with using country flags/colors as symbols by country labels and a bigger US icon in the middle.  I found this to be a too literal translation that proved distracting to the eye and I ended up gravitating to a more abstract and subtle use of shapes and colors.

Initially I had a circle of arc segments radiating from a center point.  But having a homogeneous circle with all arcs sharing the same radius and centroid did not offer enough differentiation (outside of color) between imports and exports.  I wanted to more visually separate import and exports with shape.   First I gave exports 70 degrees of arc at the top and imports 180 degrees at the bottom, but that wasn’t enough visual separation.   So then I created two different arcs with the same radius but shifted the center point of each arc on the y-axis by about 100 pixels, but that still wasn’t enough separation.  I finally adjusted the top (exports) arcs with a slightly tighter radius and separated them further.   This finally gave me the overall differential I was seeking with the shapes themselves.  To make it easier to see the greatest exports and imports I arranged them in a counter-clockwise direction with size of the import/export for a given country decreasing clockwise.   This allows the user to always look at one end of the export/import to determine which country is the greatest, and at the other end for which is the least.     Much like the Political Influence piece I capped the visual to the top 20 importers/exporters.

Since I wanted the focus to be on the overall deficit I made a choice to position the actual numeric deficit in the center of the visual, and I wanted to show it growing over time so I made it the cumulative deficit.  Because I wanted to show the deficit month-by-month over a period of years I once again used animation to show the change in data values as the visualization automatically advances a new data set each month.   To help emphasize the growing deficit I have the center numeric label grow in font size as the value increases, another subtle indicator calling out the importance of this growing number.

Originally for the animation, I simply had the labels for each arc, and the arcs themselves changing in text and size as they swapped positions (from greatest to least.)  But the result was that it became very difficult to see how countries had moved up or down in their relative position of exports/imports – there was just a lot of movement going on that wasn’t very fluid.   The user would just see labels changing and arc sizes changing but no way to track movement – the arcs did not animate from one position to the next.   I was not satisfied with this and had to re-engineer the whole visual to support a more intuitive animation.  I wanted the arcs and label to animate into position from their current position while maintaining the greatest to least order.  After this rework of the code itself I was then able to support a more intuitive animation where you see countries moving into new positions month by month, and the user can glean more insights how our top importers and exporters swap positions and move on a regular basis.

The time scale was an element of the visualization that I wanted to serve as an indicator of what point in time the visual represented but also as a way for the user to interact with the data.   I came up with a simple fish-eye algorithm of tick mark sizes for the months and labels for the year so the user can quickly glance at the timescale to know the year and month – with the label for the month animating into view.  In order to interact with the time scale the user can click on a specific year to move the visual forward or backward in time and see the deficit for the first month of that year.  The user can also use the “play/pause” button to activate or deactivate the automated animation.

Color also played a role in this visualization, helping to clarify the imports versus exports.  I don’t like to rely upon color as a primary visual indicator (while contrast and grey-scale shades can be used that way.)   In this case I wanted the exports (money coming in) to be “good” – thus green and exports (money going out) to be “bad”, which were colored red.   Having these color bands meet in the center of the visual feeding the growing numeric deficit label further reinforces this metaphor.

Like most visualizations I also employ data tips so the user can get further detail if desired by mousing over elements on the screen.