I had the pleasure of attending the the ie Data Viz Summit in Santa Clara, CA on April 6th and 7th. While the applicable data visualization tips were few and far between, there was a richer undertone that allowed for a more robust overall set of take aways. Below are my notes that have been annotated with my musings. I make no claim to this being my original content, nor any claims in regards to the accuracy of these notes; think of them as my own interpretive regurgitated remix.
David Longstreet: Microsoft
Motion Graphics and Visual Explanations
Mybooksucks.com: “Party More Study Less!”
Location, category, motion, contrast, size, sequential
Data Science is one of the “sexiest jobs.” -Harvard Business Review, October 2012
Data Science is…
Statistics, economics, and psychology, but mostly psychology
The best analogy for a data scientist is a gold miner.
SQL,Excel, C#, Python, R, SPSS, Tableu, Illustrator/Photoshop
Better tools allow you to do bad things faster. Providing visuals that look really cool, but tell us nothing are an example of that. Visualizations are not new, they are just becoming more prominent and relative in our lives.
Experts are great at building mental visualizations; Novices have trouble building those mental visualizations. Visualizations reduce the cognitive overhead, making it easier to bridge that expert-novice gap.
Motion graphics allow for these mental visualizations to be picked up with greater ease. Being able to show the effects on price and quantity is a lot easier when the entire line for demand has increased relative to its original position along supply. Or showing the standard deviation in terms of the normal distribution being 95% “normal” and the outliers being 2.5% of “luck” on both sides of the distribution.
A single day of the week offers a greater insight into trends, by grouping on this single day correlations will be easier to visualize. Tuesday is the best day for this.
Universal Principles of Design
The Laws of Simplicity
Create a report and a presentation; documentation is a bad presentation and a presentation is bad documentation. A report will allow for all of the information and analysis to be available to those who desire a greater level of understanding (i.e. specialists), while the presentation allows for everyone to understand the information in a palatable way (i.e. stakeholders). A 10:1 ratio of time spent working on a presentation to analysis is typical.
An advanced form of A/B testing to see time elapsed in order to correlate user understanding of the information being presented is the best way to establish consistent metrics in terms of relaying user understanding. Really the best test is to always step back and imagine you are novice, and then presenting on normalized novice level.
Krist Wagsuphasamat: Twitter
From Data to Visualizations: What Happens in Between?
Visualizations are power of the eyes, allowing one to understand data and discover hidden facts.
Exploratory Data Analysis
1.What type of data is it?
Whether working with 2 to n dimensions or network maps, the visualization chosen must match the data that it is being powered by.
A users intents should be anticipated and meet with tool tips, filters, highlighting, and whatever the visualization and data support, within reason.
2. What do I need from the data?
Who is my audience? (Data scientists or executives)
What are the goals? (Storytelling being less interactions and exploratory providing a richer experience)
Tableu Gephi, NodeXL
Mapreduce (to change data formats or to match data needs)
to Visual Production
Preprocess the data
What data do I have
What do I want from the data
pick appropriate tools/visuals
Ahmad R Yaghoobi: Boeing
Visualization at Boeing
Vision is the broadest communication channel to the human brain, acquiring approximately 80% of all data communicated to a human. The brain must process data to gain insights.
In the early 2000’s Boeing created IVT, which isa real-time massive model rendering part repository. IVT allowed for the entire process to be modeled in a way that allows for interactions between parts to be explored in a full flight simulator, enabling users to change parameters on the fly.
Analysis vs Analytics
Analysis is the detailed examination of any size and number of data collections.
Analytics is the science of analysis; the cognitive process which allows data to be explained. Analytics is the tools and techniques that result in the common patterns to be identified; the overall goal of analytics is to let people obtain an optimal or reasonable decision based on existing data. Visual analytics can be defined as: The science of analytical reasoning facilitated by interactive visual interfaces; rapidly exploring large, complex datasets to gain new business insights using interactive visualization. Essentially visual analytics allows us to detect the expected and discover the unexpected.
Types of analysis (Those in bold are well suited for advanced visual analytic tools that allow interactive exploration and assessment of complex data):
Situational Awareness: Command and control
Tracking and Visibility: Determine status
Causal Chain Analysis: Determine why something happened
Hypothesis Testing: Explore possible explanations
Detecting Anomalies and Correlations: Prevent event occurrences
Prediction/Forecasting: Improve quantities ordered
What-if Studies: Explore alternatives
Summarizing: Communicate results
Minimum Viable Visualization requires the following:
A place to host the data or the ability to generate the data on the fly.
A mechanism with which a supplier or customer can navigate to and retrieve the model they want to view.
Low cost visualization tools
Identification of a suitable data format
High-end Visualizations requires the following:
‘Instant’ load time (less than 1 minute)
Performance capabilities to interactively manipulate up to 1 billion polygons at 10 Hz or faster
Product structure with 200,000 separate parts and 2 million instances
Selection and feedback in less than .25 seconds
Access to internal data can be problematic due to:
Unclear understanding of the benefits (Managing Expectations)
Organizational preservation/immune responses
Quality of the data can be unfit for analysis due to:
Too little data (Managing Expectations)
Inconsistent or improper formatting (Managing Expectations)
Irrelevance to the analysis task
Text provides numerous opportunities for misspelling, vagueness,
The best way to overcome these are patience and leveraging the organic situation is the best way to promote goals. Culture will evolve better with open data.
“Moving advanced technology into practice is a contact sport.”
Clayton Clouse: Fedex
The most powerful take away from this talk was using a word cloud-map as a tool in the analysis of rhetoric for clustering of terms and the overall tone. Hovering over should present the user with the word frequency distribution.
Gephi and R
Stanford, Yahoo Labs, Datamob
Sudeep Daus: What Diners are Talking About
Topic/trend analysis based on geographic segmentation
Identify the topic and categories
Map topics back to restaurants
For each restaurant and topic, surface relevant reviews.
Topic Clustering Analysis:
Restaurants by topics; Food, Drinks, Ambiance, Value, Service, and Special Occasions. Generic topics are ignored. Manually label the first crop of topics and then use a distance-based classifier to relabel any new crop of topics against the labeled ones. The distributed weight of each review across each topic is then combined to identify the top topics for an individual restaurant.
Regional Nuances: The interesting point that surfaces was the attention to the differences in the way regional dialects affect what we call things. For example, the way a view is described reveals much about the surrounding area. On Sunday in the US the meal that consists of breakfast and lunch is called Brunch, but in London it is referred to as Sunday Roast. Some topics are regionally dominant.
Rank restaurants by topic: Reviews are scored according to the weight of that topic to surface the top reviews. Finding relevant restaurants based on topic is now possible; “Find me all restaurants with a great view.”
Future plans include recursive sentiment analysis
Trending topic sentiments
Surfaces new keywords for new demographics
Latent Dirichlet Allocation (LDA) (Blei et al. 2003)
Non-negative Matrix Factorization (NMF) (Aurora et al. 2012)
Apache Spark is the future of production-ready analysis systems like this.
During the QA session a great point was made about how the data that they were using was might be skewed due to the fact that people on the internet tend to leave reviews when the experience was either really bad or really good. And the relative context becomes harder to analyze in this day and age when people are excessivly sarcastic. The distribution curve widens on the sentiment of a term like “how silly”, as it can mean a whole lot of things to a whole lot of people.
Building Distributed Systems: Kapil Surlaker
3 States of Change:
Online- reflected immediately, updates indexes and stores at the same time
Nearline- reflected soon not real time and not in days, batches the update to the index and stores, with the stores ETL recursively adding to the batch processes to correct any errors
Offline- reflected later, batches the update to the index and stores, with the stores ETL recursively adding to the batch processes to correct any errors
INSERT DATASTACK POCKET KNIFE VISUAL HERE
Top 5 Lessons Learned While Building
5. Change Capture is Cool!
Having data flowing between these systems is critical
Data consistency with the source is critical: Change log system that propagates the changes with no points of failure
Source of the truth >>change>>Change Processor>>derived change>>derived data:Serving System
Zero loss, low latency, high availability
Have changes propagate to a database node before being pushed upstream to the data store, these changes have to be managed from the frontend perspective.
4. Stop Re-inventing and Start Inventing
Issues will arise that are not solvable with traditional models when scaling in the midst of designing, building, deploying, and maintaining complex distributed systems. The replication of that took place warranted the generalization of the problem (like zookeeper), but it is harder to make it all work in production. High level primitives must be made available through higher abstracted concepts than traditional locks, etc. Defining the transition of behaviors is crucial to having consistencies using these higher level primitives., but no more than 5 of your nodes to be doing replication. This base-level abstraction approach to the states and transitions allows for this efficient solution to be deployed across numerous tools company wide.
3. 8 Habits of Highly Effective Distributed Systems
Never assume slides will be made available!
2. Less is More
Usage parity, not feature parity
Don’t shoot for full feature replication parody, rather use case feature parody. The more features you add to the data layer is the predictability of that system makes planning and scaling the system difficult.
1. It’s not real until it is in Production!
No matter how much planning takes place, issues will arise.
Big Data Platform of the Worlds Largest Fan-to-Fan Marketplace
Sastry Malladi: Stubhub
Big Data is defined by its uses cases. The best way to understand what big data is are by examining the 4 V’s of big data.
Variety: structured, unstructured, semi-structured, different data types and formats which is really hard to represent in traditional data store.
Volume: The sheer amount compared to a traditional transactional system
Velocity: The rapid changing of the data and how relevant it is before decay sets in
Veracity: The truthfulness of the data
Manually curate the identification of the meta data, then automate to a machine learning.
How to pick a proper Data-stack Distro:
Manageability: Seamless automated ingestion
Open Source-Compliant, must have the openness and the interoperability of an evolving system.
Scalability: Being locked into a system is not conducive to future scaling.
Integration with Viz tools: Business and analyst teams want to be able to interact with the data on a visual-level.
Adaptability: Data Import and ease of consuming between services/channels.
Flexibility: Data Export and ease of sharing between services/channels.
During ETL, transforming takes on a “cleaning” of the data role. Cleansing of the data may produce inconsistent results when two different systems are performing that cleansing; to solve the ETL-cleansing happens once and then propagates that information to the other system.
Building a Data Science Team
Todd Holloway: Trulia
Trulia is the big data of real estate that offers a its employees the unique offering of being able to dig into data that has not been previously combed through in the ways Trulia does.
A typical Data Scientist at Trulia has the following skills:
Storytelling, BI/Analyst, and Product Dev.
The data science teams are broken into sections based on relevant skills and interests:
Data Science Team
1. Transform data into new context
2. Improve content relevance
3. Improve monetization
What they look for in a person:
Passion when discussing a past project(s)
Opinions about the product
Willingness to get involved in the non-modeling parts of the job – product, engineering, operations
(Desired Team / Raw )Skills:
Prospective data scientists should have attended at least 2 of the following conferences: AAAI, SIGIR , NIPS, and KDD.
Meetups they started hosting to find talent; these are the two largest data science meetups anywhere.
SF Data Mining @ Downtown SF: 5200+ members, 38 meetups (most at Trulia), held two job fairs
Bay Area Data Viz: 4500+ members, 40 past meetups (many at Trulia)
Workflow:0-6 month chronological-window, 1,2,3,4,5 and then 5,3,2,1 when 1,2,3,4,5 doesnt work.
1. Understand the problem
2. Process Data
Trulia gets their city/community crime data by simply asking for it to be opened up as an API/web service/web scrapable content. Peter Black, who used to work at Staman Design, was the gentleman who was doing this sort of work before Trulia hired him.
——————————-BEGIN DAY TWO—————————————————————-
Heidi Roller :FOX Sports
Data Designer as Storyteller
Start with a good concept; then research and distill deep while maintaining a scope that is audience appropriate. Produce a report that allows for the “vision” to be relayed by anyone who reads it.
The transition to infographics began with the publishing of an internal report. Having the low accountability and overhead of an internal report being turned into an infographic allows for a greater degree of freedom when being creative. Over time the infographics have grown in size, complexity, and scope.
Rules Distilled from Storytelling Expert Bruce D Perry, M.D., Ph.D:
1. State the fact, unrelated: George Washington was 6’4” tall
2. State the relation: The average height of men during the Revolutionary War was only 5’4”
3. Set the scene: Washington, at the darkest moment in the Revolution, when his soldiers were deep in the despair of defeat, starving and freezing at Valley Forge, slowly rose to his full height and,
4. The what: using his dominant personality was able to motivate his discouraged soldiers to re-enlist and continue fighting.
Key moments can be articulated in chronological manner using a photogrid-timeline. Using good story telling to spin the ending in the desired tone.
Good vs Evil Dichotomy makes for a easy/great infographic. When appropriately relating entities they don’t have to tell a full story; maybe just an excerpt or a prequel.
“Find the story narrative that fits the data using organic archetypal story narratives.”
Visualizations should be vetted through the end user to determine understanding.
Passion and creativity is greater than technical skill alone.
Take baby steps when creating visuals.
Putting yourself into that technical skill-person role; walking a mile in their shoes.
Typical time spent on a visualization is 1 day to 2 weeks.
Users have time to read at most 28% if the words during an average website visit.
83% of human learning occurs visually.
The brain processes images 60,000 times faster than text alone.
“Facts are empty without being linked to context and concepts.” -Bruce D Perry, M.D., Ph.D
Brian Wilt: Jawbone
Dream Big: Data Visualization at Jawbone
Artwork completing/complementing an object to make the experience better.
…Tells a story.
…Makes you feel good.
Data Stories – Sleep Elevator Pitch: 60 million hours of sleep, 600 billion steps measured with Jawbone
Correlations should be the real impact of social events rather than classical lab conditions of typical correlative studies. While those studies produced a good starting point, the volume of data that is being captured makes the implied correlation clear (albeit segmented to those who can afford a $150 fitness tracker).
Data products allow consumers to see relations/insights; the product side of things allows for those users of the product to test correlations and provide a healthy feedback loop into the insights being provided. In the case of UpCoffee, having the users provide insights into the types of caffeinated beverages they were consuming provided valuable feedback into the who, what, and when.
Providing identity to see how you stack up against others is critical. Having this information be easily sharable is the best approach to maximizing user engagement.
Make light of an unpredictable insights and provide prose to explain using data science to leverage a better marketing position.
Krystal St Julian: Modcloth
Data Visualization GUIs and the Advantage of Teaching Non-Expert Analysts
Pick it or skip it: A feature that integrates feedback into process of creating custom clothing
Creating training that is specific to the tools that are being using internally to allow the whole team a chance at getting up to speed. The purpose of this training is to reduce arbitrary BS work and to focus on innovating the future. Especially the hurdles that come with teaching non-technical stakeholders as ‘first-time analysts’.
Misunderstood jargon/misaligned communication
What metric gives me the data I need? What do the values translate into? How should I use this information?
Difficulty approaching the data/asking relevant questions
Ask stakeholders to create a particular visualization, then, ask the stakeholder to explain which problems are solvable with the data shown.
Lack of knowledge of the tool’s full capability and the data available
Enticing visuals make all of the difference.
Grouped by team and/or topic consists of the main training block and follow up training sessions, this is the training phase. Offer office hours to allow a window for disruptions; the rules for office hours are that they are related to specific topics, everyone is encouraged to share what they have been doing as a way to instill confidence. Once everyone is up to speed then the training need only go into the support phase, where office hours are the primary method of instruction.
It is at this point that you and your team have moved from Reports to Insights.
Microsoft: Michael Pell
Information as Interface
Self evident visuals make things clearer to get to the what of what’s wrong and the optimal action should be readily apparent. Information is a gradiant that includes wisdom, knowledge, insight, visuals, writing, music, art, data, thing, atom; where the visuals, writing, music, and art are our primary methods of interacting with the information.
Visualization is the new interface, but we are focusing on the wrong things. Initial impressions, instead of deep understanding. Novelty, instead of well understood patterns. Cleverness, instead of predictability. Data density, instead of surfacing insights. Parameterization, instead of flexible exploration. Mouse/keyboard or touch, instead of multi-modal input. Competition, instead of community collaboration. Beauty, instead of clarity.
“The beauty we seek is not found in first impressions, but rather in how the truth reveals itself.”
Reaching that moment of clarity faster and easier should be the goal.
Clear, not clever. Insightful, not dense. Fluid, not constrained.
Provide high level summary, but allow one to drill down into the full article. Then even further, being able to drill down and understand the key points sources.
The layering of data density is: overview, insights, and then details.
Just tell me visuals for consumers, SME’s love the abstract details and the data.
“Function follows form (when interacting with content)”
The best way to provide experience standardization is by providing a… Beautiful clear experience, rather than a clearly beautiful experience.
Elegant disclosure; transitions between experiences should be fluid and smooth.
Provide insights, not just the data. 200GB 2GB FREE, should be 2 days until storage is full.
Being able to drill down and expose patterns should be a matter of wanting to drill wider or deeper into the data. We need to shift our focus to designing and engineering to reach that moment of clarity faster.
Reference successful transformations.
History of great design; the essence of great design.
Lead with the insight, backup with the data
The burden of the filter bubble should be placed on the author to supply the meta data, which will make the tailoring of that meta data more specific to the consumer.
When working with SME’s force them to boil down their domain expertise to one simple word or sentence.
Vikas Sabnani: Glassdoor
Building Data Products: There’s More to it Than Just Data
The categories/levels of data products are:
1. Data is the product: Some aggregation takes place, but for the most part the data is the product
2. Data forms the co fabric of the product: Learning from the data to provide insights. Nest thermostat is a good example.
3. Data is invisible to the product: The product is constantly learning and providing recommendations without prompting. Google self-driving car is an example of this.
Teams consist of six people: three engineers, one data scientist, one designer, and one product manager. These teams are ran by realistic benchmarks. The right personalities need to be in place for this to work.
Get good benchmarks and make good assumptions; create ambitious goals.
Balancing pricing vs user acquisition: Start small and scrappy. Iterate really really fast. Don’t piss off (a lot of) users.
Email Case Study: 24% improvement.
“You must be willing to cannibalize your product” This is in regards to maintaining a legacy system and the best way to sunset it. You can build the worlds best recommendation system, but by not offering it in the right context will impede the full potential. After moving recommendations to within the email a 24% improvement.
Monday Hack Case study: 100% improvement.
A bug was introduced into the system that cause all of the jobs for all of time. The data was sliced on all different dimensions, finally by day of week yielded the highest engagement on Mondays. The number of new jobs posted over the weekend caused the lowest amount of jobs. On Mondays poll the last 14 days of jobs, rather than the typical 1 day.
Optimizing Send Times Case Study: 15% improvement
Scaling issues had caused the amount of time needed to batch process the emails. 24% of users were split up into 24 buckets to measure engagement times.
When training models the goal is not to maximize a predictions, the data scientists job is maximize the success of the product. The data scientist should do whatever it takes to hit product goals, not optimize their models. An okay algorithm and a good product beats a great algorithm and a bad product. Building a data product is a collaborative effort; create shared goals across the team because that is how the best products get built.