Why graph? And why, in particular, use innovative and unfamiliar graphing techniques? I started this blog without addressing these questions, but a recent blog post by Adam Crymble, critical of “shock and awe” graphs made me realize the need to explain EDA (Exploratory Data Analysis) and data visualization. Crymble wisely challenged data visualization practitioners to ask themselves the following questions: “Is this Good for Scholarship? Or am I just trying to overwhelm my reviewers and my audience?” This is sound advice, and Crymble’s concerns strike me as genuine. But, upon reflection, his post led me to think that “shock and awe” are evitable parts of any bold scholarly intervention. Feminist scholarship provoked genuine anger when it asserted that academic conventions were rife with sexist assumptions. The linguistic turn alarmed traditional scholars with its new understandings of literary production. Certainly these interventions produced (and continue to produce) needlessly complex, derivative prattle. But can anyone seriously argue that the humanities are not richer for these intellectual challenges?
What follows, therefore, is a defense of “shock and awe”: a justification for data visualizations that are unfamiliar, challenging, and demand news ways of thinking.
Why graph instead of just showing the numbers?
By just “show the numbers,” humanities researchers often refer to tables. The problem with this preference for tables it that is assumes that tables are somehow more transparent and accessible than graphs. In fact, the opposite is true. A list of data values is like a phone directory: a wonderful way to look up individual data points, but a terrible means of discerning or discovering patterns. (Kastellec and Leoni 2007; Gelman, Pasarica, and Dodhia 2002) Alternately, a table of individual data points is analogous to collection of primary text sources: it’s the raw material of research, not research. Further most published tables are not transparent, “raw” data. On the contrary, tables in most research consolidate observations into groups, listing, for example, average wages for “skilled craftsman in Flanders 1830-35,” or “Osaka dyers 1740-80.” But why those years ranges and those occupational categories? Why 1830-35 instead of 1830-1840? Why Osaka dyers and not the broader category of Osaka textile workers? Those groupings may be conceptually valid, but they are interpretative and preclude other interpretations. Certainly we can lie with graphs, but we can also lie with tables. And since a good graph is better than the best table, DH researchers need to use good graphs.
Why these novel, unfamiliar graphs?
The data visualization movement has certainly produced some bad graphs —obfuscating rather than illuminating. But it is impossible to argue that newer graph forms are more misleading than the status quo. The pie chart, for example, is easy to misuse and the many variants supported by Excel are simply awful. With a 3D exploding pie chart, even a novice can make 5% look larger than 10% or even 15%. Can you correctly guess the absolute and relative sizes of the slices in this graph?
(See answers below). Since pie charts are familiar, they are accessible, but that simply makes them easier to misuse. Are conventional bad graphs such as pie charts “better” than newer chart forms because they provide easier access to faulty conclusions? Is “schlock” worse that “shock”?
My survey of graphing techniques in history journals tuned up an alarming result. Historians rely primarily on graphing techniques developed over 200 years ago: the pie chart, bar chart, and line chart. It is hard not to shock the academy with strange graphs, when “strange” means anything developed in the past two centuries. Many new graphing techniques, such as parallel coordinate plots, are still controversial, difficult to use, and difficult to interpret. But many others are readily accessible and widely used, except in the humanities, The boxplot, developed in 1977 by John Tukey, is now recommended for middle school instruction by The National Council of Teachers of Mathematics. The intellectual pedigree of the boxplot is beyond question: Tukey, a professor of statistics at Princeton and researcher at Bell Labs, is widely considered a giant in 20th century statistics. So, what to do when humanities researchers are flummoxed by a boxplot? I now append a description of how to read a boxplot, but isn’t it an obligation of quantitative DH to push the boundaries of professional knowledge? And shouldn’t humanities Ph.D.’s have the quantitative literacy of clever eighth graders? In short, since our baseline of graphing skills in the humanities is so outdated and rudimentary, there is no avoiding some “shock and awe.”
A graph in seven-dimensions? What are you talking about? You must be trying to trick me!
Certainly “seven dimensions” sounds like a conceit designed to confuse the audience, or intimidate them into acquiescence. But a “dimension” in data visualization is simply a variable, a measurement. Decades ago Tufte showed how an elegant visualization, Menard’s graph of Napoleon’s invasion of Russia, could show six dimension on a 2D page: the position of the army (latitude and longitude), size of the army, structure of the Russian army, direction of movement, date, and temperature. Hans Rosling’s gapminder graphs use motion to represent time, thereby freeing up the x-axis. By adding size, color and text, Rosling famously fit six dimensions on a flat screen: country name, region, date, per capita GDP, life expectancy, and total population. These are celebrated and influential data visualizations, the graphic equivalents of famously compelling, yet succinct prose. While Crymble assumes that needlessly complex graphics stems from bad faith (a desire to intimidate and deceive), I am more inclined to assume that the researcher was reaching for Menard or Rosling but failed.
“How do you know there hasn’t been a dramatic mistake in the way the information was put on the graph? How do you know the data are even real? You can’t. You don’t.”
This concern strikes me as overwrought and dangerous. Liars will lie. They will quote non-existent archival documents, forge lab results, and delete inconvenient data points. When do we discover this type of deceit? When someone tries to replicate the research: combing through the archives, running a similar experiment, or trying to replicate a graph. How are complex graphics more suspect, or more prone for misuse than any other form of scholarly communication? Is there any reason to be more suspicious of complex graphs than any other research form?
I can optimistically read Crymble’s challenge as a sort of graphic counterpart of Orwell’s rules for writers. But Crymble seems to view data viz as uniquely suspect. To me this resembles the petulant grousing that greeted Foucault, Derrida, Lyotard, Lacan, etc some three decades ago – “what is this impenetrable French crap!” “You’re just talking nonsense!” Certainly many of those texts are needlessly opaque. But much of it was difficult because the ideas were new and challenging. The academy benefitted from being shocked and awed. Data visualization can and should have the same impact. The academy needs to be shocked — that how change works.
Gelman, Andrew, Cristian Pasarica, and Rahul Dodhia. 2002. “Let’s Practice What We Preach: Turning Tables into Graphs.” The American Statistician 56 (2): 121-30.
Kastellec, Jonathan P., and Eduardo L. Leoni. 2007. “Using Graphs Instead of Tables in Political Science.” Perspectives on Politics 5 (4): 755-71.
The pie chart: