When I started my first job at Amazon.com, as the first analyst in the strategic planning department, I inherited the work of producing the Analytics Package. I capitalize the term because it was both a serious tool for making our business legible, and because the job of its production each month ruled my life for over a year.
Back in 1997, analytics wasn't even a real word. I know because I tried to look up the term, hoping to clarify just I was meant to be doing, and I couldn't find it, not in the dictionary, not on the internet. You can age yourself by the volume of search results the average search engine returned when you first began using the internet in force. I remember when pockets of wisdom were hidden in eclectic newsgroups, when Yahoo organized a directory of the web by hand, and later when many Google searches returned very little, if not nothing. Back then, if Russians wanted to hack an election, they might have planted some stories somewhere in rec.arts.comics and radicalized a few nerds, but that's about it.
Though I couldn't find a definition of the word, it wasn't difficult to guess what it was. Some noun form of analysis. More than that, the Analytics Package itself was self-describing. Literally. It came with a single page cover letter, always with a short preamble describing its purpose, and then jumped into a textual summary of the information within, almost like a research paper abstract, or a Letter to Shareholders. I like to think Jeff Bezos' famous company policy, instituted many years later, banning Powerpoint in favor of written essays, had some origins in the Analytics Package cover letter way back when. The animating idea was the same: if you can't explain something in writing to another human, do you really understand it yourself?
My interview loop at Amazon ended with an hour with the head of recruiting at the time, Ryan Sawyer. After having gone through a gauntlet of interviews that included almost all the senior executives, and including people like Jeff Bezos and Joy Covey, some of the most brilliant people I've ever met in my life, I thought perhaps the requisite HR interview would be a letup. But then Ryan asked me to explain the most complex thing I understood in a way he'd understand. It would be good preparation for my job.
What was within the Analytics Package, that required a written explanation? Graphs. Page after page of graphs, on every aspect of Amazon's business. Revenue. Editorial. Marketing. Operations. Customer Service. Headcount. G&A. Customer sentiment. Market penetration. Lifetime value of a customer. Inventory turns. Usually four graphs to a page, laid out landscape.
The word Package might seem redundant if Analytics is itself a noun. But if you saw one of these, you knew why it was called a Package. When I started at Amazon in 1997, the Analytics Package was maybe thirty to forty pages of graphs. When I moved over to product management, over a year later, it was pushing a hundred pages, and I was working on a supplemental report on customer order trends in addition. Analytics might refer to a deliverable or the practice of analysis, but the Analytics Package was like the phone book, or the Restoration Hardware catalog, in its heft.
This was back in the days before entire companies focused on building internal dashboards and analytical tools, so the Analytics Package was done with what we might today consider as comparable to twigs and dirt in sophistication. I entered the data by hand into Excel tables, generated and laid out the charts in Excel, and printed the paper copies.
One of the worst parts of the whole endeavor was getting the page numbers in the entire package correct. Behind the Analytics Package was a whole folder of linked spreadsheets. Since different charts came from different workbooks, I had to print out an entire Analytics Package, get the ordering correct, then insert page numbers by hand in some obscure print settings menu. Needless to say, ensuring page breaks landed where you wanted them was like defusing a bomb.
Nowadays, companies hang flat screen TVs hanging on the walls, all them running 24/7 to display a variety of charts. Most everyone ignores them. The spirit is right, to be transparent all the time, but the understanding of human nature is not. We ignore things that are shown to us all the time. However, if once a month, a huge packet of charts dropped on your desk, with a cover letter summarizing the results, and if the CEO and your peers received the same package the same day, and that piece of work included charts on how your part of the business was running, you damn well paid attention, like any person turning to the index of a book on their company to see if they were mentioned. Ritual matters.
The package went to senior managers around the company. At first that was defined by your official level in the hierarchy, though, as most such things go, it became a source of monthly contention as to who to add to the distribution. One might suspect this went to my head, owning the distribution list, but in fact I only cared because I had to print and collate the physical copies every month.
I rarely use copy machines these days, but that year of my life I used them more than I will all the days that came before and all the days still to come, and so I can say with some confidence that they are among the least reliable machines ever made by mankind.
It was a game, one whose only goal was to minimize pain. A hundred copies of a hundred page document. The machine will break down at some point. A sheet will jam somewhere. The ink cartridge will go dry. How many collated copies do you risk printing at once? Too few and you have to go through the setup process again. Too many and you risk a mid-job error, which then might cascade into a series of ever more complex tasks, like trying to collate just the pages still remaining and then merging them with the pages that were already completed. [If you wondered why I had to insert page numbers by hand, it wasn't just for ease of referencing particular graphs in discussion; it was also so I could figure out which pages were missing from which copies when the copy machine crapped out.]
You could try just resuming the task after clearing the paper jam, but in practice it never really worked. I learned that copy machine jams on jobs of this magnitude were, for all practical purposes, failures from which the machine could not recover.
I became a shaman to all the copy machines in our headquarters at the Columbia building. I knew which ones were capable of this heavy duty task, how reliable each one was. Each machine's reliability fluctuated through some elusive alchemy of time and usage and date of the last service visit. Since I generally worked late into every night, I'd save the mass copy tasks for the end of my day, when I had the run of all the building's copy machines.
Sometimes I could sense a paper jam coming just by the sound of machine's internal rollers and gears. An unhealthy machine would wheeze, like a smoker, and sometimes I'd put my hands on a machine as it performed its service for me, like a healer laying hands on a sick patient. I would call myself a copy machine whisperer, but when I addressed them it was always a slew of expletives, never whispered. Late in my tenure as analyst, I got budget to hire a temp to help with the actual printing of the monthly Analytics Package, and we keep in touch to this date, bonded by having endured that Sisyphean labor.
My other source of grief was another tool of deep fragility: linked spreadsheets in Excel 97. I am, to this day, an advocate for Excel, the best tool in the Microsoft Office suite, and still, if you're doing serious work, the top spreadsheet on the planet. However, I'll never forget the nightmare of linked workbooks in Excel 97, an idea which sounded so promising in theory and worked so inconsistently in practice.
Why not just use one giant workbook? Various departments had to submit data for different graphs, and back then it was a complete mess to have multiple people work in the same Excel spreadsheet simultaneously. Figuring out whose changes stuck, that whole process of diffs, was untenable. So I created Excel workbooks for all the different departments. Some of the data I'd collect myself and enter by hand, while some departments had younger employees with the time and wherewithal to enter and maintain the data for their organization.
Even with that process, much could go wrong. While I tried to create guardrails to preserve formulas linking all the workbooks, everything from locked cells to bold and colorful formatting to indicate editable cells, no spreadsheet survives engagement with a casual user. Someone might insert a column here or a row there, or delete a formula by mistake. One month, a user might rename a sheet, or decide to add a summary column by quarter where none had existed before. Suddenly a slew of #ERROR's show up in cells all over the place, or if you're unlucky, the figures remain, but they're wrong and you don't realize it.
Thus some part of every month was going through each spreadsheet and fixing all the links and pointers, reconnecting charts that were searching for a table that was no longer there, or more insidiously, that were pointing to the wrong area of the right table.
Even after all that was done, though, sometimes the cells would not calculate correctly. This should have been deterministic. That's the whole idea of a spreadsheet, that the only error should be user error. A cell in my master workbook would point at a cell in another workbook. They should match in value. Yet, when I opened both workbooks up, one would display 1,345 while the other would display 1,298. The button to force a recalculation of every cell was F9. I'd press it repeatedly. Sometimes that would do it. Sometimes it wouldn't. Sometimes I'd try Ctrl - Alt - Shift - F9. Sometimes I'd pray.
One of the only times I cried at work was late one night, a short time after my mom had passed away from cancer, my left leg in a cast from an ACL/MCL rupture, when I could not understand why my workbooks weren't checking out, and I lost the will, for a moment, to wrestle it and the universe into submission. This wasn't a circular reference, which I knew could be fixed once I pursued it to the ends of the earth, or at least the bounds of the workbook. No, this inherent fragility in linked workbooks in Excel 97 was a random flaw in a godless program, and I felt I was likely the person in the entire universe most fated to suffer its arbitrary punishment.
I wanted to leave the office, but I was too tired to go far on my crutches. No one was around the that section of the office at at that hour. I turned off the computer, turned out the lights, put my head down on my desk for a while until the moment passed. Then I booted the PC back up, opened the two workbooks, and looked at the two cells in question. They still differed. I pressed F9. They matched.
Most months, after I had finished collating all the copies of the Analytics Package, clipping each with a small, then later medium, and finally a large binder clip, I'd deliver most copies by hand, dropping them on each recipient's empty desk late at night. It was a welcome break to get up from my desk and stroll through the offices, maybe stop to chat with whoever was burning the midnight oil. I felt like a paper boy on his route, and often we'd be up at the same hour.
For all the painful memories that cling to the Analytics Package, I consider it one of the formative experiences of my career. In producing it, I felt the entire organism of our business laid bare before me, its complexity and inner working made legible. The same way I imagine programmers visualizing data moving through tables in three dimensional space, I could trace the entire ripple out from a customer's desire to purchase a book, how a dollar of cash flowed through the entire anatomy of our business. I knew the salary of every employee, and could sense the cost of their time from each order as the book worked its way from a distributor to our warehouse, from a shelf to a conveyor belt, into a box, then into a delivery truck. I could predict, like a blackjack player counting cards in the shoe, what % of customers from every hundred orders would reach out to us with an issue, and what % of those would be about what types of issues.
I knew, if we gained a customer one month, how many of their friends and family would become new customers the next month, through word of mouth. I knew if a hundred customers made their first order in January of 1998, what % of them would order again in February, and March, and so on, and what the average basket size of each order would be. As we grew, and as we gained some leverage, I could see the impact on our cash flow from negotiating longer payable days with publishers and distributors, and I'd see our gross margins inch upwards every time we negotiated better discounts off of list prices.
What comfort to live in the realm of frequent transactions and normal distributions, a realm where the laws of large numbers was the rule of law. Observing the consistency and predictability of human purchases of books (and later CDs and DVDs) each month was like spotting some crystal structure in Nature under a microscope. I don't envy companies like Snapchat or Twitter or Pinterest, social networks who have gone public or likely have to someday, companies who play in the social network business, trying to manage investor expectations when their businesses are so large and yet still so volatile, their revenue streams even more so. It is fun to grow with the exponential trajectory of a social network, but not fun if you're Twitter trying to explain every quarter why you missed numbers again, and less fun when you have to pretend to know what will happen to revenue one quarter out, let alone two or three.
At Amazon, I could see our revenue next quarter to within a few percentage points of accuracy, and beyond. The only decision was how much to tell Wall Street we anticipated our revenue being. Back then, we always underpromised on revenue; we knew we'd overdeliver, the only question was how much we should do so and still maintain a credible sense of surprise on the next earnings call.
The depth of our knowledge of our own business continues to exceed that of any company I've worked at since. Much of the credit goes to Jeff for demanding that level of detail. No one can set a standard for accountability like the person at the top. Much credit goes to Joy and my manager Keith for making the Analytics Package one of the strategic planning department's central tasks. That Keith pushed me into the arms of Tufte changed everything. And still more credit belongs to all the people who helped gather obscure bits of data from all parts of the business, from my colleagues in accounting to those in every department in the company, many of whom built their own models for their own areas, maintaining and iterating them with a regular cadence because they knew every month I'd come knocking and asking questions.
I'm convinced that because Joy knew every part of our business as well or better than almost anyone running them, she was one of those rare CFO's that can play offense in addition to defense. Almost every other CFO I've met hews close to the stereotype; always reigning in spending, urging more fiscal conservatism, casting a skeptical eye on any bold financial transactions. Joy could do that better than the next CFO, but when appropriate she would urge us to spend more with a zeal that matched Jeff's. She, like many visionary CEO's, knew that sometimes the best defense is offense, especially when it comes to internet markets, with their pockets of winner-take-all contests, first mover advantages, and network effects.
It still surprises me how many companies don't help their employees understand the numeric workings of their business. One goes through orientation and hears about culture, travel policies, where the supply cabinet is, maybe some discussion of mission statements. All valuable, of course. But when was the last time any orientation featured any graphs on the business? Is it that we don't trust the numeracy of our employees? Do we fear that level of radical transparency will overwhelm them? Or perhaps it's a mechanism of control, a sort of "don't worry your little mind about the numbers" and just focus on your piece of the puzzle?
Knowing the numbers isn't enough in and of itself, but as books like Moneyball make clear, doing so can reveal hidden truths, unknown vectors of value (for example, in the case of Billy Beane and the Oakland A's, on base percentage). To this day, people still commonly talk about Amazon not being able to turn a profit for so many years as if it is some Ponzi scheme. Late one night in 1997, a few days after I had started, and about my third or fourth time reading the most recent edition of the Analytics Package cover to back, I knew our hidden truth: all the naysaying about Amazon's profitless business model was a lie. Every dollar of our profit we didn't reinvest into the business, and every dollar we didn't raise from investors to add to that investment, would be just kneecapping ourselves. The only governor of our potential was the breadth of our ambition.
What does this have to do with line graphs? A month or two into my job, my manager sent me to a seminar that passed through Seattle. It was a full day course centered around the wisdom in one book, taught by the author. The book was The Visual Display of Quantitative Information, a cult bestseller on Amazon.com, the type of long tail book that, in the age before Amazon, might have remained some niche reference book, and the author was Edward Tufte. It's difficult to conjure, on demand, a full list of the most important books I've read, but this is one.
My manager sent me to the seminar so I could apply the principles of that book to the charts in the Analytics Package. My copy of the book sits on my shelf at home, and it's the book I recommend most to work colleagues.
In contrast to this post, which has buried the lede so far you may never find it, Tufte's book opens with a concise summary of its key principles.
Excellence in statistical graphics consists of complex ideas communicated with clarity, precision, and efficiency. Graphics displays should
- show the data
- induce the viewer to think about the substance rather than about methodology, graphic design, the technology of graphic production, or something else
- avoid distorting what the data have to say
- present many numbers in a small space
- make large data sets coherent
- encourage the eye to compare different pieces of data
- reveal the data at several levels of detail, from a broad overview to the fine structure
- serve a reasonably clear purpose: description, exploration, tabulation, or decoration
- be closely integrated with the statistical and verbal descriptions of a data set.
Graphics reveal data. Indeed graphics can be more precise and revealing than conventional statistical computations.
That's it. The rest of the book is just one beautiful elaboration after another of those first principles. The world in one page.
Of all the graphs, the line graph is the greatest. Of its many forms, the most iconic form, the one I used the most in the Analytics Package, has time as the x-axis and the dimension to be measured as the y-axis. Data trended across time.
One data point is one data point. Two data points, trended across time, tell a story. [I'm joking, please don't tell a story using just two data points] The line graph tells us where we've been, and it points to where things are going. In contemplating why the line points up or down, or why it is flat, one grapples with the fundamental mechanism of what's on study.
It wasn't until I'd produced the Analytics Package graphs for several months that my manager granted me the responsibility of writing the cover letter. It was a momentous day, but the actual task of writing the summary of the state of the business wasn't hard. By looking at each graph and investigating why each had changed in which way from last month to this, I had all the key points worth writing up. Building the graphs was more than half the battle.
So many of the principles in Tufte's book made their way into the Analytics Package. For example, where relevant, each page showed a series of small multiples, with the same scale on X and Y axes, back in the age before small multiples were a thing in spreadsheet programs.
Nowhere was Tufte's influence more felt than in our line graphs. How good can a line graph be? After all, in its components, a line graph is really simple. That's a strength, not a weakness. The advice here is simple, so simple, in fact, one might think all of it is common practice already. It isn't. When I see line graphs shared online, even those from some of the smartest people I follow, almost all of them adhere to very little of what I'm going to counsel.
Perhaps Tufte isn't well read enough, his idea not taught in institutions like business schools that require their students to use Excel. That is all true, but I prefer a simpler explanation: users are lazy, the Excel line graph defaults are poor, and Excel is the most popular charting tool on the planet.
By way of illustration, let's take a data set, build a line graph in Excel, and walk through some of what I had to do when making the Analytics Package each month.
I couldn't find the raw data behind most charts shared online, and I didn't want to use any proprietary data. My friend Dan Wang pointed me at the Google Public Data Explorer, a lot of which seems to built off the World Bank Data Catalog, from which I pulled some raw data, just to save me the time of making up figures.
I used health expenditure per capita (current US$). I picked eight countries and used the data for the full range of years available, spanning 1995-2014. I chose a mix of countries I've visited or lived in, plus some people have spoken to me about in reference to their healthcare systems, but the important point here is that limiting the data series on a line graph matters if the graph is going to be readable. How many data series depends on what you want to study and how closely the lines cluster, how large the spread is. Sometimes it's hard to anticipate until you produce the graph, but suffice it to say that generating a graph just to make one is silly if the result is illegible.
Here's what the latest version of Excel on my Mac produced when I pressed the line graph button after highlighting the data (oddly enough, I found a Recommended Charts dropdown button and the three graphs it recommended were three bar graphs of a variety of forms, definitely not the right choice here, among many places where Excel's default logic is poor). I didn't do anything to this graph, just saved it directly as is, at the exact size and formatting Excel selected.
Not great. Applying Richard Thaler and Cass Sunstein's philosophy from Nudge, if we just improved the defaults in Excel and Powerpoint, the graphic excellence the world over would improve by leaps and bounds. If someone out there works on the charting features in Excel and Powerpoint, hear my cries! The power to elevate your common man is in your hands. Please read Tufte.
As an aside, after the Tufte seminar, I walked up to him and asked what software he used for the graphics in his book. His response? Adobe Illustrator. To produce the results he wanted, he, and presumably his assistants, laid out every pixel by hand. Not that helpful for me in producing the Analytics Package monthly on top of my other job duties, but a comment on the charting quality in Excel that rings true even today.
Let's take my chart above and start editing it for the better, as I did back in my Analytics Package days. Let's start with some obvious problems:
- The legend is nearly the same height as the graph itself
- A lot of the lines are really close to each other
- The figures in the left column could be made more readable with thousands comma separators
- The chart needs a title
I expanded the graph inside the worksheet to make it easier to see, it was the size of about four postage stamps in the sheet for some reason, and fixed the problems above. Here's that modified version.
Excel should add comma separators for thousands by default. The graph is somewhat better, but the labels are still really small, even if you click and expand the photo above to full size. In addition to adjusting the scale of labels and title, however, what else can we do to improve matters?
I began this post just wanting to share the following simple point, the easiest way to upgrade your Excel line graph:
Remove the legend.
That alone will make your line graph so much better that if it's the only thing you remember for the rest of your life, a generation of audiences will thank you.
The problem with a legend is that it asks the user to bounce their eyes back and forth from the graph to the legend, over and over, trying to hold what is usually some color coding system in their short-term memory.
Look at the chart above. Every time I have to see which line is which country, I have look down to the legend and then back to the graph. If I decide to compare any two data series, I have to look back down and memorize two colors, then look back at the chart. Forget even trying to do it for three countries, or for all of them, which is the whole point of the line graph. In forcing the viewer to interpret your legend, your line graph has already dampened much of its explanatory efficiency.
If you have just two data series, a legend isn't egregious, but it is still inferior to removing the legend. Of course, removing the legend isn't enough.
Remove the legend and label the data series directly on the plot.
Unfortunately, here is where your work gets harder, because, much to my disbelief, there is no option in Excel to label data series in place in a line graph. The only automated option is to use the legend.
If I am wrong, and I would love nothing more than to be wrong, please let me know, but I tried going through every one of Excel's various chart menus, which can be brought up by right clicking on different hot spots on the chart, and couldn't find this option. That Excel forces you to right click on so many obscure hot spots just to bring up various options is bad enough. That you can't find this option at all among the dozens of other useless options is a travesty.
The only fix is to create data series labels by hand. You can find an Insert Text Box option somewhere in the Excel menus and ribbons and bars, so I'll make one for each data series and put them roughly in place so you know which data series is which. Next, the moment of truth.
Select the legend. And delete it.
Undo, then delete it again, just to feel the rush as your chart now expands in size to fill the white space left behind. Feels good.
Next, shrink the actual plot area of the graph by selecting it and opening up some margin on the right side of the graph for placing your labels if there isn't enough. Since people read time series left to right, and since the most recent data is on the far right, you'll want your labels to be there, where the viewers' eyes will flow naturally.
Don't move the labels into exact position just yet. First adjust the sizing of the data labels of the axes and the scale of the graph first. Unfortunately, since these text boxes are floating and not attached to the data series, every time the scale of your chart changes, you have to reposition all the data series labels by hand. So do that last.
I have not used this latest version of Excel before, the charting options seem even more complex than before. To change the format of the labels on the x and y-axis, you right click each axis and select Format Axis. I changed the y-axis text format to currency. But to change the size of the labels on each axis, you have to right click each and then select Font. That those are in separate menus is part of the Excel experience.
In expanding the font size of the x-axis, I decided it was too crowded so I went with every other year. I left aligned the data series labels and tried to position them as precisely as possible by eye. I seem to remember Excel used to allow selecting text boxes and moving them one pixel at a time with the arrow keys, but it didn't work for me, so you may have to find the object alignment dropdown somewhere and select align left for all your labels.
Here's the next iteration of the chart.
You can click on it to see it larger. Already, we're better off than the Excel auto-generated chart by quite a margin. If this were the default, I'd be fairly happy. But there's room for improvement.
The use of color can be helpful, especially with lines that are closely stacked, but what about the color blind? If we were to stick with the coloring scheme, I might change the data series labels to match the color of each line. Again, since the labels are added by hand, you'd have to manually change each label by hand to match the color scheme Excel had selected, and again, it wouldn't fix the issue for color blind viewers. [I didn't have the patience to do this for illustrative purposes, but you can see how matching the coloring of labels to the lines helps if you view this data in Google Data Explorer.]
In The Visual Display of Quantitative Information, Tufte uses very little color. When producing the Analytics Package, I was working with black and white printers and copy machines. Color was a no go, even if it provides an added dimension for your graphics, as for elevation on maps.
While color has the advantage of making it easier to distinguish between two lines which are close to each other, it introduces all sorts of mental associations that are difficult to anticipate and which may just be a distraction. When making a chart like, say, one of the U.S. Presidential Election, using blue for Democrats and red for Republicans is a good idea since the color scheme is widely agreed upon. When distinguishing between departments in your company, or product lines, arbitrary color choices can be noise, or worse, a source of contention (NSFW language warning).
The safer alternative is to use different line styles, regardless of whether your final deliverable is capable of displaying color. Depending on how many data series you have to chart, that may or may not be an option. I looked at the data series line format options, which are labeled Dash Type in this version of Excel, and found a total of eight options, or just enough for my example chart. It takes some work to assign options for maximum legibility; you should which country receives which style based on maximum contrast between lines that cluster.
After a random pass at that, the monochrome line graph looked like this.
No issues for color blind users, but we're stretching the limits of line styles past where I'm comfortable. To me, it's somewhat easier with the colored lines above to trace different countries across time versus each other, though this monochrome version isn't terrible. Still, this chart reminds me, in many ways, of the monochromatic look of my old Amazon Analytics Package, though it is missing data labels (wouldn't fit here) and has horizontal gridlines (mine never did).
We're running into some of these tradeoffs because of the sheer number of data series in play. Eight is not just enough, it is probably too many. Past some number of data series, it's often easier and cleaner to display these as a series of small multiples. It all depends on the goal and what you're trying to communicate.
At some point, no set of principles is one size fits all, and as the communicator you have to make some subjective judgments. For example, at Amazon, I knew that Joy wanted to see the data values marked on the graph, whenever they could be displayed. She was that detail-oriented. Once I included data values, gridlines were repetitive, and y-axis labels could be reduced in number as well.
Tufte advocates reducing non-data-ink, within reason, and gridlines are often just that. In some cases, if data values aren't possible to fit onto a line graph, I sometimes include gridlines to allow for easy calculation of the relative ratio of one value to another (simply count gridlines between the values), but that's an edge case.
For sharp changes, like an anomalous reversal in the slope of a line graph, I often inserted a note directly on the graph, to anticipate and head off any viewer questions. For example, in the graph above, if fewer data series were included, but Greece remained, one might wish to explain the decline in health expenditures starting in 2008 by adding a note in the plot area near that data point, noting the beginning of the Greek financial crisis (I don't know if that's the actual cause, but whatever the reason or theory, I'd place it there).
If we had company targets for a specific metric, I'd note those on the chart(s) in question as a labeled asymptote. You can never remind people of goals often enough.
Just as an example, here's another version of that chart, with fewer data series, data labels, no gridlines, fewer y-axis labels. Also, since the lines aren't clustered together, we no longer need different line styles adding visual noise.
At that size, the data values aren't really readable, but if I were making a chart for Joy or Jeff, I'd definitely add the labels because I knew they'd want that level of detail. At Amazon, also, I typically limited our charts to rolling four or eight quarters, so we'd never have this many data points as on the graph above. Again, at some point you have to determine your audience and your goals and modify your chart to match.
Like a movie, work on a chart is a continuous process. I could generate a couple more iterations on the chart above for different purposes, but you get the idea. At some point you have to print it. Just as you'd add the end credits to a film, the last touch here would be to put a source for the data below the graph, so people can follow up on the raw data themselves.
Before I set off on this exercise, I didn't know much about health care expenditures per capita around the world, except that the United States is the world leader by a wide margin. The graph reveals that, and by what magnitude. Look at China by comparison. What explains China's low expenditures? I might hypothesize a number of reasons, including obvious ones like the huge population there, but it would take further investigation, and perhaps more charts. One reason the Analytics Package grew in time was that some charts beget further charts.
Why did Greece's expenditures per capita go into decline starting in 2008. Was it the financial crisis? Why has Japan reversed its upward trajectory starting in 2012? Should we include some other countries for comparison, and how might we choose the most illuminating set?
Every month that first year at Amazon, I'd spend most my waking hours gathering figures and confirming their accuracy, producing these graphs, and then puzzling over the stories behind their contours. The process of making line graphs was prelude to understanding.
To accelerate that understanding, upgrade your line graphs to be efficient and truthful. Some broadly applicable principles should guide you to the right neighborhood. To recap:
- Don't include a legend; instead, label data series directly in the plot area. Usually labels to the right of the most recent data point are best. Some people argue that a legend is okay if you have more than one data series. My belief is that they're never needed on any well-constructed line graph.
- Use thousands comma separators to make large figures easier to read
- Related to that, never include more precision than is needed in data labels. For example, Excel often chooses two decimal places for currency formats, but most line graphs don't need that, and often you can round to 000's or millions to reduce data label size. If you're measuring figures in the billions and trillions, we don't need to see all those zeroes, in fact it makes it harder to read.
- Format axis labels to match the format of the figures being measured; if it's US dollars, for example, format the labels as currency.
- Look at the spacing of axis labels and increase the interval if they are too crowded. As Tufte counsels, always reduce non-data-ink as much as possible without losing communicative power.
- Start your y-axis at zero (assuming you don't have negative values)
- Try not to have too many data series; five to eight seems the usual limit, depending on how closely the lines cluster. On rare occasion, it's fine to exceed this; sometimes the sheer volume of data series is the point, to show a bunch of lines clustered. These are edge cases for a reason, however.
- If you have too many data series, consider using small multiples if the situation warrants, for example if the y-axes can match in scale across all the multiples.
- Respect color blind users and those who may not be able to see your charts with color, for example on a black and white printout, and have options for distinguishing data series beyond color, like line styles. At Amazon, as I dealt with so many figures, I always formatted negative numbers to be red and enclosed in parentheses for those who wouldn't see the figures in color.
- Include explanations for anomalous events directly on the graph; you may not always be there in person to explain your chart if it travels to other audiences.
- Always note, usually below the graph, the source for the data.
Some other suggestions which are sometimes applicable:
- Display actual data values on the graph if people are just going to ask what the figures are anyway, and if they fit cleanly. If you include data labels, gridlines may not be needed. In fact, they may not be needed even if you don't include data labels.
- Include targets for figures as asymptotes to help audiences see if you're on track to reach them.
Why is The Visual Display of Quantitative Information such a formative text in my life? If it were merely a groundbreaking book on graphic excellence, it would remain one of my trusted references, sitting next to Garner's Modern American Usage, always within arm's reach. It wouldn't be a book I would push on those who never make graphs and charts.
The reason the book influenced me so deeply is that it is actually a book about the pursuit of truth through knowledge. It is ostensibly about producing better charts; what stays with you is the principles for general clarity of thought. Reading the book, chiseling away at my line graphs late nights, talking to people all over the company to understand what might explain each of them, gave me a path towards explaining the past and predicting the future. Ask anyone about any work of art they love, whether it's a book or a movie or an album, and it's never just about what it's about. I haven't read Zen and the Art of Motorcycle Maintenance; I'm guessing it wasn't written just for motorcycle enthusiasts.
A good line graph is a fusion of right and left brain, of literacy and numeracy. Just numbers alone aren't enough to explain the truth, but accurate numbers, represented truthfully, are a check on our anecdotal excesses, confirmation biases, tribal affiliations.
I'm reminded of Tufte's book whenever I brush against tendrils of many movements experiencing a moment online: rationalism, the Nate Silver/538 school of statistics-backed journalism, infographics, UX/UI/graphic design, pop economics, big history. And, much to my dismay, I'm reminded of the book most every time I see a line graph that could use some visual editing. Most people are lazy, most people use the defaults, and the defaults of the most popular charting application on the planet, Excel, are poor.
[Some out there may ask about Apple's Numbers. I tried it a bit, and while it's aesthetically cleaner than Excel, it's such a weak spreadsheet overall that I couldn't make the switch. I dropped Powerpoint for Keynote, though both have some advantages. Neither, unfortunately, includes a great charting tool, though they are simpler in function than the one in Excel. Google Sheets is, like Numbers, a really weak spreadsheet, and it's just plain ugly. If someone out there knows of a superior charting tool, one that doesn't require making charts in Illustrator like Tufte does, please let me know.]
I love this exchange early on in Batman Begins between Liam Neeson R'as Al Ghul (though he was undercover as Henri Ducard at the time) and Christian Bale's Bruce Wayne.
Bruce Wayne: You're vigilantes.
Henri Ducard: No, no, no. A vigilante is just a man lost in the scramble for his own gratification. He can be destroyed, or locked up. But if you make yourself more than just a man, if you devote yourself to an ideal, and if they can't stop you, then you become something else entirely.
Bruce Wayne: Which is?
Henri Ducard: Legend, Mr. Wayne.
It is absurdly self-serious in the way that nerds love their icons to be treated by mainstream pop culture, and I love it for its broad applicability. I've been known to drop some version of it in meetings all the time, my own Rickroll, but no one seems to find it amusing.
In this case, the passage needs some tweaking. But please do still read it with Liam Neeson's trademark gravitas.
A line graph is just another ugly chart lost in the scramble for its own gratification in a slide deck no one wants to read. It can be disregarded, forgotten. But if you make your graph more than just the default Excel format, if you devote yourself to Tufte's ideals, then your graph becomes something else entirely.
A line graph without a legend. Remove the legend, Mr. Wayne, and become a legend.