Alex’s Beastery of Graphics–How to Pick the Right Graph Type For Your Data Every Time Using Just Three Simple Questions

Use the Table of Contents below to navigate this article quickly!

  1. The Three Questions to Pick the Right Graph Type Every. Single. Time!
  2. How Knowing the Answer to the First Question Helps Us
  3. Alex’s “Beastery of Graphics!”
    1. One Variable
      1. Graph Types For One Categorical Variable
      2. Graph Types For One Quantitative Variable
    2. Two Variables
      1. Graph Types For Two Quantitative Variables
      2. Graph Types For Two Categorical Variables
      3. Graph Types For One Categorical Variable and One Quantitative Variable
    3. Three (or More) Variables
      1. Graph Types For All Quantitative Variables
      2. Graph Types For One or More Categorical Variables and One or More Quantitative Variables
      3. Graph Types For All Categorical Variables
  4. What Goes Where On Graphs?
  5. How Complex Can I Make My Graphs?
  6. How Complex Should I Make My Graphs??
  7. A Handy Handyout

Recently, I was asked to speak to the Graduate Research Fellows in our Research Center (the Minnesota Aquatic Invasive Species Research Center, or MAISRC) on the topic of “communicating science to the public.” As I prepared, I got to thinking–“What is ‘core knowledge’ when it comes to communicating science, in my opinion?”

The answer was obvious to me: Graphs. While text, diagrams, infographics, etc. all have their place in conveying scientific results, there is no more essential a tool in a scientist’s communication toolkit than a well-designed and intentional graph. After all, science is all about the acquisition of new knowledge, aka new data; a graph conveys (new) data in a form distilled for easy digestion!

There’s just one problem–there are a lot of kinds of graphs out there! And one place where I think our college stats courses may tend to fail us is in helping us discern which graph types we should reach for when.

Case in point: If I handed you a new data set and asked you to produce a graph of the data therein, can you confidently say you’d pick the right graphs for those data every single time? Did you just whimper? What was that? You’re scared because I said the “right graphs?” Yes, of course there are “right” and “wrong” graphs!

Ok, ok, I’ll take it from all that blubbering that, no, you don’t feel confident about this! Well, take heart my friend: What if I told you that you could pick the right graph type for your data and purpose every single time by answering just three simple questions? Well, it’s true! There, that dried your tears, didn’t it? What if I then told you that two of those three questions don’t even require much complex thought at all? That put a smile back on your face, didn’t it?

Yes, answer just three questions and you’ll be able to pick the right graph type for your data and purpose every time. In this post, I’ll first introduce you to the three questions you’ll need to answer. I’ll then briefly explain how answering the first question gets us anywhere. Then, for the rest of the post, I’ll explain how answering the second two questions get us anywhere. In the process, I’ll compile a “Beastery of Graphics” you can consult so that you know all your conventional options, regardless of what types of data you may have to graph. I’ll close with a few odds and ends about graph structure as well as a discussion about complexity and how to get the right amount of complexity in your graphs to suit your purposes.

The Three Questions to Pick the Right Graph Type Every. Single. Time!

Ready? Without further ado, here are the three questions you can ask yourself every time you go to make a graph that will help you choose the right graph type for your data every time:

  1. “How many variables (columns or discrete sets of data) are you plotting (one, two, or more than two)?”
  2. “Are (each of) those variables quantitative (aka “real, math-y numbers”) or categorical (aka words or arbitrary numbers)?”
  3. “How complex can/should your graph be to suit your purposes and audience?”

Before I expand on these, a quick clarification of what I mean by “complex.” Many things can make a graph more complex. These include when:

  • More raw data are being shown (100 points on a scatterplot will be more complex than 10).
  • More details are being included (e.g., confidence bands in addition to a best-fit line).
  • More element types are being used (e.g., colors, shapes, shadings, clusterings, etc.).
  • More uncertainty is being represented (e.g., showing not just group means but also standard deviations).
  • More complex patterns are being highlighted (e.g., when a non-linear rather than a linear trend is the focus).

While I will discuss why we may want more or less complexity and how to choose the appropriate amount of complexity towards the end of this post, for now, we don’t actually need to know! At this point, all we need to know is that different desired amounts of complexity will alter which graph type is best for our circumstances.

How Knowing the Answer to the First Question Helps Us

Now that we’ve met our three all-powerful questions, let’s get to know the first one a little better: “How many variables are we plotting here?”

The reasons this first question is useful are twofold. First, most graphs are built to accommodate only so many variables. Histograms, for example, assume we are plotting just a single variable, whereas grouped, stacked bar charts assume we must have at least two but likely more than two variables. So, by knowing how many variables we are trying to represent, we can quickly whittle down our options of graph types to consider.

Second, different numbers of variables usually imply different purposes for our graph:

  • If we have just one variable, we are probably trying to highlight the distribution of our data. That is, we are trying to show off what values are common/uncommon, where the “centers” of the data are, and how variable the data are around those centers.
  • If we have exactly two variables, we are probably trying to highlight an association between the two variables (or, perhaps, the lack of one). That is, we’re trying to show that there is (or isn’t) a predictable, interesting, or usable relationship between the two variables, such that as one varies, so too does the other in some specific way.
  • If we have more than two variables, we probably hate ourselves, hate our audience, or both! Ok, ok, I’m just kidding. Here again, we’re probably trying to show an association between two (or more) variables. However, for us to want to plot all these variables at the same time, the association must be complex–maybe how the two “main” variables associate with each other itself depends on how a third variable varies, for example.

Alex’s hot take: Making and relaying one- and two-variable graphs is relatively straightforward. Making and sharing three-plus-variable graphs is often not. While 3+ variable graphs exist and do have their places and times, they are often hard to make, hard to interpret, or both. They will require extra time to assemble and polish as well as extra time and space to explain to your audience, meaning they will start at a high level of complexity from the outset. They are also at least sometimes “besides the point;” unless you have a bunch of important interaction terms in your model, for example, representing trivariate (or even more complex) relationships in a single graph is more often going to complicate your story rather than enhance it.

Ultimately, we want to use our graphs to tell a rich but accessible story. That often means keeping things as simple as we can, especially if we know we want to add complexity in other ways. In that sense, it’s best to not plan from the outset to produce graphs depicting 3+ variables unless you’re sure you need to.

So, we now appreciate that there are graphs especially suited to different numbers of variables and to different intents. As it turns out, each type of graph is also best-suited for a certain type (or mix) of data, with some graph types catering towards categorical data and other graph types catering towards quantitative data–that’s how Question #2 helps us!

For most of the rest of this post, I break down the appropriate graph types to select, based first on the number of variables you are plotting and then by the types of data being plotted. Within each subgroup, I will further break down our options along a scale of complexity, with simplest options first and the most complex options last.

Alex’s “Beastery of Graphics!”

One Variable

Graph Types For One Categorical Variable

Simplest: The simplest way to represent one categorical variable is actually in a frequency table, not in a “true graph.” Usually, the different categories are listed in one column and their occurrence rates are listed in a second column in count or percentage form.

So long as there aren’t too many categories, frequency tables are easy for most audiences to digest, and they are incredibly ink-efficient“–they convey a lot of information relative to the ink and space need to depict them. A “true graph” rarely makes the data easier to understand. That said, consider whether you can enhance the reader’s understanding by sorting the table in some way (e.g., alphabetically or in decreasing order of frequency).

Here, the categories are listed in the first column and their frequencies are listed in the third column.

More complex: An option that is more complex but that also works well for a single categorical variable with a larger number of categories is a word cloud. In a word cloud, categories are represented by their names and their relative frequencies are represented by some characteristic of those names, such as their size, shape, color, or orientation.

Of these options, size is the most accessible, as some fonts, shape distortions, or printing angles can make names harder to read, and not all colors are colorblind-friendly. Ensure that the least frequent category still prints at a legible size; log-transforming the frequency data can help to reduce size disparities in extreme instances.

One drawback of word clouds is that there is not an easy way to convey absolute frequencies, so if those are needed, the next or previous graph types may be preferable.

Here, the categories are represented by their names, and their size is mapped to their frequency of occurrence (although color also indicates subgroupings, by the looks of it! So, there are actually three variables in this graph).

Most complex: The best strictly graphical option for showing one categorical variable is in a single stacked bar graph. The height of the bar can either be 100% or the total number of occurrences, and the height of each subdivision of the bar (one for each category) indicates the contribution of that category to the total. Colors, shapes, or shading can be used to distinguish categories.

The biggest asset of a stacked bar graph is it makes variation in the frequency data more obvious. Because size/area reflects frequency, really abundant categories dwarf rare categories, which can sometimes become practically invisible by comparison (which can be good or bad). Rare or non-focal categories can often be collapsed into a single subdivision to make them larger, if that works for your purposes.

Here, the bar is a “whole,” and each color indicates a category that made up part of that “whole.” The vertical height of each colored rectangle indicates the relative frequency of each category, and the frequency data are printed within each subdivision for reader convenience (see tip below). Is this easier to read and more efficient than a table? Probably not, but it is more visually stimulating, and the magnitude of variation between the categories is more readily apparent.

Again, this option works best with fewer than twelve-ish categories; any more than that are better represented in a table or word cloud.

For a single stacked bar graph, consider adding raw frequency data to the graph; either print the values inside the respective subdivisions (if they will all fit and still be of legible size) or else use lines or arrows to connect values outside the bar area to their respective subdivisions. If these values are not printed, extracting them for subdivisions towards the top of the bar requires subtraction, which some may find tedious.

Alex’s hot take: “Alex, what about pie charts? Wouldn’t they fit into this part of the beastery too?”

Yes, you’re right, they would–pie charts are technically suitable for representing one categorical variable. However, I did not include them because they are hot trash. You should not use pie charts (probably).

The TL:DR reasoning is this: While a stacked bar plot represents frequency in height/width, a pie chart represents frequency in the arc length of each “wedge.” Arc lengths are as awkward to compare as they are to talk about–basically, humans are lousy at comparing the relative sizes of pie wedges, especially when they are at different angles. We’re much better at comparing the relative sizes of rectangles, even when they are at different angles. Additionally, a stacked bar plot is still often readable when shrunk to a small size compared to a pie chart. Lastly, pie charts are so overused that most professionals view them as trite and juvenile. If you need more convincing, check out this video.

Graph Types For One Quantitative Variable

Simplest: An underutilized option, density curve plots are the simplest choice for showing the distribution of a single quantitative variable. Typically, these depict the range of values the variable took on the horizontal axis and then the height of a curving line depicts the frequency at which each value was observed. Usually, the line is “smoothed,” so it serves as more of a “trend line” for the data, helping tendencies to jump out. By adjusting the level of smoothness, the complexity level of the graph can be turned up or down somewhat.

Here, we see a histogram (see next graph type) plotted along with a density curve (solid line). Note that it doesn’t follow the data exactly because of the level of smoothing used, so it’s better suited to conveying the “general feel” of the data rather than their specific shape.

More complex: A more complex but still “standard” option for showing the distribution of a single quantitative variable is a histogram. Like with a density curve, the horizontal axis of a histogram depicts the range of values observed for the variable. Then, the heights of bars depict the frequency with which each value was observed. Histograms are great for showing the “shape” of a distribution, and they do so in a less idealized way than a density curve does.

Also like with a density curve, the data must be simplified to represent them this way. So, the data are chunked into “bins” of a given width (e.g., 5-10, 10-15, 15-20, etc.), and the heights of the bars represent frequencies for each bin rather than any one value from within a bin. As such, the data can be made more “smooth” or more “jagged” by dividing the data into fewer or more bins, respectively. It’s best practice to make the bins have equal widths; readers often assume this is the case and often don’t notice when it isn’t (so it’s a way data are often distorted to support a particular agenda!).

Histograms are more complex than density curves because they show more elements (many bars instead of a single line) and tend to show more “noise” in the data than a density curve does (see example above).

An example histogram. Note how we can see here that rather high values are somewhat more common than rather low values (i.e., the data have a skew to them). Histograms are well-suited to showing the “shape” of a distribution, including some of its intricacies.

A little more complex still: Strangely rare, dot plots are another option for graphing a single quantitative variable. They are very similar to histograms and density curves except that each observation is represented by a single circle (or “dot”). When a value occurs frequently, more circles need to be plotted at the same place along the horizontal axis, and the dots “stack up” to reflect that frequency.

One potential advantage of a dot plot is that it requires much less “simplification” than a density curve or histogram. As such, it’s more capable of showing the raw data, “warts and all.” However, some small level of binning is usually still needed to ensure the stacks can fit next to each other. Also, if frequencies get above ~10, each circle needs to start representing multiple observations.

Another big advantage of a dot plot is that a dot plot often makes for a much more interesting interactive plot than density curves or histograms do–for example, a tooltip could emerge from each dot when hovered over that identifies the individual that dot corresponds to. Symbols, numbers, or nicknames can also be printed inside each dot (or used in place of each dot) to convey identification or other information. Such things can’t be done in histograms, which lumps many observations together into one “bin.”

Two example dot plots, wherein each dot represents the value of a single observation and the height of each “stack” indicates frequency. Note how much more granular these data are than those depicted in a histogram.

Most complex: Our most complex option for displaying a single quantitative variable is a box plot (also sometimes called a box-and-whisker plot). These are easier to explain using a diagram:

The “whiskers” of the box plot are placed at the minimum and maximum observed values…unless there are “extreme values” (which can be defined many ways), in which case these are represented out past the “whiskers” as dots. The middle value (the median, or 50th percentile) is marked with a line, as are the 25th and 75th percentiles (the 1st and third quartiles). A box is then drawn between those quartiles.

Boxplots show the center of a distribution (the median line), the range of a distribution (the whiskers and outliers), and the shape of a distribution (how symmetrical the box and its whiskers are). Histograms and dot plots do this too, but less abstractly. This means a box plot is like a “distillation” of a histogram, which means it isn’t always easy to imagine what the underlying distribution would look like.

One solution to this problem, though it adds complexity, is to plot the raw data over top of a boxplot as jittered points (jittering is adding random fluctuations to data to ensure points don’t plot over the top of one another).

Here, we’re able to see the raw data in all their glory but also distillations of their distributions as well via the box plots. In this particular case, the points have been made to “stack” as in a dot plot so that they don’t plot over top of one another, which is a nice touch.

If this feels like too much complexity for you, consider instead adding some of the info that a box plot conveys to a histogram, such as the median, using arrows, labels, or dashed lines, as in the example below.

Conversely, if there are distinct advantages of both density curves and box plots that you want to exploit, you could try a violin plot, which is like a hybrid of the two plot types. These are box plots whose sides are density curves of the same data.

Here, we can engage with the “distilled” data by focusing on the box plots, and we can engage with the “raw” data by tipping the plots sideways and examining their density curves.

Alex’s hot take: “But Alex, wouldn’t a single bar plot showing the mean and standard deviation of my data go in this subcategory too?”

It sure would–but, at best, it isn’t the best choice and, at worst, it’d be a waste of ink!

Think about it: A bar plot of the mean and standard deviation shows center and spread (kind of) but it doesn’t show the shape of your data. Even the way it shows spread is more abstract and less informative than the way even a density curve would show it. In this light, a density plot can show everything a bar plot with error bars can show and more while remaining just as simple to interpret.

In fact, I’d argue that if you’re uninterested in showing the shape of your distribution, you probably don’t need a graph at all–just write the mean +/- (sd) in text! It’ll save a lot of ink/space, and no interpretability will be lost.

We’ll hammer on this again later in this post, but suffice to say: Bar plots are overused and vastly overrated!

Two Variables

Graph Types For Two Quantitative Variables

Simplest: If we’re trying to show how (or if) two quantitative variables are related, the simplest graph to do that with is one that shows a best-fit or trend line only. By not plotting the raw data, there are fewer elements to look at, so the trend or pattern becomes the sole focus. The line can be smoothed or not to adjust complexity, and the “line” could really be a curve if the data share a non-linear relationship.

Trend-line-only graphs are produced so rarely (for no good reason) that it was hard to find a good example! This one should give you the idea. However, note that this is actually a line graph. See my hot take at the end of this subsection for details!

More complex: A trend-line-only graph only tells us about the “central tendency” of the relationship. If we want to provide a sense of how “messy” that relationship might be, we can add uncertainty bounds to our trend line graph. This tells us, for any given value of our independent variable, how many values of our dependent variable are plausible (those are the ones within the uncertainty bounds).

This isn’t a reason not to use this graph type, but a lot of readers will be initially unfamiliar with uncertainty bounds, so your figure caption may need to briefly explain them.

Here, the uncertainty bounds indicate that the relationship may actually be much “higher” or “lower” than the trend line makes it seem. However, the bounds are also telling us that the relationship could be much “steeper” or “shallower” than it seems too. Essentially, any straight line that stays fully within the bounds is “plausible,” based on these uncertainty bounds.

One alternative to the uncertainty bounds is to plot the “steepest” and “shallowest” plausible lines on the graph, perhaps as dashed lines, along with the trend line. That way, the reader can see the plausible extremes without having to envision them themselves.

Most complex: A scatterplot is the “standard” way to represent two quantitative variables graphically, but it’s also the most complex way to do it. In a scatterplot, individual points represent individual observations. An observation’s point will lie at the intersection of that observation’s data values for both variables. See? It’s even hard to describe!

By including a trend line, scatterplots can be made to show the “central tendency” of the relationship, and by including uncertainty bounds around that trend line, scatterplots can be made to show uncertainty in that relationship. However, the points themselves add a lot of unique information! Through them, the reader sees the “noise” or “spread” around that relationship, and whether or not the relationship is being driven largely by a few points.

Here, points are plotted to be semi-transparent. So, when more points have the same values, they stack on top of each other and appear darker, allowing the reader to see “local density” where points would otherwise overlap. Trend lines with uncertainty bound are also included.

One problem with scatterplots is overplotting–this is when multiple points have the same or very similar values and thus are plotted over top of one another. This can give the reader the false sense that there are fewer observations in a region than there were. While jittering the points is one option to solve this, a better solution is the use of semi-transparency. Semi-transparent points by themselves will appear faint, but many semi-transparent points stacked on top of each other will appear darker.

In my opinion, to minimize complexity, one should view a trend line as mandatory and points as optional in most situations. With points alone, it is easy for humans to see (or imagine!) a trend through them that is not well-supported by the data. By contrast, with a trend line (and uncertainty bounds) alone, a reader can often still imagine what the data might look like. Still, a lot of information (about spread, density, noise, influential observations, etc.) are conveyed by points alone. If your story needs to emphasize that information, you will need to include the points, even if you have to turn the complexity of the graph down in other ways.

Alex’s hot take: “Hey Alex, what about line graphs? I’d have assumed you’d be talking about them here…”

Ok, line graphs look a lot like scatterplots, but they are different. A line graph is only appropriate when separate observations are logically linked to each other in some way. Usually, this linkage is due to non-independence–the observations came from the same organism across multiple days, or occurred in the same lake along the same transect, or came from three different siblings in the same family (notice that “same” word in each of these!). Non-independence means we shouldn’t think of each observation as being a “standalone observation” but rather just one part of a “package deal.”

Lines are used to connect those “package deals” together so that we can easily see and engage with that non-independence. For example, in the graph below, the lines indicate that the percentage share of power capacity of coal in 2020 is not independent of its percentage share of power capacity in 2015–it was high in 2015, so it’s likely to still be high in 2020 because many things can only change so quickly through time. Those two observations are connected logically, so we connect them physically in our graph.

So, yes, time is a quantitative variable, so a line graph is appropriate for plotting two quantitative variables against each other. However, it’s should only be used when time, space, or some other “linking” variable is on the X axis, and even then only when that factor is likely to cause one’s observations to be non-independent.

So, I guess just remember that line graphs are weird! Don’t draw connecting lines on your scatterplots for the fun of it–connecting lines have a very specific meaning.

Graph Types For Two Categorical Variables

Simplest: The simplest (and most ink-efficient!) way to display the relationship between two categorical variables isn’t with a graph at all–it’s with a contingency table. A contingency table is just a fancy name for a frequency table that has sets of categories along both the top row and first column, with frequencies of each subgroup in the table’s cells.

Much as with a frequency table, it’s easier for readers to pull out patterns if the table can be meaningfully sorted to highlight them. However, shading, formatting, and other “tricks” can be used to draw attention to certain cells or trends too.

A two-by-five contingency table showing frequencies of different sizes of trees of different species.

In many cases, a “true graph” rarely adds much–contingency tables are very easy to read, they show the raw data so nothing is “hidden,” and they can be tailored to make trends apparent. That said, if you are concerned that the magnitude of a pattern won’t be obvious from a contingency table, one cool trick is to add bars inside the contingency table’s cells whose length reflects frequency. Making the text visible against the background both where the bar is present and where it is absent can be a challenge, but it’s possible.

Adding bars inside of cells is doable in Excel (as shown here), but it’s also possible using tools in the DT package in R.

More complex: If a graph rather than a table is desired, the least complex way to represent two categorical variables is with stacked bar plots, with one bar per category in one variable and one rectangle inside each bar per category in the other variable.

Many of the same challenges and opportunities that apply to single stacked bar plots apply here as well. In particular, these graphs become difficult to interpret when there are too many categories in one or both variables or when some subcategories are quite rare. It can also become difficult to distinguish subcategories adequately or provide frequency values inside the graph area, forcing readers to do math to get these values if they want them.

A nice example of a stacked bar plot. Notice how this graph features multiple shades of red and green, which could be difficult for colorblind readers, and the graph could be less interpretable if shown in black and white because color hues are largely similar. Also, consider that the bars could have been sorted in some way to highlight a pattern–for example, perhaps in descending order of the relative frequency of Quercus species from left to right. Also, note that sample size information is not presented here, but it could be–the widths of the bars could be varied as sample size varies, for example.

Most complex: Though underutilized, mosaic plots (also called tile plots) are a visually striking way to represent two categorical variables. These are sort of like “deconstructed stacked bar plots” that make sample sizes explicit. Each subcategory gets a rectangle whose width indicates its relative frequency across one categorical variable and whose height reflects its relative frequency in the other variable.

Otherwise, the same readability challenges encountered with stacked bar plots tend to arise with mosaic plots as well. For readability, it can be good to restrict to around three or fewer categories in one or both variables, “binning” data as needed to accomplish this.

A simple mosaic plot, with just two categories in each variable. We can see that the “Other birds” width is larger than the “Ground-nesting birds” width, so we know that somewhat more observations occurred in the former than the latter. The combined height of the “negative” boxes is lower than that of the “positive” boxes, so we also know “negatives” were observed less often overall. See here for a more complex example.

Graph Types For One Categorical Variable and One Quantitative Variable

Too simple: The “standard” graph type for when we have one categorical variable and one quantitative has been bar plots (with error bars) for far too long. Even when error bars are provided to give some sense of uncertainty/spread, as I argued in one of my hot takes earlier in this post, bar plots are crazily ink-inefficient. A table or even just text of the means +/- the standard deviations conveys the same information with (virtually) the same interpretability. Beyond that, bar plots do nothing to show us the shape of the distribution, so it makes it hard to discern how “fluky” vs. “consistent” any pattern might be.

Consider: How much less ink and space would it have used to represent the means and standard errors here in text instead? What is this graph really adding?

One additional problem with bar plots is “Y-axis hacking.” By not anchoring the lower Y axis limit at 0, or by inserting a continuity break along the Y axis, a bar graph can easily be made to intentionally misrepresent the patterns in the data. Yes, continuity breaks can be represented with symbols, but not everyone knows what those symbols mean or will notice them.

Consider here that the Paris-June data here are actually around 7 times higher than the Munich-June data even though, on first glance, they look only modestly different here. One has to pay close attention to the symbology on the graph (and instinctively know what it means!) to spot the difference!

Here’s a nice scientific article that goes deeper into why we should all just stop making basic bar graphs.

Just simple enough: One solution to the problem that bar plots don’t show “shape” well is to plot the raw data as points (either jittered or semi-transparent to reduce overplotting, as needed) over top of a bar plot. This can increase the amount of visual elements for the reader to interpret a lot, but it also conveys a lot more information in an instinctively intuitive way.

A nice example pulled from an article aptly named “Show the dots in plots.” Consider the “Flights of stairs climbed” data from May-Aug 2015. We can see that the error bars are only as wide as they are (and the mean only as high as it is) because of one observation out of four. Notice also that plotting points makes sample sizes explicit–results based on 4 points are a lot less certain than those based on 400! Note that the points have been jittered here to prevent them from plotting overtop of one another.

More complex: Another way to “dress up” bar plots to accommodate their shortcomings is to plot them with density plots or histograms on their sides. This might be a simple way of presenting the same information as that relayed by plotting the points but adds far fewer elements for the reader to interpret. This can be somewhat difficult to achieve in some graphical software, though.

A bar plot with plotted points and a smoothed density curve (actually, an idealized normal curve) plotted on the side. Notice that this reveals that the C group (far right) may not look much different by the bars alone, but there is a bulk of points towards the top of the graph in that group that would not be evident without these other features added in.

Most complex: While bar plots are still the “standard” for plotting these kinds of data, as I’ve noted elsewhere in this post, boxplots have some distinct advantages–chief among these is that they rarely need “added elements,” such as overplotted points, to relay considerably more information than a bar plot. However, for added complexity, boxplots can be substituted for violin plots or have points plotted over top, as desired.

Consider how much more information and nuance is available from a boxplot than from a bar plot. Here, we get information about “lopsidedness,” outliers, range of the “central region,” and more.

Three (or More) Variables

Graph Types For All Quantitative Variables

Oof…are you sure? Like, I mean, are you really sure? Is such a graph essential for your narrative and related to your analytical approach? If not, you should consider…

  • Just plotting two variables against each other at a time, even if that requires three graph panels instead of one.
  • Using ordination to “collapse” the trends in your data down to fewer axes.

But if you’re really sure…

Your only options: You will need to use a three-dimensional plot. These are as fun to make and to interpret as they sound like they’d be! This requires choosing which variable/variables will occupy the “horizontal” and “vertical” axes (more resolution will often be visible for the one vertical variable than for the two horizontal ones). It also requires choosing a “viewing frame” to present to your audience (where will you be peering into the “cube” from?), an issue not encountered in two-dimensional space.

Consider that this graph actually contains a fourth variable–something categorical is being represented by different colors. Hopefully, you can see why I say that this is not a road that should be traveled lightly. For me, it’s hard to get a sense of what the overall trend is here, let alone what I should be learning from the various spreads of points. And a color-blind person might not extract much from this graph at all!

If you do choose to go this route, my best advice would be to:

  • Restrict yourself to just three variables–any more, and it could overload the plotting area and also your reader’s cognitive faculties.
  • Consider not plotting the points and instead plotting just the “trend plane,” if the focus can be on the overall trend and not the individual data points (a “confidence plane” can be added, as desired).
  • Build the graph in an interactive environment where the viewer can freely spin the graph to view it from whatever lens they choose. The selection of the “viewing frame” can distort interpretation of a graph (making some patterns more or less obvious), and that bias can be removed by putting the reader in control of their own “viewing frame.”

Graph Types For One or More Categorical Variables and One or More Quantitative Variables

These are actually generally alright, both to make and to interpret, but because they contain so much more information that has to be interpreted by the reader and “coded” within the graph by the creator, they start out complex. For that reason, you ought to consider ways of reducing complexity wherever possible.

Your graph type options here include grouped boxplots or bar plots and multiple scatterplots. Everything said elsewhere in this post about these graph types applies here as well.

A grouped boxplot like this one uses “nestedness” on the X axis, as well as (in this case) color to distinguish the boxes for each combination of the two categorical variables down there. The one quantitative variable then occupies the Y axis.
A multiple scatterplot, with color (and a legend) distinguishing two categorical groups from each other for both the individual points and for the trend lines. Consider how else you could have indicated the different groups here besides by using color–there are other options!

Some rules to remember for these graphs:

  • You will need to use nesting or different colors/shapes/line types, etc. to differentiate between different subcategories.
  • Don’t always reach for color first–consider using text, lines, shadings, point shapes, size, and other graphic parameters to indicate category affiliation. Many of these are as effective–if not more effective–than color at differentiation and frequently go underutilized.
  • Remember that some people are color-blind and that others print or view media primarily in black and white. If you do choose to use color, make sure to opt for a colorblind-friendly palette.
  • Don’t use redundant indicators (e.g., colors and line types both used to indicate category affiliation). Not only does this make the graph more complex to build than it needs to be, but it increases cognitive load for your reader–many readers will find it counter-intuitive that two different kinds of elements convey the same information and will waste time trying to figure out if that is or isn’t the case.
  • Don’t overdo it! These graphs are already complex thanks to the volume of data being plotted–eliminate extraneous complexity wherever you can and leave any associations involving 4+ variables to discussions in words rather than in graphics.

Graph Types For All Categorical Variables

…Are you sure? I’ll be honest–this doesn’t come up very much! A contingency table that uses a nested structure is probably your best bet…

Here, the columns represent categories in one categorical variable. Then, the rows mark different drugs, another categorical variable. The “sub-rows” here then reflect quantitative data on age and marker level, but you could imagine these instead holding counts of, say, sexes of individuals being subjected to each drug.

However, for something a bit more novel, you could try a “slanty” stacked bar plot, such as the one shown below:

On this graph, the left versus the right sides of each bar reflect different subcategories, as do different colors and different bars, so the relative frequencies of three categorical variables are represented.

What Goes Where On Graphs?

“Ok, Alex, I feel much better about choosing graphs for my data! But how do I decide which variables should go where on my graphs?”

Good question! For the most part, there is no choice–your independent variable (or explanatory variable, or experimental variable) goes on the horizontal axis and the dependent variable goes on the vertical axis. This is what most people find intuitive and is convention. So, go against this convention only when you have a good reason!

If you really don’t have an independent variable (your work wasn’t hypothesis-driven, perhaps), then you should choose the orientation that makes the pattern you are trying to highlight easier to notice. Usually, one configuration will work better than the other for this, so try both initially. Get some feedback from some trusted colleagues or mentors if needed!

How Complex Can I Make My Graphs?

“So, Alex, when are you going to explain how I decide on the appropriate level of complexity for my graphs–don’t think I haven’t noticed that you haven’t done that yet!”

You’re right! Let’s do that right now. And, here again, I have good news: You can figure out how much complexity is right for your graph by answering just three simple questions:

  1. Who’s your audience? How much complexity would they expect? How much complexity can they handle?
  2. What’s your message? To what extent is complexity a part of it?
  3. How transparent about your data do you need to be?

If your audience is technically trained (many aren’t!) and technically oriented (many aren’t, even if they are technically trained!), then high complexity is fair. That doesn’t mean you should aim for high complexity, but it does mean you’d have “permission” to do so. A general audience, meanwhile, should be insulated from complexity unless and until they ask for it.

Sometimes, our message is that systems/processes are complex! When that is the case, complexity becomes “the point,” and we need to show it in our graphs. Other times, complexity would only detract from the message we are trying to convey while adding little else–in those instances, we should opt for a simpler option.

And, of course, we also have responsibilities as scientists to be open and honest about our data. When our data are “messy” and our patterns are vague or uncertain, we have an obligation to represent that uncertainty in a graph so our readers can decide how they feel about those patterns themselves, knowing as much as we know as they do it.

How Complex Should I Make My Graphs??

“Alex, what kind of half-answer was that?! Can’t you just tell me how complex I should make my graphs? All these sets of three questions are making my head spin!”

Alright, fine–while no answer about complexity will work in every instance (see the previous section!!!), here’s my hot take on this:

Alex’s hot take: If you’re really not sure, start out making as complex a graph as you possibly can. Here are some benefits I see to this approach:

  1. You can show your complex graph to a sample audience and see if they can follow along and get “the point.” If they can’t, you can cut back on complexity knowing that you needed to do it.
  2. Worst-case scenario, you will hopefully know how to make the most complex version of your graph you could ever need! Who knows–maybe you’ll need one like it in the future and won’t have to learn how to do it at that point.
  3. By making a more complex graph, one where all the nuances of your data set are on full display, something new and cool might emerge that you wouldn’t have discovered otherwise!
  4. If you reveal more complexity, no one can accuse you of “hiding behind your statistics.” Raw data can’t lie, so if you show them in all their messy glory, readers are empowered to make all their own decisions.
  5. Remember that graphical elements don’t need to have equal “weight,” or visual emphasis. You can include “complexity elements” but downweight their visual emphasis by using transparency, gray tones, thinner lines, etc. to force them into the “background” so that the patterns and trends come to the “foreground” for most readers.

A Handy Handyout

That’s everything I have to say on this topic (for now)! That said, in case you’d like to share this resource with others who are strapped for time, or you would like to have a shorter version of it to reference, here is a “take-away” version as a Word Document. Feel free to share it! Many of the links will likely be broken–sorry about that!