Fast, easy improvements to bar charts in ggplot2

geom_col() makes a straightforward bar chart in ggplot2, but at default settings, the result may be unusable.

Consider this plot of Catholicism in 19th century Switzerland. Notice (1) the horizontal axis is unreadable, (2) the bars don’t seem to be in any meaningful order, and (3) the whole plot is a boring grayscale.

library(datasets)
library(data.table)
library(ggplot2)

swiss = as.data.table(swiss, keep.rownames = TRUE)

#> str(swiss)
#Classes ‘data.table’ and 'data.frame':	47 obs. of  7 variables:
# $ rn              : chr  "Courtelary" "Delemont" "Franches-Mnt" "Moutier" ...
# $ Fertility       : num  80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
# $ Agriculture     : num  17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
# $ Examination     : int  15 6 5 12 17 9 16 14 12 16 ...
# $ Education       : int  12 9 5 7 15 7 7 8 7 13 ...
# $ Catholic        : num  9.96 84.84 93.4 33.77 5.16 ...
# $ Infant.Mortality: num  22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...

ggplot(swiss, aes(rn, Catholic)) +
  geom_col()

geom_col_defaults

Let’s solve the first problem by rotation. We’ll also relabel “rn” to something more descriptive. Of note, in ggplot2, once an x axis, always an x axis, irrespective of coord_flip().

ggplot(swiss, aes(rn, Catholic)) +
  geom_col() +
  coord_flip() +
  labs(x = "province")

geom_col_flipped_xlab

With province names unobstructed we see the bars actually were in some order (reverse alphabetical) but you’re more likely to want to sort A-to-Z or by percentage Catholic.

Either choice requires just one extra line of code with scale_x_discrete(). The limits argument accepts a character vector.

## A-to-Z order
## (plot not shown)
ggplot(swiss, aes(rn, Catholic)) +
  geom_col() +
  coord_flip() +
  labs(x = "province") +
  scale_x_discrete(limits = swiss[order(-rn), rn])

## descending order: most to least Catholic
## for ascending order, use order(-Catholic) instead
ggplot(swiss, aes(rn, Catholic)) +
  geom_col() +
  coord_flip() +
  labs(x = "province") +
  scale_x_discrete(limits = swiss[order(Catholic), rn])

geom_col_ordered

To finish, we’ll round out the labels and add color for flair.

ggplot(swiss, aes(rn, Catholic)) +
  geom_col(color = "red", # outline of each bar
           fill = "red", # main color
           alpha = 0.8) + # 80% opacity for the fill
  coord_flip() +
  labs(x = "province", 
       y = "percent of residents", 
       title = "Catholicism in 19th century Switzerland") +
  scale_x_discrete(limits = swiss[order(Catholic), rn]) +
  theme_minimal() # white background

geom_col_color

The datasets package documentation contains a more extensive description of what is visualized here, if you’re curious.

Linear regression by hand

… most companies would be taking a huge step forward if they got somebody who knows how to do linear regression.

Hacker News user ‘mindcrime’ on the necessary skills for data science

When you have x_1, \ldots, x_k predictors of a scalar-valued outcome y with observations indexed by i and residuals denoted \varepsilon, a model of the form

y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_k x_{ik} + \varepsilon_i

or (equivalently) in matrix notation

\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}

is best[1] estimated using ordinary least squares, the workhorse of linear regression. The underlying math is a fair bit of matrix algebra which, when all is said and done, returns

\boldsymbol{\hat{\beta}} = \left(\mathbf{X^\prime X}\right)^{-1} \mathbf{X^\prime y}.

This was the one equation my graduate school program director urged every student to know by heart.

X prime X inverse X prime y, “prime” meaning transpose, yields OLS estimates of linear regression coefficients. On the left-hand side, the hat symbol denotes an estimate from a sample as opposed to a true value in the population.

We can walk through a real-data example using mtcars, an automobile-themed dataset built into R. Regressing fuel economy (mpg) on weight (wt) and number of cylinders (cyl),

\text{mpg}_i = \beta_0 + \beta_1 \text{wt}_i + \beta_2 \text{cyl}_i + \varepsilon_i,

by

summary(lm(mpg ~ wt + cyl, data=mtcars))

will give you

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  39.6863     1.7150  23.141  < 2e-16 ***
wt           -3.1910     0.7569  -4.216 0.000222 ***
cyl          -1.5078     0.4147  -3.636 0.001064 ** 

wherein the estimates (first column of numbers) from top to bottom correspond to

\boldsymbol{\hat{\beta}} = \begin{bmatrix} \hat{\beta_0} \\  \hat{\beta_1} \\  \hat{\beta_2} \end{bmatrix}.

Per the foregoing all-important equation, only two objects are necessary to compute the estimates manually: (1) the matrix \mathbf{X} and (2) the vector \mathbf{y}. Both are easy to extract from mtcars.

## getting `X` the easy but obfuscated way
X = model.matrix(mpg ~ wt + cyl, data=mtcars)

## getting `X` a more transparent way:
## grab only the `wt` and `cyl` columns from mtcars
## and prepend a column of ones to represent the intercept
X = cbind(1, mtcars[, c("wt", "cyl")])
X = as.matrix(X) # transform data.frame to matrix type

## getting `y`
y = mtcars[, "mpg"]

Given \mathbf{X} and \mathbf{y}, all that’s left are matrix operations. Mathematically they are described in this tutorial from Harvey Mudd College. Computationally, t() transposes, solve() inverts, and %*% multiplies matrices.

Now for the moment of truth—X prime X inverse X prime y.

solve(t(X) %*% X) %*% t(X) %*% y
         [,1]
1   39.686261
wt  -3.190972
cyl -1.507795

This formula doesn’t get us p-values but who needs those anyway. 😉


[1] OLS is BLUE—the best linear unbiased estimator—under certain assumptions that are very important but beyond the scope of this post.