# Fast, easy improvements to bar charts in ggplot2

geom_col() makes a straightforward bar chart in ggplot2, but at default settings, the result may be unusable.

Consider this plot of Catholicism in 19th century Switzerland. Notice (1) the horizontal axis is unreadable, (2) the bars don’t seem to be in any meaningful order, and (3) the whole plot is a boring grayscale.

library(datasets)
library(data.table)
library(ggplot2)

swiss = as.data.table(swiss, keep.rownames = TRUE)

#> str(swiss)
#Classes ‘data.table’ and 'data.frame':	47 obs. of  7 variables:
# $rn : chr "Courtelary" "Delemont" "Franches-Mnt" "Moutier" ... #$ Fertility       : num  80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
# $Agriculture : num 17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ... #$ Examination     : int  15 6 5 12 17 9 16 14 12 16 ...
# $Education : int 12 9 5 7 15 7 7 8 7 13 ... #$ Catholic        : num  9.96 84.84 93.4 33.77 5.16 ...
# \$ Infant.Mortality: num  22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...

ggplot(swiss, aes(rn, Catholic)) +
geom_col()


Let’s solve the first problem by rotation. We’ll also relabel “rn” to something more descriptive. Of note, in ggplot2, once an x axis, always an x axis, irrespective of coord_flip().

ggplot(swiss, aes(rn, Catholic)) +
geom_col() +
coord_flip() +
labs(x = "province")


With province names unobstructed we see the bars actually were in some order (reverse alphabetical) but you’re more likely to want to sort A-to-Z or by percentage Catholic.

Either choice requires just one extra line of code with scale_x_discrete(). The limits argument accepts a character vector.

## A-to-Z order
## (plot not shown)
ggplot(swiss, aes(rn, Catholic)) +
geom_col() +
coord_flip() +
labs(x = "province") +
scale_x_discrete(limits = swiss[order(-rn), rn])

## descending order: most to least Catholic
## for ascending order, use order(-Catholic) instead
ggplot(swiss, aes(rn, Catholic)) +
geom_col() +
coord_flip() +
labs(x = "province") +
scale_x_discrete(limits = swiss[order(Catholic), rn])



To finish, we’ll round out the labels and add color for flair.

ggplot(swiss, aes(rn, Catholic)) +
geom_col(color = "red", # outline of each bar
fill = "red", # main color
alpha = 0.8) + # 80% opacity for the fill
coord_flip() +
labs(x = "province",
y = "percent of residents",
title = "Catholicism in 19th century Switzerland") +
scale_x_discrete(limits = swiss[order(Catholic), rn]) +
theme_minimal() # white background


The datasets package documentation contains a more extensive description of what is visualized here, if you’re curious.

# Linear regression by hand

… most companies would be taking a huge step forward if they got somebody who knows how to do linear regression.

When you have $x_1, \ldots, x_k$ predictors of a scalar-valued outcome $y$ with observations indexed by $i$ and residuals denoted $\varepsilon$, a model of the form

$y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_k x_{ik} + \varepsilon_i$

or (equivalently) in matrix notation

$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$

is best[1] estimated using ordinary least squares, the workhorse of linear regression. The underlying math is a fair bit of matrix algebra which, when all is said and done, returns

$\boldsymbol{\hat{\beta}} = \left(\mathbf{X^\prime X}\right)^{-1} \mathbf{X^\prime y}.$

This was the one equation my graduate school program director urged every student to know by heart.

X prime X inverse X prime y, “prime” meaning transpose, yields OLS estimates of linear regression coefficients. On the left-hand side, the hat symbol denotes an estimate from a sample as opposed to a true value in the population.

We can walk through a real-data example using mtcars, an automobile-themed dataset built into R. Regressing fuel economy (mpg) on weight (wt) and number of cylinders (cyl),

$\text{mpg}_i = \beta_0 + \beta_1 \text{wt}_i + \beta_2 \text{cyl}_i + \varepsilon_i,$

by

summary(lm(mpg ~ wt + cyl, data=mtcars))


will give you

            Estimate Std. Error t value Pr(>|t|)
(Intercept)  39.6863     1.7150  23.141  < 2e-16 ***
wt           -3.1910     0.7569  -4.216 0.000222 ***
cyl          -1.5078     0.4147  -3.636 0.001064 **


wherein the estimates (first column of numbers) from top to bottom correspond to

$\boldsymbol{\hat{\beta}} = \begin{bmatrix} \hat{\beta_0} \\ \hat{\beta_1} \\ \hat{\beta_2} \end{bmatrix}.$

Per the foregoing all-important equation, only two objects are necessary to compute the estimates manually: (1) the matrix $\mathbf{X}$ and (2) the vector $\mathbf{y}$. Both are easy to extract from mtcars.

## getting X the easy but obfuscated way
X = model.matrix(mpg ~ wt + cyl, data=mtcars)

## getting X a more transparent way:
## grab only the wt and cyl columns from mtcars
## and prepend a column of ones to represent the intercept
X = cbind(1, mtcars[, c("wt", "cyl")])
X = as.matrix(X) # transform data.frame to matrix type

## getting y
y = mtcars[, "mpg"]


Given $\mathbf{X}$ and $\mathbf{y}$, all that’s left are matrix operations. Mathematically they are described in this tutorial from Harvey Mudd College. Computationally, t() transposes, solve() inverts, and %*% multiplies matrices.

Now for the moment of truth—X prime X inverse X prime y.

solve(t(X) %*% X) %*% t(X) %*% y

         [,1]
1   39.686261
wt  -3.190972
cyl -1.507795


This formula doesn’t get us p-values but who needs those anyway. 😉

[1] OLS is BLUE—the best linear unbiased estimator—under certain assumptions that are very important but beyond the scope of this post.