Doing a PCA in R is easy: Just run the function `prcomp()`

on your matrix of scaled numeric predictor variables. There’s just one problem, however. The result is an object of class `prcomp`

that doesn’t fit nicely into the tidyverse framework, e.g. for visualization. While it’s reasonably easy to extract the relevant info with some base-R manipulations, I’ve never been happy with this approach. But now, I’ve realized that all the necessary functions to do this tidyverse-style are available in the broom package.

For our PCA example, we’ll need the packages tidyverse and broom. Note that as of this writing, we need the current development version of broom because of a bug in `tidy.prcomp()`

. We’ll also use the cowplot package for plot themes.

`library(tidyverse)`

`# ── Attaching packages ────────────────────────────────── tidyverse 1.3.0 ──`

```
# ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
# ✓ tibble 3.0.3 ✓ dplyr 1.0.2
# ✓ tidyr 1.1.2 ✓ stringr 1.4.0
# ✓ readr 1.3.1 ✓ forcats 0.5.0
```

```
# ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
# x dplyr::filter() masks stats::filter()
# x dplyr::lag() masks stats::lag()
```

```
library(broom) # devtools::install_github("tidymodels/broom")
library(cowplot)
```

We’ll be analyzing the `biopsy`

dataset, which comes originally from the MASS package. It is a breast cancer dataset from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. He assessed biopsies of breast tumors for 699 patients; each of nine attributes was scored on a scale of 1 to 10. The true outcome (benign/malignant) is also known.

`biopsy <- read_csv("https://wilkelab.org/classes/SDS348/data_sets/biopsy.csv")`

```
# Parsed with column specification:
# cols(
# clump_thickness = col_double(),
# uniform_cell_size = col_double(),
# uniform_cell_shape = col_double(),
# marg_adhesion = col_double(),
# epithelial_cell_size = col_double(),
# bare_nuclei = col_double(),
# bland_chromatin = col_double(),
# normal_nucleoli = col_double(),
# mitoses = col_double(),
# outcome = col_character()
# )
```

In general, when performing PCA, we’ll want to do (at least) three things:

- Look at the data in PC coordinates.
- Look at the rotation matrix.
- Look at the variance explained by each PC.

Let’s do these three things in turn.

## Look at the data in PC coordinates

We start by running the PCA and storing the result in a variable `pca_fit`

. There are two issues to consider here. First, the `prcomp()`

function can only deal with numeric columns, so we need to remove all non-numeric columns from the data. This is straightforward using the `where(is.numeric)`

tidyselect construct. Second, we normally want to scale the data values to unit variance before PCA. We do so by using the argument `scale = TRUE`

in `prcomp()`

.

```
pca_fit <- biopsy %>%
select(where(is.numeric)) %>% # retain only numeric columns
prcomp(scale = TRUE) # do PCA on scaled data
```

As an alternative to `scale = TRUE`

, we could also have scaled the data by explicitly invoking the `scale()`

function.

```
pca_fit <- biopsy %>%
select(where(is.numeric)) %>% # retain only numeric columns
scale() %>% # scale data
prcomp() # do PCA
```

Now, we want to plot the data in PC coordinates. In general, this means combining the PC coordinates with the original dataset, so we can color points by categorical variables present in the original data but removed for the PCA. We do this with the `augment()`

function from broom, which takes as arguments the fitted model and the original data. The columns containing the fitted coordinates are called `.fittedPC1`

, `.fittedPC2`

, etc.

```
pca_fit %>%
augment(biopsy) %>% # add original dataset back in
ggplot(aes(.fittedPC1, .fittedPC2, color = outcome)) +
geom_point(size = 1.5) +
scale_color_manual(
values = c(malignant = "#D55E00", benign = "#0072B2")
) +
theme_half_open(12) + background_grid()
```

## Look at the data in PC coordinates

Next, we plot the rotation matrix. The rotation matrix is stored as `pca_fit$rotation`

, but here we’ll extract it using the `tidy()`

function from broom. When applied to `prcomp`

objects, the `tidy()`

function takes an additional argument `matrix`

, which we set to `matrix = "rotation"`

to extract the rotation matrix.

```
# extract rotation matrix
pca_fit %>%
tidy(matrix = "rotation")
```

```
# # A tibble: 81 x 3
# column PC value
# <chr> <dbl> <dbl>
# 1 clump_thickness 1 -0.302
# 2 clump_thickness 2 -0.141
# 3 clump_thickness 3 0.866
# 4 clump_thickness 4 -0.108
# 5 clump_thickness 5 0.0803
# 6 clump_thickness 6 -0.243
# 7 clump_thickness 7 -0.00852
# 8 clump_thickness 8 0.248
# 9 clump_thickness 9 -0.00275
# 10 uniform_cell_size 1 -0.381
# # … with 71 more rows
```

Now in the context of a plot:

```
# define arrow style for plotting
arrow_style <- arrow(
angle = 20, ends = "first", type = "closed", length = grid::unit(8, "pt")
)
# plot rotation matrix
pca_fit %>%
tidy(matrix = "rotation") %>%
pivot_wider(names_from = "PC", names_prefix = "PC", values_from = "value") %>%
ggplot(aes(PC1, PC2)) +
geom_segment(xend = 0, yend = 0, arrow = arrow_style) +
geom_text(
aes(label = column),
hjust = 1, nudge_x = -0.02,
color = "#904C2F"
) +
xlim(-1.25, .5) + ylim(-.5, 1) +
coord_fixed() + # fix aspect ratio to 1:1
theme_minimal_grid(12)
```

## Look at the variance explained by each PC

Finally, we’ll plot the variance explained by each PC. We can again extract this information using the `tidy()`

function from broom, now by setting the `matrix`

argument to `matrix = "eigenvalues"`

.

```
pca_fit %>%
tidy(matrix = "eigenvalues")
```

```
# # A tibble: 9 x 4
# PC std.dev percent cumulative
# <dbl> <dbl> <dbl> <dbl>
# 1 1 2.43 0.656 0.656
# 2 2 0.881 0.0862 0.742
# 3 3 0.734 0.0599 0.802
# 4 4 0.678 0.0511 0.853
# 5 5 0.617 0.0422 0.895
# 6 6 0.549 0.0335 0.928
# 7 7 0.543 0.0327 0.961
# 8 8 0.511 0.0290 0.990
# 9 9 0.297 0.00982 1
```

Now in the context of a plot.

```
pca_fit %>%
tidy(matrix = "eigenvalues") %>%
ggplot(aes(PC, percent)) +
geom_col(fill = "#56B4E9", alpha = 0.8) +
scale_x_continuous(breaks = 1:9) +
scale_y_continuous(
labels = scales::percent_format(),
expand = expansion(mult = c(0, 0.01))
) +
theme_minimal_hgrid(12)
```

The first component captures 65% of the variation in the data and, as we can see from the first plot in this post, nicely separates the benign samples from the malignant samples.