Descriptive statistics for categorical variables in r

Below are some basic commands to calculate descriptive statistics and generate associated graphs. Below that I showcase the table1 package/function, which makes calculating and automatically generating a table of summary statistics easy. Lastly, I include some links to some helpful data visualization resources and showcase the patchwork package, which allows one to combine multiple graphs into a single display.

Packages Needed for Descriptive Statistics and Data Visualization

This code will check that required packages for this chapter are installed, install them if needed, and load them into your session.

req <- substitute(require(x, character.only = TRUE))
libs<-c("psych", "tidyverse", "table1", "patchwork")
sapply(libs, function(x) eval(req) || {install.packages(x); eval(req)})

Interval or Continuous Variables

There are a variety of packages and commands that will return various descriptive statistics. Here are some options:

psych::describe(mydata, digits = 2)
psych::describe(mydata$intvar, digits = 2)

You can also get descriptive statistics for interval variables broken out by groups (categorical variable).

psych::describe.by(mydata, mydata$groupvar, digits = 2)

Histograms (and related density and area plots) and boxplots are all useful for visualizing continuous variables. All of these can be refined by adding/changing arguments.

ggplot2::ggplot(data = mydata, aes(x = intvar)) + geom_histogram(binwidth = 5)

ggplot2::ggplot(data = mydata, aes(x = intvar)) + geom_density(kernel = "gaussian")

ggplot2::ggplot(data = mydata, aes(x = intvar)) + geom_area(stat = "bin"))

ggplot2::ggplot(data = mydata, aes(x = intvar)) + geom_boxplot()

Categorical Variables

For simple frequency counts:

table(mydata$catvar)

To calculate proportions for a categorical variable, it is a two step process:

object <- table(mydata$catvar)
prop.table(object)

Bar charts are most often used to visualize categorical variables. You can have the bars reflect frequencies or percentages/proportions.

# Frequency Bar Graph

ggplot2::ggplot(data = mydata, aes(x = catvar)) +
   geom_bar() +
   xlab("X-axis label") +
   ylab("Frequency")

# Percentage/Proportion Bar Graph

ggplot2::ggplot(data = mydata, aes(x = catvar)) +
   geom_bar(aes(y = (..count..)/sum(..count..))) +
   xlab("X-axis label") +
   scale_y_continuous(labels = scales::percent_format(), name = "Proportion")

Generating a Summary Statistics Table

There are a variety of packages that have been created to facilitate the production of summary statistics tables. I’ll showcase table1 here. This site offers some helpful insights on how to make the most of the table1 package/function.

Before attempting to generate a table, you will want to first reclassify your categorical variables as factor variables.

# Classify your categorical variables as factor variables

mydata$catvar <- factor(mydata$catvar) 

# If you want to add value labels at the same time:

mydata$catvar <- factor(mydata$catvar, levels = c(1,2,3), 
   labels = c("label1", "label2", "label3"))

# If using only some of the variables in a dataset, create a subset of your data.

mydata2 <- subset(mydata, select = c(var1, var2, var3))

# If you only need to exclude a variable or two (here var1 and var2):

mydata2 <- subset(mydata, select = -c(var1, var2))

# Generate your table of summary statistics (I'm including some arguments you may want to change)

table1::table1(~var1 + var2 + var3, data = mydata, na.rm = TRUE, digits = 1, format.number = TRUE)

# If you want to break out your summary statistics by groups:

table1::table1(~var1 + var2 + var3 | groupvar, data = mydata, na.rm = TRUE, digits = 1, format.number = TRUE)

# Copy and paste the table into your document!

Here’s an example of a summary statistics table generated by table1.

Descriptive statistics for categorical variables in r

Patchwork for Combining Graphs

The patchwork package is also quite useful for displaying multiple graphs at once. Each graph is assigned to an object. They are then simply patched together using a few different options.

# Generate graphs, assigning each to a distinct object

p1 <- ggplot2::ggplot(data = mydata, aes(x = intvar1, y = intvar2)) + geom_point() + ggtitle("Graph title")

p2 <- ggplot2::ggplot(data = mydata, aes(x = catvar, y = intvar)) + geom_boxplot() + ggtitle("Graph title")

p3 <- ggplot2::ggplot(data = mydata, aes(x = intvar)) + geom_smooth() + ggtitle("Graph title")

p4 <- ggplot2::ggplot(data = mydata, aes(x = catvar)) + geom_bar() + ggtitle("Graph title")

# Patch the objects together

p1 / p2 / p3    # This stacks the graphs vertically

p1 + p2 + p3    # This aligns them horizontally

p1 / (p2 + p3)  # p1 is placed above p2 and p3, which are horizontal to one another

# To add an overall title, subtitle, and caption:

object <- p1 / (p2 + p3)
object + plot_annotation(title = "Title of overall graph", subtitle = "Subtitle if desired", 
   caption = "Caption at bottom of graph if desired", )

Here’s are a couple of examples of patchwork at work:

(p1 + p2 + p3) / p4

Check out Little Miss Data’s r-bloggers post on patchwork for more information and examples.

This webpage is useful for adding titles, subtitles, captions, and tags. An example from that page:

What descriptive statistics are used for categorical variables?

Descriptive statistics used to analyse data for a single categorical variable include frequencies, percentages, fractions and/or relative frequencies (which are simply frequencies divided by the sample size) obtained from the variable's frequency distribution table.

How do you describe categorical data in statistics?

What is Categorical Data? Categorical data is a collection of information that is divided into groups. I.e, if an organisation or agency is trying to get a biodata of its employees, the resulting data is referred to as categorical.

What is the best way to visualize the descriptive statistics of a categorical variable?

Bar charts are most often used to visualize categorical variables. You can have the bars reflect frequencies or percentages/proportions.

How do I show descriptive statistics in R?

The descr() function allows to display: only a selection of descriptive statistics of your choice, with the stats = c("mean", "sd") argument for mean and standard deviation for example. the minimum, first quartile, median, third quartile and maximum with stats = "fivenum"