A Simple Modification of Missingness Maps

 

Source: itc2.utk.edu

Source: itc2.utk.edu

I am one of those who is becoming increasingly convinced that data cleaning should be done in such a way that it is open to scrutiny. This is one of the disadvantages of using point-and-click software for data management. As an aspiring R Jedi, I was working on a relatively large dataset (37,000+ records) that has a lot of missing values. When I plotted it in R using the following code


#Load and explore data
pop <- read.csv("consolidated data.csv", na.strings = "")
dim(pop)
str(pop)
names(pop)

#Display and quantify missing values
apply(pop, 2, function(x) sum(is.na(x)))

# Plot missing values
library(Amelia) 
missmap(pop)

The ensuing plot turned out like this:

missingmap1

As one can see, this pretty plot could do with some customization so that we can tell the audience what we are mapping as well as get rid of some superfluous detail. On looking at the R documentation on the missmap function, I discovered the following default parameters (click here for full documentation):

missmap(obj, legend = TRUE, col = c(“wheat”,”darkred”), main,
y.cex = 0.8, x.cex = 0.8, y.labels, y.at, csvar = NULL, tsvar =
NULL, rank.order = TRUE, …)

I am okay with the legend and colour scheme, but would like to give it a better title. Also, because there are so many records, the y-axis labels are completely illegible. I would love to get rid of that. As I tinkered with this, I discovered that the customization is slightly different from what I would have done in the in-built graphics package. For instance, the argument for y-axis labels for the latter is ylab, as against y.labels in Amelia. Also, adjustments in y.labels had to be concomitantly reflected in the y.at argument.

After a little trial and not a few errors, I eventually came up with this:


missmap(pop,
 main = "Missingness Map of CCT Dataset",
 y.labels = NULL,
 y.at = NULL)

And this is what the plot now looks like this:

missingmap2

This looked a bit better to me (though not perfect). I’ve given it my own title and removed the jumbled up axis labels on the y-axis.

The moral to the story is that with examination of the R documentation and with a little luxury of time, I can tweak my output to suit my needs, tastes and quirks as well as provided greater appeal for the target audience. Also, R helps us to make our data cleaning more reproducible and therefore more transparent and credible.

Advertisements

Leave a comment

Filed under Computers & Internet

Your comments:

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s