I am one of those who is becoming increasingly convinced that data cleaning should be done in such a way that it is open to scrutiny. This is one of the disadvantages of using point-and-click software for data management. As an aspiring R Jedi, I was working on a relatively large dataset (37,000+ records) that has a lot of missing values. When I plotted it in R using the following code
#Load and explore data pop <- read.csv("consolidated data.csv", na.strings = "") dim(pop) str(pop) names(pop) #Display and quantify missing values apply(pop, 2, function(x) sum(is.na(x))) # Plot missing values library(Amelia) missmap(pop)
The ensuing plot turned out like this:
As one can see, this pretty plot could do with some customization so that we can tell the audience what we are mapping as well as get rid of some superfluous detail. On looking at the R documentation on the missmap function, I discovered the following default parameters (click here for full documentation):
missmap(obj, legend = TRUE, col = c(“wheat”,”darkred”), main,
y.cex = 0.8, x.cex = 0.8, y.labels, y.at, csvar = NULL, tsvar =
NULL, rank.order = TRUE, …)
I am okay with the legend and colour scheme, but would like to give it a better title. Also, because there are so many records, the y-axis labels are completely illegible. I would love to get rid of that. As I tinkered with this, I discovered that the customization is slightly different from what I would have done in the in-built graphics package. For instance, the argument for y-axis labels for the latter is ylab, as against y.labels in Amelia. Also, adjustments in y.labels had to be concomitantly reflected in the y.at argument.
After a little trial and not a few errors, I eventually came up with this:
missmap(pop, main = "Missingness Map of CCT Dataset", y.labels = NULL, y.at = NULL)
And this is what the plot now looks like this:
This looked a bit better to me (though not perfect). I’ve given it my own title and removed the jumbled up axis labels on the y-axis.
The moral to the story is that with examination of the R documentation and with a little luxury of time, I can tweak my output to suit my needs, tastes and quirks as well as provided greater appeal for the target audience. Also, R helps us to make our data cleaning more reproducible and therefore more transparent and credible.