Tag Archives: visualization

My experience doing R trainings at work

Recently, the office decided to set up a small team to manage its social media presence. Because I had somewhat encouraged the development, I was asked to work with them, at least as a facilitator.

Somewhere down the line, I suggested to some on the team that they should consider carrying out analysis of the social media data, at least beyond the metrics that were already available on most of those sites.

I quickly put together a very rudimentary, but useful, Shiny app, (not without some inspiration from this guy) just to demonstrate a bit of what was possible, and they were eager for me to train them in the use of R. I will share more about the app sometime later.

Application that plots social media data

Screenshot of the Shiny app developed for the team

My aim was (and still is) to get them to a point where they could carry out basic analyses on their own and grow from there. I tried to keep the material as basic and non-intimidating as possible – some of the students admitted to a morbid fear of statistics and I didn’t want to scare them off with anything too tough.

I consider myself a beginner still, so this experience really broadened my own understanding of the language. And I had a lot of fun doing it.

Well, I put together some slides on the training sessions and felt I should share them and hopefully get some feedback. Here they are:

  1. Introduction to R Programming
  2. R Data Structures – starting them off on vectors
  3. R Data Structures (Pt. II) – diving into the basics of data frames
  4. R Data Structures (Pt. III) – examining ways of working with matrices
  5. R Data Structures (Pt. IV) – lists (and lists)

The good thing is that some friends and colleagues (outside the office) have told me that, in the coming year, they would like me to train them as well in the use of R.

It’s only an opportunity for me to, yet, learn the more.

Advertisements

Leave a comment

Filed under Computers & Internet

A Simple Modification of Missingness Maps

 

Source: itc2.utk.edu

Source: itc2.utk.edu

I am one of those who is becoming increasingly convinced that data cleaning should be done in such a way that it is open to scrutiny. This is one of the disadvantages of using point-and-click software for data management. As an aspiring R Jedi, I was working on a relatively large dataset (37,000+ records) that has a lot of missing values. When I plotted it in R using the following code


#Load and explore data
pop <- read.csv("consolidated data.csv", na.strings = "")
dim(pop)
str(pop)
names(pop)

#Display and quantify missing values
apply(pop, 2, function(x) sum(is.na(x)))

# Plot missing values
library(Amelia) 
missmap(pop)

The ensuing plot turned out like this:

missingmap1

As one can see, this pretty plot could do with some customization so that we can tell the audience what we are mapping as well as get rid of some superfluous detail. On looking at the R documentation on the missmap function, I discovered the following default parameters (click here for full documentation):

missmap(obj, legend = TRUE, col = c(“wheat”,”darkred”), main,
y.cex = 0.8, x.cex = 0.8, y.labels, y.at, csvar = NULL, tsvar =
NULL, rank.order = TRUE, …)

I am okay with the legend and colour scheme, but would like to give it a better title. Also, because there are so many records, the y-axis labels are completely illegible. I would love to get rid of that. As I tinkered with this, I discovered that the customization is slightly different from what I would have done in the in-built graphics package. For instance, the argument for y-axis labels for the latter is ylab, as against y.labels in Amelia. Also, adjustments in y.labels had to be concomitantly reflected in the y.at argument.

After a little trial and not a few errors, I eventually came up with this:


missmap(pop,
 main = "Missingness Map of CCT Dataset",
 y.labels = NULL,
 y.at = NULL)

And this is what the plot now looks like this:

missingmap2

This looked a bit better to me (though not perfect). I’ve given it my own title and removed the jumbled up axis labels on the y-axis.

The moral to the story is that with examination of the R documentation and with a little luxury of time, I can tweak my output to suit my needs, tastes and quirks as well as provided greater appeal for the target audience. Also, R helps us to make our data cleaning more reproducible and therefore more transparent and credible.

Leave a comment

Filed under Computers & Internet

Facial recognition for data visualization

Yesterday, I discovered a very interesting capability in R. There’s a package called aplpack that allows you to plot Chernoff faces. Chernoff faces is a way of presenting multivariate data in which the output looks like the faces of cartoon characters. One thing that I immediately noticed was that in using this package, I could quickly recognize (at a glance) aspects of the dataset that are similar or dissimilar.

Following the example of the article I read, I tested it using the built-in dataset in R known as mtcars (Motor Trend Car Road Tests), which looks like this:

                                  mpg cyl  disp  hp drat    wt  qsec vs am gear carb

Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4

Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4

Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1

Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1

Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2

Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1

Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4

Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2

Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2

Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4

Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3

Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3

Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3

Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4

Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4

Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4

Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1

Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2

Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1

Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1

Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2

AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2

Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4

Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2

Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1

Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2

Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2

Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4

Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6

Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

When you run the appropriate function in aplpack, the dataset now looks like this:

faces

With the visualization of the data , one can quickly see which cars in the dataset are similar in the variables tested.

Neat eh?

For more information on this, take a look at the original blog post.

1 Comment

Filed under Computers & Internet