Wednesday 29 October 2014

Point size by accuracy

When you do a scatter plot, try setting the size of the point from how certain you are of the location of the point! That sentence is essentially all I wanted to say, but to make it a blog post, rather than a twitter tweet, let me elaborate a bit and show an example.

When we work with data, we often have some kind of idea of how accurate the numbers are. You could even argue that the number are useless if you don't have any idea of the uncertainty. It then naturally follows that it often is a good idea to show the accuracy of the number when you plot your data, even if you don't know exactly what the error are. Normally errors are shown with error bars
Yep.
which is perfectly fine normally, but when you have too many points to look at each one individually, it isn't a good idea any longer (such as when visualising 'omics and other big data (so many buzzwords, it must be true)).
Nope.

Point size by accuracy
A good way of displaying the accuracy of each dot is to make the dot larger the smaller the uncertainty is. It will make it a lot easier to spot collective effects between many points, and helps you understand the big picture of the plot mainly based on the most accurate points. You have to put some thought into exactly how the point size should depend on the uncertainty, but it is in general worth the effort.

Some guidelines:
  • when the uncertainty is so large that the point is essentially useless, then the dot should be so small that it is barely visible.
  • When the uncertainty is so small that it doesn't matter any longer, then cap the point size. If not else, cap the point size when the error is smaller than the point!
  • If possible, try to have the area of the point proportional to the amount of data the point is based on (number of reads for example). Then two points based on 50 reads each will carry the same visual impact as a single point based on 100 reads, which feels intuitive.
  • Use dots, ie pch=16 (or 19) in R.

A toy example: differential expression
As an example, let's take differential expression. Say you have two knockout (KO) experiments, and you do your RNA-seq, you align, count reads and run limma-voom to get your fit object with log fold changes (LFC) etc. Let's simulate some very naive toy data: 10000 genes, 100 genes upregulated by 2 in one KO, 100 other genes upregulated a factor 2^1.5 in the other KO. Let's assume that the genes have a wide spread of variances (heteroscedastic data), and that we know/have accurately measured the variance. In R, the simulation would be something like:
w = abs(rnorm(10000, 0, 1))   #the width of the genes

#9900 null genes, 100 last DE with true LFC of 1.
x = c(rnorm(9900, 0, w[1:9900]),
      rnorm(100, 1, w[9901:10000]))

#first 100 genes DE with true LFC 1.5
y = c(rnorm(100, 1.5, w[1:100]),
      rnorm(9900, 0, w[101:10000]))

Now let's say you want to scatter plot the LFCs against each other, to get an overview. A plain scatter plot would look something like this:
plot(x,y, pch=16, cex=0.3, xlim=c(-3,3), ylim=c(-3,3))
Nope.

Do you see the DE genes at (1, 0) and (0, 1.5)? Hmm, nope, not really. Well, we know the uncertainties w off the genes (a fit object from limma-voom has a width of fit$coefficients/fit$t), so let's use them! Point size inversely proportional to the width seems reasonable, divide by ten, and cap at a point size of 1.5. That way, a width of 1 gives a point size of 0.1 (barely visible), while a gene with width of 0.1 or smaller will get full size or larger dots.
plot(x,y, pch=16, cex=pmin(1.5, 0.1/w),
     xlim=c(-3,3), ylim=c(-3,3))
Yep.
Yes, the point coordinates are the same, but now the two DE clusters are the first things you spot! The simulation is very artificial, but it think it illustrates the usefulness*.
 *DISCALIMER: this method of plotting will not create neat clusters of DE genes in your data.

There are plenty other examples, and I use this principle in most of the point-based plots I make. I wouldn't mind if this becomes a standard way of showing uncertainty when you have too many points for normal error bars. Maybe call them splatter plots? :)

Wednesday 22 October 2014

"I am doing bioinformatics"

Sometimes when I am at work people pass by, look at my screen and ask me what I am doing. I often reply that "I am doing bioinformatics". Let me explain why this is not meant as a short uninformative, almost a bit rude, answer.

Going from theoretical physics to bioinformatics, I quickly realised that some things were different. The change from "physics" to "bio" wasn't as big a deal as I thought it would be; what caught me off guard was the "theoretical" to "informatics".

I was writing code all the way through my physics PhD but it was almost exclusively within my own code: there were very little interaction with other peoples methods or data. A few tab-separated text files with prediction from other groups methods and data from a couple of experiments, and that all the relevant data there was for my project. The following cartoon describes more or less what was going on. Area of boxes and width of arrows represents time and effort spent.
  
 Despite my romantic hopes as a master student, most of the time I was not standing in front of the blackboard unravelling the mysteries of the universe. Nonetheless, I got some of that, and the rest was essentially writing my own (well, within my group) code for my Monte Carlo simulation. A minor annoyance was bringing external data (grey box) as I couldn't control the format of the data (red arrow), but it probably happened less than ten times over the course of five years.

 Bioinformatics, apart from my own analysis, involve an effectively endless list of interactions with data and methods that I have no control over. The corresponding cartoon would be something like this:
As you can see, there is a lot more grey boxes and red arrows, which seems to be an important difference between "theoretical" and "informatics". While I may have exaggerated a bit with the ratio of grey area to red area, I still spend a significant amount of time on getting public tools and data to work to fit into the rest of my analysis. I know exactly what I want to do, but it can still take me a lot of time.

Sometimes when I am at work in the middle of converting between file formats or some other grey box, people pass by, look at my screen full of messy terminal tabs and ask me what I am doing. At that point there is no point in going into detailed explanations, but I stick with a ":/" and "I am doing bioinformatics".