Wednesday 19 November 2014

Gene blind analysis

The problem
I am starting to recognise a lot of known cancer genes, and while that is useful in many ways it also introduces a risk of confirmation bias:

Mutated Genes Score
Cancer gene 9068
Another Cancer Gene 9018
My Favourite Gene 9009
Another Interesting Gene 9002
            Time to publish!! :D

Mutated GenesScore
Not Known Cancer Gene8997
Unknown Gene8965
Gene Of Competing Group8965
Some Other Random Gene8654
Umm, better go through the analysis again...

This is a well known problem, and no one is immune. But it is hard to avoid, and would result in (potentially) less sexy articles for more work, so it isn't a very tempting problem to attack. I try my best to treat all genes equally, as do everyone else, but I don't think there are many bioinformaticians that quality control an analysis giving expected genes as much as an analysis giving unexpected genes.

The problem in bioinformatics is that both the data and the methods usually have loopholes that can let through false positives systematically. We are often able to recognise them easily by eye, but it can be time consuming to remove them automatically, and reviewer 2 (always reviewer 2...) is guaranteed to heavily disapprove if you try to publish an analysis that involves manual removal of genes from the top table.

A solution? gene blind analysis
So the other day I was changing some things at the very start of my pipeline, involving amongst other things the translation of gene names from ensembl (just a bunch of numbers) to symbol (the usual human-readable gene names). My changes of course messed up the translation of the gene names (typo in the regular expression that was supposed to extract the ensembl name from the capture regions), and I end up with plots and tables with gene names like
Mutated GenesScore
ENS00034569023
ENS00197459001
ENS00034638974
ENS01132418973
 Is this right? No idea...

While it was incredibly frustrating to not know if the changed pipeline still picked out that [Cancer Gene] mutation, I felt a healthy urge to go through the manual quality control of the top hits without knowing which genes they are. Of course I couldn't because the moment I open IGV at the mutation, it'll show me the symbol gene name, and even without that I recognise the chromosome-basepair coordinates of some of the most common cancer genes.

Anyway, the moment gave me the idea of gene blind analysis: to go through the entire analysis with randomised gene names and coordinates (and whatever more is needed to make the genes completely anonymous). So not only automate everything (hopefully the computer isn't biased... (it isn't, is it? :o right???)), but we would be allowed to manually look through top tables of randomised gene names (in IGV or whatever) and remove (or flag) what looks like false positive. We would be able to curate the top few hits manually, and still be sure to not introduce any bias. It'd also create an exciting/horrifying moment when the randomised gene names are revealed! Be sure to gather everyone involved for that! :)

Surely, not even Reviewer 2 could complain on that procedure?

Wednesday 29 October 2014

Point size by accuracy

When you do a scatter plot, try setting the size of the point from how certain you are of the location of the point! That sentence is essentially all I wanted to say, but to make it a blog post, rather than a twitter tweet, let me elaborate a bit and show an example.

When we work with data, we often have some kind of idea of how accurate the numbers are. You could even argue that the number are useless if you don't have any idea of the uncertainty. It then naturally follows that it often is a good idea to show the accuracy of the number when you plot your data, even if you don't know exactly what the error are. Normally errors are shown with error bars
Yep.
which is perfectly fine normally, but when you have too many points to look at each one individually, it isn't a good idea any longer (such as when visualising 'omics and other big data (so many buzzwords, it must be true)).
Nope.

Point size by accuracy
A good way of displaying the accuracy of each dot is to make the dot larger the smaller the uncertainty is. It will make it a lot easier to spot collective effects between many points, and helps you understand the big picture of the plot mainly based on the most accurate points. You have to put some thought into exactly how the point size should depend on the uncertainty, but it is in general worth the effort.

Some guidelines:
  • when the uncertainty is so large that the point is essentially useless, then the dot should be so small that it is barely visible.
  • When the uncertainty is so small that it doesn't matter any longer, then cap the point size. If not else, cap the point size when the error is smaller than the point!
  • If possible, try to have the area of the point proportional to the amount of data the point is based on (number of reads for example). Then two points based on 50 reads each will carry the same visual impact as a single point based on 100 reads, which feels intuitive.
  • Use dots, ie pch=16 (or 19) in R.

A toy example: differential expression
As an example, let's take differential expression. Say you have two knockout (KO) experiments, and you do your RNA-seq, you align, count reads and run limma-voom to get your fit object with log fold changes (LFC) etc. Let's simulate some very naive toy data: 10000 genes, 100 genes upregulated by 2 in one KO, 100 other genes upregulated a factor 2^1.5 in the other KO. Let's assume that the genes have a wide spread of variances (heteroscedastic data), and that we know/have accurately measured the variance. In R, the simulation would be something like:
w = abs(rnorm(10000, 0, 1))   #the width of the genes

#9900 null genes, 100 last DE with true LFC of 1.
x = c(rnorm(9900, 0, w[1:9900]),
      rnorm(100, 1, w[9901:10000]))

#first 100 genes DE with true LFC 1.5
y = c(rnorm(100, 1.5, w[1:100]),
      rnorm(9900, 0, w[101:10000]))

Now let's say you want to scatter plot the LFCs against each other, to get an overview. A plain scatter plot would look something like this:
plot(x,y, pch=16, cex=0.3, xlim=c(-3,3), ylim=c(-3,3))
Nope.

Do you see the DE genes at (1, 0) and (0, 1.5)? Hmm, nope, not really. Well, we know the uncertainties w off the genes (a fit object from limma-voom has a width of fit$coefficients/fit$t), so let's use them! Point size inversely proportional to the width seems reasonable, divide by ten, and cap at a point size of 1.5. That way, a width of 1 gives a point size of 0.1 (barely visible), while a gene with width of 0.1 or smaller will get full size or larger dots.
plot(x,y, pch=16, cex=pmin(1.5, 0.1/w),
     xlim=c(-3,3), ylim=c(-3,3))
Yep.
Yes, the point coordinates are the same, but now the two DE clusters are the first things you spot! The simulation is very artificial, but it think it illustrates the usefulness*.
 *DISCALIMER: this method of plotting will not create neat clusters of DE genes in your data.

There are plenty other examples, and I use this principle in most of the point-based plots I make. I wouldn't mind if this becomes a standard way of showing uncertainty when you have too many points for normal error bars. Maybe call them splatter plots? :)

Wednesday 22 October 2014

"I am doing bioinformatics"

Sometimes when I am at work people pass by, look at my screen and ask me what I am doing. I often reply that "I am doing bioinformatics". Let me explain why this is not meant as a short uninformative, almost a bit rude, answer.

Going from theoretical physics to bioinformatics, I quickly realised that some things were different. The change from "physics" to "bio" wasn't as big a deal as I thought it would be; what caught me off guard was the "theoretical" to "informatics".

I was writing code all the way through my physics PhD but it was almost exclusively within my own code: there were very little interaction with other peoples methods or data. A few tab-separated text files with prediction from other groups methods and data from a couple of experiments, and that all the relevant data there was for my project. The following cartoon describes more or less what was going on. Area of boxes and width of arrows represents time and effort spent.
  
 Despite my romantic hopes as a master student, most of the time I was not standing in front of the blackboard unravelling the mysteries of the universe. Nonetheless, I got some of that, and the rest was essentially writing my own (well, within my group) code for my Monte Carlo simulation. A minor annoyance was bringing external data (grey box) as I couldn't control the format of the data (red arrow), but it probably happened less than ten times over the course of five years.

 Bioinformatics, apart from my own analysis, involve an effectively endless list of interactions with data and methods that I have no control over. The corresponding cartoon would be something like this:
As you can see, there is a lot more grey boxes and red arrows, which seems to be an important difference between "theoretical" and "informatics". While I may have exaggerated a bit with the ratio of grey area to red area, I still spend a significant amount of time on getting public tools and data to work to fit into the rest of my analysis. I know exactly what I want to do, but it can still take me a lot of time.

Sometimes when I am at work in the middle of converting between file formats or some other grey box, people pass by, look at my screen full of messy terminal tabs and ask me what I am doing. At that point there is no point in going into detailed explanations, but I stick with a ":/" and "I am doing bioinformatics".