Wednesday, 19 November 2014

Gene blind analysis

The problem
I am starting to recognise a lot of known cancer genes, and while that is useful in many ways it also introduces a risk of confirmation bias:

Mutated Genes Score
Cancer gene 9068
Another Cancer Gene 9018
My Favourite Gene 9009
Another Interesting Gene 9002
            Time to publish!! :D

Mutated GenesScore
Not Known Cancer Gene8997
Unknown Gene8965
Gene Of Competing Group8965
Some Other Random Gene8654
Umm, better go through the analysis again...

This is a well known problem, and no one is immune. But it is hard to avoid, and would result in (potentially) less sexy articles for more work, so it isn't a very tempting problem to attack. I try my best to treat all genes equally, as do everyone else, but I don't think there are many bioinformaticians that quality control an analysis giving expected genes as much as an analysis giving unexpected genes.

The problem in bioinformatics is that both the data and the methods usually have loopholes that can let through false positives systematically. We are often able to recognise them easily by eye, but it can be time consuming to remove them automatically, and reviewer 2 (always reviewer 2...) is guaranteed to heavily disapprove if you try to publish an analysis that involves manual removal of genes from the top table.

A solution? gene blind analysis
So the other day I was changing some things at the very start of my pipeline, involving amongst other things the translation of gene names from ensembl (just a bunch of numbers) to symbol (the usual human-readable gene names). My changes of course messed up the translation of the gene names (typo in the regular expression that was supposed to extract the ensembl name from the capture regions), and I end up with plots and tables with gene names like
Mutated GenesScore
ENS00034569023
ENS00197459001
ENS00034638974
ENS01132418973
 Is this right? No idea...

While it was incredibly frustrating to not know if the changed pipeline still picked out that [Cancer Gene] mutation, I felt a healthy urge to go through the manual quality control of the top hits without knowing which genes they are. Of course I couldn't because the moment I open IGV at the mutation, it'll show me the symbol gene name, and even without that I recognise the chromosome-basepair coordinates of some of the most common cancer genes.

Anyway, the moment gave me the idea of gene blind analysis: to go through the entire analysis with randomised gene names and coordinates (and whatever more is needed to make the genes completely anonymous). So not only automate everything (hopefully the computer isn't biased... (it isn't, is it? :o right???)), but we would be allowed to manually look through top tables of randomised gene names (in IGV or whatever) and remove (or flag) what looks like false positive. We would be able to curate the top few hits manually, and still be sure to not introduce any bias. It'd also create an exciting/horrifying moment when the randomised gene names are revealed! Be sure to gather everyone involved for that! :)

Surely, not even Reviewer 2 could complain on that procedure?