Yet Another BioInformatics Blog: November 2014

The problem
I am starting to recognise a lot of known cancer genes, and while that is useful in many ways it also introduces a risk of confirmation bias:

Mutated Genes	Score
Cancer gene	9068
Another Cancer Gene	9018
My Favourite Gene	9009
Another Interesting Gene	9002

Time to publish!! :D

Mutated Genes	Score
Not Known Cancer Gene	8997
Unknown Gene	8965
Gene Of Competing Group	8965
Some Other Random Gene	8654

Umm, better go through the analysis again...

This is a well known problem, and no one is immune. But it is hard to avoid, and would result in (potentially) less sexy articles for more work, so it isn't a very tempting problem to attack. I try my best to treat all genes equally, as do everyone else, but I don't think there are many bioinformaticians that quality control an analysis giving expected genes as much as an analysis giving unexpected genes.

The problem in bioinformatics is that both the data and the methods usually have loopholes that can let through false positives systematically. We are often able to recognise them easily by eye, but it can be time consuming to remove them automatically, and reviewer 2 (always reviewer 2...) is guaranteed to heavily disapprove if you try to publish an analysis that involves manual removal of genes from the top table.

A solution? gene blind analysis
So the other day I was changing some things at the very start of my pipeline, involving amongst other things the translation of gene names from ensembl (just a bunch of numbers) to symbol (the usual human-readable gene names). My changes of course messed up the translation of the gene names (typo in the regular expression that was supposed to extract the ensembl name from the capture regions), and I end up with plots and tables with gene names like

Mutated Genes	Score
ENS0003456	9023
ENS0019745	9001
ENS0003463	8974
ENS0113241	8973

Is this right? No idea...

While it was incredibly frustrating to not know if the changed pipeline still picked out that [Cancer Gene] mutation, I felt a healthy urge to go through the manual quality control of the top hits without knowing which genes they are. Of course I couldn't because the moment I open IGV at the mutation, it'll show me the symbol gene name, and even without that I recognise the chromosome-basepair coordinates of some of the most common cancer genes.

Anyway, the moment gave me the idea of gene blind analysis: to go through the entire analysis with randomised gene names and coordinates (and whatever more is needed to make the genes completely anonymous). So not only automate everything (hopefully the computer isn't biased... (it isn't, is it? :o right???)), but we would be allowed to manually look through top tables of randomised gene names (in IGV or whatever) and remove (or flag) what looks like false positive. We would be able to curate the top few hits manually, and still be sure to not introduce any bias. It'd also create an exciting/horrifying moment when the randomised gene names are revealed! Be sure to gather everyone involved for that! :)

Surely, not even Reviewer 2 could complain on that procedure?

Yet Another BioInformatics Blog

Wednesday, 19 November 2014

Gene blind analysis