Saturday 5 November 2016

Density Scatter with plotColourScatter

Common, that looks pretty fancy! :)
It's made with a public function in my superFreq package, caller plotColourScatter(). It behaves essentially like the base plot() function, but transitions nicely into density plot when the dots start to overlap. It also has bit different defaults, such as dots rather than circles, and colour. There are a few different density plotting functions around, such as hexbin or the built in smoothScatter. They all have different advantages and disadvantages.

Hexbin, as you can see, bins the values and then heatmaps the number of hits in each bin. This solves the issue with overplotting as you can see arbitrarily large numbers of dots in each bin (depending on the colour scale you choose ofc). It also avoids plotting all the dots, which allows small vectorised images for any number of input point. Also hexagons looks so much cooler than squares. They used squares in the 90ies, common, we are past that now. You do lose some in resolution though. You can't easily see if the outer squares contain 1 or 3 dots (unless you gear your colour scale very much in that direction), and you can't see where in the square the dot(s) is. Similarly, you won't pick up on density structures anywhere if they are smaller than your hexagon size. You get a pixelation effect (but with hexagonal pixels, which is cooler). You can just make smaller hexagons though, which mitigates this.

SmoothScatter finds a smoothed density, and then plots the first 100 "low density" points. This is done through some complicated-looking stats that I won't pretend to understand. I think it's clear from the plot above what it does though: it just smears out each point a bit. Here you also don't have to plot all points, so I assume smoothScatter produces decent sized vectorised images. This allows you to better see the structure of the points in the sparse regions, but only for the first 100 dots. You can of course that parameter to see the dot structure in denser regions as well. This comes at the cost of getting a weird-looking transition where the dots are no longer plotted. When you lean back, it almost looks like the density drops a bit just when the dots are no longer plotted. This can be mitigated by plotting fewer (or no) points, so you have to find a compromise here. The smearing also removes structures smaller than the smearing radius.

plotColourScatter actually plots all the points. Then plots them another 3 times, each time with a smaller size, more transparent and with a brighter colour. The first round of plotting picks out the dots and overplots as usual with the default plot() function. The replotting is just barely visible for single dots, but adds up in high density regions. This shows exactly what is going on in the sparse regions, makes a smooth transition into dense regions, and it at the same time very good at picking up small structures in the dense regions with the increasingly small dot sizes. The price is that the output files will be large if you use vector format and have many points. 30000 points (typical for genomics) gives an output file of around 5MB, which is acceptable but can be cumbersome if you have many. If you start going up into many hundreds of thousands  or millions of points, you more or less have to rasterise as PNG.

To display what happens when you have smaller structures, let's look at

x = rnorm(10000)^3
y = x + rnorm(10000)^3

the cube on the normal distribution makes it spike at 0, so we will get lines at x=0 and x=y, and a point at the intersection x=y=0:

 
It also allows a natural weighting of the dots. Just change cex if you want some dots to have larger impact that others. This is done in the opening plot at the top, showing GC-content correction. For example, take normal distributions and weight with the sum of the decimal part of the coordinates: 

x = rnorm(30000)
y = x + rnorm(30000)
w = x - floor(x) + y - floor(y)
plotColourScatter(x,y, cex=w)

This shows weight of points in both sparse and dense regions, as well as making a smooth transition.  Also, plotColourScatter comes with a red colour scheme! :)

plotColourScatter(x,y,col='defaultRed')

 

So well. I think all three methods presented here works well. They have slightly different limitations and strengths, but in most cases it comes down to preference, and what people find most aesthetically pleasing. In case you like what you see here and want to use it, go install superFreq at https://github.com/ChristofferFlensburg/superFreq and get plotting:


install.packages('devtools')
library(devtools)
install_github('ChristofferFlensburg/superFreq')


library(superFreq)

x = rnorm(30000)
y = x + rnorm(30000)
w = x - floor(x) + y - floor(y)
plotColourScatter(x,y, cex=w)
plotColourScatter(x,y, cex=w, col='defaultRed')

This is done in base graphics, so I assume it can't easily be merged into ggplot, sorry about that all you ggplot buddies. If you really want to though, it assume it wouldn't be much work to do the same thing in ggplot. Either from scratch (just overplot with smaller, brighter transparent dots), or from the pretty short source found around line 700 of https://github.com/ChristofferFlensburg/superFreq/blob/master/R/runDEexon.R (as of version 0.9.15).

If you make some nice plots with this, feel free to share to make me happy. :)

"3D" plots

It is very common to perform linear analysis in many dimensions. We experience a 3 (approximately) linear spatial dimensions in our world, so we have very good intuition in up to three dimensional linear space. This makes it tempting to visualise things in three dimensions, giving rise to "3D" plots. Going into high impact literature, let's google image "nature 3D scatter plot". the first hit is found on http://www.nature.com/app_notes/nmeth/2006/060328/fig_tab/nmeth870_F3.html from 2006:

This is a pretty representative plot if you continue to scroll down the image search. The problem is that "3D" scatter plots like this doesn't actually display 3 dimensions, making it a harder-to-read version of a normal 2D scatter plot with 3D inspired cosmetics. I don't want to isolate this specific figure, publication or journal (I've no idea about the context), I just wanted to show that these plots are present all the way up into the most prestigious journals. In fact, a google image search for just "3D scatter plots" yields the same kind of hard-to-read 2D plots. Some are better than others, but the majority are plots like these:

To be fair, these two specific plots are from a question on stackoverflow, and a screenshot from a plotting suite where you can also rotate the figure. Point is that static 2D renders of 3D plots like the ones above do not allow you to read out the third space dimension.

Let me elaborate:

The number of dimensions of a scatter plot is how many values you can read out from each dot. A typical scatter plot, as you know, will have 2 dimensions: the x and y coordinate of the dot.

A typical 2D scatter plot


Sometimes, or even frequently, you want to display more than two numbers for each dot though, and there are ways to do that.
Using points size, point type (discrete), colour and support lines to display a third dimension apart from x and y.
These methods have their advantages and drawback, but they all make it possible to read out more information than from the basic 2D scatter plot.

The "3D" scatter plot is a basic 2D scatter plot, but with various 3D-looking graphical effects like reflections on the dots or angled grids in the background. An example I made using a cute little online tool I found called highcharts is this:
A "3D" scatter plot: dot reflections and a 3D grid, but neither allow me to read out more than 2 numbers from the each dot.
We cannot read out the third dimension of the dots as we cannot see how far into the screen each dot is, so this is no more than a basic 2D scatter plot where I read the x and y coordinate on the screen, despite the cosmetics. The choice of camera angle decides which direction ends up as the depth dimension that we cannot see, and which directions (perpendicular to the depth) we can see. So we lose as much information as if we had only plotted dimension 1 and 2, but exactly which information we lose is much less transparent, for both the plotter and the reader.

So I am arguing that, unless you have a very good reason for your choice of camera angle, a "3D" scatter is a worse way to plot three-dimensional data than to just plot the first and second dimension, where you at least know what you are looking at, and what you are leaving out. In fact, I suggest we from now on refer to this kind of plots as "3D" scatter plots, with the quotation marks, as a derogatory term, and we leave the term 3D scatter plot for plots that actually display three values per dot. This can conveniently be done in spoken language with air-quotes, and a slightly disgusted facial expression for good measure. "3D" plots can be used in general for any plots displaying less than three dimensional data, but still using perspective effects. "3D" pie charts jumps to mind. (Credit to @JovMaksimovic for tweeting this one)

This chart shows, uhh... the fraction of colours, umm.. of a circular staircase?

It is always easy (and fun!) to complain, so let's discuss what we should do with our 3 dimensional data that we want to make a picture from. First, think hard about what we actually want to show with the picture, as we may not need all three dimensions. Do we just want to show that the red dots cluster separately from the blue dots? That is actually a 1-dimensional problem, and we should be fine with just finding the direction they separate in (which will be a linear combination of the three dimensions your data is in) and plot that dimension only, while being honest in the legend that we picked this direction from the three dimensional space to show separation. Not as sexy figure? Sorry, the message we are trying to convey isn't either.

Second, can some of the dimensions be conveniently displayed through colour, size or other means? That can look pretty fancy as well, especially if you match colour scales with other plots in your paper.

But assume that we actually do want to show all three dimensions on equal footing. This is a problem that astronomers have had essentially from the founding of the field, and a good solution they use is this (a kind of support line method):
Possibly the best way to 3D scatter plot with the three dimensions on equal footing. (from European Southern Obervatory) Note that the plot also use size (I assume luminosity) and colour (representing, I guess, colour of the star), so it is actually a 5D scatter plot.

With that I wish you all happy sensible plotting, as well as much enjoyable complaining. :)

Sunday 13 March 2016

slug plots

I made an (overly) fancy plot for a paper that we decided to not include in the paper, as the biological/clinical (why isn't bioclinical a word? bioinformatics is...) implications are unclear, or uninteresting, or both. Probably both.

The data is about 10 patients with transforming cancers. We have RNA from diagnosis and from transformed disease, which range from a few months to many years down the track. One of the obvious analyses you can do is differential expression. You can do this for each patient separately, or you can group up all diagnosis samples and all transformed samples. Doing the grouped analysis yields a bunch of differentially expressed genes, and some are the usual suspects: MYC et. al. Top tables, volcano plots, etc. All good.

The question we pursued but didn't bother to put in the paper was if this DE signal was related to the time between diagnosis and transformation. We already know that the patients on average have the top DE genes change between timepoints (as they came up as DE), but is the DE signal stronger in some patients than in others. A first approach is to take the the top N genes (default FDR < 0.05 I'd guess...), and check the log fold change for each patient. A signal here would be a correlation between the by-patient mean LFC and time to transformation. It'd be tempting to look at things like average log(p-value) to see how significant the genes are in each patient, but we would then bias by power: we would get stronger signals from less noise or deeper sequenced samples, which we don't care about. Using the uncorrected LFC mostly avoids that confounding factor.


Doing that, we do see a bit of a correlation. Unfortunately, the correlation is driven by variation in the transformed sample (the diagnosis samples don't depend much, if at all, on the transformation time). It's a bit unfortunate, as a correlation on the diagnosis side could in theory be used to predict time to transformation when a patient comes in at diagnosis. If you redo the analysis on more than 10 patients that is, and actually test that the correlation is consistent and reliable over several data sets. As it is, the correlation could be nothing more than the effect of the treatments the patient is undergoing.

Now, after having explained why this isn't in the paper, let's move to the subject of the blog: how would you make a figure out of that?

Well, first approach would be to just plot the average LFC by patient over the N significantly up and down regulated (separately ofc) genes, and order the patients by time to transformation. You'd then get a scissors like plot, where the shortest time to transformation have a modest LFC for both up and down regulated genes, and you'd see how they split apart for patients with increasing time to transformation:




This shows that the split increases as you go right on the plot: patients with longer time to relapse has a larger difference in LFC between the top up and down regulated genes. You can probably do something a bit more exciting though, like showing the variance of the LFC in some way... Box plots for example, with standard deviation and percentiles or whatever would already be a lot more useful. It'd look something like this:



The choice of SD and quantiles are a bit arbitrary of course, and if we want to go the full stretch and show the complete distribution of the LFCs of the top N genes we can always go to violin plots, that are essentially sideways smoothed histograms:




Well, that's the end of that I guess. Or is it?? This blog wouldn't be here if it were, so anyone familiar with the anthropic principle (with the universe replaced by this blog) will realise that there is another step. Also you can see on the scroll bar that there is still some left of the blog. You also haven't seen the plot that was used to advertise the blog (unless you've scrolled down, which you probably have), which is kindof a tell. Anyway, onwards.

Do the most DE genes in the group comparison separate the patients better? Is it better to go down to FDR < 0.5 to get more genes and more power? No reviewer would object to the choice of FDR < 0.05, or even think of it twice, but that doesn't change the fact that the cut-off is very arbitrary.

We can get rid of it if we don't plot just a subset, but we plot the ranking. That is, we plot the violin plot for the top N genes in red/blue, Inside that we plot the top N/2 genes with half width in orange/cyan. And so on. All on top of each other. With just two overlapping violins, you get this:


This shows a bit more information, but it's also messy. Let's start by doing two half-violins for the up and down regulated gene sets. Then do the above violins-inside-violins, but put a gradient from black to red to yellow to white for the upregulated genes, and black to blue to cyan to white for the downregulated genes. So make a continuous number of violins (well, in the computer it'll always be discreet, but small enough steps to look continuous). So the red colour shows the distribution of the largest gene set, and as we strip down the gene set to only the very top ranked genes, the colours shift to brighter colours. You then get this:



It kindof looks like slugs slugging their way up and down a wall, doesn't it? Imagine frames of a stop-motion of a slug movie. The blue slug going down, the red going up. So well, I'm calling them slug plots. :)

At the end of the day, this is a pretty plot more than a useful plot. Yes, it contains a lot of information in the sense that it shows the distribution for any top N genes, but it is not really immediately obvious if you are looking at a significant signal or not. It is not clear where the magical FDR = 5% limit is. You do however see that the higher ranked genes have a stronger signal (the white parts of the slugs climb up higher than the red parts), which is neat.

Anyway, maybe there is a use for someone. Essentially it can be used when you have several DE contrasts you want to compare to a ranked gene list (potentially from another DE contrastt). If you want to make your own slug plots, there is an R package called slugPlot now. :)

library(devtools)
install_github('ChristofferFlensburg/slugPlot')
library(slugPlot)
example('slugPlot')
?slugPlot

It is all a bit thrown together, so feel free to improve or add things if you want.

Thursday 10 March 2016

Frequency scatters in superFreq

SuperFreq is a cancer exome analysis pipeline developed in my lab, handling SNVs, CNAs and clonal tracking. This blog goes a bit more in depth of why and how we constructed one of the plots in superFreq, and how you can make the most of it.

I will talk about the variant frequency scatter plot. The idea behind this plot is to show the data in a relatively unprocessed way, to make it easier to understand and quality control the data. I often use these plots to verify that signals in the analysis are real. You can also use it for discovery to some extent, but it is not the main purpose. The plot is simply a scatter plot of the variant frequencies in two samples from the same individual. All variant provided in the input .vcf are shown in the default plot. If you plot only that, you get a plot like this:
As example, I will show a follicular lymphoma sample that transformed into large diffuse B-cell lymphoma. The frequency of the SNVs in the first time point is on the x-axis, and the frequencies of the SNVs in the transformed disease is on the y-axis. This is not a whole exome, but sequencing over exons of a panel of around 700 genes. I use this as example to make the plots less cluttered as we walk through the steps, but I will show a whole exome further down.


This plot above, to be frank, isn't very useful. Only thing we can say from here is that there are a lot of variants, and we see clusters around (0,0), (0.5,0.5) and (1,1). The absence of clusters at (0,1) and (1,0) tells us that the samples probably are from the same individual, ie correctly matched samples, but hard to say much more.


Point size by accuracy

The first improvement to the plot is the usual point size by accuracy. Here, the error of the frequency scales inversely with the square root of coverage, so the inverse of that (ie, the square root of the coverage) seems like a reasonable size of the points. This also makes the area of the points proportional to the coverage, which is a nice property. The total area of the points in a cluster describes the total number of reads that supports the cluster. Cool! There are more details in this, such as the geometric mean of the coverage between the two samples (we want a variant with low coverage only one sample to show up as small) and the cut-off for small (let's show at least a tiny dot) and large (once a dot is pretty accurate we don't care enough to clutter up the plot) dots, but let's not get into too much details.
Point size by accuracy. Area of dots proportional to read depth.

This defines the (0.5,0.5) cluster a lot more, and most of the points outside the main clusters turn out to just be low coverage variants, with high uncertainty in the frequency. Probably nothing to get excited about. There a few large point well outside the main clusters though. And maybe a few of them even seem to cluster up a bit? Hmm...


Grey low quality variants

Next, let's go QC the variants. SuperFreq checks the .bam of each file, looking for poor mapping or base quality. Chances are that the caller you used to produce the .vcf already did something similar, but just in case... SuperFreq also cross-checks the variants against a set of reference normal samples, and flags variants that behaves suspiciously in the normal ("suspicious" in this context amounts to be present, with some leeway given to potential population variants around 50% or 100% frequency). Let's replace the flagged variants with grey rings, and put then in the background:

Flagged variants are shows as grey rings.

Well, most variants seem to pass the filters. Mainly the low frequency variants around (0.1,0.1) are affected. This tempts us to believe that the rest of the low frequency variants (at (0.05,0.05)) probably also are not real, or at the very least not reliable enough to take seriously. Probably the lower frequency goes for the reference normals as well, and the variants slipped below the noise-level cuts.


crosses for dbSNP

The (0.5,0.5) and (1,1) clusters are likely the germline heterozygous and homozygous variants, so they should be enriched for common population variants. Let's cross-check with dbSNP and replace population variants with black crosses instead of blue dots.
dbSNP variants shown as black crosses, or grey crosses if flagged.
What a difference! The (0.5, 0.5) and (1,1) clusters are mainly dbSNP variants, as expected! The clusters at (0.45, 0.1) and (0.55,0.9) seem to also be dbSNP variants. There are a few non-dbSNP variants in the (0.5,0.5) cluster, a large fraction of the (0.05, 0.05) noise SNVs are also non-dbSNP, There is a non-dbSNP cluster around (0.1, 0.4).

This information gives us quite some power to tell what the cluster are.The non-dbSNP clusters are probably not germline variant, and thus either noise or somatic SNVs. It is fair to assume that the low frequency variants are mostly noise, as they cluster with the flagged grey variants. The (01, 0.4) non-dbSNP variants is probably the most interesting one, as it likely is somatic variants. They are high frequency in one sample, and low in the other, making it unlikely to be noise. Being a blood cancer that typically is mostly diploid, these frequencies correspond to clonalities of 0.2 and 0.8 in the two samples. So these SNVs were present to low degree at the first timepoint and then dominates the relapse. This is where we will find the mutations that are causing the disease transformation.

There is then a pair of dbSNP clusters at (0.45, 0.1) and (0.55, 0.9) that is probably related to a copy number loss. We already suspect that there is a population of cell at ~20% clonality in the first sample and 80% in the second, and a loss in these cells would give this signal: the lost heterozygous germline SNPs would shift down to around 40-45% in the first sample, and to around 10% in the second, and we would have the mirror image for the retained heterozygous germline SNPs. The fact that these are mostly dbSNPs leads us to this conclusion. If we find a cluster of mostly non-dbSNP at (0.45, 0.1) we would assume it is somatic SNVs, so the dots vs crosses allows us to separate these two kinds of clusters.


colour by significantly different frequency

In general, things along the x=y diagonal are less interesting, and we want to know if clusters are significantly separated from the diagonal. To do this, we run a fisher exact test (reference and variant counts in both samples) to check for significant difference in frequency between the samples. Add in a multiple hypothesis correction, and colour the dots or crosses red as they approach global significance. We set the limit so that a null set (replicate samples, expecting only x=y) will have a couple of dots turning a deep red, so that a shiny red dot or cross usually means significant signal.
Colour red if significantly different frequency in the two samples.

We see that the three clusters we talked about are indeed significantly off-diagonal. Essentially this is just a bit of a confirmation that we didn't imagine the clusters, and it helps draw the eye to them when we look at the plot the first time.


gene names

We now want to figure out a bit more what the new somatic SNVs are, as they can be biologically relevant. For this, let's add gene names over all SNVs that are sufficiently red.
Add gene names over red dots or crosses.

We get four gene names in the somatic SNV cluster, which are candidates for playing a role in transformation. We also note that some of the genes in the dbSNP clusters have SNPs in both the upper and lower cluster, which confirms that the two clusters are associated with each other. If we want to, we can go and check up which chromosomes FRK, TTK and EPHA7 are on, and it'll turn out that they are all on the chr6q arm. For copy number calls, probably better to look at the copy number plots otherwise, which are described in the manual on github (link on top).


Highlight protein changing SNVs

Finally, we want to know if the somatic SNVs damage the protein of their gene. Running the somatic variants through VEP, we mark the protein altering SNVs with an orange border (thicker border for more severe, such as nonsense or frameshift mutations).

Add rings around protein changing somatic SNVs.

We see now that CCND3 is the only one of the names somatic SNVs that is actually protein altering, making it a prime candidate for causing the relapse. Going to the literature, you can find support for CCND3 playing a role in transforming blood cancers, which is encouraging.


Whole exome example

The above plots are from an older version of superFreq, and on the a capture set of only around 700 of the 20k genes. By now, superFreq also highlights protein changing SNVs that in the cosmic census of cancer related genes. A typical (blood cancer, so mostly diploid) scatter between two cancer timepoint with whole exomes will look something like this (showing the version without gene names for less clutter). You can see how it grows out of the raw scatter in this animation:
Animation of gradual introduction of the steps explained above (plus the COSMIC highlighting) in a scatter plot between two different time points of a blood cancer. The title of the plot shows the step currently being introduced.
The final version looks like this in a bit higher resolution.
A typical scatter between two different time points in a blood cancer. The green rings highlight somatic protein changing SNVs in COSMIC census genes.


 I'm leaving the interpretation of this plot to the reader as an exercise. How many clusters can you find, and what do they mean? There are at least 11 clusters. Post in comments or on twitter or whatever media may have lead you to read this. :)

As closing remark, let me say that while it is possible to read a lot of information from these scatters, many of them can more conveniently be read from other output in superFreq. The big strength of the scatters is that they show the data in a comparably raw and unprocessed manner. You don't just get the CNA calls, or just a top table of somatic SNVs according to some frequency cuts. Looking at this plot it is often easy to convince yourself if you are looking at a real signal or noise, which can be hard in more processed downstream results. So before you publish anything, check in the scatters that the signal actually look reasonable. ;)

This concludes the first blog on superFreq. I'll try to keep it up, and explain different parts of the pipeline, both what is going on underneath the hood, and how you can interpret the output. I'm open for requests if you are curious on some specific part. More details are in the manual on github (linked on top). Thanks for reading!

Wednesday 19 November 2014

Gene blind analysis

The problem
I am starting to recognise a lot of known cancer genes, and while that is useful in many ways it also introduces a risk of confirmation bias:

Mutated Genes Score
Cancer gene 9068
Another Cancer Gene 9018
My Favourite Gene 9009
Another Interesting Gene 9002
            Time to publish!! :D

Mutated GenesScore
Not Known Cancer Gene8997
Unknown Gene8965
Gene Of Competing Group8965
Some Other Random Gene8654
Umm, better go through the analysis again...

This is a well known problem, and no one is immune. But it is hard to avoid, and would result in (potentially) less sexy articles for more work, so it isn't a very tempting problem to attack. I try my best to treat all genes equally, as do everyone else, but I don't think there are many bioinformaticians that quality control an analysis giving expected genes as much as an analysis giving unexpected genes.

The problem in bioinformatics is that both the data and the methods usually have loopholes that can let through false positives systematically. We are often able to recognise them easily by eye, but it can be time consuming to remove them automatically, and reviewer 2 (always reviewer 2...) is guaranteed to heavily disapprove if you try to publish an analysis that involves manual removal of genes from the top table.

A solution? gene blind analysis
So the other day I was changing some things at the very start of my pipeline, involving amongst other things the translation of gene names from ensembl (just a bunch of numbers) to symbol (the usual human-readable gene names). My changes of course messed up the translation of the gene names (typo in the regular expression that was supposed to extract the ensembl name from the capture regions), and I end up with plots and tables with gene names like
Mutated GenesScore
ENS00034569023
ENS00197459001
ENS00034638974
ENS01132418973
 Is this right? No idea...

While it was incredibly frustrating to not know if the changed pipeline still picked out that [Cancer Gene] mutation, I felt a healthy urge to go through the manual quality control of the top hits without knowing which genes they are. Of course I couldn't because the moment I open IGV at the mutation, it'll show me the symbol gene name, and even without that I recognise the chromosome-basepair coordinates of some of the most common cancer genes.

Anyway, the moment gave me the idea of gene blind analysis: to go through the entire analysis with randomised gene names and coordinates (and whatever more is needed to make the genes completely anonymous). So not only automate everything (hopefully the computer isn't biased... (it isn't, is it? :o right???)), but we would be allowed to manually look through top tables of randomised gene names (in IGV or whatever) and remove (or flag) what looks like false positive. We would be able to curate the top few hits manually, and still be sure to not introduce any bias. It'd also create an exciting/horrifying moment when the randomised gene names are revealed! Be sure to gather everyone involved for that! :)

Surely, not even Reviewer 2 could complain on that procedure?

Wednesday 29 October 2014

Point size by accuracy

When you do a scatter plot, try setting the size of the point from how certain you are of the location of the point! That sentence is essentially all I wanted to say, but to make it a blog post, rather than a twitter tweet, let me elaborate a bit and show an example.

When we work with data, we often have some kind of idea of how accurate the numbers are. You could even argue that the number are useless if you don't have any idea of the uncertainty. It then naturally follows that it often is a good idea to show the accuracy of the number when you plot your data, even if you don't know exactly what the error are. Normally errors are shown with error bars
Yep.
which is perfectly fine normally, but when you have too many points to look at each one individually, it isn't a good idea any longer (such as when visualising 'omics and other big data (so many buzzwords, it must be true)).
Nope.

Point size by accuracy
A good way of displaying the accuracy of each dot is to make the dot larger the smaller the uncertainty is. It will make it a lot easier to spot collective effects between many points, and helps you understand the big picture of the plot mainly based on the most accurate points. You have to put some thought into exactly how the point size should depend on the uncertainty, but it is in general worth the effort.

Some guidelines:
  • when the uncertainty is so large that the point is essentially useless, then the dot should be so small that it is barely visible.
  • When the uncertainty is so small that it doesn't matter any longer, then cap the point size. If not else, cap the point size when the error is smaller than the point!
  • If possible, try to have the area of the point proportional to the amount of data the point is based on (number of reads for example). Then two points based on 50 reads each will carry the same visual impact as a single point based on 100 reads, which feels intuitive.
  • Use dots, ie pch=16 (or 19) in R.

A toy example: differential expression
As an example, let's take differential expression. Say you have two knockout (KO) experiments, and you do your RNA-seq, you align, count reads and run limma-voom to get your fit object with log fold changes (LFC) etc. Let's simulate some very naive toy data: 10000 genes, 100 genes upregulated by 2 in one KO, 100 other genes upregulated a factor 2^1.5 in the other KO. Let's assume that the genes have a wide spread of variances (heteroscedastic data), and that we know/have accurately measured the variance. In R, the simulation would be something like:
w = abs(rnorm(10000, 0, 1))   #the width of the genes

#9900 null genes, 100 last DE with true LFC of 1.
x = c(rnorm(9900, 0, w[1:9900]),
      rnorm(100, 1, w[9901:10000]))

#first 100 genes DE with true LFC 1.5
y = c(rnorm(100, 1.5, w[1:100]),
      rnorm(9900, 0, w[101:10000]))

Now let's say you want to scatter plot the LFCs against each other, to get an overview. A plain scatter plot would look something like this:
plot(x,y, pch=16, cex=0.3, xlim=c(-3,3), ylim=c(-3,3))
Nope.

Do you see the DE genes at (1, 0) and (0, 1.5)? Hmm, nope, not really. Well, we know the uncertainties w off the genes (a fit object from limma-voom has a width of fit$coefficients/fit$t), so let's use them! Point size inversely proportional to the width seems reasonable, divide by ten, and cap at a point size of 1.5. That way, a width of 1 gives a point size of 0.1 (barely visible), while a gene with width of 0.1 or smaller will get full size or larger dots.
plot(x,y, pch=16, cex=pmin(1.5, 0.1/w),
     xlim=c(-3,3), ylim=c(-3,3))
Yep.
Yes, the point coordinates are the same, but now the two DE clusters are the first things you spot! The simulation is very artificial, but it think it illustrates the usefulness*.
 *DISCALIMER: this method of plotting will not create neat clusters of DE genes in your data.

There are plenty other examples, and I use this principle in most of the point-based plots I make. I wouldn't mind if this becomes a standard way of showing uncertainty when you have too many points for normal error bars. Maybe call them splatter plots? :)

Wednesday 22 October 2014

"I am doing bioinformatics"

Sometimes when I am at work people pass by, look at my screen and ask me what I am doing. I often reply that "I am doing bioinformatics". Let me explain why this is not meant as a short uninformative, almost a bit rude, answer.

Going from theoretical physics to bioinformatics, I quickly realised that some things were different. The change from "physics" to "bio" wasn't as big a deal as I thought it would be; what caught me off guard was the "theoretical" to "informatics".

I was writing code all the way through my physics PhD but it was almost exclusively within my own code: there were very little interaction with other peoples methods or data. A few tab-separated text files with prediction from other groups methods and data from a couple of experiments, and that all the relevant data there was for my project. The following cartoon describes more or less what was going on. Area of boxes and width of arrows represents time and effort spent.
  
 Despite my romantic hopes as a master student, most of the time I was not standing in front of the blackboard unravelling the mysteries of the universe. Nonetheless, I got some of that, and the rest was essentially writing my own (well, within my group) code for my Monte Carlo simulation. A minor annoyance was bringing external data (grey box) as I couldn't control the format of the data (red arrow), but it probably happened less than ten times over the course of five years.

 Bioinformatics, apart from my own analysis, involve an effectively endless list of interactions with data and methods that I have no control over. The corresponding cartoon would be something like this:
As you can see, there is a lot more grey boxes and red arrows, which seems to be an important difference between "theoretical" and "informatics". While I may have exaggerated a bit with the ratio of grey area to red area, I still spend a significant amount of time on getting public tools and data to work to fit into the rest of my analysis. I know exactly what I want to do, but it can still take me a lot of time.

Sometimes when I am at work in the middle of converting between file formats or some other grey box, people pass by, look at my screen full of messy terminal tabs and ask me what I am doing. At that point there is no point in going into detailed explanations, but I stick with a ":/" and "I am doing bioinformatics".