When you do a scatter plot, try setting the size of the point from how certain you are of the location of the point! That sentence is essentially all I wanted to say, but to make it a blog post, rather than a twitter tweet, let me elaborate a bit and show an example.
When we work with data, we often have some kind of idea of how accurate the numbers are. You could even argue that the number are useless if you don't have any idea of the uncertainty. It then naturally follows that it often is a good idea to show the accuracy of the number when you plot your data, even if you don't know exactly what the error are. Normally errors are shown with error bars
which is perfectly fine normally, but when you have too many points to look at each one individually, it isn't a good idea any longer (such as when visualising 'omics and other big data (so many buzzwords, it must be true)).
Point size by accuracy
A good way of displaying the accuracy of each dot is to make the dot larger the smaller the uncertainty is. It will make it a lot easier to spot collective effects between many points, and helps you understand the big picture of the plot mainly based on the most accurate points. You have to put some thought into exactly how the point size should depend on the uncertainty, but it is in general worth the effort.
Some guidelines:
A toy example: differential expression
As an example, let's take differential expression. Say you have two knockout (KO) experiments, and you do your RNA-seq, you align, count reads and run limma-voom to get your fit object with log fold changes (LFC) etc. Let's simulate some very naive toy data: 10000 genes, 100 genes upregulated by 2 in one KO, 100 other genes upregulated a factor 2^1.5 in the other KO. Let's assume that the genes have a wide spread of variances (heteroscedastic data), and that we know/have accurately measured the variance. In R, the simulation would be something like:
Now let's say you want to scatter plot the LFCs against each other, to get an overview. A plain scatter plot would look something like this:
Do you see the DE genes at (1, 0) and (0, 1.5)? Hmm, nope, not really. Well, we know the uncertainties w off the genes (a fit object from limma-voom has a width of fit$coefficients/fit$t), so let's use them! Point size inversely proportional to the width seems reasonable, divide by ten, and cap at a point size of 1.5. That way, a width of 1 gives a point size of 0.1 (barely visible), while a gene with width of 0.1 or smaller will get full size or larger dots.
Yes, the point coordinates are the same, but now the two DE clusters are the first things you spot! The simulation is very artificial, but it think it illustrates the usefulness*.
*DISCALIMER: this method of plotting will not create neat clusters of DE genes in your data.
There are plenty other examples, and I use this principle in most of the point-based plots I make. I wouldn't mind if this becomes a standard way of showing uncertainty when you have too many points for normal error bars. Maybe call them splatter plots? :)
When we work with data, we often have some kind of idea of how accurate the numbers are. You could even argue that the number are useless if you don't have any idea of the uncertainty. It then naturally follows that it often is a good idea to show the accuracy of the number when you plot your data, even if you don't know exactly what the error are. Normally errors are shown with error bars
Yep. |
Nope. |
Point size by accuracy
A good way of displaying the accuracy of each dot is to make the dot larger the smaller the uncertainty is. It will make it a lot easier to spot collective effects between many points, and helps you understand the big picture of the plot mainly based on the most accurate points. You have to put some thought into exactly how the point size should depend on the uncertainty, but it is in general worth the effort.
Some guidelines:
- when the uncertainty is so large that the point is essentially useless, then the dot should be so small that it is barely visible.
- When the uncertainty is so small that it doesn't matter any longer, then cap the point size. If not else, cap the point size when the error is smaller than the point!
- If possible, try to have the area of the point proportional to the amount of data the point is based on (number of reads for example). Then two points based on 50 reads each will carry the same visual impact as a single point based on 100 reads, which feels intuitive.
- Use dots, ie pch=16 (or 19) in R.
A toy example: differential expression
As an example, let's take differential expression. Say you have two knockout (KO) experiments, and you do your RNA-seq, you align, count reads and run limma-voom to get your fit object with log fold changes (LFC) etc. Let's simulate some very naive toy data: 10000 genes, 100 genes upregulated by 2 in one KO, 100 other genes upregulated a factor 2^1.5 in the other KO. Let's assume that the genes have a wide spread of variances (heteroscedastic data), and that we know/have accurately measured the variance. In R, the simulation would be something like:
w = abs(rnorm(10000, 0, 1)) #the width of the genes
#9900 null genes, 100 last DE with true LFC of 1.
x = c(rnorm(9900, 0, w[1:9900]),
rnorm(100, 1, w[9901:10000]))
#first 100 genes DE with true LFC 1.5
y = c(rnorm(100, 1.5, w[1:100]),
rnorm(9900, 0, w[101:10000]))
Now let's say you want to scatter plot the LFCs against each other, to get an overview. A plain scatter plot would look something like this:
plot(x,y, pch=16, cex=0.3, xlim=c(-3,3), ylim=c(-3,3))
Nope. |
plot(x,y, pch=16, cex=pmin(1.5, 0.1/w),
xlim=c(-3,3), ylim=c(-3,3))
Yep. |
*DISCALIMER: this method of plotting will not create neat clusters of DE genes in your data.
There are plenty other examples, and I use this principle in most of the point-based plots I make. I wouldn't mind if this becomes a standard way of showing uncertainty when you have too many points for normal error bars. Maybe call them splatter plots? :)