Yet Another BioInformatics Blog: Frequency scatters in superFreq

SuperFreq is a cancer exome analysis pipeline developed in my lab, handling SNVs, CNAs and clonal tracking. This blog goes a bit more in depth of why and how we constructed one of the plots in superFreq, and how you can make the most of it.

I will talk about the variant frequency scatter plot. The idea behind this plot is to show the data in a relatively unprocessed way, to make it easier to understand and quality control the data. I often use these plots to verify that signals in the analysis are real. You can also use it for discovery to some extent, but it is not the main purpose. The plot is simply a scatter plot of the variant frequencies in two samples from the same individual. All variant provided in the input .vcf are shown in the default plot. If you plot only that, you get a plot like this:

As example, I will show a follicular lymphoma sample that transformed into large diffuse B-cell lymphoma. The frequency of the SNVs in the first time point is on the x-axis, and the frequencies of the SNVs in the transformed disease is on the y-axis. This is not a whole exome, but sequencing over exons of a panel of around 700 genes. I use this as example to make the plots less cluttered as we walk through the steps, but I will show a whole exome further down.

This plot above, to be frank, isn't very useful. Only thing we can say from here is that there are a lot of variants, and we see clusters around (0,0), (0.5,0.5) and (1,1). The absence of clusters at (0,1) and (1,0) tells us that the samples probably are from the same individual, ie correctly matched samples, but hard to say much more.

Point size by accuracy

The first improvement to the plot is the usual point size by accuracy. Here, the error of the frequency scales inversely with the square root of coverage, so the inverse of that (ie, the square root of the coverage) seems like a reasonable size of the points. This also makes the area of the points proportional to the coverage, which is a nice property. The total area of the points in a cluster describes the total number of reads that supports the cluster. Cool! There are more details in this, such as the geometric mean of the coverage between the two samples (we want a variant with low coverage only one sample to show up as small) and the cut-off for small (let's show at least a tiny dot) and large (once a dot is pretty accurate we don't care enough to clutter up the plot) dots, but let's not get into too much details.

Point size by accuracy. Area of dots proportional to read depth.

This defines the (0.5,0.5) cluster a lot more, and most of the points outside the main clusters turn out to just be low coverage variants, with high uncertainty in the frequency. Probably nothing to get excited about. There a few large point well outside the main clusters though. And maybe a few of them even seem to cluster up a bit? Hmm...

Grey low quality variants

Next, let's go QC the variants. SuperFreq checks the .bam of each file, looking for poor mapping or base quality. Chances are that the caller you used to produce the .vcf already did something similar, but just in case... SuperFreq also cross-checks the variants against a set of reference normal samples, and flags variants that behaves suspiciously in the normal ("suspicious" in this context amounts to be present, with some leeway given to potential population variants around 50% or 100% frequency). Let's replace the flagged variants with grey rings, and put then in the background:

Flagged variants are shows as grey rings.

Well, most variants seem to pass the filters. Mainly the low frequency variants around (0.1,0.1) are affected. This tempts us to believe that the rest of the low frequency variants (at (0.05,0.05)) probably also are not real, or at the very least not reliable enough to take seriously. Probably the lower frequency goes for the reference normals as well, and the variants slipped below the noise-level cuts.

crosses for dbSNP

The (0.5,0.5) and (1,1) clusters are likely the germline heterozygous and homozygous variants, so they should be enriched for common population variants. Let's cross-check with dbSNP and replace population variants with black crosses instead of blue dots.

dbSNP variants shown as black crosses, or grey crosses if flagged.

What a difference! The (0.5, 0.5) and (1,1) clusters are mainly dbSNP variants, as expected! The clusters at (0.45, 0.1) and (0.55,0.9) seem to also be dbSNP variants. There are a few non-dbSNP variants in the (0.5,0.5) cluster, a large fraction of the (0.05, 0.05) noise SNVs are also non-dbSNP, There is a non-dbSNP cluster around (0.1, 0.4).

This information gives us quite some power to tell what the cluster are.The non-dbSNP clusters are probably not germline variant, and thus either noise or somatic SNVs. It is fair to assume that the low frequency variants are mostly noise, as they cluster with the flagged grey variants. The (01, 0.4) non-dbSNP variants is probably the most interesting one, as it likely is somatic variants. They are high frequency in one sample, and low in the other, making it unlikely to be noise. Being a blood cancer that typically is mostly diploid, these frequencies correspond to clonalities of 0.2 and 0.8 in the two samples. So these SNVs were present to low degree at the first timepoint and then dominates the relapse. This is where we will find the mutations that are causing the disease transformation.

There is then a pair of dbSNP clusters at (0.45, 0.1) and (0.55, 0.9) that is probably related to a copy number loss. We already suspect that there is a population of cell at ~20% clonality in the first sample and 80% in the second, and a loss in these cells would give this signal: the lost heterozygous germline SNPs would shift down to around 40-45% in the first sample, and to around 10% in the second, and we would have the mirror image for the retained heterozygous germline SNPs. The fact that these are mostly dbSNPs leads us to this conclusion. If we find a cluster of mostly non-dbSNP at (0.45, 0.1) we would assume it is somatic SNVs, so the dots vs crosses allows us to separate these two kinds of clusters.

colour by significantly different frequency

In general, things along the x=y diagonal are less interesting, and we want to know if clusters are significantly separated from the diagonal. To do this, we run a fisher exact test (reference and variant counts in both samples) to check for significant difference in frequency between the samples. Add in a multiple hypothesis correction, and colour the dots or crosses red as they approach global significance. We set the limit so that a null set (replicate samples, expecting only x=y) will have a couple of dots turning a deep red, so that a shiny red dot or cross usually means significant signal.

Colour red if significantly different frequency in the two samples.

We see that the three clusters we talked about are indeed significantly off-diagonal. Essentially this is just a bit of a confirmation that we didn't imagine the clusters, and it helps draw the eye to them when we look at the plot the first time.

gene names

We now want to figure out a bit more what the new somatic SNVs are, as they can be biologically relevant. For this, let's add gene names over all SNVs that are sufficiently red.

Add gene names over red dots or crosses.

We get four gene names in the somatic SNV cluster, which are candidates for playing a role in transformation. We also note that some of the genes in the dbSNP clusters have SNPs in both the upper and lower cluster, which confirms that the two clusters are associated with each other. If we want to, we can go and check up which chromosomes FRK, TTK and EPHA7 are on, and it'll turn out that they are all on the chr6q arm. For copy number calls, probably better to look at the copy number plots otherwise, which are described in the manual on github (link on top).

Highlight protein changing SNVs

Finally, we want to know if the somatic SNVs damage the protein of their gene. Running the somatic variants through VEP, we mark the protein altering SNVs with an orange border (thicker border for more severe, such as nonsense or frameshift mutations).

Add rings around protein changing somatic SNVs.

We see now that CCND3 is the only one of the names somatic SNVs that is actually protein altering, making it a prime candidate for causing the relapse. Going to the literature, you can find support for CCND3 playing a role in transforming blood cancers, which is encouraging.

Whole exome example

The above plots are from an older version of superFreq, and on the a capture set of only around 700 of the 20k genes. By now, superFreq also highlights protein changing SNVs that in the cosmic census of cancer related genes. A typical (blood cancer, so mostly diploid) scatter between two cancer timepoint with whole exomes will look something like this (showing the version without gene names for less clutter). You can see how it grows out of the raw scatter in this animation:

Animation of gradual introduction of the steps explained above (plus the COSMIC highlighting) in a scatter plot between two different time points of a blood cancer. The title of the plot shows the step currently being introduced.

The final version looks like this in a bit higher resolution.

A typical scatter between two different time points in a blood cancer. The green rings highlight somatic protein changing SNVs in COSMIC census genes.

I'm leaving the interpretation of this plot to the reader as an exercise. How many clusters can you find, and what do they mean? There are at least 11 clusters. Post in comments or on twitter or whatever media may have lead you to read this. :)

As closing remark, let me say that while it is possible to read a lot of information from these scatters, many of them can more conveniently be read from other output in superFreq. The big strength of the scatters is that they show the data in a comparably raw and unprocessed manner. You don't just get the CNA calls, or just a top table of somatic SNVs according to some frequency cuts. Looking at this plot it is often easy to convince yourself if you are looking at a real signal or noise, which can be hard in more processed downstream results. So before you publish anything, check in the scatters that the signal actually look reasonable. ;)

This concludes the first blog on superFreq. I'll try to keep it up, and explain different parts of the pipeline, both what is going on underneath the hood, and how you can interpret the output. I'm open for requests if you are curious on some specific part. More details are in the manual on github (linked on top). Thanks for reading!

Yet Another BioInformatics Blog

Thursday, 10 March 2016

Frequency scatters in superFreq