My gene of interest has counts ranging from 20,000 to over 100,000 reads in my "case" samples, where as the same gene has counts only in the 50-200 reads in my "control" sample. I have 4 biological replicates in each group. The log2 fold change is nearly 4 but the p-value is "NA". I'm guessing that the calculation has reached some sort of a limit in this case but would like to know how to interpret/report this value?
Thanks
It's more likely that you have an outlier sample and this gene got flagged because of that. Have a look at section 1.4.2 of the vignette and then section 3.5 ("Dealing with count outliers").
I should note that I've also been bitten by this once, where a hemizygous gene in one group simply had a weird distribution due to the genetic modification. So, the outlier detection isn't exactly infallible (not that that'll surprise anyone).
Dear ExMachina
Have a look at the documentation of the package, i.e. the vignette or the man page of the
results
function. There it says:
By default, independent filtering is performed to select a set of genes for multiple test correction which will optimize the number of adjusted p-values less than a given critical value alpha (by default 0.1). The adjusted p-values for the genes which do not pass the filter threshold are set to
NA
. By default, the mean of normalized counts is used to perform this filtering, though other statistics can be provided. Several arguments from the
filtered_p
function of
genefilter
are provided here to control or turn off the independent filtering behavior.
In addition, results by default assigns a p-value of
NA
to genes containing count outliers, as identified using Cook's distance. See the
cooksCutoff
argument for control of this behavior. Cook's distances for each sample are accessible as a matrix "
cooks
" stored in the
assays()
list. This measure is useful for identifying rows where the observed counts might not fit to a negative binomial distribution.
In your case, it sounds like the second one applies?
PS I assume you are using a recent version of the package (always post the output of sessionInfo() for such posts).
Kind regards
Wolfgang
Thanks Wolfgang and dpryan.
Yes, it seems like it is getting flagged for not fitting the NegBin dist. So it sounds like adjusting the Cook's cutoff is what I need to look into?
Is there a simplified explanation of what this really "means"?--I'm shortly going to have to explain to a bunch of people
why
a gene that appears so obviously differentially expressed (in terms the distributions and magnitudes of the read counts) is being flagged by DESeq2. The complicating factor is that the DE of this gene is very well established in one of the cases I'm looking at; consequently, I'm sure many will raise a concern about the interpretation (and appropriateness) of "NA" genes in the cases of other genes.
FYI, sessionInfo:
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] gplots_2.14.1 RColorBrewer_1.0-5 vsn_3.32.0 Biobase_2.24.0
[5] DESeq2_1.4.5 RcppArmadillo_0.4.320.0 Rcpp_0.11.2 GenomicRanges_1.16.4
[9] GenomeInfoDb_1.0.2 IRanges_1.22.10 BiocGenerics_0.10.0 BiocInstaller_1.14.2
loaded via a namespace (and not attached):
[1] affy_1.42.3 affyio_1.32.0 annotate_1.42.1 AnnotationDbi_1.26.0
[5] bitops_1.0-6 caTools_1.17 DBI_0.2-7 gdata_2.13.3
[9] genefilter_1.46.1 geneplotter_1.42.0 grid_3.1.1 gtools_3.4.1
[13] KernSmooth_2.23-12 lattice_0.20-29 limma_3.20.8 locfit_1.5-9.1
[17] preprocessCore_1.26.1 RSQLite_0.11.4 splines_3.1.1 stats4_3.1.1
[21] survival_2.37-7 tools_3.1.1 XML_3.98-1.1 xtable_1.7-3
[25] XVector_0.4.0 zlibbioc_1.10.0
You can find plain language explanations for the Cook's filtering in two places: the vignette has a simplified explanation in the 'Theory' section and the citation of the package also explains it.
We use the Cook's distance as a diagnostic to tell if a single sample has a count which has a disproportionate impact on the log fold change and p-values. You can turn off the filtering and inspect the genes manually with:
results(dds, cooksCutoff=FALSE)
The maximum of Cook's distances for a row is in the column:
mcols(dds)$maxCooks
Or the matrix for all genes and samples:
assays(dds)[["cooks"]]
What I meant to ask was, is an "NA" in this case best interpreted as as potential false positive requiring further inspection, or is it a warning that any assigned p-value may very well be meaningless? The former is what it
sounds
like it means, but I know of some people taking a Cook's Cutoff assigned "NA" as an indication that that gene somehow failed to show DE.
Is it wrong-headed (or simply naive) to tinker with the Cooks cutoff until the cutoff is able to accommodate a gene that is known to be DE? Or, if a sample is truly causing a problem, would it be better to just remove that sample than risk letting a lot of artifacts leak through.
Thanks again. The answers here have already been very helpful.
The NA p-value is a warning about a potential outlier. It's not "failed to show DE".
I myself have turned off Cook's cutoff and inspected the counts of rows with high maxCooks manually, for an experiment which had low variance of counts for all but one condition, which had highly variable counts due to details of the protocol.
Additionally, working with this and a few other datasets with one highly variable condition, I have made a small change in the development branch (will be released in 2 months) which helps to avoid too much outlier calling in such cases.
Thank you for the additional perspective. So basically a good starting strategy would seem be to run DESeq2 twice (once with and once without Cooks) and then append the Cooks flags to the non-Cooks output. Then, visually inspect from there. Then perhaps rerun with an adjusted cutoff (or a sample discarded). Sound reasonable?
The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...