-
Notifications
You must be signed in to change notification settings - Fork 0
/
data-exploration.Rmd
59 lines (48 loc) · 2.51 KB
/
data-exploration.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
---
output:
html_document: default
---
# Healthy Gupta samples with low correlation
The heatmap of healthy samples from Gupta exhibited a weird pattern: first 5 samples, namely DF, CQ, CT, AJ, AN, are poorly correlated with the rest. A possible explanation are drastic differences in sequencing depth
```{r}
knitr::opts_chunk$set(message=FALSE, warning=FALSE)
meta <- read.table('data/otutable_metadata.csv',
sep = ",", header = TRUE, row.names = 1)
# select heakthy cols
rows <- grep("^GupDM_H", rownames(meta), value = TRUE)
meta <- meta[rows, ]
rownames(meta) <- gsub("GupDM_H", "", rownames(meta))
```
```{r}
library(data.table)
library(knitr)
kable(
meta[order(meta['number_reads']), ][c('number_bases', 'minimum_read_length', 'location')])
```
Not all of the samples that do not comply have low coverage. It might be explained with overall low DNA concentration in a sample as a consequence of sample collection and DNA extraction.
# Lost phylum after CLR normalization
On a heatmap that illustrates the correlation in species, CLR was the only method to loose some phyla, which might be the result of preparation step prior to calculating correlations.
```{r}
sdzerocols <- function(method, healthy = TRUE){
pathtoftable <- paste('camp_normalization_out/normalization/otutable_', method, '.csv', sep='')
ftable <- read.table(pathtoftable,
sep = ",", header = TRUE, row.names = 1)
if (healthy){
colnames(ftable) <- gsub("GupDM_H", "", colnames(ftable))
# sort out the healthy samples
ftable <- ftable[, rownames(meta)]
}
# transpose the feature table
ftable <- data.frame(t(ftable), check.names = FALSE)
sum(sapply(ftable, function(x) sd(x) == 0))
}
```
```{r}
df <- data.frame()
for (method in c('prepped', 'clr', 'blom', 'css', 'combat')){
df['only healthy',method] <- sdzerocols(method)
df['all samples',method] <- sdzerocols(method, healthy = FALSE)
}
kable(df, caption = "Number of zero std columns")
```
The table contains numbers of columns with zero standard deviation, which were excluded prior to correlation analysis.Turns out CLR does reduce the variation by introducing all zero species rows (116 in the case of this dataset), but zero standard deviation and the consequent inability to perform correlation analysis is the result of subsampling (e.g. selecting only healthy samples). Interestingly enough, Blom transforms the data to approximate the normal distribution, which in turn leaves the dataset zero-free in any case either with or without subsampling.