Statistics_genomics

1. Load the Montgomery and Pickrell eSet:

library(ballgown)
library(Biobase)
con =url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/montpick_eset.RData")
load(file=con)
close(con)
mp = montpick.eset
pdata=pData(mp)
edata=as.data.frame(exprs(mp))
fdata = fData(mp)

What percentage of variation is explained by the 1st principal component in the data set if you:

Do no transformations?
log2(data + 1) transform?
log2(data + 1) transform and subtract row means?

# No transformations
svd1 = svd(edata)
ori_pca = svd1$d^2/sum(svd1$d^2)
ori_pca[1]

## [1] 0.8873421

# log2 transform
edata_log2 = log2(edata + 1)
svd2 = svd(edata_log2)
log2_pca = svd2$d^2/sum(svd2$d^2)
log2_pca[1]

## [1] 0.9737781

# log2 transform, subtract row means
edata_centered = edata_log2 - rowMeans(edata_log2)
svd3 = svd(edata_centered)
centered_data_pca = svd3$d^2/sum(svd3$d^2)
centered_data_pca[1]

## [1] 0.3463729

2. Load the Montgomery and Pickrell eSet:

con =url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/montpick_eset.RData")
load(file=con)
close(con)
mp = montpick.eset
pdata=pData(mp)
edata=as.data.frame(exprs(mp))
fdata = fData(mp)

Perform the log2(data + 1) transform and subtract row means from the samples. Set the seed to 333 and use k-means to cluster the samples into two clusters. Use svd to calculate the singular vectors. What is the correlation between the first singular vector and the sample clustering indicator?

edata_log2 = log2(edata + 1)
edata_centered = edata_log2 - rowMeans(edata_log2)

# use svd to calculate the singular vectors
set.seed(333)
svd1 = svd(edata_centered)

edata_kmeans = kmeans(t(edata_centered), centers=2)
cor.test(svd1$v[,1], edata_kmeans$cluster)

## 
##  Pearson's product-moment correlation
## 
## data:  svd1$v[, 1] and edata_kmeans$cluster
## t = -19.683, df = 127, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9049326 -0.8176191
## sample estimates:
##        cor 
## -0.8678247

3. Load the Bodymap data with the following command

con =url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/bodymap_eset.RData")
load(file=con)
close(con)
bm = bodymap.eset
edata = exprs(bm)
pdata_bm=pData(bm)

Fit a linear model relating the first gene’s counts to the number of technical replicates, treating the number of replicates as a factor. Plot the data for this gene versus the covariate. Can you think of why this model might not fit well?

# fit linear model
lm1 = lm(edata[1,] ~ pdata_bm$num.tech.reps)

# plot the data
plot(pdata_bm$num.tech.reps,edata[1,])
abline(lm1$coeff[1], lm1$coeff[2], col=2, lwd=3)

There are very few samples with more than 2 replicates so the estimates for those values will not be very good.

4. Fit a linear model relating he first gene’s counts to the age of the person and the sex of the samples. What is the value and interpretation of the coefficient for age?

# fit linear model
lm2 = lm(edata[1,] ~ pdata_bm$age + pdata_bm$gender)
summary(lm2)

## 
## Call:
## lm(formula = edata[1, ] ~ pdata_bm$age + pdata_bm$gender)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -734.35 -229.31   -3.26  243.02  768.09 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2331.581    438.181   5.321 0.000139 ***
## pdata_bm$age      -23.913      6.488  -3.686 0.002744 ** 
## pdata_bm$genderM -207.257    236.431  -0.877 0.396610    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 469.6 on 13 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.5147, Adjusted R-squared:   0.44 
## F-statistic: 6.894 on 2 and 13 DF,  p-value: 0.009102

This coefficient means that for each additional year of age, the count goes down by an average of 23.91 for a fixed sex.

5. Load the Montgomery and Pickrell eSet:

con =url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/montpick_eset.RData")
load(file=con)
close(con)
mp = montpick.eset
pdata=pData(mp)
edata=as.data.frame(exprs(mp))
fdata = fData(mp)

Perform the log2(data + 1) transform. Then fit a regression model to each sample using population as the outcome. Do this using the lm.fit function (hint: don’t forget the intercept). What is the dimension of the residual matrix, the effects matrix and the coefficients matrix?

edata = log2(edata + 1)

# fit a regression model to each sample, using population as the outcome
mod = model.matrix(~ pdata$population)
fit = lm.fit(mod, t(edata))

# dimension of the residual matrix
dim(fit$residuals)

## [1]   129 52580

# dimension of the effects matrix
dim(fit$effects)

## [1]   129 52580

# dimension of the coefficients matrix
dim(fit$coefficients)

## [1]     2 52580

6. Perform the log2(data + 1) transform. Then fit a regression model to each sample using population as the outcome. Do this using the lm.fit function (hint: don’t forget the intercept). What is the effects matrix?

?lm.fit

7. Load the Bodymap data with the following command

con =url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/bodymap_eset.RData")
load(file=con)
close(con)
bm = bodymap.eset
edata = exprs(bm)
pdata_bm=pData(bm)

Fit many regression models to the expression data where age is the outcome variable using the lmFit function from the limma package (hint: you may have to subset the expression data to the samples without missing values of age to get the model to fit). What is the coefficient for age for the 1,000th gene? Make a plot of the data and fitted values for this gene. Does the model fit well?

library(devtools)
library(Biobase)
library(limma)
library(edge)

# subset the expression data to the samples without mimssing values of age
pdata_bm = na.omit(pdata_bm)
edata = edata[,rownames(pdata_bm), drop=FALSE]

# fit many regression models to the expression data where age is the outcome
mod_adj = model.matrix(~ pdata_bm$age)
fit_limma = lmFit(edata,mod_adj)

fit_limma$coefficients[1000,]

##  (Intercept) pdata_bm$age 
##   2469.87375    -27.61178

# make a plot of the 1,000th gene and fitted values
intercept = fit_limma$coefficients[1000,][1]
slope = fit_limma$coefficients[1000,][2]
x = edata[1000,]*slope+intercept

plot(x,pdata_bm$age)

The model doesn’t fit well since there are two large outlying values and the rest of the values are near zero.

8. Fit many regression models to the expression data where age is the outcome variable and tissue.type is an adjustment variable using the lmFit function from the limma package (hint: you may have to subset the expression data to the samples without missing values of age to get the model to fit). What is wrong with this model?

pdata_bm$tissue.type

##  [1] adipose          adrenal          brain            breast          
##  [5] colon            heart            kidney           liver           
##  [9] lung             lymphnode        ovary            prostate        
## [13] skeletal_muscle  testes           thyroid          white_blood_cell
## 17 Levels: adipose adrenal brain breast colon heart kidney liver ... white_blood_cell

tissue.type has 18 levels but there are only 16 data points per gene, so this model can’t fit a unique solution.

9. Why is it difficult to distinguish the study effect from the population effect in the Montgomery Pickrell dataset from ReCount?

con =url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/montpick_eset.RData")
load(file=con)
close(con)
mp = montpick.eset
pdata=pData(mp)
edata=as.data.frame(exprs(mp))
fdata = fData(mp)

head(pdata)

##         sample.id num.tech.reps population      study
## NA06985   NA06985             1        CEU Montgomery
## NA06986   NA06986             1        CEU Montgomery
## NA06994   NA06994             1        CEU Montgomery
## NA07000   NA07000             1        CEU Montgomery
## NA07037   NA07037             1        CEU Montgomery
## NA07051   NA07051             1        CEU Montgomery

The effects are difficult to distinguish because each study only measured one population.

10. Load the Bodymap data with the following command

con =url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/bodymap_eset.RData")
load(file=con)
close(con)
bm = bodymap.eset
edata = exprs(bm)
pdata_bm=pData(bm)

Set the seed using the command set.seed(33353) then estimate a single surrogate variable using the sva function after log2(data + 1) transforming the expression data, removing rows with rowMeans less than 1, and treating age as the outcome (hint: you may have to subset the expression data to the samples without missing values of age to get the model to fit). What is the correlation between the estimated surrogate for batch and age? Is the surrogate more highly correlated with race or gender?

library(devtools)
library(Biobase)
library(sva)
library(bladderbatch)
library(snpStats)

# preprocessing the data
set.seed(33353)
pheno = na.omit(pdata_bm)
edata = edata[,rownames(pheno), drop=FALSE]
edata = log2(edata + 1)
edata = edata[rowMeans(edata) > 1,]

# fit a sva model
mod = model.matrix(~age, data=pheno)
mod0 = model.matrix(~1, data=pheno)
sva1 = sva(edata, mod,mod0, n.sv=2)

## Number of significant surrogate variables is:  2 
## Iteration (out of 5 ):1  2  3  4  5

# correlation between surrogate for batch and age
cor(sva1$sv, pheno$age)

##            [,1]
## [1,] -0.1965417
## [2,] -0.1560322

# correlation between surrogate for batch and race
cor(sva1$sv, as.numeric(pheno$race))

##            [,1]
## [1,] -0.2265909
## [2,] -0.1051495

# correlation between surrogate for batch and gender
cor(sva1$sv, as.numeric(pheno$gender))

##             [,1]
## [1,] -0.35780610
## [2,]  0.04154497

sessionInfo()

## R version 3.6.3 (2020-02-29)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.4
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] snpStats_1.36.0     Matrix_1.2-18       survival_3.1-12    
##  [4] bladderbatch_1.24.0 sva_3.34.0          BiocParallel_1.20.1
##  [7] genefilter_1.68.0   mgcv_1.8-31         nlme_3.1-147       
## [10] edge_2.18.0         limma_3.42.2        devtools_2.3.0     
## [13] usethis_1.6.1       Biobase_2.46.0      BiocGenerics_0.32.0
## [16] ballgown_2.18.0    
## 
## loaded via a namespace (and not attached):
##  [1] minqa_1.2.4                 colorspace_1.4-1           
##  [3] ellipsis_0.3.0              rprojroot_1.3-2            
##  [5] qvalue_2.18.0               corpcor_1.6.9              
##  [7] XVector_0.26.0              GenomicRanges_1.38.0       
##  [9] fs_1.4.1                    remotes_2.1.1              
## [11] bit64_0.9-7                 AnnotationDbi_1.48.0       
## [13] fansi_0.4.1                 ClusterR_1.2.1             
## [15] splines_3.6.3               knitr_1.28                 
## [17] pkgload_1.0.2               nloptr_1.2.2.1             
## [19] Rsamtools_2.2.3             annotate_1.64.0            
## [21] cluster_2.1.0               lfa_1.16.0                 
## [23] compiler_3.6.3              backports_1.1.6            
## [25] assertthat_0.2.1            cli_2.0.2                  
## [27] htmltools_0.4.0             prettyunits_1.1.1          
## [29] tools_3.6.3                 gmp_0.5-13.6               
## [31] rsvd_1.0.3                  gtable_0.3.0               
## [33] glue_1.4.0                  GenomeInfoDbData_1.2.2     
## [35] reshape2_1.4.4              dplyr_0.8.5                
## [37] Rcpp_1.0.4.6                vctrs_0.2.4                
## [39] Biostrings_2.54.0           rtracklayer_1.46.0         
## [41] xfun_0.13                   stringr_1.4.0              
## [43] ps_1.3.2                    testthat_2.3.2             
## [45] lme4_1.1-23                 lifecycle_0.2.0            
## [47] irlba_2.3.3                 gtools_3.8.2               
## [49] statmod_1.4.34              XML_3.99-0.3               
## [51] zlibbioc_1.32.0             MASS_7.3-51.6              
## [53] scales_1.1.0                SummarizedExperiment_1.16.1
## [55] RColorBrewer_1.1-2          yaml_2.2.1                 
## [57] memoise_1.1.0               ggplot2_3.3.0              
## [59] stringi_1.4.6               RSQLite_2.2.0              
## [61] S4Vectors_0.24.4            desc_1.2.0                 
## [63] boot_1.3-25                 pkgbuild_1.0.7             
## [65] GenomeInfoDb_1.22.1         rlang_0.4.6                
## [67] pkgconfig_2.0.3             matrixStats_0.56.0         
## [69] bitops_1.0-6                evaluate_0.14              
## [71] lattice_0.20-41             purrr_0.3.4                
## [73] GenomicAlignments_1.22.1    bit_1.1-15.2               
## [75] processx_3.4.2              tidyselect_1.0.0           
## [77] plyr_1.8.6                  magrittr_1.5               
## [79] R6_2.4.1                    IRanges_2.20.2             
## [81] jackstraw_1.3               DelayedArray_0.12.3        
## [83] DBI_1.1.0                   pillar_1.4.3               
## [85] withr_2.2.0                 RCurl_1.98-1.2             
## [87] tibble_3.0.1                crayon_1.3.4               
## [89] rmarkdown_2.1               grid_3.6.3                 
## [91] snm_1.34.0                  blob_1.2.1                 
## [93] callr_3.4.3                 digest_0.6.25              
## [95] xtable_1.8-4                stats4_3.6.3               
## [97] munsell_0.5.0               sessioninfo_1.1.1

Statistics_genomics_quiz2

Shengyuan Wang

4/17/2020