1. The yeastRNASeq experiment data package contains FASTQ files from an RNA seq experiment in yeast. When the package is installed, you can access one of the FASTQ files by the path given by
library(yeastRNASeq)
fastqFilePath <- system.file("reads", "wt_1_f.fastq.gz", package = "yeastRNASeq")
Question: What fraction of reads in this file has an A nucleotide in the 5th base of the read?
library(ShortRead)
library(Biostrings)

# read fastq file
reads <- readFastq(fastqFilePath)
DNAStringSet <- sread(reads)

# fraction of reads has an A in the 5th base
cm <- consensusMatrix(DNAStringSet, as.prob=TRUE, baseOnly=TRUE)
cm['A', 5]
##        A 
## 0.363841
2. Question: What is the average numeric quality value of the 5th base of these reads?
mean(as(quality(reads), "matrix")[,5])
## [1] 28.93346
3. The leeBamViews experiment data package contains aligned BAM files from an RNA seq experiment in yeast (the same experiment as in Questions 1 and 2, but that is not pertinent to the question). You can access one of the BAM files by the path given by
library(leeBamViews)
bamFilePath <- system.file("bam", "isowt5_13e.bam", package="leeBamViews")
These reads are short reads (36bp) and have been aligned to the genome using a standard aligner, ie. potential junctions have been ignored (this makes some sense as yeast has very few junctions and the reads are very short). A read duplicated by position is a read where at least one more read shares the same position. We will focus on the interval from 800,000 to 801,000 on yeast chromosome 13.
Question: In this interval, how many reads are duplicated by position?
library(Rsamtools)

bamFile <- BamFile(bamFilePath)

# focus on Scchr13, interval from 800,000 to 801,000
gr <- GRanges(seqnames = "Scchr13", ranges = IRanges(start = c(800000), end = c(801000)))
params <- ScanBamParam(which = gr, what = scanBamWhat())
aln <- scanBam(bamFile, param = params)

# find duplicates
sum(table(aln[[1]]$pos)) - sum(table(aln[[1]]$pos) == 1)
## [1] 129
4. The package contains 8 BAM files in total, representing 8 different samples from 4 groups. A full list of file paths can be had as
bpaths <- list.files(system.file("bam", package="leeBamViews"), pattern = "bam$", full=TRUE)
An objective of the original paper was the discovery of novel transcribed regions in yeast. One such region is Scchr13:807762-808068.
Question: What is the average number of reads across the 8 samples falling in this interval?
# focus on the novel transcribed regions
bamView <- BamViews(bpaths)
gr_nt <- GRanges(seqnames="Scchr13", ranges=IRanges(start = c(807762), end = c(808068)))
bamRanges(bamView) <- gr_nt
aln_nt <- scanBam(bamView)

# get sequences for each sample
alns <- lapply(aln_nt, function(xx) xx[[1]]$seq)

# calculate the average number of reads across 8 the samples
alns_len_sum = 0
for (i in 1:length(alns)){
  alns_len_sum = alns_len_sum + length(alns[i][[1]])
}
alns_len_sum / length(alns)
## [1] 90.25
5. In the lecture on the oligo package an ExpressionSet with 18 samples is constructed, representing normalized data from an Affymetrix gene expression microarray. The samples are divided into two groups given by the group variable.
Question: What is the average expression across samples in the control group for the “8149273” probeset (this is a character identifier, not a row number).
library(oligo)
library(GEOquery)

# get data
getGEOSuppFiles("GSE38792")
untar("GSE38792/GSE38792_RAW.tar", exdir = "GSE38792/CEL")

# read data
celfiles <- list.files("GSE38792/CEL", full = TRUE)
rawData <- read.celfiles(celfiles)

# parse pData
filename <- sampleNames(rawData)
pData(rawData)$filename <- filename
sampleNames <- sub(".*_", "", filename)
sampleNames <- sub(".CEL.gz$", "", sampleNames)
sampleNames(rawData) <- sampleNames
pData(rawData)$group <- ifelse(grepl("^OSA", sampleNames(rawData)), "OSA", "Control")

# find "8149273" probeset
normData <- rma(rawData)
loc <- match("8149273", rownames(normData))
# average expression in control group
mean(exprs(normData[loc,])[1:8])
## [1] 7.02183
6. Use the limma package to fit a two group comparison between the control group and the OSA group, and borrow strength across the genes using eBayes(). Include all 18 samples in the model fit.
Question: What is the absolute value of the log foldchange (logFC) of the gene with the lowest P.value.
library(limma)

# use limma to fit between control group and OSA group
normData$group <- factor(normData$group)
design <- model.matrix(~normData$group)
fit <- lmFit(normData, design)
fit <- eBayes(fit)

# absolute value of logFC which has lowest P.value
abs(topTable(fit)$logFC[1])
## [1] 0.7126484
7. Question: How many genes are differentially expressed between the two groups at an adj.P.value cutoff of 0.05?
fit_toptable <- topTable(fit)
de <- subset(fit_toptable, adj.P.Val < 0.05)
de
## [1] logFC     AveExpr   t         P.Value   adj.P.Val B        
## <0 rows> (or 0-length row.names)
8. An example 450k dataset is contained in the minfiData package. This dataset contains 6 samples; 3 cancer and 3 normals. Cancer has been shown to be globally hypo-methylated (less methylated) compared to normal tissue of the same kind.
Take the RGsetEx dataset in this package and preprocess it with the preprocessFunnorm function. For each sample, compute the average Beta value (percent methylation) across so-called OpenSea loci.
Question: What is the mean difference in beta values between the 3 normal samples and the 3 cancer samples, across OpenSea CpGs?
library(minfi)
library(minfiData)

# get OpenSea loci in RGsetEx with preprocess
rgSet <- preprocessFunnorm(RGsetEx)
rg_opensea <- rgSet[c(getIslandStatus(rgSet) == "OpenSea")]

# get Beta value in both group
rg_beta <- getBeta(rg_opensea)
normal <- mean(rg_beta[, c(1,2,5)])
cancer <- mean(rg_beta[, c(3,4,6)])

# mean difference between normal and cancer group
normal - cancer
## [1] 0.08863657
9. The Caco2 cell line is a colon cancer cell line profiled by ENCODE. Obtain the narrowPeak DNase hyper sensitive sites computed by the analysis working group (AWG).
Question: How many of these DNase hypersensitive sites contain one or more CpGs on the 450k array?
library(AnnotationHub)

# get Caco2 data
ah <- AnnotationHub()
ah <- subset(ah, species=="Homo sapiens")
ah_Caco2 <- query(ah, c("Caco2", "AWG"))
ah_Caco2 <- ah_Caco2[["AH22442"]]

CpG_450K <- granges(rgSet)

unique(findOverlaps(CpG_450K, ah_Caco2, type="within"))
## Hits object with 68183 hits and 0 metadata columns:
##           queryHits subjectHits
##           <integer>   <integer>
##       [1]        40           9
##       [2]        51          10
##       [3]        52          10
##       [4]        53          10
##       [5]        71          11
##       ...       ...         ...
##   [68179]    485086      122978
##   [68180]    485087      122978
##   [68181]    485090      122979
##   [68182]    485091      122979
##   [68183]    485274      123007
##   -------
##   queryLength: 485512 / subjectLength: 123048
10. The zebrafishRNASeq package contains summarized data from an RNA-seq experiment in zebrafish in the form of a data.frame called zfGenes. The experiment compared 3 control samples to 3 treatment samples. Each row is a transcript; the data.frame contains 92 rows with spikein transcripts; these have a rowname starting with “ERCC”. Exclude these rows from the analysis.
Use DESeq2 to perform a differential expression analysis between control and treatment. Do not discard (filter) genes and use the padj results output as the p-value.
Question: How many features are differentially expressed between control and treatment (ie. padj<=0.05)?
library(DESeq2)
library(zebrafishRNASeq)

# get and parse data
data("zfGenes")
zf <- zfGenes[grep("^ERCC", rownames(zfGenes), invert = T), ]
zf <- as.matrix(zf)
colData <- DataFrame(sampleID = colnames(zf), group = as.factor(c("control", "control", "control", "treatment", "treatment", "treatment")))

# perform DESeq2
dds <- DESeqDataSetFromMatrix(zf, colData, design = ~ group)
dds <- DESeq(dds)

# find differentially expressed features
res <- results(dds)
sigRes <- subset(res, padj <= 0.05)
dim(sigRes)[1]
## [1] 116
sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.4
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] zebrafishRNASeq_1.6.0                             
##  [2] DESeq2_1.26.0                                     
##  [3] BSgenome.Hsapiens.UCSC.hg19_1.4.0                 
##  [4] AnnotationHub_2.18.0                              
##  [5] BiocFileCache_1.10.2                              
##  [6] dbplyr_1.4.3                                      
##  [7] minfiData_0.32.0                                  
##  [8] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.0
##  [9] IlluminaHumanMethylation450kmanifest_0.4.0        
## [10] minfi_1.32.0                                      
## [11] bumphunter_1.28.0                                 
## [12] locfit_1.5-9.4                                    
## [13] iterators_1.0.12                                  
## [14] foreach_1.5.0                                     
## [15] limma_3.42.2                                      
## [16] pd.hugene.1.0.st.v1_3.14.1                        
## [17] DBI_1.1.0                                         
## [18] RSQLite_2.2.0                                     
## [19] GEOquery_2.54.1                                   
## [20] oligo_1.50.0                                      
## [21] oligoClasses_1.48.0                               
## [22] leeBamViews_1.22.0                                
## [23] BSgenome_1.54.0                                   
## [24] rtracklayer_1.46.0                                
## [25] ShortRead_1.44.3                                  
## [26] GenomicAlignments_1.22.1                          
## [27] SummarizedExperiment_1.16.1                       
## [28] DelayedArray_0.12.3                               
## [29] matrixStats_0.56.0                                
## [30] Biobase_2.46.0                                    
## [31] Rsamtools_2.2.3                                   
## [32] GenomicRanges_1.38.0                              
## [33] GenomeInfoDb_1.22.1                               
## [34] Biostrings_2.54.0                                 
## [35] XVector_0.26.0                                    
## [36] IRanges_2.20.2                                    
## [37] S4Vectors_0.24.4                                  
## [38] BiocParallel_1.20.1                               
## [39] BiocGenerics_0.32.0                               
## [40] yeastRNASeq_0.24.0                                
## 
## loaded via a namespace (and not attached):
##   [1] backports_1.1.7               Hmisc_4.4-0                  
##   [3] plyr_1.8.6                    splines_3.6.3                
##   [5] ggplot2_3.3.0                 digest_0.6.25                
##   [7] htmltools_0.4.0               checkmate_2.0.0              
##   [9] magrittr_1.5                  memoise_1.1.0                
##  [11] cluster_2.1.0                 readr_1.3.1                  
##  [13] annotate_1.64.0               askpass_1.1                  
##  [15] siggenes_1.60.0               prettyunits_1.1.1            
##  [17] colorspace_1.4-1              jpeg_0.1-8.1                 
##  [19] blob_1.2.1                    rappdirs_0.3.1               
##  [21] xfun_0.13                     dplyr_0.8.5                  
##  [23] crayon_1.3.4                  RCurl_1.98-1.2               
##  [25] genefilter_1.68.0             survival_3.1-12              
##  [27] glue_1.4.1                    gtable_0.3.0                 
##  [29] zlibbioc_1.32.0               Rhdf5lib_1.8.0               
##  [31] HDF5Array_1.14.4              scales_1.1.1                 
##  [33] rngtools_1.5                  Rcpp_1.0.4.6                 
##  [35] htmlTable_1.13.3              xtable_1.8-4                 
##  [37] progress_1.2.2                foreign_0.8-76               
##  [39] bit_1.1-15.2                  mclust_5.4.6                 
##  [41] preprocessCore_1.48.0         Formula_1.2-3                
##  [43] htmlwidgets_1.5.1             httr_1.4.1                   
##  [45] RColorBrewer_1.1-2            acepack_1.4.1                
##  [47] ellipsis_0.3.1                ff_2.2-14.2                  
##  [49] pkgconfig_2.0.3               reshape_0.8.8                
##  [51] XML_3.99-0.3                  nnet_7.3-14                  
##  [53] tidyselect_1.1.0              rlang_0.4.6                  
##  [55] later_1.0.0                   AnnotationDbi_1.48.0         
##  [57] munsell_0.5.0                 BiocVersion_3.10.1           
##  [59] tools_3.6.3                   fastmap_1.0.1                
##  [61] evaluate_0.14                 stringr_1.4.0                
##  [63] yaml_2.2.1                    knitr_1.28                   
##  [65] bit64_0.9-7                   beanplot_1.2                 
##  [67] scrime_1.3.5                  purrr_0.3.4                  
##  [69] nlme_3.1-147                  doRNG_1.8.2                  
##  [71] mime_0.9                      nor1mix_1.3-0                
##  [73] xml2_1.3.2                    biomaRt_2.42.1               
##  [75] rstudioapi_0.11               compiler_3.6.3               
##  [77] curl_4.3                      png_0.1-7                    
##  [79] interactiveDisplayBase_1.24.0 affyio_1.56.0                
##  [81] geneplotter_1.64.0            tibble_3.0.1                 
##  [83] stringi_1.4.6                 GenomicFeatures_1.38.2       
##  [85] lattice_0.20-41               Matrix_1.2-18                
##  [87] multtest_2.42.0               vctrs_0.3.0                  
##  [89] pillar_1.4.4                  lifecycle_0.2.0              
##  [91] BiocManager_1.30.10           data.table_1.12.8            
##  [93] bitops_1.0-6                  httpuv_1.5.2                 
##  [95] R6_2.4.1                      latticeExtra_0.6-29          
##  [97] hwriter_1.3.2                 promises_1.1.0               
##  [99] gridExtra_2.3                 affxparser_1.58.0            
## [101] codetools_0.2-16              MASS_7.3-51.6                
## [103] assertthat_0.2.1              rhdf5_2.30.1                 
## [105] openssl_1.4.1                 GenomeInfoDbData_1.2.2       
## [107] hms_0.5.3                     rpart_4.1-15                 
## [109] quadprog_1.5-8                grid_3.6.3                   
## [111] tidyr_1.0.3                   base64_2.0                   
## [113] rmarkdown_2.1                 DelayedMatrixStats_1.8.0     
## [115] illuminaio_0.28.0             base64enc_0.1-3              
## [117] shiny_1.4.0.2