All the data and code are available but the codebook does not fully explain the experimental design and all protocols for patient recruitment.
knitr::opts_chunk$set(cache=TRUE)
Then create the following code chunks
x = rnorm(10)
plot(x,pch=19,col="dodgerblue")
y = rbinom(20,size=1,prob=0.5)
table(y)
The plot is random the first time you knit the document. It is identical to the first time the second time you knit the document. After removing the folders test_cache and test_files they generate new random versions.
library(Biobase)
library(GenomicRanges)
library(SummarizedExperiment)
data(sample.ExpressionSet, package = "Biobase")
se = makeSummarizedExperimentFromExpressionSet(sample.ExpressionSet)
Look up the help files for summarizedExperiment with the code summarizedExperiment. How do you access the genomic data for this object? How do you access the phenotype table? How do you access the feature data? What is the unique additional information provided by rowRanges(se)?
??summarizedExperiment
Get the genomic table with assay(se), get the phenotype table with colData(se), get the feature data with rowRanges(se). rowRanges(se) gives information on the genomic location and structure of the measured
con =url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/bottomly_eset.RData")
load(file=con)
close(con)
bot = bottomly.eset
pdata_bot=pData(bot)
con =url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/bodymap_eset.RData")
load(file=con)
close(con)
bm = bodymap.eset
pdata_bm=pData(bm)
Just considering the phenotype data what are some reasons that the Bottomly data set is likely a better experimental design than the Bodymap data? Imagine the question of interest in the Bottomly data is to compare strains and in the Bodymap data it is to compare tissues.
The covariates in the Bottomly data set (experiment number, lane number) are balanced with respect to strain. The covariates in the Bodymap data set (gender, age, number of technical replicates) are not balanced with respect to tissue.
con =url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/bodymap_eset.RData")
load(file=con)
close(con)
bm = bodymap.eset
pdata_bm=pData(bm)
library(plotrix)
pie3D(pdata_bm$num.tech.reps,labels=pdata_bm$tissue.type)
The “mixture” category is split across multiple wedges.
con =url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/bodymap_eset.RData")
load(file=con)
close(con)
bm = bodymap.eset
edata = exprs(bm)
Which of the following code chunks will make a heatmap of the 500 most highly expressed genes (as defined by total count), without re-ordering due to clustering? Are the highly expressed samples next to each other in sample order?
row_sums = rowSums(edata)
edata = edata[order(-row_sums),]
index = 1:500
heatmap(edata[index,],Rowv=NA,Colv=NA)
con =url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/bodymap_eset.RData")
load(file=con)
close(con)
bm = bodymap.eset
pdata = pData(bm)
edata = exprs(bm)
Make an MA-plot of the first sample versus the second sample using the log2 transform (hint: you may have to add 1 first) and the rlog transform from the DESeq2 package. How are the two MA-plots different? Which kind of genes appear most different in each plot?
# make an MA-plot
mm = log2(edata[,1]+1) - log2(edata[,2]+1)
aa = log2(edata[,1]+1) + log2(edata[,2]+1)
plot(aa,mm,col=2)
library(DESeq2)
rld <- rlog(exprs(bm))
y_rld = rld[,1] - rld[,2]
x_rld = rld[,1] - rld[,2]
plot(x_rld, y_rld, col = "blue", type = "p")
The plots look pretty similar, but the rlog transform seems to shrink the low abundance genes more. In both cases, the genes in the middle of the expression distribution show the biggest differences.
con =url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/montpick_eset.RData")
load(file=con)
close(con)
mp = montpick.eset
pdata=pData(mp)
edata=as.data.frame(exprs(mp))
fdata = fData(mp)
Cluster the data in three ways:
How do the methods compare in terms of how well they cluster the data by study? Why do you think that is?
#With no changes to the data
dist1 = dist(t(edata))
hclust1 = hclust(dist1)
par(mar=c(0, 4, 4, 2))
plot(hclust1, hang = -1, main="origin", labels=FALSE)
#After filtering all genes with rowMeans less than 100
low_genes = rowMeans(edata) < 100
filter_edata = filter(edata, !low_genes)
f_dist1 = dist(t(filter_edata))
f_hclust1 = hclust(f_dist1)
par(mar=c(0, 4, 4, 2))
plot(f_hclust1, hang = -1, main="remove low expression", labels=FALSE)
#After taking the log2 transform of the data without filtering
log_edata = log2(edata + 1)
l_dist1 = dist(t(log_edata))
l_hclust1 = hclust(l_dist1)
par(mar=c(0, 4, 4, 2))
plot(l_hclust1, hang=-1, main="perform log2 transform", labels=FALSE)
con =url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/montpick_eset.RData")
load(file=con)
close(con)
mp = montpick.eset
pdata=pData(mp)
edata=as.data.frame(exprs(mp))
fdata = fData(mp)
Cluster the samples using k-means clustering after applying the log2 transform (be sure to add 1). Set a seed for reproducible results (use set.seed(1235)). If you choose two clusters, do you get the same two clusters as you get if you use the cutree function to cluster the samples into two groups? Which cluster matches most closely to the study labels?
edata = log2(edata + 1)
# perfrom k-means clustering
set.seed(1235)
k2 = kmeans(edata,centers=2)
matplot(t(k2$centers),col=1:2,type="l",lwd=3)
dist1 = dist(t(edata))
hclust1 = hclust(dist1)
tree = cutree(hclust1, 2)
par(mar=c(0, 4, 4, 2))
plot(hclust1, tree, main="cutree")
They produce different answers. The k-means clustering matches study better. Hierarchical clustering would look better if we went farther down the tree but the top split doesn’t perfectly describe the study variable.