For this project, it is recommended that you use the VMBox virtual environment provided with the Course package and the tools therein. You may also use your own system and software, however make sure that appropriate versions are installed. The answers are compatible with the following versions of the software: samtools v.1.2, bedtools v.2.24.0.
As part of a larger project cataloging genetic variation in the plant Arabidopsis thaliana, you sequenced and assembled the genome of one strain (‘wu_0_A’), then mapped back the reads to the assembled genome. The resulting BAM file is included in the package ‘gencommand_proj2_data.tar.gz’. Using SAMtools and BEDtools as well as other Unix commands introduced in this course, examine the files and answer the following questions. NOTE: Input data have been obtained and modified from those generated by the 1001 Genomes Project, accession ‘Wu_0_A’.
Apply these rules and steps to the questions marked above each rule.
Questions 1 - 5:
For the original set of alignments (file ‘athal_wu_0_A.bam’)
Questions 6 - 10:
Extract only the alignments in the range “Chr3:11,777,000-11,794,000”, corresponding to a locus of interest.
Questions 11 - 15:
Determine general information about the alignment process from the original BAM file.
Questions 16 - 20:
Using BEDtools, examine how many of the alignments at point 2 overlap exons at the locus of interest. Use the BEDtools ‘-wo’ option to only report non-zero overlaps. The list of exons is given in the included ‘athal_wu_0_A_annot.gtf’ GTF file.
samtools view athal_wu_0_A.bam | wc -l
## 221372
samtools view athal_wu_0_A.bam | cut -f7 | grep "*" | wc -l
## 65521
samtools view athal_wu_0_A.bam | cut -f6 | grep "D" | wc -l
## 2451
samtools view athal_wu_0_A.bam | cut -f7 | grep "=" | wc -l
## 150913
samtools view athal_wu_0_A.bam | cut -f6 | grep "N" | wc -l
## 0
# sort the bam_file
samtools sort athal_wu_0_A.bam -o athal_wu_0_A.sort.bam
# index the bam_file
samtools index athal_wu_0_A.sort.bam
# extract range “Chr3:11,777,000-11,794,000”
samtools view athal_wu_0_A.sort.bam "Chr3:11,777,000-11,794,000" | wc -l
## 7081
samtools view athal_wu_0_A.sort.bam "Chr3:11,777,000-11,794,000" | cut -f7 | grep "*" | wc -l
## 1983
samtools view athal_wu_0_A.sort.bam "Chr3:11,777,000-11,794,000" | cut -f6 | grep "D" | wc -l
## 31
samtools view athal_wu_0_A.sort.bam "Chr3:11,777,000-11,794,000" | cut -f7 | grep "=" | wc -l
## 4670
samtools view athal_wu_0_A.sort.bam "Chr3:11,777,000-11,794,000" | cut -f6 | grep "N" | wc -l
## 0
samtools view -H athal_wu_0_A.bam
@HD VN:1.0 GO:none SO:coordinate
@SQ SN:Chr1 LN:29923332 AS:wu_0.v7.fas SP:wu_0.v7.fas
@SQ SN:Chr2 LN:19386101 AS:wu_0.v7.fas SP:wu_0.v7.fas
@SQ SN:Chr3 LN:23042017 AS:wu_0.v7.fas SP:wu_0.v7.fas
@SQ SN:Chr4 LN:18307997 AS:wu_0.v7.fas SP:wu_0.v7.fas
@SQ SN:Chr5 LN:26567293 AS:wu_0.v7.fas SP:wu_0.v7.fas
@SQ SN:chloroplast LN:154478 AS:wu_0.v7.fas SP:wu_0.v7.fas
@SQ SN:mitochondria LN:366924 AS:wu_0.v7.fas SP:wu_0.v7.fas
@RG ID:H100223_GAII05_0002 PL:SLX LB:wu_PII PI:400 DS:wu_0_Genome SM:wu_0
@RG ID:Wii_PER01 PL:SLX LB:wu_phaseI PI:400 DS:wu_0_Genome SM:wu_0
@RG ID:Wii_PER02 PL:SLX LB:wu_phaseI PI:400 DS:wu_0_Genome SM:wu_0
@RG ID:Wii_SR03 PL:SLX LB:wu_phaseI PI:400 DS:wu_0_Genome SM:wu_0
@PG ID:stampy
## 7
## 29923332
## stampy
samtools view athal_wu_0_A.bam | head
GAII05_0002:1:113:7822:3886#0 1187 Chr3 11699950 60 51M = 11700332 433 AAAAAAAATGTAAAACTGCTAAATCTCTCCTCTCTAAAGAACTCGTCCCCG CCCCCCBBBCCCCCCCCCCCCCCCCCCCCCCCCCCCBAAB??@ACBBCCCD PQ:i:21 SM:i:37 UQ:i:0 MQ:i:37 XQ:i:0 RG:Z:H100223_GAII05_0002
## GAII05_0002:1:113:7822:3886#0
## chr3:11700332
bedtools bamtobed -i athal_wu_0_A.bam > athal_wu_0_A.bed
bedtools intersect -wo -a athal_wu_0_A_annot.gtf -b athal_wu_0_A.bed | wc -l
## 3101
bedtools intersect -wo -a athal_wu_0_A_annot.gtf -b athal_wu_0_A.bed | cut -f16 > count.txt
awk '$1>9{c++} END{print c+0}' count.txt
## 2899
## 3101
bedtools intersect -wo -a athal_wu_0_A_annot.gtf -b athal_wu_0_A.bed | cut -f4 | sort -u | wc -l
## 21
bedtools intersect -wo -a athal_wu_0_A_annot.gtf -b athal_wu_0_A.bed | cut -f9 | cut -d " " -f4 | sort -u | wc -l
## 4