For this project, it is recommended that you use the VMBox virtual environment provided with the Course package and the tools therein. You may also use your own system and software, however make sure that appropriate versions are installed. The answers are compatible with the following versions of the software: samtools v.1.2, bowtie v.2.2.5 and bcftools v.1.2.
As part of the effort to catalog genetic variation in the plant Arabidopsis thaliana, you re-sequenced the genome of one strain (‘wu_0_A’; genome file: ‘wu_0.v7.fas’), to determine genetic variants in this organism. The sequencing reads produced are in the file ‘wu_0_A_wgs.fastq’. Using the tools bowtie2, samtools and bcftools, develop a pipeline for variant calling in this genome. NOTE: Genome and re-sequencing data have been obtained and modified from those generated by the 1001 Genomes Project, accession ‘Wu_0_A’.
Apply to questions 1 - 5:
Generate a bowtie2 index of the wu_0_A genome using bowtie2-build, with the prefix ‘wu_0’.
Apply to questions 6 - 14:
Run bowtie2 to align the reads to the genome, under two scenarios: first, to report only full-length matches of the reads; and second, to allow partial (local) matches. All other parameters are as set by default.
For the following set of questions (11 - 20), use the set of full-length alignments calculated under scenario 1 only. Convert this SAM file to BAM, then sort the resulting BAM file.
Apply to questions 15 - 19:
Compile candidate sites of variation using SAMtools mpileup for further evaluation with BCFtools. Provide the reference fasta genome and use the option “-uv” to generate the output in uncompressed VCF format for easy examination.
Apply to questions 20 - 24:
Call variants using ‘BCFtools call’ with the multiallelic-caller model. For this, you will need to first re-run SAMtools mpileup with the BCF output format option (‘-g’) to generate the set of candidate sites to be evaluated by BCFtools. In the output to BCFtools, select to show only the variant sites, in uncompressed VCF format for easy examination.
cat wu_0.v7.fas | grep ">" | wc -l
## 7
cat wu_0.v7.fas | grep ">"
## Chr3
cat wu_0.v7.fas | grep ">"
## mitochondria
bowtie2-build wu_0.v7.fas wu_0/wu_0
## 6
## bt2
cat wu_0_A_wgs.fastq | wc -l
## 147354
bowtie2 -p 4 -x wu_0/wu_0 wu_0_A_wgs.fastq -S wu_0.bt2.sam
147354 reads; of these:
147354 (100.00%) were unpaired; of these:
9636 (6.54%) aligned 0 times
93780 (63.64%) aligned exactly 1 time
43938 (29.82%) aligned >1 times
93.46% overall alignment rate
## 137719
## 137719
## 43939
bowtie2 -p 4 --local -x wu_0/wu_0 wu_0_A_wgs.fastq -S wu_0.bt2.local.sam
147354 reads; of these:
147354 (100.00%) were unpaired; of these:
6823 (4.63%) aligned 0 times
88935 (60.35%) aligned exactly 1 time
51596 (35.01%) aligned >1 times
95.37% overall alignment rate
## 141044
## 141044
## 56105
cat wu_0.bt2.sam | cut -f6 | grep "I" | grep "D" | wc -l
## 2782
cat wu_0.bt2.local.sam | cut -f6 | grep "I" | grep "D" | wc -l
## 2614
samtools view -bT wu_0.v7.fas wu_0.bt2.sam > wu_0.bt2.bam
samtools sort wu_0.bt2.bam -o wu_0.bt2.sorted.bam
samtools index wu_0.bt2.sorted.bam
samtools mpileup -uv -f wu_0.v7.fas wu_0.bt2.sorted.bam > wu_0.vcf
cat wu_0.vcf | cut -f1 | grep "Chr3" | wc -l
## 360295
cat wu_0.vcf | cut -f4 | awk '$1 == "A"' | wc -l
## 1150985
cat wu_0.vcf | cut -f8 | grep "DP=20" | wc -l
## 1816
cat wu_0.vcf | cut -f8 | grep "INDEL" | wc -l
## 1972
cat wu_0.vcf | cut -f1,2 | grep "Chr1" | awk '$2 == "175672"' | wc -l
## 2
zcat < wu_0.vcf.gz | cut -f1 | grep "Chr3" | wc -l
## 398
zcat < wu_0.vcf.gz | cut -f4,5 | awk '$1 == "A" && $2 =="T"' | wc -l
## 392
zcat < wu_0.vcf.gz | cut -f8 | grep "INDEL" | wc -l
## 320
zcat < wu_0.vcf.gz | cut -f8 | grep "DP=20" | wc -l
## 2
zcat < wu_0.vcf.gz | grep "Chr3" | awk '$2 == "11937923"'
DP=20;VDB=0.0587288;SGB=-0.556411;RPB=0.639909;MQB=0.931063;BQB=0.972484;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=16,0,4,0;MQ=39 GT:PL 0/1:48,0,137
## SNP