For this project, it is recommended that you use the VMBox virtual environment provided with the Course package and the tools therein. You may also use your own system and software, however make sure that appropriate versions are installed. The answers are compatible with the following versions of the software: samtools v.1.2, bedtools v.2.24.0.

As part of a larger project cataloging genetic variation in the plant Arabidopsis thaliana, you sequenced and assembled the genome of one strain (‘wu_0_A’), then mapped back the reads to the assembled genome. The resulting BAM file is included in the package ‘gencommand_proj2_data.tar.gz’. Using SAMtools and BEDtools as well as other Unix commands introduced in this course, examine the files and answer the following questions. NOTE: Input data have been obtained and modified from those generated by the 1001 Genomes Project, accession ‘Wu_0_A’.

Apply these rules and steps to the questions marked above each rule.

Questions 1 - 5:
For the original set of alignments (file ‘athal_wu_0_A.bam’)

Questions 6 - 10:
Extract only the alignments in the range “Chr3:11,777,000-11,794,000”, corresponding to a locus of interest.

Questions 11 - 15:
Determine general information about the alignment process from the original BAM file.

Questions 16 - 20:
Using BEDtools, examine how many of the alignments at point 2 overlap exons at the locus of interest. Use the BEDtools ‘-wo’ option to only report non-zero overlaps. The list of exons is given in the included ‘athal_wu_0_A_annot.gtf’ GTF file.

1. How many alignments does the set contain?
samtools view athal_wu_0_A.bam | wc -l

## 221372

2. How many alignments show the read’s mate unmapped?
samtools view athal_wu_0_A.bam | cut -f7 | grep "*" | wc -l

## 65521

3. How many alignments contain a deletion (D)?
samtools view athal_wu_0_A.bam | cut -f6 | grep "D" | wc -l

## 2451

4. How many alignments show the read’s mate mapped to the same chromosome?
samtools view athal_wu_0_A.bam | cut -f7 | grep "=" | wc -l

## 150913

5. How many alignments are spliced?
samtools view athal_wu_0_A.bam | cut -f6 | grep "N" | wc -l

## 0

6. How many alignments does the set contain?
# sort the bam_file
samtools sort athal_wu_0_A.bam -o athal_wu_0_A.sort.bam

# index the bam_file
samtools index athal_wu_0_A.sort.bam

# extract range “Chr3:11,777,000-11,794,000”
samtools view athal_wu_0_A.sort.bam "Chr3:11,777,000-11,794,000" | wc -l

## 7081

7. How many alignments show the read’s mate unmapped?
samtools view athal_wu_0_A.sort.bam "Chr3:11,777,000-11,794,000" | cut -f7 | grep "*" | wc -l

## 1983

8. How many alignments contain a deletion (D)?
samtools view athal_wu_0_A.sort.bam "Chr3:11,777,000-11,794,000" | cut -f6 | grep "D" | wc -l

## 31

9. How many alignments show the read’s mate mapped to the same chromosome?
samtools view athal_wu_0_A.sort.bam "Chr3:11,777,000-11,794,000" | cut -f7 | grep "=" | wc -l

## 4670

10. How many alignments are spliced?
samtools view athal_wu_0_A.sort.bam "Chr3:11,777,000-11,794,000" | cut -f6 | grep "N" | wc -l

## 0

11. How many sequences are in the genome file?
12. What is the length of the first sequence in the genome file?
13. What alignment tool was used?
samtools view -H athal_wu_0_A.bam

@HD VN:1.0 GO:none SO:coordinate
@SQ SN:Chr1 LN:29923332 AS:wu_0.v7.fas SP:wu_0.v7.fas
@SQ SN:Chr2 LN:19386101 AS:wu_0.v7.fas SP:wu_0.v7.fas
@SQ SN:Chr3 LN:23042017 AS:wu_0.v7.fas SP:wu_0.v7.fas
@SQ SN:Chr4 LN:18307997 AS:wu_0.v7.fas SP:wu_0.v7.fas
@SQ SN:Chr5 LN:26567293 AS:wu_0.v7.fas SP:wu_0.v7.fas
@SQ SN:chloroplast LN:154478 AS:wu_0.v7.fas SP:wu_0.v7.fas
@SQ SN:mitochondria LN:366924 AS:wu_0.v7.fas SP:wu_0.v7.fas
@RG ID:H100223_GAII05_0002 PL:SLX LB:wu_PII PI:400 DS:wu_0_Genome SM:wu_0
@RG ID:Wii_PER01 PL:SLX LB:wu_phaseI PI:400 DS:wu_0_Genome SM:wu_0
@RG ID:Wii_PER02 PL:SLX LB:wu_phaseI PI:400 DS:wu_0_Genome SM:wu_0
@RG ID:Wii_SR03 PL:SLX LB:wu_phaseI PI:400 DS:wu_0_Genome SM:wu_0
@PG ID:stampy

## 7
## 29923332
## stampy

14. What is the read identifier (name) for the first alignment?
samtools view athal_wu_0_A.bam | head

GAII05_0002:1:113:7822:3886#0 1187 Chr3 11699950 60 51M = 11700332 433 AAAAAAAATGTAAAACTGCTAAATCTCTCCTCTCTAAAGAACTCGTCCCCG CCCCCCBBBCCCCCCCCCCCCCCCCCCCCCCCCCCCBAAB??@ACBBCCCD PQ:i:21 SM:i:37 UQ:i:0 MQ:i:37 XQ:i:0 RG:Z:H100223_GAII05_0002

## GAII05_0002:1:113:7822:3886#0

15. What is the start position of this read’s mate on the genome? Give this as ‘chrom:pos’ if the read was mapped, or ‘*” if unmapped.

## chr3:11700332

16. How many overlaps (each overlap is reported on one line) are reported?
bedtools bamtobed -i athal_wu_0_A.bam > athal_wu_0_A.bed
bedtools intersect -wo -a athal_wu_0_A_annot.gtf -b athal_wu_0_A.bed | wc -l

## 3101

17. How many of these are 10 bases or longer?
bedtools intersect -wo -a athal_wu_0_A_annot.gtf -b athal_wu_0_A.bed | cut -f16 > count.txt
awk '$1>9{c++} END{print c+0}' count.txt

## 2899

18. How many alignments overlap the annotations?

## 3101

19. Conversely, how many exons have reads mapped to them?
bedtools intersect -wo -a athal_wu_0_A_annot.gtf -b athal_wu_0_A.bed | cut -f4 | sort -u | wc -l

## 21

20. If you were to convert the transcript annotations in the file “athal_wu_0_A_annot.gtf” into BED format, how many BED records would be generated?
bedtools intersect -wo -a athal_wu_0_A_annot.gtf -b athal_wu_0_A.bed | cut -f9 | cut -d " " -f4 | sort -u | wc -l

## 4