Modern FASTQ to Eigenstrat Format with AADR Merge

Jovialis · Apr 1, 2025

This bioinformatics pipeline processes raw genomic data from paired-end FASTQ files to generate merged genotype data compatible with the AADR (Allen Ancient DNA Resource) dataset, enabling population genetics analysis. It begins by setting up a working environment and installing essential tools like samtools, bwa, fastp, Picard, pileupCaller, and EIGENSOFT. The pipeline then downloads and prepares the GRCh37 reference genome (with decoy sequences), adjusts its FASTA headers, and indexes it for alignment. Paired-end FASTQ files are trimmed and filtered using fastp, aligned to the reference with bwa mem, and converted to sorted, indexed BAM files with samtools. Duplicates are removed using Picard, and the resulting BAM is processed with pileupCaller to extract genotypes at 1240k SNP positions from the AADR dataset, producing EIGENSTRAT files. Finally, the pipeline merges the sample’s genotype data with the AADR dataset using mergeit, creating a unified dataset for downstream analysis, with verification steps throughout to ensure data integrity:

Code:

# Set Your Working Directory
WORKDIR="<WORKDIR>"
cd "$WORKDIR" || { echo "Error: Unable to change directory to $WORKDIR"; exit 1; }

# 1. Update Packages and Install Required Tools
sudo apt-get update && sudo apt-get install -y samtools bwa wget gzip fastp parallel
# Install Picard manually
wget https://github.com/broadinstitute/picard/releases/download/3.2.0/picard.jar -O <OUTPUT_DIR>/picard.jar
# Install pileupCaller (sequenceTools)
wget https://github.com/stschiff/sequenceTools/archive/refs/tags/v1.6.0.tar.gz
tar -xzf v1.6.0.tar.gz
cd sequenceTools-1.6.0
make
sudo cp pileupCaller /usr/local/bin/
cd "$WORKDIR"
# Install EIGENSOFT (for mergeit)
wget https://github.com/DReichLab/EIG/archive/refs/tags/v8.0.0.tar.gz
tar -xzf v8.0.0.tar.gz
cd EIG-8.0.0/src
make
sudo cp mergeit /usr/local/bin/
cd "$WORKDIR"

# 2. Download the Reference Genome
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz -P <OUTPUT_DIR>/

# 3. Verify the Download
ls -lh <OUTPUT_DIR>/hs37d5.fa.gz

# 4. Decompress the FASTA File
gunzip <OUTPUT_DIR>/hs37d5.fa.gz

# 5. Inspect FASTA Headers
grep "^>" <OUTPUT_DIR>/hs37d5.fa | head -n 20

# 6. Adjust FASTA Headers
sed -E 's/^(>[^ ]+).*/\1/' <OUTPUT_DIR>/hs37d5.fa > <OUTPUT_DIR>/hs37d5.adjusted.fa

# 7. Verify Adjusted Headers
grep "^>" <OUTPUT_DIR>/hs37d5.adjusted.fa | head -n 20

# 8. Index the Adjusted FASTA with samtools
samtools faidx <OUTPUT_DIR>/hs37d5.adjusted.fa

# 9. Index the Adjusted FASTA with bwa
bwa index <OUTPUT_DIR>/hs37d5.adjusted.fa

# 10. Trim Paired-End FASTQ Files with fastp
fastp \
  -i <INPUT_DIR>/<SAMPLE_NAME>_1.fq.gz \
  -I <INPUT_DIR>/<SAMPLE_NAME>_2.fq.gz \
  -o <OUTPUT_DIR>/<SAMPLE_NAME>_1.trim.fq.gz \
  -O <OUTPUT_DIR>/<SAMPLE_NAME>_2.trim.fq.gz \
  -q 20 \
  -u 30 \
  --detect_adapter_for_pe \
  -h <OUTPUT_DIR>/fastp_report.html \
  -j <OUTPUT_DIR>/fastp_report.json

# 11. Align Trimmed FASTQ Files to Reference and Convert to BAM
bwa mem -M <OUTPUT_DIR>/hs37d5.adjusted.fa \
  <OUTPUT_DIR>/<SAMPLE_NAME>_1.trim.fq.gz \
  <OUTPUT_DIR>/<SAMPLE_NAME>_2.trim.fq.gz | \
samtools view -bS - > <OUTPUT_DIR>/<SAMPLE_NAME>.raw.bam

# 12. Sort the Raw BAM File
samtools sort -o <OUTPUT_DIR>/<SAMPLE_NAME>.sorted.bam \
  <OUTPUT_DIR>/<SAMPLE_NAME>.raw.bam

# 13. Index the Sorted BAM File
samtools index <OUTPUT_DIR>/<SAMPLE_NAME>.sorted.bam

# 14. Generate Alignment Statistics
samtools flagstat <OUTPUT_DIR>/<SAMPLE_NAME>.sorted.bam

# 15. Mark and Remove Duplicates with Picard
java -jar <OUTPUT_DIR>/picard.jar MarkDuplicates \
  I=<OUTPUT_DIR>/<SAMPLE_NAME>.sorted.bam \
  O=<OUTPUT_DIR>/<SAMPLE_NAME>.dedup.bam \
  M=<OUTPUT_DIR>/<SAMPLE_NAME>.dup_metrics.txt \
  REMOVE_DUPLICATES=true
samtools index <OUTPUT_DIR>/<SAMPLE_NAME>.dedup.bam

# 16. Verify the Deduplicated BAM
samtools view -H <OUTPUT_DIR>/<SAMPLE_NAME>.dedup.bam | head

# 17. Obtain AADR Dataset (Manual Step)
# Download <AADR_VERSION>.{geno,snp,ind} manually and place in <AADR_DIR>/
# Verify
ls -lh <AADR_DIR>/<AADR_VERSION>.*

# 18. Create BED File for Targeted Positions
awk '{print $2 "\t" ($4-1) "\t" $4}' <AADR_DIR>/<AADR_VERSION>.snp > <OUTPUT_DIR>/AADR_positions.bed
head <OUTPUT_DIR>/AADR_positions.bed

# 19. Generate EIGENSTRAT Files
samtools mpileup -B -q30 -Q30 -f <OUTPUT_DIR>/hs37d5.adjusted.fa \
  -l <OUTPUT_DIR>/AADR_positions.bed \
  <OUTPUT_DIR>/<SAMPLE_NAME>.dedup.bam | \
pileupCaller --randomHaploid --sampleNames <SAMPLE_NAME> --eigenstratOut <SAMPLE_NAME> \
  --snpFile <AADR_DIR>/<AADR_VERSION>.snp

# 20. Fix .ind File for Sample
echo "<SAMPLE_NAME>    <SEX>    <POPULATION>" > <WORKDIR>/<SAMPLE_NAME>.ind
cat <WORKDIR>/<SAMPLE_NAME>.ind

# 21. Fix AADR .ind File
awk 'NF==2 {print $0 "\tUnknown"} NF==3 {print $0}' <AADR_DIR>/<AADR_VERSION>.ind > temp.ind && mv temp.ind <AADR_DIR>/<AADR_VERSION>.ind
grep "<SAMPLE_CHECK>" <AADR_DIR>/<AADR_VERSION>.ind

# 22. Create Merge Parameter File
cat << EOF > <WORKDIR>/merge.par
geno1: <AADR_DIR>/<AADR_VERSION>.geno
snp1: <AADR_DIR>/<AADR_VERSION>.snp
ind1: <AADR_DIR>/<AADR_VERSION>.ind
geno2: <WORKDIR>/<SAMPLE_NAME>.geno
snp2: <WORKDIR>/<SAMPLE_NAME>.snp
ind2: <WORKDIR>/<SAMPLE_NAME>.ind
genooutfilename: <WORKDIR>/merged_output.geno
snpoutfilename: <WORKDIR>/merged_output.snp
indoutfilename: <WORKDIR>/merged_output.ind
strandcheck: NO
EOF

# 23. Run the Merge
mergeit -p <WORKDIR>/merge.par

# 24. Verify Merged Files
ls -lh <WORKDIR>/merged_output.*
grep "<SAMPLE_NAME>" <WORKDIR>/merged_output.ind

baeticvs · Aug 26, 2025

what if i have access to .bam files directly? they come in different "formats" and im unsure how to proceed with them, for example

Modern FASTQ to Eigenstrat Format with AADR Merge

Jovialis

Advisor

baeticvs

Regular Member