Alignment

The alignment module performs quality control, trimming, alignment, deduplication, and outputting of reads to a binary alignment map (BAM) file.

  1. Quality Control: A report of the FASTQ file quality is generated using the third-party software fastqc.

  2. Trimming: Adapters and low-quality base calls are trimmed from the ends of every read using the third-party software cutadapt.

  3. Trimmed Quality Control: Another report of FASTQ file quality after trimming is generated with fastqc.

  4. Alignment: The trimmed FASTQ files are aligned to a FASTA file containing one or more reference sequences, yielding a sequence alignment map (SAM) file, using third-party software bowtie2.

  5. Deduplication: Reads that align equally well to multiple locations in the reference sequences are removed from the SAM file using a DREEM internal function.

  6. Outputting: The deduplicated SAM file is converted into a BAM file, sorted positionally, indexed, and split into one BAM file per reference sequence using the third-party software samtools.

Examples

CLI

Basic usage of the align module, showing how to align a pair of FASTQ files containing paired-end reads (DMS_100mM_R1.fq.gz and DMS_100mM_R2.fq.gz) to a set of reference sequences in human_transcriptome.fasta:

dreem align -1 DMS_100mM_R1.fq.gz -2 DMS_100mM_R2.fq.gz -f human_transcriptome.fasta

Python

Basic usage of the Python API, showing the same example as with the CLI:

>>> from dreem import align
>>> bam_files = align.main.run(fastq1=("DMS_100mM_R1.fq.gz",),
...                            fastq2=("DMS_100mM_R2.fq.gz",),
...                            fasta="human_transcriptome.fasta")

API Reference

The arguments for the CLI and Python API are the same. The CLI is just a wrapper around the Python API.

CLI cmds

dreem align

dreem align [OPTIONS]

Options

--fasta <fasta>

FASTA file of all reference sequences in the project

--fastqs <fastqs>

FASTQ files of single-end reads

--fastqi <fastqi>

FASTQ files of interleaved paired reads

--fastq1 <fastq1>

FASTQ files of mate 1 paired-end reads

--fastq2 <fastq2>

FASTQ files of mate 2 paired-end reads

--fastqs-dir <fastqs_dir>

Directory containing demultiplexed FASTQ files of single-end reads from one sample

--fastqi-dir <fastqi_dir>

Directory containing demultiplexed FASTQ files of interleaved paired-end reads from one sample

--fastq12-dir <fastq12_dir>

Directory containing demultiplexed pairs of FASTQ files of mate 1 and mate 2 reads from one sample

--phred-enc <phred_enc>

Phred score encoding in FASTQ/SAM/BAM files

--out-dir <out_dir>

Where to output all finished files

--temp-dir <temp_dir>

Where to write all temporary files

--rerun, --no-rerun

Whether to regenerate files that already exist

--save-temp, --erase-temp

Whether to save or erase temporary files after the program exits

--parallel, --no-parallel

Whether to run multiple jobs in parallel

--max-procs <max_procs>

Maximum number of simultaneous processes

--fastqc, --no-fastqc

Whether to check quality of FASTQ files

--qc-extract, --qc-no-extract

Whether to unzip FASTQC reports

--cut, --no-cut

Whether to trim reads with Cutadapt before alignment

--cut-a1 <cut_a1>

3’ adapter for read 1

--cut-g1 <cut_g1>

5’ adapter for read 1

--cut-a2 <cut_a2>

3’ adapter for read 2

--cut-g2 <cut_g2>

5’ adapter for read 2

--cut-O <cut_o>

Minimum overlap of read and adapter

--cut-e <cut_e>

Error tolerance for adapters

--cut-q1 <cut_q1>

Phred score for read 1 quality trimming

--cut-q2 <cut_q2>

Phred score for read 2 quality trimming

--cut-m <cut_m>

Discard reads shorter than this length after trimming

--cut-indels, --cut-no-indels

Whether to allow indels in adapters

--cut-discard-trimmed, --cut-keep-trimmed

Whether to discard reads in which an adapter was found

--cut-discard-untrimmed, --cut-keep-untrimmed

Whether to discard reads in which no adapter was found

--cut-nextseq, --cut-no-nextseq

Whether to trim high-quality Gs from 3’ end

--bt2-local, --bt2-end-to-end

Whether to perform local or end-to-end alignment

--bt2-discordant, --bt2-no-discordant

Whether to output discordant alignments

--bt2-mixed, --bt2-no-mixed

Whether to align individual mates of unaligned pairs

--bt2-dovetail, --bt2-no-dovetail

Whether to treat dovetailed mate pairs as concordant

--bt2-contain, --bt2-no-contain

Whether to treat nested mate pairs as concordant

--bt2-unal, --bt2-no-unal

Whether to output unaligned reads

--bt2-I <bt2_i>

Minimum fragment length for valid paired-end alignments

--bt2-X <bt2_x>

Maximum fragment length for valid paired-end alignments

--bt2-score-min <bt2_score_min>

Minimum score for a valid alignment

--bt2-i <bt2_s>

Seed interval

--bt2-L <bt2_l>

Seed length

--bt2-gbar <bt2_gbar>

Minimum distance of a gap from end of a read

--bt2-D <bt2_d>

Maximum number of failed seed extensions

--bt2-R <bt2_r>

Maximum number of times to re-seed

--bt2-dpad <bt2_dpad>

Width of padding on alignment matrix, to allow gaps

--bt2-orient <bt2_orient>

Valid orientations of paired-end mates

Options

fr | rf | ff

Python args

dreem.align.run(*, fasta: str, fastqs: tuple[str] = (), fastqi: tuple[str] = (), fastq1: tuple[str] = (), fastq2: tuple[str] = (), fastqs_dir: tuple[str] = (), fastqi_dir: tuple[str] = (), fastq12_dir: tuple[str] = (), phred_enc: int = 33, out_dir: str = './output', temp_dir: str = './temp', save_temp: bool = False, rerun: bool = False, max_procs: int = 2, parallel: bool = True, fastqc: bool = True, qc_extract: bool = False, cut: bool = True, cut_q1: int = 25, cut_q2: int = 25, cut_g1: tuple[str] = (), cut_a1: tuple[str] = ('AGATCGGAAGAGC',), cut_g2: tuple[str] = (), cut_a2: tuple[str] = ('AGATCGGAAGAGC',), cut_o: int = 6, cut_e: float = 0.1, cut_indels: bool = True, cut_nextseq: bool = False, cut_discard_trimmed: bool = False, cut_discard_untrimmed: bool = False, cut_m: int = 20, bt2_local: bool = True, bt2_discordant: bool = False, bt2_mixed: bool = False, bt2_dovetail: bool = False, bt2_contain: bool = True, bt2_unal: bool = False, bt2_score_min: str = 'L,1,0.5', bt2_i: int = 0, bt2_x: int = 600, bt2_gbar: int = 4, bt2_l: int = 12, bt2_s: str = 'L,1,0.1', bt2_d: int = 4, bt2_r: int = 2, bt2_dpad: int = 2, bt2_orient: str = 'fr') tuple[str, ...]

Run the alignment module.

Align the reads to the set of reference sequences and output one BAM file for each sample aligned to each reference in the directory ‘output’. Temporary intermediary files are written in the directory ‘temp’ and then deleted after they are no longer needed.

Parameters
  • fasta (str) – FASTA file of all reference sequences in the project [keyword-only]

  • fastqs (tuple) – FASTQ files of single-end reads [keyword-only, default: ()]

  • fastqi (tuple) – FASTQ files of interleaved paired reads [keyword-only, default: ()]

  • fastq1 (tuple) – FASTQ files of mate 1 paired-end reads [keyword-only, default: ()]

  • fastq2 (tuple) – FASTQ files of mate 2 paired-end reads [keyword-only, default: ()]

  • fastqs_dir (tuple) – Directory containing demultiplexed FASTQ files of single-end reads from one sample [keyword-only, default: ()]

  • fastqi_dir (tuple) – Directory containing demultiplexed FASTQ files of interleaved paired-end reads from one sample [keyword-only, default: ()]

  • fastq12_dir (tuple) – Directory containing demultiplexed pairs of FASTQ files of mate 1 and mate 2 reads from one sample [keyword-only, default: ()]

  • phred_enc (int) – Phred score encoding in FASTQ/SAM/BAM files [keyword-only, default: 33]

  • out_dir (str) – Where to output all finished files [keyword-only, default: ‘./output’]

  • temp_dir (str) – Where to write all temporary files [keyword-only, default: ‘./temp’]

  • save_temp (bool) – Whether to save or erase temporary files after the program exits [keyword-only, default: False]

  • rerun (bool) – Whether to regenerate files that already exist [keyword-only, default: False]

  • max_procs (int) – Maximum number of simultaneous processes [keyword-only, default: 2]

  • parallel (bool) – Whether to run multiple jobs in parallel [keyword-only, default: True]

  • fastqc (bool) – Whether to check quality of FASTQ files [keyword-only, default: True]

  • qc_extract (bool) – Whether to unzip FASTQC reports [keyword-only, default: False]

  • cut (bool) – Whether to trim reads with Cutadapt before alignment [keyword-only, default: True]

  • cut_q1 (int) – Phred score for read 1 quality trimming [keyword-only, default: 25]

  • cut_q2 (int) – Phred score for read 2 quality trimming [keyword-only, default: 25]

  • cut_g1 (tuple) – 5’ adapter for read 1 [keyword-only, default: ()]

  • cut_a1 (tuple) – 3’ adapter for read 1 [keyword-only, default: (‘AGATCGGAAGAGC’,)]

  • cut_g2 (tuple) – 5’ adapter for read 2 [keyword-only, default: ()]

  • cut_a2 (tuple) – 3’ adapter for read 2 [keyword-only, default: (‘AGATCGGAAGAGC’,)]

  • cut_o (int) – Minimum overlap of read and adapter [keyword-only, default: 6]

  • cut_e (float) – Error tolerance for adapters [keyword-only, default: 0.1]

  • cut_indels (bool) – Whether to allow indels in adapters [keyword-only, default: True]

  • cut_nextseq (bool) – Whether to trim high-quality Gs from 3’ end [keyword-only, default: False]

  • cut_discard_trimmed (bool) – Whether to discard reads in which an adapter was found [keyword-only, default: False]

  • cut_discard_untrimmed (bool) – Whether to discard reads in which no adapter was found [keyword-only, default: False]

  • cut_m (int) – Discard reads shorter than this length after trimming [keyword-only, default: 20]

  • bt2_local (bool) – Whether to perform local or end-to-end alignment [keyword-only, default: True]

  • bt2_discordant (bool) – Whether to output discordant alignments [keyword-only, default: False]

  • bt2_mixed (bool) – Whether to align individual mates of unaligned pairs [keyword-only, default: False]

  • bt2_dovetail (bool) – Whether to treat dovetailed mate pairs as concordant [keyword-only, default: False]

  • bt2_contain (bool) – Whether to treat nested mate pairs as concordant [keyword-only, default: True]

  • bt2_unal (bool) – Whether to output unaligned reads [keyword-only, default: False]

  • bt2_score_min (str) – Minimum score for a valid alignment [keyword-only, default: ‘L,1,0.5’]

  • bt2_i (int) – Minimum fragment length for valid paired-end alignments [keyword-only, default: 0]

  • bt2_x (int) – Maximum fragment length for valid paired-end alignments [keyword-only, default: 600]

  • bt2_gbar (int) – Minimum distance of a gap from end of a read [keyword-only, default: 4]

  • bt2_l (int) – Seed length [keyword-only, default: 12]

  • bt2_s (str) – Seed interval [keyword-only, default: ‘L,1,0.1’]

  • bt2_d (int) – Maximum number of failed seed extensions [keyword-only, default: 4]

  • bt2_r (int) – Maximum number of times to re-seed [keyword-only, default: 2]

  • bt2_dpad (int) – Width of padding on alignment matrix, to allow gaps [keyword-only, default: 2]

  • bt2_orient (str) – Valid orientations of paired-end mates [keyword-only, default: ‘fr’]

I/O files

Input files

The alignment module requires two input files:

Endedness: single vs. paired

The alignment module accepts both single-end and paired-end reads. Each FASTQ file must contain one type (not a mix of both), although if multiple FASTQ files are given, they may be of different types from each other. The accepted types of FASTQ files are as follows:

  • Single-end: 1 file, 4 lines per read. Example showing 2 reads:

name.fq

@Read_1_ID
GCATGCTAGCCA
+
FFFFFFFFF:F:
@Read_2_ID
ATCGTCATGTGT
+
FFFFFFF:FFFF
  • Paired-end, separate files for mate 1 and mate 2 reads: 2 files, 4 lines per mate in each file; mates must be in the same order in both files. Example showing 2 reads:

name_R1.fq

@Read_1_ID/1
GCATGCTAGCCA
+
FFFFFFFFF:F:
@Read_2_ID/1
ATCGTCATGTGT
+
FFFFFFF:FFFF

name_R2.fq

@Read_1_ID/2
TACGTCGTCGTC
+
FFFFF:FF:F::
@Read_2_ID/2
CACGAGCGATAG
+
FFFF:FF:::F:

Note that for every mate 1 file given, a mate 2 file with the same sample name must be given, and vice versa.

  • Paired-end, interleaved mate 1 and mate 2 reads: 1 file, 8 lines per paired read (4 for mate 1, then 4 for mate 2). Example showing 2 reads:

name.fq

@Read_1_ID/1
GCATGCTAGCCA
+
FFFFFFFFF:F:
@Read_1_ID/2
TACGTCGTCGTC
+
FFFFF:FF:F::
@Read_2_ID/1
ATCGTCATGTGT
+
FFFFFFF:FFFF
@Read_2_ID/2
CACGAGCGATAG
+
FFFF:FF:::F:

File extensions

The following file extensions are accepted for each file format:

  • FASTA: .fasta, .fna, .fa

  • FASTQ: .fastq, .fq, .fastq.gz, .fq.gz

Note that FASTQ files may be compressed with gzip (and end with .gz), while FASTA files may not. As FASTQ files from deep sequencing may occupy many gigabytes when uncompressed, it is recommended to keep them compressed (.gz) unless they must be demultiplexed. The file extension will be preserved throughout the pipeline; that is, if an input FASTQ file is compressed and has the extension .fq.gz, then the temporary trimmed FASTQ file will also be compressed and end in .fq.gz.

For paired-end reads in which mate 1s and mate 2s are in separate files, the file names must have one of the following suffixes immediately before the file extension:

  • FASTQ 1: _R1, _mate1, _1_sequence, _R1_001, _mate1_001, _1_sequence_001

  • FASTQ 2: _R2, _mate2, _2_sequence, _R2_001, _mate2_001, _2_sequence_001

If you would like future versions of DREEM to support additional file extensions, please create a new issue on GitHub.

File nomenclature

The paths of the input files are parsed to determine the names of the sample and reference.

  • For a FASTQ file containing reads from an entire sample (arguments fastqs, fastqi, or fastq1 and fastq2), the sample name is the name of the file, minus its extension (e.g. .fq) and suffix indicating mate 1 or 2 (e.g. _R1). For example, if the path is /home/dms/foo_R2.fq.gz, then the sample is foo. The name of the reference sequence comes from the label of the sequence inside the FASTA file (the name of the FASTA file itself is used only in temporary files).

  • For a FASTQ file containing reads from only one reference or construct (i.e. output from the demultiplexing module), the sample name is the directory in which the FASTQ file is located, and the reference is the name of the file, minus the suffix and extension. For example, if the path is /home/dms/bar/baz_mate1.fq.gz, then the sample is bar and the reference is baz. (A sequence named baz must then be present in the given FASTA file.)

Sequence alphabets

Reference sequences must contain only the uppercase characters A, C, G, and T. Read sequences may contain any uppercase characters, but all characters besides A, C, G, and T (including degenerate bases defined by the IUPAC) are treated as any nucleotide (i.e. N).

Quality score encodings

The Phred quality scores in FASTQ files are encoded by adding an integer N to the Phred score (Phred+N). Most modern Illumina instruments output FASTQ files with Phred+33 encoding (which is the default in DREEM), but Phred+64 is also common. The quality score encoding can be set to a non-default value (in this example, Phred+64) as follows:

  • CLI: --phred-enc 64

  • API: phred_enc=64

Output files

All output files and directories are written in the user-specified directory output_dir.

Alignment maps

For each sample named sample_name, given as a FASTQ file:
For each reference sequence named reference_name in the input FASTA file:

FASTQC reports

For each input FASTQ file named file_name.x coming from sample sample_name:
  • A FASTQC report of the input file. Zipped by default; can extract automatically using --qc-extract (CLI) or qc_extract=True (API): {output_dir}/alignment/{sample_name}/qc-inp/{file_name}_fastqc.zip

  • A FASTQC report of the file after trimming with Cutadapt: {output_dir}/alignment/{sample_name}/qc-cut/{file_name}_fastqc.zip

Bowtie2 Indexes

Temporary files

All temporary files and directories are written in the user-specified directory temp_dir. By default, temporary files (but not directories) are deleted as soon as they are no longer needed. Specifying --save-temp (CLI) or save_temp=True (API) prevents temporary files from being deleted (e.g. for troubleshooting).

FASTQ files

If Cutadapt is enabled (the default, but it can be disabled with --no-cut (CLI) or cut=False (API)):
For each input FASTQ file named file_name.x:
  • A FASTQ file trimmed with Cutadapt

Key algorithm(s)

[A description of the key algorithms for this module. Keep it short and use examples.]

Edge cases handling

[What are the edge cases that need to be handled? How are they handled?]

Contributors: [write who contributed to the module here.]