Alignment
The alignment module performs quality control, trimming, alignment, deduplication, and outputting of reads to a binary alignment map (BAM) file.
Quality Control: A report of the FASTQ file quality is generated using the third-party software
fastqc
.Trimming: Adapters and low-quality base calls are trimmed from the ends of every read using the third-party software
cutadapt
.Trimmed Quality Control: Another report of FASTQ file quality after trimming is generated with
fastqc
.Alignment: The trimmed FASTQ files are aligned to a FASTA file containing one or more reference sequences, yielding a sequence alignment map (SAM) file, using third-party software
bowtie2
.Deduplication: Reads that align equally well to multiple locations in the reference sequences are removed from the SAM file using a DREEM internal function.
Outputting: The deduplicated SAM file is converted into a BAM file, sorted positionally, indexed, and split into one BAM file per reference sequence using the third-party software
samtools
.
Examples
CLI
Basic usage of the align module, showing how to align a pair of FASTQ files containing paired-end reads
(DMS_100mM_R1.fq.gz
and DMS_100mM_R2.fq.gz
) to a set of reference sequences in human_transcriptome.fasta
:
dreem align -1 DMS_100mM_R1.fq.gz -2 DMS_100mM_R2.fq.gz -f human_transcriptome.fasta
Python
Basic usage of the Python API, showing the same example as with the CLI:
>>> from dreem import align
>>> bam_files = align.main.run(fastq1=("DMS_100mM_R1.fq.gz",),
... fastq2=("DMS_100mM_R2.fq.gz",),
... fasta="human_transcriptome.fasta")
API Reference
The arguments for the CLI and Python API are the same. The CLI is just a wrapper around the Python API.
CLI cmds
dreem align
dreem align [OPTIONS]
Options
- --fasta <fasta>
FASTA file of all reference sequences in the project
- --fastqs <fastqs>
FASTQ files of single-end reads
- --fastqi <fastqi>
FASTQ files of interleaved paired reads
- --fastq1 <fastq1>
FASTQ files of mate 1 paired-end reads
- --fastq2 <fastq2>
FASTQ files of mate 2 paired-end reads
- --fastqs-dir <fastqs_dir>
Directory containing demultiplexed FASTQ files of single-end reads from one sample
- --fastqi-dir <fastqi_dir>
Directory containing demultiplexed FASTQ files of interleaved paired-end reads from one sample
- --fastq12-dir <fastq12_dir>
Directory containing demultiplexed pairs of FASTQ files of mate 1 and mate 2 reads from one sample
- --phred-enc <phred_enc>
Phred score encoding in FASTQ/SAM/BAM files
- --out-dir <out_dir>
Where to output all finished files
- --temp-dir <temp_dir>
Where to write all temporary files
- --rerun, --no-rerun
Whether to regenerate files that already exist
- --save-temp, --erase-temp
Whether to save or erase temporary files after the program exits
- --parallel, --no-parallel
Whether to run multiple jobs in parallel
- --max-procs <max_procs>
Maximum number of simultaneous processes
- --fastqc, --no-fastqc
Whether to check quality of FASTQ files
- --qc-extract, --qc-no-extract
Whether to unzip FASTQC reports
- --cut, --no-cut
Whether to trim reads with Cutadapt before alignment
- --cut-a1 <cut_a1>
3’ adapter for read 1
- --cut-g1 <cut_g1>
5’ adapter for read 1
- --cut-a2 <cut_a2>
3’ adapter for read 2
- --cut-g2 <cut_g2>
5’ adapter for read 2
- --cut-O <cut_o>
Minimum overlap of read and adapter
- --cut-e <cut_e>
Error tolerance for adapters
- --cut-q1 <cut_q1>
Phred score for read 1 quality trimming
- --cut-q2 <cut_q2>
Phred score for read 2 quality trimming
- --cut-m <cut_m>
Discard reads shorter than this length after trimming
- --cut-indels, --cut-no-indels
Whether to allow indels in adapters
- --cut-discard-trimmed, --cut-keep-trimmed
Whether to discard reads in which an adapter was found
- --cut-discard-untrimmed, --cut-keep-untrimmed
Whether to discard reads in which no adapter was found
- --cut-nextseq, --cut-no-nextseq
Whether to trim high-quality Gs from 3’ end
- --bt2-local, --bt2-end-to-end
Whether to perform local or end-to-end alignment
- --bt2-discordant, --bt2-no-discordant
Whether to output discordant alignments
- --bt2-mixed, --bt2-no-mixed
Whether to align individual mates of unaligned pairs
- --bt2-dovetail, --bt2-no-dovetail
Whether to treat dovetailed mate pairs as concordant
- --bt2-contain, --bt2-no-contain
Whether to treat nested mate pairs as concordant
- --bt2-unal, --bt2-no-unal
Whether to output unaligned reads
- --bt2-I <bt2_i>
Minimum fragment length for valid paired-end alignments
- --bt2-X <bt2_x>
Maximum fragment length for valid paired-end alignments
- --bt2-score-min <bt2_score_min>
Minimum score for a valid alignment
- --bt2-i <bt2_s>
Seed interval
- --bt2-L <bt2_l>
Seed length
- --bt2-gbar <bt2_gbar>
Minimum distance of a gap from end of a read
- --bt2-D <bt2_d>
Maximum number of failed seed extensions
- --bt2-R <bt2_r>
Maximum number of times to re-seed
- --bt2-dpad <bt2_dpad>
Width of padding on alignment matrix, to allow gaps
- --bt2-orient <bt2_orient>
Valid orientations of paired-end mates
- Options
fr | rf | ff
Python args
- dreem.align.run(*, fasta: str, fastqs: tuple[str] = (), fastqi: tuple[str] = (), fastq1: tuple[str] = (), fastq2: tuple[str] = (), fastqs_dir: tuple[str] = (), fastqi_dir: tuple[str] = (), fastq12_dir: tuple[str] = (), phred_enc: int = 33, out_dir: str = './output', temp_dir: str = './temp', save_temp: bool = False, rerun: bool = False, max_procs: int = 2, parallel: bool = True, fastqc: bool = True, qc_extract: bool = False, cut: bool = True, cut_q1: int = 25, cut_q2: int = 25, cut_g1: tuple[str] = (), cut_a1: tuple[str] = ('AGATCGGAAGAGC',), cut_g2: tuple[str] = (), cut_a2: tuple[str] = ('AGATCGGAAGAGC',), cut_o: int = 6, cut_e: float = 0.1, cut_indels: bool = True, cut_nextseq: bool = False, cut_discard_trimmed: bool = False, cut_discard_untrimmed: bool = False, cut_m: int = 20, bt2_local: bool = True, bt2_discordant: bool = False, bt2_mixed: bool = False, bt2_dovetail: bool = False, bt2_contain: bool = True, bt2_unal: bool = False, bt2_score_min: str = 'L,1,0.5', bt2_i: int = 0, bt2_x: int = 600, bt2_gbar: int = 4, bt2_l: int = 12, bt2_s: str = 'L,1,0.1', bt2_d: int = 4, bt2_r: int = 2, bt2_dpad: int = 2, bt2_orient: str = 'fr') tuple[str, ...]
Run the alignment module.
Align the reads to the set of reference sequences and output one BAM file for each sample aligned to each reference in the directory ‘output’. Temporary intermediary files are written in the directory ‘temp’ and then deleted after they are no longer needed.
- Parameters
fasta (
str
) – FASTA file of all reference sequences in the project [keyword-only]fastqs (
tuple
) – FASTQ files of single-end reads [keyword-only, default: ()]fastqi (
tuple
) – FASTQ files of interleaved paired reads [keyword-only, default: ()]fastq1 (
tuple
) – FASTQ files of mate 1 paired-end reads [keyword-only, default: ()]fastq2 (
tuple
) – FASTQ files of mate 2 paired-end reads [keyword-only, default: ()]fastqs_dir (
tuple
) – Directory containing demultiplexed FASTQ files of single-end reads from one sample [keyword-only, default: ()]fastqi_dir (
tuple
) – Directory containing demultiplexed FASTQ files of interleaved paired-end reads from one sample [keyword-only, default: ()]fastq12_dir (
tuple
) – Directory containing demultiplexed pairs of FASTQ files of mate 1 and mate 2 reads from one sample [keyword-only, default: ()]phred_enc (
int
) – Phred score encoding in FASTQ/SAM/BAM files [keyword-only, default: 33]out_dir (
str
) – Where to output all finished files [keyword-only, default: ‘./output’]temp_dir (
str
) – Where to write all temporary files [keyword-only, default: ‘./temp’]save_temp (
bool
) – Whether to save or erase temporary files after the program exits [keyword-only, default: False]rerun (
bool
) – Whether to regenerate files that already exist [keyword-only, default: False]max_procs (
int
) – Maximum number of simultaneous processes [keyword-only, default: 2]parallel (
bool
) – Whether to run multiple jobs in parallel [keyword-only, default: True]fastqc (
bool
) – Whether to check quality of FASTQ files [keyword-only, default: True]qc_extract (
bool
) – Whether to unzip FASTQC reports [keyword-only, default: False]cut (
bool
) – Whether to trim reads with Cutadapt before alignment [keyword-only, default: True]cut_q1 (
int
) – Phred score for read 1 quality trimming [keyword-only, default: 25]cut_q2 (
int
) – Phred score for read 2 quality trimming [keyword-only, default: 25]cut_g1 (
tuple
) – 5’ adapter for read 1 [keyword-only, default: ()]cut_a1 (
tuple
) – 3’ adapter for read 1 [keyword-only, default: (‘AGATCGGAAGAGC’,)]cut_g2 (
tuple
) – 5’ adapter for read 2 [keyword-only, default: ()]cut_a2 (
tuple
) – 3’ adapter for read 2 [keyword-only, default: (‘AGATCGGAAGAGC’,)]cut_o (
int
) – Minimum overlap of read and adapter [keyword-only, default: 6]cut_e (
float
) – Error tolerance for adapters [keyword-only, default: 0.1]cut_indels (
bool
) – Whether to allow indels in adapters [keyword-only, default: True]cut_nextseq (
bool
) – Whether to trim high-quality Gs from 3’ end [keyword-only, default: False]cut_discard_trimmed (
bool
) – Whether to discard reads in which an adapter was found [keyword-only, default: False]cut_discard_untrimmed (
bool
) – Whether to discard reads in which no adapter was found [keyword-only, default: False]cut_m (
int
) – Discard reads shorter than this length after trimming [keyword-only, default: 20]bt2_local (
bool
) – Whether to perform local or end-to-end alignment [keyword-only, default: True]bt2_discordant (
bool
) – Whether to output discordant alignments [keyword-only, default: False]bt2_mixed (
bool
) – Whether to align individual mates of unaligned pairs [keyword-only, default: False]bt2_dovetail (
bool
) – Whether to treat dovetailed mate pairs as concordant [keyword-only, default: False]bt2_contain (
bool
) – Whether to treat nested mate pairs as concordant [keyword-only, default: True]bt2_unal (
bool
) – Whether to output unaligned reads [keyword-only, default: False]bt2_score_min (
str
) – Minimum score for a valid alignment [keyword-only, default: ‘L,1,0.5’]bt2_i (
int
) – Minimum fragment length for valid paired-end alignments [keyword-only, default: 0]bt2_x (
int
) – Maximum fragment length for valid paired-end alignments [keyword-only, default: 600]bt2_gbar (
int
) – Minimum distance of a gap from end of a read [keyword-only, default: 4]bt2_l (
int
) – Seed length [keyword-only, default: 12]bt2_s (
str
) – Seed interval [keyword-only, default: ‘L,1,0.1’]bt2_d (
int
) – Maximum number of failed seed extensions [keyword-only, default: 4]bt2_r (
int
) – Maximum number of times to re-seed [keyword-only, default: 2]bt2_dpad (
int
) – Width of padding on alignment matrix, to allow gaps [keyword-only, default: 2]bt2_orient (
str
) – Valid orientations of paired-end mates [keyword-only, default: ‘fr’]
I/O files
Input files
The alignment module requires two input files:
A list of sequencing reads in FASTQ format
A list of one or more reference sequences in FASTA format
Endedness: single vs. paired
The alignment module accepts both single-end and paired-end reads. Each FASTQ file must contain one type (not a mix of both), although if multiple FASTQ files are given, they may be of different types from each other. The accepted types of FASTQ files are as follows:
Single-end: 1 file, 4 lines per read. Example showing 2 reads:
name.fq
@Read_1_ID
GCATGCTAGCCA
+
FFFFFFFFF:F:
@Read_2_ID
ATCGTCATGTGT
+
FFFFFFF:FFFF
Paired-end, separate files for mate 1 and mate 2 reads: 2 files, 4 lines per mate in each file; mates must be in the same order in both files. Example showing 2 reads:
name_R1.fq
@Read_1_ID/1
GCATGCTAGCCA
+
FFFFFFFFF:F:
@Read_2_ID/1
ATCGTCATGTGT
+
FFFFFFF:FFFF
name_R2.fq
@Read_1_ID/2
TACGTCGTCGTC
+
FFFFF:FF:F::
@Read_2_ID/2
CACGAGCGATAG
+
FFFF:FF:::F:
Note that for every mate 1 file given, a mate 2 file with the same sample name must be given, and vice versa.
Paired-end, interleaved mate 1 and mate 2 reads: 1 file, 8 lines per paired read (4 for mate 1, then 4 for mate 2). Example showing 2 reads:
name.fq
@Read_1_ID/1
GCATGCTAGCCA
+
FFFFFFFFF:F:
@Read_1_ID/2
TACGTCGTCGTC
+
FFFFF:FF:F::
@Read_2_ID/1
ATCGTCATGTGT
+
FFFFFFF:FFFF
@Read_2_ID/2
CACGAGCGATAG
+
FFFF:FF:::F:
File extensions
The following file extensions are accepted for each file format:
FASTA:
.fasta
,.fna
,.fa
FASTQ:
.fastq
,.fq
,.fastq.gz
,.fq.gz
Note that FASTQ files may be compressed with gzip
(and end with .gz
), while FASTA files may not.
As FASTQ files from deep sequencing may occupy many gigabytes when uncompressed, it is recommended to keep them compressed (.gz
) unless they must be demultiplexed.
The file extension will be preserved throughout the pipeline; that is, if an input FASTQ file is compressed and has the extension .fq.gz
, then the temporary trimmed FASTQ file will also be compressed and end in .fq.gz
.
For paired-end reads in which mate 1s and mate 2s are in separate files, the file names must have one of the following suffixes immediately before the file extension:
FASTQ 1:
_R1
,_mate1
,_1_sequence
,_R1_001
,_mate1_001
,_1_sequence_001
FASTQ 2:
_R2
,_mate2
,_2_sequence
,_R2_001
,_mate2_001
,_2_sequence_001
If you would like future versions of DREEM to support additional file extensions, please create a new issue on GitHub.
File nomenclature
The paths of the input files are parsed to determine the names of the sample and reference.
For a FASTQ file containing reads from an entire sample (arguments
fastqs
,fastqi
, orfastq1
andfastq2
), the sample name is the name of the file, minus its extension (e.g..fq
) and suffix indicating mate 1 or 2 (e.g._R1
). For example, if the path is/home/dms/foo_R2.fq.gz
, then the sample isfoo
. The name of the reference sequence comes from the label of the sequence inside the FASTA file (the name of the FASTA file itself is used only in temporary files).For a FASTQ file containing reads from only one reference or construct (i.e. output from the demultiplexing module), the sample name is the directory in which the FASTQ file is located, and the reference is the name of the file, minus the suffix and extension. For example, if the path is
/home/dms/bar/baz_mate1.fq.gz
, then the sample isbar
and the reference isbaz
. (A sequence namedbaz
must then be present in the given FASTA file.)
Sequence alphabets
Reference sequences must contain only the uppercase characters A
, C
, G
, and T
.
Read sequences may contain any uppercase characters, but all characters besides A
, C
, G
, and T
(including degenerate bases defined by the IUPAC) are treated as any nucleotide (i.e. N
).
Quality score encodings
The Phred quality scores in FASTQ files are encoded by adding an integer N to the Phred score (Phred+N). Most modern Illumina instruments output FASTQ files with Phred+33 encoding (which is the default in DREEM), but Phred+64 is also common. The quality score encoding can be set to a non-default value (in this example, Phred+64) as follows:
CLI:
--phred-enc 64
API:
phred_enc=64
Output files
All output files and directories are written in the user-specified directory output_dir
.
Alignment maps
- For each sample named
sample_name
, given as a FASTQ file: - For each reference sequence named
reference_name
in the input FASTA file: A set of all reads that aligned to the reference in binary alignment map (BAM) format:
{output_dir}/alignment/{sample_name}/{reference_name}.bam
- For each reference sequence named
FASTQC reports
- For each input FASTQ file named
file_name.x
coming from samplesample_name
: A FASTQC report of the input file. Zipped by default; can extract automatically using
--qc-extract
(CLI) orqc_extract=True
(API):{output_dir}/alignment/{sample_name}/qc-inp/{file_name}_fastqc.zip
A FASTQC report of the file after trimming with Cutadapt:
{output_dir}/alignment/{sample_name}/qc-cut/{file_name}_fastqc.zip
Bowtie2 Indexes
Temporary files
All temporary files and directories are written in the user-specified directory temp_dir
.
By default, temporary files (but not directories) are deleted as soon as they are no longer needed.
Specifying --save-temp
(CLI) or save_temp=True
(API) prevents temporary files from being deleted (e.g. for troubleshooting).
FASTQ files
- If Cutadapt is enabled (the default, but it can be disabled with
--no-cut
(CLI) orcut=False
(API)): - For each input FASTQ file named
file_name.x
: A FASTQ file trimmed with Cutadapt
- For each input FASTQ file named
Key algorithm(s)
[A description of the key algorithms for this module. Keep it short and use examples.]
Edge cases handling
[What are the edge cases that need to be handled? How are they handled?]
Contributors: [write who contributed to the module here.]