Vectorization
The vectorization module determines whether each base in each sequencing read matches the reference RNA sequence. It outputs a matrix in which each column is a base in the reference sequence, each row is a read, and each element is a byte indicating the relationship between the bases in the read and reference sequences. This matrix can then be used to calculate mutation frequencies or to cluster the reads to find alternative structures.
Examples
CLI
Basic usage showing how to create mutation vectors from the BAM file DMS_100mM/MAPT-3R.bam
over section 231 - 417 of the transcript named “MAPT-3R”,
with the reference sequence defined in human_transcriptome.fasta
:
dreem vector -b DMS_100mM/MAPT-3R.bam -f human_transcriptome.fasta -c MAPT-3R 231 417
Python
Basic usage of the Python API to perform the same operation as shown in the CLI example.
>>> from dreem import vector
>>> report_files = vector.main.run(bamf=("DMS_100mM/MAPT-3R.bam",),
... fasta="human_transcriptome.fasta",
... coords=(("MAPT-3R", 231, 417),))
API Reference
The arguments for the CLI and Python API are the same. The CLI is just a wrapper around the Python API.
Command line interface
dreem vector
Hook the command line interface to the `run`
function.
dreem vector [OPTIONS]
Options
- --fasta <fasta>
FASTA file of all reference sequences in the project
- --bamf <bamf>
BAM file
- --bamd <bamd>
Directory of BAM files
- --library <library>
Library CSV file
- --autosect, --no-autosect
Whether, for every reference that was not explicitly given at least one section (by –initial_coords or –primers), to generate coordinates covering the entire reference sequence automatically
- -c, --coords <coords>
Reference name, 5’ end, and 3’ end of a section; coordinates are 1-indexed and include both ends
- -p, --primers <primers>
Reference name, forward primer, and reverse primer of a section; reverse primer must be given 5’ to 3’
- --primer-gap <primer_gap>
Number of bases to leave as a gap between the end of a primer and the end of the section
- --out-dir <out_dir>
Where to output all finished files
- --temp-dir <temp_dir>
Where to write all temporary files
- --phred-enc <phred_enc>
Phred score encoding in FASTQ/SAM/BAM files
- --min-phred <min_phred>
Minimum Phred score to use a base call
- --ambid, --no-ambid
Whether to find and label all ambiguous insertions and deletions (improves accuracy but runs slower)
- --strict-pairs, --no-strict-pairs
Whether to require that every paired read that maps to a section also have a mate that maps to the section
- -z, --batch-size <batch_size>
Maximum size of each batch of mutation vectors, in millions of base calls
- --max-procs <max_procs>
Maximum number of simultaneous processes
- --parallel, --no-parallel
Whether to run multiple jobs in parallel
- --rerun, --no-rerun
Whether to regenerate files that already exist
- --save-temp, --erase-temp
Whether to save or erase temporary files after the program exits
Python interface
- dreem.vector.run(fasta: str, *, bamf: tuple[str] = (), bamd: tuple[str] = (), library: str = '', autosect: bool = False, coords: tuple[tuple[str, int, int], ...] = (), primers: tuple[tuple[str, str, str], ...] = (), primer_gap: int = 2, out_dir: str = './output', temp_dir: str = './temp', phred_enc: int = 33, min_phred: int = 25, ambid: bool = True, strict_pairs: bool = True, batch_size: float = 32.0, max_procs: int = 2, parallel: bool = True, rerun: bool = False, save_temp: bool = False)
- Run the vectoring step. Generate a vector encoding mutations for
each read (or read pair, if paired-end).
- Parameters
fasta (
str
) – FASTA file of all reference sequences in the project [positional or keyword]bamf (
tuple
) – BAM file [keyword-only, default: ()]bamd (
tuple
) – Directory of BAM files [keyword-only, default: ()]library (
str
) – Library CSV file [keyword-only, default: ‘’]autosect (
bool
) – Whether, for every reference that was not explicitly given at least one section (by –initial_coords or –primers), to generate coordinates covering the entire reference sequence automatically [keyword-only, default: False]coords (
tuple
) – Reference name, 5’ end, and 3’ end of a section; coordinates are 1-indexed and include both ends [keyword-only, default: ()]primers (
tuple
) – Reference name, forward primer, and reverse primer of a section; reverse primer must be given 5’ to 3’ [keyword-only, default: ()]primer_gap (
int
) – Number of bases to leave as a gap between the end of a primer and the end of the section [keyword-only, default: 2]out_dir (
str
) – Where to output all finished files [keyword-only, default: ‘./output’]temp_dir (
str
) – Where to write all temporary files [keyword-only, default: ‘./temp’]phred_enc (
int
) – Phred score encoding in FASTQ/SAM/BAM files [keyword-only, default: 33]min_phred (
int
) – Minimum Phred score to use a base call [keyword-only, default: 25]ambid (
bool
) – Whether to find and label all ambiguous insertions and deletions (improves accuracy but runs slower) [keyword-only, default: True]strict_pairs (
bool
) – Whether to require that every paired read that maps to a section also have a mate that maps to the section [keyword-only, default: True]batch_size (
float
) – Maximum size of each batch of mutation vectors, in millions of base calls [keyword-only, default: 32.0]max_procs (
int
) – Maximum number of simultaneous processes [keyword-only, default: 2]parallel (
bool
) – Whether to run multiple jobs in parallel [keyword-only, default: True]rerun (
bool
) – Whether to regenerate files that already exist [keyword-only, default: False]save_temp (
bool
) – Whether to save or erase temporary files after the program exits [keyword-only, default: False]
I/O files
This module needs the files to be sorted into a specific folder structure. The folder structure is described below.
Input file structure
The module takes one or more files of aligned reads in binary alignment map (BAM) format. Each file must be named after the reference sequence to which it was aligned, end with the file extension .bam
, and reside in a directory named after the sample from which the reads came:
|- sample_1/
|- reference_1.bam
|- reference_2.bam
|- sample_2/
|- reference_1.bam
|- reference_3.bam
Output file structure
The module outputs one report file (JSON format) and one or more mutation vector batch files (Apache ORC format) for each section of each reference for each sample. Each sample corresponds to one directory that contains one sub-directory for each reference to which the sample was aligned. Each reference directory contains one sub-directory for each section of the reference for which mutation vectors were computed. Each section directory contains one or more mutation vector batch files (numbered 0.orc
, 1.orc
, …) and a report file (report.json
) that contains metadata about the vectors:
|- sample_1/
|- reference_1/
|- section_1.1/
|- 0.orc
|- 1.orc
|- report.json
|- section_1.2
|- 0.orc
|- report.json
|- reference_2
|- section_2.1
|- 0.orc
|- 1.orc
|- 2.orc
|- report.json
|- sample_2/
|- reference_1/
|- section_1.1/
|- 0.orc
|- report.json
|- section_1.2
|- 0.orc
|- 1.orc
|- report.json
|- reference_3/
|- section_3.1/
|- 0.orc
|- report.json
Key algorithm(s)
Create SAM files to vectorize
Purpose: In order to turn BAM files into mutation vectors, the BAM files must be converted into SAM files containing only reads that align to the references and sections of interest.
For each pair of 5’/3’ coordinates (
--coords
) or forward/reverse primers (--primers
) given for every reference, a section (i.e. subsequence between two coordinates) of its reference is defined.If the option
--autosect
(CLI) orautosect=True
(API) is given, then for every reference for which no coordinates/primers were given, a section equal to the full reference sequence is defined automatically.For each given BAM file (i.e. alignment of a sample to a reference sequence), and for each section defined for its reference sequence, a temporary SAM file of only the reads aligning to the section is generated using
samtools
.
In the following steps, each temporary SAM file is then processed independently, either in parallel (with Python’s multiprocessing
module) or in series.
Define batches of reads
Purpose: memory and parallelization
Parse CIGAR strings into vectors
Find ambiguous mutations
Write output files
Edge cases handling
[What are the edge cases that need to be handled? How are they handled?]
Contributors: Matthew Allan