Vectorization

The vectorization module determines whether each base in each sequencing read matches the reference RNA sequence. It outputs a matrix in which each column is a base in the reference sequence, each row is a read, and each element is a byte indicating the relationship between the bases in the read and reference sequences. This matrix can then be used to calculate mutation frequencies or to cluster the reads to find alternative structures.

Examples

CLI

Basic usage showing how to create mutation vectors from the BAM file DMS_100mM/MAPT-3R.bam over section 231 - 417 of the transcript named “MAPT-3R”, with the reference sequence defined in human_transcriptome.fasta:

dreem vector -b DMS_100mM/MAPT-3R.bam -f human_transcriptome.fasta -c MAPT-3R 231 417

Python

Basic usage of the Python API to perform the same operation as shown in the CLI example.

>>> from dreem import vector
>>> report_files = vector.main.run(bamf=("DMS_100mM/MAPT-3R.bam",),
...                                fasta="human_transcriptome.fasta",
...                                coords=(("MAPT-3R", 231, 417),))

API Reference

The arguments for the CLI and Python API are the same. The CLI is just a wrapper around the Python API.

Command line interface

dreem vector

Hook the command line interface to the `run` function.

dreem vector [OPTIONS]

Options

--fasta <fasta>

FASTA file of all reference sequences in the project

--bamf <bamf>

BAM file

--bamd <bamd>

Directory of BAM files

--library <library>

Library CSV file

--autosect, --no-autosect

Whether, for every reference that was not explicitly given at least one section (by –initial_coords or –primers), to generate coordinates covering the entire reference sequence automatically

-c, --coords <coords>

Reference name, 5’ end, and 3’ end of a section; coordinates are 1-indexed and include both ends

-p, --primers <primers>

Reference name, forward primer, and reverse primer of a section; reverse primer must be given 5’ to 3’

--primer-gap <primer_gap>

Number of bases to leave as a gap between the end of a primer and the end of the section

--out-dir <out_dir>

Where to output all finished files

--temp-dir <temp_dir>

Where to write all temporary files

--phred-enc <phred_enc>

Phred score encoding in FASTQ/SAM/BAM files

--min-phred <min_phred>

Minimum Phred score to use a base call

--ambid, --no-ambid

Whether to find and label all ambiguous insertions and deletions (improves accuracy but runs slower)

--strict-pairs, --no-strict-pairs

Whether to require that every paired read that maps to a section also have a mate that maps to the section

-z, --batch-size <batch_size>

Maximum size of each batch of mutation vectors, in millions of base calls

--max-procs <max_procs>

Maximum number of simultaneous processes

--parallel, --no-parallel

Whether to run multiple jobs in parallel

--rerun, --no-rerun

Whether to regenerate files that already exist

--save-temp, --erase-temp

Whether to save or erase temporary files after the program exits

Python interface

dreem.vector.run(fasta: str, *, bamf: tuple[str] = (), bamd: tuple[str] = (), library: str = '', autosect: bool = False, coords: tuple[tuple[str, int, int], ...] = (), primers: tuple[tuple[str, str, str], ...] = (), primer_gap: int = 2, out_dir: str = './output', temp_dir: str = './temp', phred_enc: int = 33, min_phred: int = 25, ambid: bool = True, strict_pairs: bool = True, batch_size: float = 32.0, max_procs: int = 2, parallel: bool = True, rerun: bool = False, save_temp: bool = False)
Run the vectoring step. Generate a vector encoding mutations for

each read (or read pair, if paired-end).

Parameters
  • fasta (str) – FASTA file of all reference sequences in the project [positional or keyword]

  • bamf (tuple) – BAM file [keyword-only, default: ()]

  • bamd (tuple) – Directory of BAM files [keyword-only, default: ()]

  • library (str) – Library CSV file [keyword-only, default: ‘’]

  • autosect (bool) – Whether, for every reference that was not explicitly given at least one section (by –initial_coords or –primers), to generate coordinates covering the entire reference sequence automatically [keyword-only, default: False]

  • coords (tuple) – Reference name, 5’ end, and 3’ end of a section; coordinates are 1-indexed and include both ends [keyword-only, default: ()]

  • primers (tuple) – Reference name, forward primer, and reverse primer of a section; reverse primer must be given 5’ to 3’ [keyword-only, default: ()]

  • primer_gap (int) – Number of bases to leave as a gap between the end of a primer and the end of the section [keyword-only, default: 2]

  • out_dir (str) – Where to output all finished files [keyword-only, default: ‘./output’]

  • temp_dir (str) – Where to write all temporary files [keyword-only, default: ‘./temp’]

  • phred_enc (int) – Phred score encoding in FASTQ/SAM/BAM files [keyword-only, default: 33]

  • min_phred (int) – Minimum Phred score to use a base call [keyword-only, default: 25]

  • ambid (bool) – Whether to find and label all ambiguous insertions and deletions (improves accuracy but runs slower) [keyword-only, default: True]

  • strict_pairs (bool) – Whether to require that every paired read that maps to a section also have a mate that maps to the section [keyword-only, default: True]

  • batch_size (float) – Maximum size of each batch of mutation vectors, in millions of base calls [keyword-only, default: 32.0]

  • max_procs (int) – Maximum number of simultaneous processes [keyword-only, default: 2]

  • parallel (bool) – Whether to run multiple jobs in parallel [keyword-only, default: True]

  • rerun (bool) – Whether to regenerate files that already exist [keyword-only, default: False]

  • save_temp (bool) – Whether to save or erase temporary files after the program exits [keyword-only, default: False]

I/O files

This module needs the files to be sorted into a specific folder structure. The folder structure is described below.

Input file structure

The module takes one or more files of aligned reads in binary alignment map (BAM) format. Each file must be named after the reference sequence to which it was aligned, end with the file extension .bam, and reside in a directory named after the sample from which the reads came:

|- sample_1/
    |- reference_1.bam
    |- reference_2.bam
|- sample_2/
    |- reference_1.bam
    |- reference_3.bam

Output file structure

The module outputs one report file (JSON format) and one or more mutation vector batch files (Apache ORC format) for each section of each reference for each sample. Each sample corresponds to one directory that contains one sub-directory for each reference to which the sample was aligned. Each reference directory contains one sub-directory for each section of the reference for which mutation vectors were computed. Each section directory contains one or more mutation vector batch files (numbered 0.orc, 1.orc, …) and a report file (report.json) that contains metadata about the vectors:

|- sample_1/
    |- reference_1/
        |- section_1.1/
            |- 0.orc
            |- 1.orc
            |- report.json
        |- section_1.2
            |- 0.orc
            |- report.json
    |- reference_2
        |- section_2.1
            |- 0.orc
            |- 1.orc
            |- 2.orc
            |- report.json
|- sample_2/
    |- reference_1/
        |- section_1.1/
            |- 0.orc
            |- report.json
        |- section_1.2
            |- 0.orc
            |- 1.orc
            |- report.json
    |- reference_3/
        |- section_3.1/
            |- 0.orc
            |- report.json

Key algorithm(s)

Create SAM files to vectorize

Purpose: In order to turn BAM files into mutation vectors, the BAM files must be converted into SAM files containing only reads that align to the references and sections of interest.

  1. For each pair of 5’/3’ coordinates (--coords) or forward/reverse primers (--primers) given for every reference, a section (i.e. subsequence between two coordinates) of its reference is defined.

  2. If the option --autosect (CLI) or autosect=True (API) is given, then for every reference for which no coordinates/primers were given, a section equal to the full reference sequence is defined automatically.

  3. For each given BAM file (i.e. alignment of a sample to a reference sequence), and for each section defined for its reference sequence, a temporary SAM file of only the reads aligning to the section is generated using samtools.

In the following steps, each temporary SAM file is then processed independently, either in parallel (with Python’s multiprocessing module) or in series.

Define batches of reads

Purpose: memory and parallelization

Parse CIGAR strings into vectors

Find ambiguous mutations

Write output files

Edge cases handling

[What are the edge cases that need to be handled? How are they handled?]

Contributors: Matthew Allan