Aggregation

Aggregation turns the bitvectors into a DREEM output file.

Key features:

Computing the mutation population average for all bitvectors.
Computing the mutation population average for clusters of bitvectors.
Predicting the free energy and the structure from the sequence using RNAstructure.

Get started

Let’s aggregate sample_1.

You need:

the path to the bitvectors of sample_1 stored as shown in the I/O files section.
the reference genome under the fasta format reference.fasta.

You get:

sample_1.json under the DREEM output format.

CLI

Run the command:

dreem aggregate --bv-files path/to/sample_1 —-fasta path/to/reference.fasta

Python

Run the following code:

from dreem import aggregate
aggregate.run(
    bv_files =('path/to/sample_1',),
    fasta='path/to/reference.fasta'
    )

See the API Reference for more details.

I/O files

This module needs the files to be sorted into a specific folder structure. The folder structure is described below.

Input files structure

As input, aggregate requires one or more files of Mutation Vector batches for each section of a reference. The folder structure is as follows:

sample_1/              # <=> a fastq file <<< give this path to aggregate the entire sample
    |- reference_1/    # <=> a single reference in the fasta file
        |- section_1/  # <=> a section (a sub-sequence of a reference)
            |- 0.orc   # <=> a batch of bitvectors
            |- 1.orc
            |- report.json  # <=> a report file <<< give this path to aggregate a single section
        |- section_2  # <=> another section for this reference
            |- 0.orc
            |- report.json
    |- reference_2    # <=> another reference
        |- section_3
            |- 0.orc
            |- 1.orc
            |- 2.orc
            |- report.json

Output file structure

sample_1.json

API Reference

The arguments for the CLI and Python API are the same. The CLI is just a wrapper around the Python API.

CLI

dreem aggregate

dreem aggregate [OPTIONS]

Options

--fasta <fasta>: FASTA file of all reference sequences in the project

-bv, --bv-files <bv_files>: Tuple of paths. Give the path to the sample folder to process every section. Give the path to a report to process a single section.

--library <library>: Library CSV file

--samples <samples>: Samples file

-cf, --clustering-file <clustering_file>: Path to the json clustering file from dreem clustering.

-rs, --rnastructure-path <rnastructure_path>: Path to the RNAstructure executable folder (e.g. /home/user/RNAstructure/exe/). Use this option if RNAstructure is not in your PATH.

-rst, --rnastructure-use-temp <rnastructure_use_temp>: Use the temperature signal to make predictions with RNAstructure

-rsa, --rnastructure-fold-args <rnastructure_fold_args>: Additional arguments to pass to RNAstructure’s Fold command

-rsd, --rnastructure-use-dms <rnastructure_use_dms>: Use the DMS signal to make predictions with RNAstructure

-rsdmin, --rnastructure-dms-min-unpaired-value <rnastructure_dms_min_unpaired_value>: Minimum unpaired value for using the dms signal as an input for RNAstructure

-rsdmax, --rnastructure-dms-max-paired-value <rnastructure_dms_max_paired_value>: Maximum paired value for using the dms signal as an input for RNAstructure

-rspa, --rnastructure-deltag-ensemble <rnastructure_deltag_ensemble>: Use RNAstructure partition function to predict free energy

-rspr, --rnastructure_probability <rnastructure_probability>: Use RNAstructure partition function to predict per-base pairing probability

-v, --verbose: Print info or info+debug messages to stdout

--out-dir <out_dir>: Where to output all finished files

--temp-dir <temp_dir>: Where to write all temporary files

--save-temp, --erase-temp: Whether to save or erase temporary files after the program exits

Python

dreem.aggregate.run(fasta: str, bv_files: list = (), *, out_dir: str = './output', temp_dir: str = './temp', save_temp: bool = False, library: str = '', samples: str = '', clustering_file: str = '', rnastructure_path: str = '', rnastructure_use_temp: bool = False, rnastructure_fold_args: str = '', rnastructure_use_dms: str = False, rnastructure_dms_min_unpaired_value: float = 0.07, rnastructure_dms_max_paired_value: float = 0.01, rnastructure_deltag_ensemble: bool = False, rnastructure_probability: bool = False, verbose: bool = 0)

Run the aggregate module.

Parameters

fasta (str) – FASTA file of all reference sequences in the project [positional or keyword]
bv_files (list) – Tuple of paths. Give the path to the sample folder to process every section. Give the path to a report to process a single section. [positional or keyword, default: ()]
out_dir (str) – Where to output all finished files [keyword-only, default: ‘./output’]
temp_dir (str) – Where to write all temporary files [keyword-only, default: ‘./temp’]
save_temp (bool) – Whether to save or erase temporary files after the program exits [keyword-only, default: False]
library (str) – Library CSV file [keyword-only, default: ‘’]
samples (str) – Samples file [keyword-only, default: ‘’]
clustering_file (str) – Path to the json clustering file from dreem clustering. [keyword-only, default: ‘’]
rnastructure_path (str) – Path to the RNAstructure executable folder (e.g. /home/user/RNAstructure/exe/). Use this option if RNAstructure is not in your PATH. [keyword-only, default: ‘’]
rnastructure_use_temp (bool) – Use the temperature signal to make predictions with RNAstructure [keyword-only, default: False]
rnastructure_fold_args (str) – Additional arguments to pass to RNAstructure’s Fold command [keyword-only, default: ‘’]
rnastructure_use_dms (str) – Use the DMS signal to make predictions with RNAstructure [keyword-only, default: False]
rnastructure_dms_min_unpaired_value (float) – Minimum unpaired value for using the dms signal as an input for RNAstructure [keyword-only, default: 0.07]
rnastructure_dms_max_paired_value (float) – Maximum paired value for using the dms signal as an input for RNAstructure [keyword-only, default: 0.01]
rnastructure_deltag_ensemble (bool) – Use RNAstructure partition function to predict free energy [keyword-only, default: False]
rnastructure_probability (bool) – Use RNAstructure partition function to predict per-base pairing probability [keyword-only, default: False]
verbose (bool) – Print info or info+debug messages to stdout [keyword-only, default: 0]

Key algorithms

Counting mutations for the entire bitvector

The following table describes the aggregation of the bitvector into a set of per-residue counts (except for sub_hist which is per-read).

Please refer to Mutation Vector for the definition of the bits and the bitvector format.

Aggregation	Description
`cov`	Per-residue count of covered bases
`del`	Per-residue count of deleted bases
`ins`	Per-residue count of inserted bases
`info`	Per-residue count of bases that are substituted for another base or non-mutated. Doesn’t include deleted bases.
`sub_A`	Per-residue count of bases that substituted to A
`sub_C`	Per-residue count of bases that substituted to C
`sub_G`	Per-residue count of bases that substituted to G
`sub_T`	Per-residue count of bases that substituted to T
`sub_N`	Per-residue count of bases that substituted to any base
`sub_rate`	Per-residue ratio `sub_N/info`
`sub_hist`	Per-read count of mutations histogram (i.e. number of reads with 0 mutations, 1 mutation, 2 mutations, etc.)

Example:

# example.orc
                # Bitvector
                C    A    C    A    A    T    G    T    G   # reference sequence
                9    5    1    128  1    0    1    1    2   # read 1
                1    1    128  1    0    1    16   2    1   # read 2
                64   1    2    1    1    1    16   1    1   # read 3

    # Read 1
cov             1    1    1    1    1    0    1    1    1
del             0    0    0    0    0    0    0    0    1
ins             0    1    0    0    0    0    0    0    0
info            1    1    1    1    1    0    1    1    0
sub_A           0    0    0    0    0    0    0    0    0
sub_C           0    0    0    0    0    0    0    0    0
sub_G           0    0    0    0    0    0    0    0    0
sub_T           0    0    0    1    0    0    0    0    0
sub_N           0    0    0    1    0    0    0    0    0
sub_rate        0    0    0    1    0    0    0    0    0

# Read 2
cov             1    1    1    1    0    1    1    1    1
del             0    0    0    0    0    0    0    1    0
ins             0    0    0    0    0    0    0    0    0
info            1    1    1    1    0    1    1    0    1
sub_A           0    0    0    0    0    0    1    0    0
sub_C           0    0    0    0    0    0    0    0    0
sub_G           0    0    0    0    0    0    0    0    0
sub_T           0    0    1    0    0    0    0    0    0
sub_N           0    0    1    0    0    0    1    0    0
sub_rate        0    0    1    0    0    0    1    0    0

# Read 3
cov             1    1    1    1    1    1    1    1    1
del             0    0    1    0    0    0    0    1    0
ins             0    0    0    0    0    0    0    0    0
info            1    1    0    1    1    1    1    1    1
sub_A           0    0    0    0    0    0    1    0    0
sub_C           0    0    0    0    0    0    0    0    0
sub_G           1    0    0    0    0    0    0    0    0
sub_T           0    0    0    0    0    0    0    0    0
sub_N           1    0    0    0    0    0    1    0    0
sub_rate        1    0    0    0    0    0    1    0    0

# DREEM output = sum of these

# Aggregation ---------------------------------------------
cov             3    3    3    3    2    2    3    3    3
del             0    0    0    0    0    0    0    1    1
ins             0    1    0    0    0    0    0    0    0
info            3    3    2    3    2    2    3    2    2
sub_A           0    0    0    0    0    0    2    0    0
sub_C           0    0    0    0    0    0    0    0    0
sub_G           1    0    0    0    0    0    0    0    0
sub_T           0    0    1    1    0    0    0    0    0
sub_N           1    0    1    1    0    0    2    0    0
sub_N           1    0    1    1    0    0    2    0    0
sub_rate        0.33 0.00 0.5  0.33 0.00 0.00 0.66 0.00 0.00

# This one is per read, not per base
sub_hist   0  1  2  0  0  0  0  0  0  0

Counting mutations for a certain cluster #TODO

Clustering outputs for each read the likelihood to belong to a certain cluster. The sum of the likelihoods for different clusters for a read is 1.

When counting mutations for a certain cluster, we weight the mutations by the likelihood of the read to belong to that cluster.

Let’s add likelihoods to belong to a cluster K2_1 to the example above:

# example.orc
                # Bitvector
                C    A    C    A    A    T    G    T    G   # reference sequence
                9    5    1    128  1    0    1    1    2   # read 1, likelihood=1
                1    1    128  1    0    1    16   2    1   # read 2  likelihood=0.5
                64   1    2    1    1    1    16   1    1   # read 3  likelihood=0.1

# Read 1, likelihood=1
cov             1    1    1    1    1    0    1    1    1
del             0    0    0    0    0    0    0    0    1
ins             0    1    0    0    0    0    0    0    0
info            1    1    1    1    1    0    1    1    0
sub_A           0    0    0    0    0    0    0    0    0
sub_C           0    0    0    0    0    0    0    0    0
sub_G           0    0    0    0    0    0    0    0    0
sub_T           0    0    0    1    0    0    0    0    0
sub_N           0    0    0    1    0    0    0    0    0
sub_rate        0    0    0    1    0    0    0    0    0

# Read 2, likelihood=0.5
cov             1    1    1    1    0    1    1    1    1
del             0    0    0    0    0    0    0    1    0
ins             0    0    0    0    0    0    0    0    0
info            1    1    1    1    0    1    1    0    1
sub_A           0    0    0    0    0    0    1    0    0
sub_C           0    0    0    0    0    0    0    0    0
sub_G           0    0    0    0    0    0    0    0    0
sub_T           0    0    1    0    0    0    0    0    0
sub_N           0    0    1    0    0    0    1    0    0
sub_rate        0    0    1    0    0    0    1    0    0

# Read 3, likelihood=0.1
cov             1    1    1    1    1    1    1    1    1
del             0    0    1    0    0    0    0    0    0
ins             0    0    0    0    0    0    0    0    0
info            1    1    0    1    1    1    1    1    1
sub_A           0    0    0    0    0    0    1    0    0
sub_C           0    0    0    0    0    0    0    0    0
sub_G           1    0    0    0    0    0    0    0    0
sub_T           0    0    0    0    0    0    0    0    0
sub_N           1    0    0    0    0    0    1    0    0
sub_rate        1    0    0    0    0    0    1    0    0

# DREEM output = sum weighted by the likelihood: read1 + 0.5*read2 + 0.1*read3

# Aggregation ---------------------------------------------
cov             1.6  1.6  1.6  1.6  1.1  0.6  1.6  1.6  1.6
del             0    0    0.1  0    0    0    0    0.5  1
ins             0    1    0    0    0    0    0    0    0
info            1.6  1.6  1.5  1.6  1.1  0.6  1.6  1.1  0.6
sub_A           0    0    0    0    0    0    0.6  0    0
sub_C           0    0    0    0    0    0    0    0    0
sub_G           0.1  0    0    0    0    0    0    0    0
sub_T           0    0    0.5  1    0    0    0    0    0
sub_N           0.1  0    0.5  1    0    0    0.6  0    0
sub_rate        0.06 0    0.33 0.62 0    0    0.37 0    0

# This one is per read, not per base
sub_hist  0  1  0.6

Predicting the structure and the free energy

We use the following script:

echo ">ref\nGGCGACACAGTCGACGGTTTTCACA">temp.fasta
Fold temp.fasta temp.ct
ct2dot temp.ct 1 temp_dot.txt
cat temp_dot.txt

Note

The results are stored in a temporary json file ledger.json, so that if you predict multiple times the same sequence, you will not have to wait for the prediction to be done again.

Edge cases handling

Insertions are counted on the 3’UTR side. See example above.

Contributors: Yves Martin des Taillades