Aggregation

Aggregation turns the bitvectors into a DREEM output file.

Key features:
  • Computing the mutation population average for all bitvectors.

  • Computing the mutation population average for clusters of bitvectors.

  • Predicting the free energy and the structure from the sequence using RNAstructure.

Get started

Let’s aggregate sample_1.

You need:
  • the path to the bitvectors of sample_1 stored as shown in the I/O files section.

  • the reference genome under the fasta format reference.fasta.

You get:

CLI

Run the command:

dreem aggregate --bv-files path/to/sample_1 —-fasta path/to/reference.fasta

Python

Run the following code:

from dreem import aggregate
aggregate.run(
    bv_files =('path/to/sample_1',),
    fasta='path/to/reference.fasta'
    )

See the API Reference for more details.

I/O files

This module needs the files to be sorted into a specific folder structure. The folder structure is described below.

Input files structure

As input, aggregate requires one or more files of Mutation Vector batches for each section of a reference. The folder structure is as follows:

sample_1/              # <=> a fastq file <<< give this path to aggregate the entire sample
    |- reference_1/    # <=> a single reference in the fasta file
        |- section_1/  # <=> a section (a sub-sequence of a reference)
            |- 0.orc   # <=> a batch of bitvectors
            |- 1.orc
            |- report.json  # <=> a report file <<< give this path to aggregate a single section
        |- section_2  # <=> another section for this reference
            |- 0.orc
            |- report.json
    |- reference_2    # <=> another reference
        |- section_3
            |- 0.orc
            |- 1.orc
            |- 2.orc
            |- report.json

Output file structure

sample_1.json

API Reference

The arguments for the CLI and Python API are the same. The CLI is just a wrapper around the Python API.

CLI

dreem aggregate

dreem aggregate [OPTIONS]

Options

--fasta <fasta>

FASTA file of all reference sequences in the project

-bv, --bv-files <bv_files>

Tuple of paths. Give the path to the sample folder to process every section. Give the path to a report to process a single section.

--library <library>

Library CSV file

--samples <samples>

Samples file

-cf, --clustering-file <clustering_file>

Path to the json clustering file from dreem clustering.

-rs, --rnastructure-path <rnastructure_path>

Path to the RNAstructure executable folder (e.g. /home/user/RNAstructure/exe/). Use this option if RNAstructure is not in your PATH.

-rst, --rnastructure-use-temp <rnastructure_use_temp>

Use the temperature signal to make predictions with RNAstructure

-rsa, --rnastructure-fold-args <rnastructure_fold_args>

Additional arguments to pass to RNAstructure’s Fold command

-rsd, --rnastructure-use-dms <rnastructure_use_dms>

Use the DMS signal to make predictions with RNAstructure

-rsdmin, --rnastructure-dms-min-unpaired-value <rnastructure_dms_min_unpaired_value>

Minimum unpaired value for using the dms signal as an input for RNAstructure

-rsdmax, --rnastructure-dms-max-paired-value <rnastructure_dms_max_paired_value>

Maximum paired value for using the dms signal as an input for RNAstructure

-rspa, --rnastructure-deltag-ensemble <rnastructure_deltag_ensemble>

Use RNAstructure partition function to predict free energy

-rspr, --rnastructure_probability <rnastructure_probability>

Use RNAstructure partition function to predict per-base pairing probability

-v, --verbose

Print info or info+debug messages to stdout

--out-dir <out_dir>

Where to output all finished files

--temp-dir <temp_dir>

Where to write all temporary files

--save-temp, --erase-temp

Whether to save or erase temporary files after the program exits

Python

dreem.aggregate.run(fasta: str, bv_files: list = (), *, out_dir: str = './output', temp_dir: str = './temp', save_temp: bool = False, library: str = '', samples: str = '', clustering_file: str = '', rnastructure_path: str = '', rnastructure_use_temp: bool = False, rnastructure_fold_args: str = '', rnastructure_use_dms: str = False, rnastructure_dms_min_unpaired_value: float = 0.07, rnastructure_dms_max_paired_value: float = 0.01, rnastructure_deltag_ensemble: bool = False, rnastructure_probability: bool = False, verbose: bool = 0)

Run the aggregate module.

Parameters
  • fasta (str) – FASTA file of all reference sequences in the project [positional or keyword]

  • bv_files (list) – Tuple of paths. Give the path to the sample folder to process every section. Give the path to a report to process a single section. [positional or keyword, default: ()]

  • out_dir (str) – Where to output all finished files [keyword-only, default: ‘./output’]

  • temp_dir (str) – Where to write all temporary files [keyword-only, default: ‘./temp’]

  • save_temp (bool) – Whether to save or erase temporary files after the program exits [keyword-only, default: False]

  • library (str) – Library CSV file [keyword-only, default: ‘’]

  • samples (str) – Samples file [keyword-only, default: ‘’]

  • clustering_file (str) – Path to the json clustering file from dreem clustering. [keyword-only, default: ‘’]

  • rnastructure_path (str) – Path to the RNAstructure executable folder (e.g. /home/user/RNAstructure/exe/). Use this option if RNAstructure is not in your PATH. [keyword-only, default: ‘’]

  • rnastructure_use_temp (bool) – Use the temperature signal to make predictions with RNAstructure [keyword-only, default: False]

  • rnastructure_fold_args (str) – Additional arguments to pass to RNAstructure’s Fold command [keyword-only, default: ‘’]

  • rnastructure_use_dms (str) – Use the DMS signal to make predictions with RNAstructure [keyword-only, default: False]

  • rnastructure_dms_min_unpaired_value (float) – Minimum unpaired value for using the dms signal as an input for RNAstructure [keyword-only, default: 0.07]

  • rnastructure_dms_max_paired_value (float) – Maximum paired value for using the dms signal as an input for RNAstructure [keyword-only, default: 0.01]

  • rnastructure_deltag_ensemble (bool) – Use RNAstructure partition function to predict free energy [keyword-only, default: False]

  • rnastructure_probability (bool) – Use RNAstructure partition function to predict per-base pairing probability [keyword-only, default: False]

  • verbose (bool) – Print info or info+debug messages to stdout [keyword-only, default: 0]

Key algorithms

Counting mutations for the entire bitvector

The following table describes the aggregation of the bitvector into a set of per-residue counts (except for sub_hist which is per-read).

Please refer to Mutation Vector for the definition of the bits and the bitvector format.

Aggregation

Description

cov

Per-residue count of covered bases

del

Per-residue count of deleted bases

ins

Per-residue count of inserted bases

info

Per-residue count of bases that are substituted for another base or non-mutated. Doesn’t include deleted bases.

sub_A

Per-residue count of bases that substituted to A

sub_C

Per-residue count of bases that substituted to C

sub_G

Per-residue count of bases that substituted to G

sub_T

Per-residue count of bases that substituted to T

sub_N

Per-residue count of bases that substituted to any base

sub_rate

Per-residue ratio sub_N/info

sub_hist

Per-read count of mutations histogram (i.e. number of reads with 0 mutations, 1 mutation, 2 mutations, etc.)

Example:

# example.orc
                # Bitvector
                C    A    C    A    A    T    G    T    G   # reference sequence
                9    5    1    128  1    0    1    1    2   # read 1
                1    1    128  1    0    1    16   2    1   # read 2
                64   1    2    1    1    1    16   1    1   # read 3

    # Read 1
cov             1    1    1    1    1    0    1    1    1
del             0    0    0    0    0    0    0    0    1
ins             0    1    0    0    0    0    0    0    0
info            1    1    1    1    1    0    1    1    0
sub_A           0    0    0    0    0    0    0    0    0
sub_C           0    0    0    0    0    0    0    0    0
sub_G           0    0    0    0    0    0    0    0    0
sub_T           0    0    0    1    0    0    0    0    0
sub_N           0    0    0    1    0    0    0    0    0
sub_rate        0    0    0    1    0    0    0    0    0

# Read 2
cov             1    1    1    1    0    1    1    1    1
del             0    0    0    0    0    0    0    1    0
ins             0    0    0    0    0    0    0    0    0
info            1    1    1    1    0    1    1    0    1
sub_A           0    0    0    0    0    0    1    0    0
sub_C           0    0    0    0    0    0    0    0    0
sub_G           0    0    0    0    0    0    0    0    0
sub_T           0    0    1    0    0    0    0    0    0
sub_N           0    0    1    0    0    0    1    0    0
sub_rate        0    0    1    0    0    0    1    0    0

# Read 3
cov             1    1    1    1    1    1    1    1    1
del             0    0    1    0    0    0    0    1    0
ins             0    0    0    0    0    0    0    0    0
info            1    1    0    1    1    1    1    1    1
sub_A           0    0    0    0    0    0    1    0    0
sub_C           0    0    0    0    0    0    0    0    0
sub_G           1    0    0    0    0    0    0    0    0
sub_T           0    0    0    0    0    0    0    0    0
sub_N           1    0    0    0    0    0    1    0    0
sub_rate        1    0    0    0    0    0    1    0    0

# DREEM output = sum of these

# Aggregation ---------------------------------------------
cov             3    3    3    3    2    2    3    3    3
del             0    0    0    0    0    0    0    1    1
ins             0    1    0    0    0    0    0    0    0
info            3    3    2    3    2    2    3    2    2
sub_A           0    0    0    0    0    0    2    0    0
sub_C           0    0    0    0    0    0    0    0    0
sub_G           1    0    0    0    0    0    0    0    0
sub_T           0    0    1    1    0    0    0    0    0
sub_N           1    0    1    1    0    0    2    0    0
sub_N           1    0    1    1    0    0    2    0    0
sub_rate        0.33 0.00 0.5  0.33 0.00 0.00 0.66 0.00 0.00

# This one is per read, not per base
sub_hist   0  1  2  0  0  0  0  0  0  0

Counting mutations for a certain cluster #TODO

Clustering outputs for each read the likelihood to belong to a certain cluster. The sum of the likelihoods for different clusters for a read is 1.

When counting mutations for a certain cluster, we weight the mutations by the likelihood of the read to belong to that cluster.

Let’s add likelihoods to belong to a cluster K2_1 to the example above:

# example.orc
                # Bitvector
                C    A    C    A    A    T    G    T    G   # reference sequence
                9    5    1    128  1    0    1    1    2   # read 1, likelihood=1
                1    1    128  1    0    1    16   2    1   # read 2  likelihood=0.5
                64   1    2    1    1    1    16   1    1   # read 3  likelihood=0.1

# Read 1, likelihood=1
cov             1    1    1    1    1    0    1    1    1
del             0    0    0    0    0    0    0    0    1
ins             0    1    0    0    0    0    0    0    0
info            1    1    1    1    1    0    1    1    0
sub_A           0    0    0    0    0    0    0    0    0
sub_C           0    0    0    0    0    0    0    0    0
sub_G           0    0    0    0    0    0    0    0    0
sub_T           0    0    0    1    0    0    0    0    0
sub_N           0    0    0    1    0    0    0    0    0
sub_rate        0    0    0    1    0    0    0    0    0

# Read 2, likelihood=0.5
cov             1    1    1    1    0    1    1    1    1
del             0    0    0    0    0    0    0    1    0
ins             0    0    0    0    0    0    0    0    0
info            1    1    1    1    0    1    1    0    1
sub_A           0    0    0    0    0    0    1    0    0
sub_C           0    0    0    0    0    0    0    0    0
sub_G           0    0    0    0    0    0    0    0    0
sub_T           0    0    1    0    0    0    0    0    0
sub_N           0    0    1    0    0    0    1    0    0
sub_rate        0    0    1    0    0    0    1    0    0

# Read 3, likelihood=0.1
cov             1    1    1    1    1    1    1    1    1
del             0    0    1    0    0    0    0    0    0
ins             0    0    0    0    0    0    0    0    0
info            1    1    0    1    1    1    1    1    1
sub_A           0    0    0    0    0    0    1    0    0
sub_C           0    0    0    0    0    0    0    0    0
sub_G           1    0    0    0    0    0    0    0    0
sub_T           0    0    0    0    0    0    0    0    0
sub_N           1    0    0    0    0    0    1    0    0
sub_rate        1    0    0    0    0    0    1    0    0

# DREEM output = sum weighted by the likelihood: read1 + 0.5*read2 + 0.1*read3

# Aggregation ---------------------------------------------
cov             1.6  1.6  1.6  1.6  1.1  0.6  1.6  1.6  1.6
del             0    0    0.1  0    0    0    0    0.5  1
ins             0    1    0    0    0    0    0    0    0
info            1.6  1.6  1.5  1.6  1.1  0.6  1.6  1.1  0.6
sub_A           0    0    0    0    0    0    0.6  0    0
sub_C           0    0    0    0    0    0    0    0    0
sub_G           0.1  0    0    0    0    0    0    0    0
sub_T           0    0    0.5  1    0    0    0    0    0
sub_N           0.1  0    0.5  1    0    0    0.6  0    0
sub_rate        0.06 0    0.33 0.62 0    0    0.37 0    0

# This one is per read, not per base
sub_hist  0  1  0.6

Predicting the structure and the free energy

We use the following script:

echo ">ref\nGGCGACACAGTCGACGGTTTTCACA">temp.fasta
Fold temp.fasta temp.ct
ct2dot temp.ct 1 temp_dot.txt
cat temp_dot.txt

Note

The results are stored in a temporary json file ledger.json, so that if you predict multiple times the same sequence, you will not have to wait for the prediction to be done again.

Edge cases handling

  • Insertions are counted on the 3’UTR side. See example above.

Contributors: Yves Martin des Taillades