Aggregation
Aggregation turns the bitvectors into a DREEM output file.
- Key features:
Computing the mutation population average for all bitvectors.
Computing the mutation population average for clusters of bitvectors.
Predicting the free energy and the structure from the sequence using RNAstructure.
Get started
Let’s aggregate sample_1
.
- You need:
the path to the bitvectors of
sample_1
stored as shown in the I/O files section.the reference genome under the fasta format
reference.fasta
.
- You get:
sample_1.json
under the DREEM output format.
CLI
Run the command:
dreem aggregate --bv-files path/to/sample_1 —-fasta path/to/reference.fasta
Python
Run the following code:
from dreem import aggregate
aggregate.run(
bv_files =('path/to/sample_1',),
fasta='path/to/reference.fasta'
)
See the API Reference for more details.
I/O files
This module needs the files to be sorted into a specific folder structure. The folder structure is described below.
Input files structure
As input, aggregate requires one or more files of Mutation Vector batches for each section of a reference. The folder structure is as follows:
sample_1/ # <=> a fastq file <<< give this path to aggregate the entire sample
|- reference_1/ # <=> a single reference in the fasta file
|- section_1/ # <=> a section (a sub-sequence of a reference)
|- 0.orc # <=> a batch of bitvectors
|- 1.orc
|- report.json # <=> a report file <<< give this path to aggregate a single section
|- section_2 # <=> another section for this reference
|- 0.orc
|- report.json
|- reference_2 # <=> another reference
|- section_3
|- 0.orc
|- 1.orc
|- 2.orc
|- report.json
Output file structure
sample_1.json
API Reference
The arguments for the CLI and Python API are the same. The CLI is just a wrapper around the Python API.
CLI
dreem aggregate
dreem aggregate [OPTIONS]
Options
- --fasta <fasta>
FASTA file of all reference sequences in the project
- -bv, --bv-files <bv_files>
Tuple of paths. Give the path to the sample folder to process every section. Give the path to a report to process a single section.
- --library <library>
Library CSV file
- --samples <samples>
Samples file
- -cf, --clustering-file <clustering_file>
Path to the json clustering file from dreem clustering.
- -rs, --rnastructure-path <rnastructure_path>
Path to the RNAstructure executable folder (e.g. /home/user/RNAstructure/exe/). Use this option if RNAstructure is not in your PATH.
- -rst, --rnastructure-use-temp <rnastructure_use_temp>
Use the temperature signal to make predictions with RNAstructure
- -rsa, --rnastructure-fold-args <rnastructure_fold_args>
Additional arguments to pass to RNAstructure’s Fold command
- -rsd, --rnastructure-use-dms <rnastructure_use_dms>
Use the DMS signal to make predictions with RNAstructure
- -rsdmin, --rnastructure-dms-min-unpaired-value <rnastructure_dms_min_unpaired_value>
Minimum unpaired value for using the dms signal as an input for RNAstructure
- -rsdmax, --rnastructure-dms-max-paired-value <rnastructure_dms_max_paired_value>
Maximum paired value for using the dms signal as an input for RNAstructure
- -rspa, --rnastructure-deltag-ensemble <rnastructure_deltag_ensemble>
Use RNAstructure partition function to predict free energy
- -rspr, --rnastructure_probability <rnastructure_probability>
Use RNAstructure partition function to predict per-base pairing probability
- -v, --verbose
Print info or info+debug messages to stdout
- --out-dir <out_dir>
Where to output all finished files
- --temp-dir <temp_dir>
Where to write all temporary files
- --save-temp, --erase-temp
Whether to save or erase temporary files after the program exits
Python
- dreem.aggregate.run(fasta: str, bv_files: list = (), *, out_dir: str = './output', temp_dir: str = './temp', save_temp: bool = False, library: str = '', samples: str = '', clustering_file: str = '', rnastructure_path: str = '', rnastructure_use_temp: bool = False, rnastructure_fold_args: str = '', rnastructure_use_dms: str = False, rnastructure_dms_min_unpaired_value: float = 0.07, rnastructure_dms_max_paired_value: float = 0.01, rnastructure_deltag_ensemble: bool = False, rnastructure_probability: bool = False, verbose: bool = 0)
Run the aggregate module.
- Parameters
fasta (
str
) – FASTA file of all reference sequences in the project [positional or keyword]bv_files (
list
) – Tuple of paths. Give the path to the sample folder to process every section. Give the path to a report to process a single section. [positional or keyword, default: ()]out_dir (
str
) – Where to output all finished files [keyword-only, default: ‘./output’]temp_dir (
str
) – Where to write all temporary files [keyword-only, default: ‘./temp’]save_temp (
bool
) – Whether to save or erase temporary files after the program exits [keyword-only, default: False]library (
str
) – Library CSV file [keyword-only, default: ‘’]samples (
str
) – Samples file [keyword-only, default: ‘’]clustering_file (
str
) – Path to the json clustering file from dreem clustering. [keyword-only, default: ‘’]rnastructure_path (
str
) – Path to the RNAstructure executable folder (e.g. /home/user/RNAstructure/exe/). Use this option if RNAstructure is not in your PATH. [keyword-only, default: ‘’]rnastructure_use_temp (
bool
) – Use the temperature signal to make predictions with RNAstructure [keyword-only, default: False]rnastructure_fold_args (
str
) – Additional arguments to pass to RNAstructure’s Fold command [keyword-only, default: ‘’]rnastructure_use_dms (
str
) – Use the DMS signal to make predictions with RNAstructure [keyword-only, default: False]rnastructure_dms_min_unpaired_value (
float
) – Minimum unpaired value for using the dms signal as an input for RNAstructure [keyword-only, default: 0.07]rnastructure_dms_max_paired_value (
float
) – Maximum paired value for using the dms signal as an input for RNAstructure [keyword-only, default: 0.01]rnastructure_deltag_ensemble (
bool
) – Use RNAstructure partition function to predict free energy [keyword-only, default: False]rnastructure_probability (
bool
) – Use RNAstructure partition function to predict per-base pairing probability [keyword-only, default: False]verbose (
bool
) – Print info or info+debug messages to stdout [keyword-only, default: 0]
Key algorithms
Counting mutations for the entire bitvector
The following table describes the aggregation of the bitvector into a set of per-residue counts (except for sub_hist
which is per-read).
Please refer to Mutation Vector for the definition of the bits and the bitvector format.
Aggregation |
Description |
|
Per-residue count of covered bases |
|
Per-residue count of deleted bases |
|
Per-residue count of inserted bases |
|
Per-residue count of bases that are substituted for another base or non-mutated. Doesn’t include deleted bases. |
|
Per-residue count of bases that substituted to A |
|
Per-residue count of bases that substituted to C |
|
Per-residue count of bases that substituted to G |
|
Per-residue count of bases that substituted to T |
|
Per-residue count of bases that substituted to any base |
|
Per-residue ratio |
|
Per-read count of mutations histogram (i.e. number of reads with 0 mutations, 1 mutation, 2 mutations, etc.) |
Example:
# example.orc
# Bitvector
C A C A A T G T G # reference sequence
9 5 1 128 1 0 1 1 2 # read 1
1 1 128 1 0 1 16 2 1 # read 2
64 1 2 1 1 1 16 1 1 # read 3
# Read 1
cov 1 1 1 1 1 0 1 1 1
del 0 0 0 0 0 0 0 0 1
ins 0 1 0 0 0 0 0 0 0
info 1 1 1 1 1 0 1 1 0
sub_A 0 0 0 0 0 0 0 0 0
sub_C 0 0 0 0 0 0 0 0 0
sub_G 0 0 0 0 0 0 0 0 0
sub_T 0 0 0 1 0 0 0 0 0
sub_N 0 0 0 1 0 0 0 0 0
sub_rate 0 0 0 1 0 0 0 0 0
# Read 2
cov 1 1 1 1 0 1 1 1 1
del 0 0 0 0 0 0 0 1 0
ins 0 0 0 0 0 0 0 0 0
info 1 1 1 1 0 1 1 0 1
sub_A 0 0 0 0 0 0 1 0 0
sub_C 0 0 0 0 0 0 0 0 0
sub_G 0 0 0 0 0 0 0 0 0
sub_T 0 0 1 0 0 0 0 0 0
sub_N 0 0 1 0 0 0 1 0 0
sub_rate 0 0 1 0 0 0 1 0 0
# Read 3
cov 1 1 1 1 1 1 1 1 1
del 0 0 1 0 0 0 0 1 0
ins 0 0 0 0 0 0 0 0 0
info 1 1 0 1 1 1 1 1 1
sub_A 0 0 0 0 0 0 1 0 0
sub_C 0 0 0 0 0 0 0 0 0
sub_G 1 0 0 0 0 0 0 0 0
sub_T 0 0 0 0 0 0 0 0 0
sub_N 1 0 0 0 0 0 1 0 0
sub_rate 1 0 0 0 0 0 1 0 0
# DREEM output = sum of these
# Aggregation ---------------------------------------------
cov 3 3 3 3 2 2 3 3 3
del 0 0 0 0 0 0 0 1 1
ins 0 1 0 0 0 0 0 0 0
info 3 3 2 3 2 2 3 2 2
sub_A 0 0 0 0 0 0 2 0 0
sub_C 0 0 0 0 0 0 0 0 0
sub_G 1 0 0 0 0 0 0 0 0
sub_T 0 0 1 1 0 0 0 0 0
sub_N 1 0 1 1 0 0 2 0 0
sub_N 1 0 1 1 0 0 2 0 0
sub_rate 0.33 0.00 0.5 0.33 0.00 0.00 0.66 0.00 0.00
# This one is per read, not per base
sub_hist 0 1 2 0 0 0 0 0 0 0
Counting mutations for a certain cluster #TODO
Clustering outputs for each read the likelihood to belong to a certain cluster. The sum of the likelihoods for different clusters for a read is 1.
When counting mutations for a certain cluster, we weight the mutations by the likelihood of the read to belong to that cluster.
Let’s add likelihoods to belong to a cluster K2_1 to the example above:
# example.orc
# Bitvector
C A C A A T G T G # reference sequence
9 5 1 128 1 0 1 1 2 # read 1, likelihood=1
1 1 128 1 0 1 16 2 1 # read 2 likelihood=0.5
64 1 2 1 1 1 16 1 1 # read 3 likelihood=0.1
# Read 1, likelihood=1
cov 1 1 1 1 1 0 1 1 1
del 0 0 0 0 0 0 0 0 1
ins 0 1 0 0 0 0 0 0 0
info 1 1 1 1 1 0 1 1 0
sub_A 0 0 0 0 0 0 0 0 0
sub_C 0 0 0 0 0 0 0 0 0
sub_G 0 0 0 0 0 0 0 0 0
sub_T 0 0 0 1 0 0 0 0 0
sub_N 0 0 0 1 0 0 0 0 0
sub_rate 0 0 0 1 0 0 0 0 0
# Read 2, likelihood=0.5
cov 1 1 1 1 0 1 1 1 1
del 0 0 0 0 0 0 0 1 0
ins 0 0 0 0 0 0 0 0 0
info 1 1 1 1 0 1 1 0 1
sub_A 0 0 0 0 0 0 1 0 0
sub_C 0 0 0 0 0 0 0 0 0
sub_G 0 0 0 0 0 0 0 0 0
sub_T 0 0 1 0 0 0 0 0 0
sub_N 0 0 1 0 0 0 1 0 0
sub_rate 0 0 1 0 0 0 1 0 0
# Read 3, likelihood=0.1
cov 1 1 1 1 1 1 1 1 1
del 0 0 1 0 0 0 0 0 0
ins 0 0 0 0 0 0 0 0 0
info 1 1 0 1 1 1 1 1 1
sub_A 0 0 0 0 0 0 1 0 0
sub_C 0 0 0 0 0 0 0 0 0
sub_G 1 0 0 0 0 0 0 0 0
sub_T 0 0 0 0 0 0 0 0 0
sub_N 1 0 0 0 0 0 1 0 0
sub_rate 1 0 0 0 0 0 1 0 0
# DREEM output = sum weighted by the likelihood: read1 + 0.5*read2 + 0.1*read3
# Aggregation ---------------------------------------------
cov 1.6 1.6 1.6 1.6 1.1 0.6 1.6 1.6 1.6
del 0 0 0.1 0 0 0 0 0.5 1
ins 0 1 0 0 0 0 0 0 0
info 1.6 1.6 1.5 1.6 1.1 0.6 1.6 1.1 0.6
sub_A 0 0 0 0 0 0 0.6 0 0
sub_C 0 0 0 0 0 0 0 0 0
sub_G 0.1 0 0 0 0 0 0 0 0
sub_T 0 0 0.5 1 0 0 0 0 0
sub_N 0.1 0 0.5 1 0 0 0.6 0 0
sub_rate 0.06 0 0.33 0.62 0 0 0.37 0 0
# This one is per read, not per base
sub_hist 0 1 0.6
Predicting the structure and the free energy
We use the following script:
echo ">ref\nGGCGACACAGTCGACGGTTTTCACA">temp.fasta
Fold temp.fasta temp.ct
ct2dot temp.ct 1 temp_dot.txt
cat temp_dot.txt
Note
The results are stored in a temporary json file ledger.json
, so that if you predict multiple times the same sequence, you will not have to wait for the prediction to be done again.
Edge cases handling
Insertions are counted on the 3’UTR side. See example above.
Contributors: Yves Martin des Taillades