Clustering

Cluster separate reads into groups of reads corresponding to alternative structures.

Key features:

Pre-process and compress bitvector to speed up computation
Cluster each read using Expectation-Maximisation algorithm
Find optimal number of clusters using Bayesian Information Criterion (BIC)

Get started

Let’s cluster the bitvectors of sample_1.

You need:

the path to the bitvectors of sample_1 stored as shown in the I/O files section.

You get:

the likelihod of each read belonging to each cluster in best_cluster_reads.json under the Clustering report format.

CLI

Run the command:

dreem cluster --mp-report /path/to/report.json --out-dir /path/to/output_dir

Python

Run the following code:

from dreem import cluster

# Important for multiprocessing
if __name__ == '__main__':
    cluster.run(
        mp_report=['/path/to/report.json'],
        out_dir='/path/to/output_dir'
        )

See the API Reference for more details.

I/O files

This module needs the files to be sorted into a specific folder structure. The folder structure is described below.

Input files structure

As input, cluster requires one or more files of Mutation Vector batches for each section of a reference. The folder structure is as follows:

sample_1/              # <=> a fastq file <<< give this path to aggregate the entire sample
    |- reference_1/    # <=> a single reference in the fasta file
        |- section_1/  # <=> a section (a sub-sequence of a reference)
            |- 0.orc   # <=> a batch of bitvectors
            |- 1.orc
            |- report.json  # <=> a report file <<< give this path to aggregate a single section
        |- section_2  # <=> another section for this reference
            |- 0.orc
            |- report.json
    |- reference_2    # <=> another reference
        |- section_3
            |- 0.orc
            |- 1.orc
            |- 2.orc
            |- report.json

Note that only the report.json file is required as argument to load the batches of bitvectors.

Output file structure

best_cluster_reads.json

API Reference

The arguments for the CLI and Python API are the same. The CLI is just a wrapper around the Python API.

CLI cmds

dreem cluster

dreem cluster [OPTIONS]

Options

--mp-report <mp_report>: Path to the bit vector folder or list of paths to the bit vector folders.

--out-dir <out_dir>: Where to output all finished files

--max-procs <max_procs>: Maximum number of simultaneous processes

--max-clusters <max_clusters>: Maximum number of clusters.

-n, --num-runs <num_runs>: Number of time to run the clustering algorithm.

--signal-thresh <signal_thresh>: Minimum Mutation fraction to keep a base.

--include-gu, --exclude-gu: Whether to include G and U bases in reads.

--include-del, --exclude-del: Whether to include deletions in reads.

--polya-max <polya_max>: Maximum length of poly(A) sequences to include.

--min-iter <min_iter>: Minimum number of iteration before checking convergence of EM.

--max-iter <max_iter>: Maximum number of iteration before stopping EM.

--convergence-cutoff <convergence_cutoff>: Minimum difference between the log-likelihood of two consecutive iterations to stop EM.

--min-reads <min_reads>: Minimum number of reads to start clustering.

Python args

dreem.cluster.run(mp_report: tuple[str] = (), *, out_dir: str = './output', max_procs: int = 2, max_clusters: int = 3, num_runs: int = 10, signal_thresh: float = 0.005, include_gu: bool = False, include_del: bool = False, polya_max: int = 4, min_iter: int = 100, max_iter: int = 500, convergence_cutoff: float = 0.5, min_reads: int = 1000)

Run the clustering module.

Parameters

mp_report (tuple) – Path to the bit vector folder or list of paths to the bit vector folders. [positional or keyword, default: ()]
out_dir (str) – Where to output all finished files [keyword-only, default: ‘./output’]
max_procs (int) – Maximum number of simultaneous processes [keyword-only, default: 2]
max_clusters (int) – Maximum number of clusters. [keyword-only, default: 3]
num_runs (int) – Number of time to run the clustering algorithm. [keyword-only, default: 10]
signal_thresh (float) – Minimum Mutation fraction to keep a base. [keyword-only, default: 0.005]
include_gu (bool) – Whether to include G and U bases in reads. [keyword-only, default: False]
include_del (bool) – Whether to include deletions in reads. [keyword-only, default: False]
polya_max (int) – Maximum length of poly(A) sequences to include. [keyword-only, default: 4]
min_iter (int) – Minimum number of iteration before checking convergence of EM. [keyword-only, default: 100]
max_iter (int) – Maximum number of iteration before stopping EM. [keyword-only, default: 500]
convergence_cutoff (float) – Minimum difference between the log-likelihood of two consecutive iterations to stop EM. [keyword-only, default: 0.5]
min_reads (int) – Minimum number of reads to start clustering. [keyword-only, default: 1000]

Key algorithms

Preprocessing the bitvector

The bitvector is pre-processed by the following steps:

Remove the G and U bases (if the option --include-gu is set to False)
Remove the low Mutation fraction bases (threshold defined by --signal-thresh)
Remove the reads with too many mutations (more than 3 standard deviations from the mean)
Remove the reads with deletions (if the option --include-del is set to False)
Remove the reads with mutations closer than 4 bases apart
Convert the bitvector to binary format by counting substitution as 1
Keep only unique reads, and keep track of their occurence

Here is an example of non processed bitvector, as outputted by the vectoring module. Please refer to Mutation Vector for more information about the bitvector format.

                    G1  C2  G3  T4  C5  T6  T7  A8  G9  T10  G11  C12  A13  T14  A15
__index_level_0__
reference_1_read_1   1   1   1   1  64   1   1   1   1    1    1    1    1    1    1
reference_1_read_2   1   1   1   1   1   1   1   1   1   16    1    1    1    1    1
reference_1_read_3   1   1   1   1   1   1   1   1   1    1    1    1    1    1   32
reference_1_read_4   1   1   1   1   1   1   1   1   1    1    1    1    1    1    1
reference_1_read_5   1   1   1   1   1   1   1   1   1    1    1    1    1    1    1
reference_1_read_6   1   1   1   1   1   1   128 1   1    1    1    1    1    1    1
reference_1_read_7   1   1   1   1   1   1   1   1   1    1    1    1    1    1    1
reference_1_read_8   1   1   1   1   1   1   1   1   1    1    1    256  1    1    1
reference_1_read_9   1   4   1   1   1   1   1   1   1    1    1    1    1    1    1

After pre-processing:

                     C5   T7   C12  A15  occurence
__index_level_0__
idx_0                 1   0    0    0    1
idx_1                 0   1    0    0    1
idx_3                 0   0    1    0    1
idx_4                 0   0    0    1    1
idx_5                 0   0    0    0    4

Clustering algorithm

We use Expectation-Maximisation (EM) algorithm with a Bernoulli mixture model. This model is adapted to account for the fact that one read cannot have mutations closer than 4 bases apart (i.e. the bases mutations are not independent).

The expectation step computes the probability that each read belongs to each cluster:

\[P(x_n|\mu_k) = \frac{ \prod_{i=1}^{D} \mu_{ki}^{x_{ni}} * (1-\mu_{ki})^{1-x_{ni}} } {d_k}\]

with :

\(x_n\) the bitvector of the read n
\(\mu_k\) the mutation profile of cluster k
\(D\) the number of bases in the read
\(d_k\) the correction factor to account for the fact that one read cannot have consecutive mutations

The maximisation step computes the new mutation profile of each cluster \(\mu_k\) and their proportion \(\pi_k\). We use a numerical solver to find the maximum likelihood solution.

See the following paper for more details: Determination of RNA structural diversity and its role in HIV-1 RNA splicing.

Edge cases handling

[What are the edge cases that need to be handled? How are they handled?]

Contributors: Alberic de Lajarte, Yves Martin des Taillades, Harish Swaminathan