Clustering
Cluster separate reads into groups of reads corresponding to alternative structures.
- Key features:
Pre-process and compress bitvector to speed up computation
Cluster each read using Expectation-Maximisation algorithm
Find optimal number of clusters using Bayesian Information Criterion (BIC)
Get started
Let’s cluster the bitvectors of sample_1
.
- You need:
the path to the bitvectors of
sample_1
stored as shown in the I/O files section.
- You get:
the likelihod of each read belonging to each cluster in
best_cluster_reads.json
under the Clustering report format.
CLI
Run the command:
dreem cluster --mp-report /path/to/report.json --out-dir /path/to/output_dir
Python
Run the following code:
from dreem import cluster
# Important for multiprocessing
if __name__ == '__main__':
cluster.run(
mp_report=['/path/to/report.json'],
out_dir='/path/to/output_dir'
)
See the API Reference for more details.
I/O files
This module needs the files to be sorted into a specific folder structure. The folder structure is described below.
Input files structure
As input, cluster requires one or more files of Mutation Vector batches for each section of a reference. The folder structure is as follows:
sample_1/ # <=> a fastq file <<< give this path to aggregate the entire sample
|- reference_1/ # <=> a single reference in the fasta file
|- section_1/ # <=> a section (a sub-sequence of a reference)
|- 0.orc # <=> a batch of bitvectors
|- 1.orc
|- report.json # <=> a report file <<< give this path to aggregate a single section
|- section_2 # <=> another section for this reference
|- 0.orc
|- report.json
|- reference_2 # <=> another reference
|- section_3
|- 0.orc
|- 1.orc
|- 2.orc
|- report.json
Note that only the report.json
file is required as argument to load the batches of bitvectors.
Output file structure
best_cluster_reads.json
API Reference
The arguments for the CLI and Python API are the same. The CLI is just a wrapper around the Python API.
CLI cmds
dreem cluster
dreem cluster [OPTIONS]
Options
- --mp-report <mp_report>
Path to the bit vector folder or list of paths to the bit vector folders.
- --out-dir <out_dir>
Where to output all finished files
- --max-procs <max_procs>
Maximum number of simultaneous processes
- --max-clusters <max_clusters>
Maximum number of clusters.
- -n, --num-runs <num_runs>
Number of time to run the clustering algorithm.
- --signal-thresh <signal_thresh>
Minimum Mutation fraction to keep a base.
- --include-gu, --exclude-gu
Whether to include G and U bases in reads.
- --include-del, --exclude-del
Whether to include deletions in reads.
- --polya-max <polya_max>
Maximum length of poly(A) sequences to include.
- --min-iter <min_iter>
Minimum number of iteration before checking convergence of EM.
- --max-iter <max_iter>
Maximum number of iteration before stopping EM.
- --convergence-cutoff <convergence_cutoff>
Minimum difference between the log-likelihood of two consecutive iterations to stop EM.
- --min-reads <min_reads>
Minimum number of reads to start clustering.
Python args
- dreem.cluster.run(mp_report: tuple[str] = (), *, out_dir: str = './output', max_procs: int = 2, max_clusters: int = 3, num_runs: int = 10, signal_thresh: float = 0.005, include_gu: bool = False, include_del: bool = False, polya_max: int = 4, min_iter: int = 100, max_iter: int = 500, convergence_cutoff: float = 0.5, min_reads: int = 1000)
Run the clustering module.
- Parameters
mp_report (
tuple
) – Path to the bit vector folder or list of paths to the bit vector folders. [positional or keyword, default: ()]out_dir (
str
) – Where to output all finished files [keyword-only, default: ‘./output’]max_procs (
int
) – Maximum number of simultaneous processes [keyword-only, default: 2]max_clusters (
int
) – Maximum number of clusters. [keyword-only, default: 3]num_runs (
int
) – Number of time to run the clustering algorithm. [keyword-only, default: 10]signal_thresh (
float
) – Minimum Mutation fraction to keep a base. [keyword-only, default: 0.005]include_gu (
bool
) – Whether to include G and U bases in reads. [keyword-only, default: False]include_del (
bool
) – Whether to include deletions in reads. [keyword-only, default: False]polya_max (
int
) – Maximum length of poly(A) sequences to include. [keyword-only, default: 4]min_iter (
int
) – Minimum number of iteration before checking convergence of EM. [keyword-only, default: 100]max_iter (
int
) – Maximum number of iteration before stopping EM. [keyword-only, default: 500]convergence_cutoff (
float
) – Minimum difference between the log-likelihood of two consecutive iterations to stop EM. [keyword-only, default: 0.5]min_reads (
int
) – Minimum number of reads to start clustering. [keyword-only, default: 1000]
Key algorithms
Preprocessing the bitvector
- The bitvector is pre-processed by the following steps:
Remove the G and U bases (if the option
--include-gu
is set to False)Remove the low Mutation fraction bases (threshold defined by
--signal-thresh
)Remove the reads with too many mutations (more than 3 standard deviations from the mean)
Remove the reads with deletions (if the option
--include-del
is set to False)Remove the reads with mutations closer than 4 bases apart
Convert the bitvector to binary format by counting substitution as 1
Keep only unique reads, and keep track of their occurence
Here is an example of non processed bitvector, as outputted by the vectoring module. Please refer to Mutation Vector for more information about the bitvector format.
G1 C2 G3 T4 C5 T6 T7 A8 G9 T10 G11 C12 A13 T14 A15
__index_level_0__
reference_1_read_1 1 1 1 1 64 1 1 1 1 1 1 1 1 1 1
reference_1_read_2 1 1 1 1 1 1 1 1 1 16 1 1 1 1 1
reference_1_read_3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 32
reference_1_read_4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
reference_1_read_5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
reference_1_read_6 1 1 1 1 1 1 128 1 1 1 1 1 1 1 1
reference_1_read_7 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
reference_1_read_8 1 1 1 1 1 1 1 1 1 1 1 256 1 1 1
reference_1_read_9 1 4 1 1 1 1 1 1 1 1 1 1 1 1 1
After pre-processing:
C5 T7 C12 A15 occurence
__index_level_0__
idx_0 1 0 0 0 1
idx_1 0 1 0 0 1
idx_3 0 0 1 0 1
idx_4 0 0 0 1 1
idx_5 0 0 0 0 4
Clustering algorithm
We use Expectation-Maximisation (EM) algorithm with a Bernoulli mixture model. This model is adapted to account for the fact that one read cannot have mutations closer than 4 bases apart (i.e. the bases mutations are not independent).
The expectation step computes the probability that each read belongs to each cluster:
- with :
\(x_n\) the bitvector of the read n
\(\mu_k\) the mutation profile of cluster k
\(D\) the number of bases in the read
\(d_k\) the correction factor to account for the fact that one read cannot have consecutive mutations
The maximisation step computes the new mutation profile of each cluster \(\mu_k\) and their proportion \(\pi_k\). We use a numerical solver to find the maximum likelihood solution.
See the following paper for more details: Determination of RNA structural diversity and its role in HIV-1 RNA splicing.
Edge cases handling
[What are the edge cases that need to be handled? How are they handled?]
Contributors: Alberic de Lajarte, Yves Martin des Taillades, Harish Swaminathan