seismicrna.cluster package

Subpackages

Submodules

class seismicrna.cluster.batch.ClusterMutsBatch(*, resps: DataFrame, **kwargs)

Bases: ClusterReadBatch, PartialRegionMutsBatch

property read_weights

Weights for each read when computing counts.

class seismicrna.cluster.batch.ClusterReadBatch(*, resps: DataFrame, **kwargs)

Bases: PartialReadBatch

property num_reads: Series

Number of reads.

property read_nums

Read numbers.

class seismicrna.cluster.data.ClusterAbundanceTable

Bases: AbundanceTable, PartialTable, ClusterFile, ABC

classmethod get_header_type()

Type of the header for the table.

classmethod get_index_depth()

Number of columns in the index.

classmethod get_load_function()

LoadFunction for all Dataset types for this Table.

property proportions

Proportion of each item.

class seismicrna.cluster.data.ClusterAbundanceTableLoader(table_file: str | Path, **kwargs)

Bases: TableLoader, ClusterAbundanceTable

Load cluster data indexed by cluster.

property data: Series

Table’s data.

class seismicrna.cluster.data.ClusterAbundanceTableWriter(tabulator: Tabulator)

Bases: AbundanceTableWriter, ClusterAbundanceTable

class seismicrna.cluster.data.ClusterBatchTabulator(*, get_batch_count_all: Callable, num_batches: int, num_cpus: int = 1, **kwargs)

Bases: BatchTabulator, ClusterTabulator

class seismicrna.cluster.data.ClusterCountTabulator(*, batch_counts: Iterable[tuple[Any, Any, Any, Any]], **kwargs)

Bases: CountTabulator, ClusterTabulator

class seismicrna.cluster.data.ClusterDataset(report_file: str | Path, verify_times: bool = True)

Bases: Dataset, ABC

Dataset of clustered data.

class seismicrna.cluster.data.ClusterDatasetTabulator(*, dataset: MutsDataset, validate: bool = False, **kwargs)

Bases: PartialDatasetTabulator, ClusterTabulator

classmethod init_kws()

Attributes of the dataset to use as keyword arguments in super().__init__().

class seismicrna.cluster.data.ClusterMutsDataset(dataset2_report_file: Path, **kwargs)

Bases: ClusterDataset, MultistepDataset, UnbiasDataset

Merge cluster responsibilities with mutation data.

property best_k

Best number of clusters.

classmethod get_dataset1_load_func()

Function to load Dataset 1.

classmethod get_dataset2_type()

Type of Dataset 2.

property ks

Numbers of clusters.

property min_mut_gap

Minimum gap between two mutations.

property pattern

Pattern of mutations to count.

property quick_unbias

Use the quick heuristic for unbiasing.

property quick_unbias_thresh

Consider mutation rates less than or equal to this threshold to be 0 when using the quick heuristic for unbiasing.

property region

Region of the dataset.

class seismicrna.cluster.data.ClusterPositionTable

Bases: ClusterTable, PartialPositionTable, ABC

class seismicrna.cluster.data.ClusterPositionTableLoader(table_file: str | Path, **kwargs)

Bases: PositionTableLoader, ClusterPositionTable

Load cluster data indexed by position.

class seismicrna.cluster.data.ClusterPositionTableWriter(tabulator: Tabulator)

Bases: PositionTableWriter, ClusterPositionTable

class seismicrna.cluster.data.ClusterReadDataset(report_file: str | Path, verify_times: bool = True)

Bases: ClusterDataset, LoadedDataset

Load clustering results.

property best_k

Best number of clusters.

classmethod get_batch_type()

Type of batch.

classmethod get_report_type()

Type of report.

property ks

Numbers of clusters.

property pattern

Pattern of mutations to count.

class seismicrna.cluster.data.ClusterTable

Bases: RelTypeTable, ClusterFile, ABC

classmethod get_header_type()

Type of the header for the table.

classmethod get_load_function()

LoadFunction for all Dataset types for this Table.

class seismicrna.cluster.data.ClusterTabulator(*, ks: list[int], **kwargs)

Bases: PartialTabulator, ABC

property clust_header

Header of the per-cluster data.

property data_per_clust

Number of reads in each cluster.

classmethod table_types()

Types of tables that this tabulator can write.

class seismicrna.cluster.data.JoinClusterMutsDataset(*args, **kwargs)

Bases: ClusterDataset, JoinMutsDataset, MergedUnbiasDataset

property best_k

Best number of clusters.

property clusts

Index of k and cluster numbers.

classmethod get_batch_type()

Type of batch.

classmethod get_dataset_load_func()

Function to load one constituent dataset.

classmethod get_report_type()

Type of report.

property ks

Numbers of clusters.

classmethod name_batch_attrs()

Name the attributes of each batch.

seismicrna.cluster.data.get_clust_params(dataset: ClusterMutsDataset, num_cpus: int = 1)

Get the mutation rates and proportion for each cluster. If table files already exist, then use them to get the parameters; otherwise, calculate the parameters from the dataset.

class seismicrna.cluster.em.EMRun(uniq_reads: UniqReads, k: int, seed: int | None = None, *, min_iter: int, max_iter: int, em_thresh: float, jackpot: bool, jackpot_conf_level: float, max_jackpot_quotient: float)

Bases: object

Run expectation-maximization to cluster the given reads into the specified number of clusters.

property bic

Bayesian Information Criterion of the model.

get_resps(batch_num: int)

Cluster memberships of the reads in the batch.

property jackpot_quotient

jackpotting score divided by the score of the median null model.

Type:

Jackpotting quotient

property jackpot_score

expected value of the G-anomaly.

Type:

Jackpotting score

property log_exp

Expected log count of each unique read (with names).

property log_exp_values

Expected log count of each unique read (values only).

property log_like

Return the current log likelihood, which is the last item in the trajectory of log likelihood values.

property max_arcd_vs_ens_avg

Maximum arcsine distance between the mutation rates of the clusters and the ensemble average.

property max_gini

Maximum Gini coefficient among all clusters.

property max_pearson

Maximum Pearson correlation among any pair of clusters.

property min_marcd

Minimum MARCD among any pair of clusters.

property mus

Mutation rate of each position in each cluster, corrected for observer bias.

property mus_ens_avg

Mutation rate of each position in the ensemble average, corrected for observer bias.

property n_reads

Total number of reads (including redundant reads).

property pis

Proportion of each cluster, corrected for observer bias.

property region_end5

5’ end of the region.

property semi_g_anomalies

Semi-G-anomaly of each read.

class seismicrna.cluster.emk.EMRunsK(runs: list[EMRun], max_pearson_run: float, min_marcd_run: float, max_arcd_vs_ens_avg: float, max_gini_run: float, max_jackpot_quotient: float, max_loglike_vs_best: float, min_pearson_vs_best: float, max_marcd_vs_best: float)

Bases: object

One or more EM runs with the same number of clusters.

best_index(**kwargs) int

Index of the best valid run.

enough_runs_passing(**kwargs)

Whether enough runs passed.

get_valid_index(i: int | list[int] | ndarray, **kwargs)

Index(es) of valid run number(s) i.

loglike_vs_best(**kwargs)

Log likelihood difference between the best and second-best runs.

marcd_vs_best(**kwargs)

Minimum MARCD between the best run and any other run.

n_runs_passing(**kwargs)

Number of runs passing the filters.

passing(**kwargs)

Whether this number of clusters passes the filters.

pearson_vs_best(**kwargs)

Maximum Pearson correlation between the best run and any other run.

run_passing(allow_underclustered: bool = False)

Whether each run passed the filters.

subopt_indexes(**kwargs)

Indexes of the valid suboptimal runs.

summarize(**kwargs)

Summarize the results of the runs.

seismicrna.cluster.emk.assign_clusterings(mus1: ndarray, mus2: ndarray)

Optimally assign clusters from two groups to each other.

seismicrna.cluster.emk.calc_mean_arcsine_distance_clusters(mus1: ndarray, mus2: ndarray)

MARCD between all clusters after matching.

seismicrna.cluster.emk.calc_mean_pearson_clusters(mus1: ndarray, mus2: ndarray)

Pearson correlation between all clusters after matching.

seismicrna.cluster.emk.find_best_k(ks: Iterable[EMRunsK], **kwargs)

Find the best number of clusters.

seismicrna.cluster.emk.get_common_k(runs: list[EMRun])

Find the number of clusters (k) from among EM clustering runs. If there are multiple ks, then raise a ValueError.

seismicrna.cluster.emk.sort_runs(runs: list[EMRun])

Sort the runs of EM clustering by decreasing likelihood so that the run with the best (largest) likelihood comes first.

class seismicrna.cluster.io.ClusterBatchIO(*, resps: DataFrame, **kwargs)

Bases: ClusterReadBatch, ReadBatchIO, RegBrickleIO, ClusterIO

classmethod get_file_seg_type()

Type of the last segment in the path.

class seismicrna.cluster.io.ClusterBatchWriter(dataset: MaskMutsDataset, ks: list[EMRunsK], brotli_level: int, top: Path, branch: str)

Bases: object

property branches
get_read_nums(batch_num: int)

Get the read numbers for one batch.

property ks_written
write_batches()

Save the batches.

class seismicrna.cluster.io.ClusterFile

Bases: HasRegFilePath, ABC

classmethod get_step()

Step of the workflow.

class seismicrna.cluster.io.ClusterIO

Bases: ClusterFile, RegFileIO, ABC

seismicrna.cluster.jackpot.bootstrap_jackpot_scores(uniq_end5s: ndarray, uniq_end3s: ndarray, counts_per_uniq: ndarray, p_mut: ndarray, p_ends: ndarray, p_clust: ndarray, min_mut_gap: int, unmasked: ndarray, real_jackpot_score: float, confidence_level: float, max_jackpot_quotient: float)

Bootstrap jackpotting scores from the null model.

Parameters:
  • uniq_end5s (np.ndarray) – 1D (unique reads) array of the 5’ ends (0-indexed) of the unique reads.

  • uniq_end3s (np.ndarray) – 1D (unique reads) array of the 3’ ends (0-indexed) of the unique reads.

  • counts_per_uniq (np.ndarray) – 1D (unique reads) array of the number of times each unique read was observed.

  • p_mut (np.ndarray) – 2D (positions x clusters) array of the mutation rate at each position in each cluster.

  • p_ends (np.ndarray) – 2D (positions x positions) array of the probability distribution of 5’/3’ end coordinates.

  • p_clust (np.ndarray) – 1D (clusters) probability that a read comes from each cluster.

  • min_mut_gap (int) – Minimum number of non-mutated positions between two mutations.

  • unmasked (np.ndarray) – Positions (0-indexed) of p_mut that are not masked.

  • real_jackpot_score (float) – Jackpotting score of the real dataset.

  • confidence_level (float) – Confidence level for computing a confidence interval of the jackpotting quotient.

  • max_jackpot_quotient (float) – Stop bootstrapping once the confidence interval lies entirely above or entirely below this threshold.

Returns:

Array of normalized G-test statistics for the null model.

Return type:

np.ndarray

seismicrna.cluster.jackpot.calc_jackpot_quotient(real_jackpot_score: float, null_jackpot_score: float)

Calculate the jackpotting quotient.

The jackpotting quotient indicates how much more overrepresented the average read is in the real dataset compared to the null dataset.

Since the jackpotting score is the expected log-ratio of a read’s observed to expected count, raising e to the power of it yields the observed-to-expected ratio, which measures jackpotting.

JS = average{log(O / E)} exp(JS) = exp(average{log(O / E)})

Then, the jackpotting quotient is the jackpotting score of the real dataset divided by that of the null dataset:

JQ = exp(JS_real) / exp(JS_null)

= exp(JS_real - JS_null)

Parameters:
  • real_jackpot_score (float) – Jackpotting score of the real dataset.

  • null_jackpot_score (float) – Jackpotting score of the null model.

Returns:

Jackpotting quotient

Return type:

float

seismicrna.cluster.jackpot.calc_jackpot_score(semi_g_anomalies: ndarray, n_reads: int)

Calculate the jackpotting score.

The jackpotting score is defined as the average log of the ratio of observed to expected counts for a read, weighted by the count of each read:

JS = sum{O_i * log(O_i / E_i)} / sum{O_i} where i is each unique read.

This formula is based on that of the G-test statistic, which is identical except that the latter is multiplied by 2 and not divided by the number of observations.

seismicrna.cluster.jackpot.calc_jackpot_score_ci(jackpot_scores: Iterable[float], confidence_level: float)

Calculate the confidence interval of the mean of an array of jackpotting scores.

seismicrna.cluster.jackpot.calc_semi_g_anomaly(num_obs: int | ndarray, log_exp: float | ndarray)

Calculate each read’s semi-G-anomaly, i.e. half its contribution to the G-test statistic.

seismicrna.cluster.jackpot.linearize_ends_matrix(p_ends: ndarray)

Linearize an N x N matrix of end coordinate probabilities/counts into a length-N list of 5’/3’ coordinates and their values.

seismicrna.cluster.jackpot.sim_obs_exp(end5s: ndarray, end3s: ndarray, p_mut: ndarray, p_ends: ndarray, p_clust: ndarray, min_mut_gap: int, unmasked: ndarray)

Simulate observed and expected counts.

seismicrna.cluster.main.run(input_path: Iterable[str | Path], *, branch: str = '', tmp_pfx: str | Path = './tmp', keep_tmp: bool = False, min_clusters: int = 1, max_clusters: int = 0, min_em_runs: int = 6, max_em_runs: int = 30, min_em_iter: int = 10, max_em_iter: int = 500, em_thresh: float = 0.37, min_marcd_run: float = 0.016, max_pearson_run: float = 0.9, max_arcd_vs_ens_avg: float = 0.2, max_gini_run: float = 0.667, jackpot: bool = True, jackpot_conf_level: float = 0.95, max_jackpot_quotient: float = 1.1, max_loglike_vs_best: float = 0.0, min_pearson_vs_best: float = 0.97, max_marcd_vs_best: float = 0.008, try_all_ks: bool = False, write_all_ks: bool = False, cluster_pos_table: bool = True, cluster_abundance_table: bool = True, verify_times: bool = True, brotli_level: int = 10, num_cpus: int = 4, force: bool = False) list[Path]

Infer alternative structures by clustering reads’ mutations.

Parameters:
  • branch (str) – Create a new branch of the workflow with this name [keyword-only, default: ‘’]

  • tmp_pfx (str | pathlib._local.Path) – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]

  • keep_tmp (bool) – Keep temporary files after finishing [keyword-only, default: False]

  • min_clusters (int) – Start at this many clusters [keyword-only, default: 1]

  • max_clusters (int) – Stop at this many clusters (0 for no limit) [keyword-only, default: 0]

  • min_em_runs (int) – Run EM (successfully) at least this number of times for each K [keyword-only, default: 6]

  • max_em_runs (int) – Run EM (successfully or not) at most this number of times for each K [keyword-only, default: 30]

  • min_em_iter (int) – Run EM for at least this many iterations [keyword-only, default: 10]

  • max_em_iter (int) – Run EM for at most this many iterations [keyword-only, default: 500]

  • em_thresh (float) – Stop EM when the log likelihood increases by less than this threshold [keyword-only, default: 0.37]

  • min_marcd_run (float) – Remove runs with two clusters that differ by less than this MARCD [keyword-only, default: 0.016]

  • max_pearson_run (float) – Remove runs with two clusters more similar than this correlation [keyword-only, default: 0.9]

  • max_arcd_vs_ens_avg (float) – Remove runs where a cluster differs by more than this ARCD from the ensemble average at any position [keyword-only, default: 0.2]

  • max_gini_run (float) – Remove runs where any cluster’s Gini coefficient exceeds this limit [keyword-only, default: 0.667]

  • jackpot (bool) – Calculate the jackpotting quotient to find over-represented reads [keyword-only, default: True]

  • jackpot_conf_level (float) – Confidence level for the jackpotting quotient confidence interval [keyword-only, default: 0.95]

  • max_jackpot_quotient (float) – Remove runs whose jackpotting quotient exceeds this limit [keyword-only, default: 1.1]

  • max_loglike_vs_best (float) – Remove Ks with a log likelihood gap larger than this (0 for no limit) [keyword-only, default: 0.0]

  • min_pearson_vs_best (float) – Remove Ks where every run has less than this correlation vs. the best [keyword-only, default: 0.97]

  • max_marcd_vs_best (float) – Remove Ks where every run has more than this MARCD vs. the best [keyword-only, default: 0.008]

  • try_all_ks (bool) – Try all numbers of clusters (Ks), even after finding the best number [keyword-only, default: False]

  • write_all_ks (bool) – Write all numbers of clusters (Ks), rather than only the best number [keyword-only, default: False]

  • cluster_pos_table (bool) – Tabulate relationships per position for cluster data [keyword-only, default: True]

  • cluster_abundance_table (bool) – Tabulate number of reads per cluster for cluster data [keyword-only, default: True]

  • verify_times (bool) – Verify that report files from later steps have later timestamps [keyword-only, default: True]

  • brotli_level (int) – Compress pickle files with this level of Brotli (0 - 11) [keyword-only, default: 10]

  • num_cpus (int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]

  • force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

seismicrna.cluster.marginal.calc_marginal(*args, **kwargs)
seismicrna.cluster.marginal.calc_marginal_resps(*args, **kwargs)
seismicrna.cluster.obsexp.assemble_log_obs_exp(uniq_reads: UniqReads, ks: list[EMRunsK])

Assemble the expected and observed log counts of each read.

seismicrna.cluster.obsexp.format_exp_count_col(k: int)
seismicrna.cluster.obsexp.graph_log_obs_exp(log_obs_exp: DataFrame, ks: list[EMRunsK], to_dir: Path)

Graph the expected vs. observed log counts of unique reads.

seismicrna.cluster.obsexp.table_log_obs_exp(log_obs_exp: DataFrame, to_dir: Path)

Write the expected and observed log counts of unique reads.

seismicrna.cluster.obsexp.write_obs_exp_counts(uniq_reads: UniqReads, ks: list[EMRunsK], to_dir: Path)
seismicrna.cluster.params.get_table_path(top: Path, branches: dict[str, str], sample: str, ref: str, reg: str, table: str, k: int, run: int)

Build a path for a table of clustering results.

seismicrna.cluster.params.write_mus(run: EMRun, top: Path, branches: dict[str, str], sample: str, ref: str, reg: str, rank: int)
seismicrna.cluster.params.write_pis(run: EMRun, top: Path, branches: dict[str, str], sample: str, ref: str, reg: str, rank: int)
seismicrna.cluster.params.write_single_run_table(run: EMRun, top: Path, branches: dict[str, str], sample: str, ref: str, reg: str, rank: int, *, attr: str, table: str)

Write a DataFrame of one type of data from one independent run of EM clustering to a CSV file.

class seismicrna.cluster.report.BaseClusterReport(**kwargs: Any | Callable[[Report], Any])

Bases: RegReport, ClusterIO, ABC

classmethod get_file_seg_type()

Type of the last segment in the path.

class seismicrna.cluster.report.ClusterReport(**kwargs: Any | Callable[[Report], Any])

Bases: BatchedReport, BaseClusterReport

classmethod from_clusters(ks: list[EMRunsK], uniq_reads: UniqReads, *, min_clusters: int, max_clusters: int, min_em_runs: int, max_em_runs: int, try_all_ks: bool, write_all_ks: bool, min_marcd_run: float, max_pearson_run: float, max_arcd_vs_ens_avg: float, max_gini_run: float, jackpot: bool, jackpot_conf_level: float, max_jackpot_quotient: float, max_loglike_vs_best: float, min_pearson_vs_best: float, max_marcd_vs_best: float, min_iter: int, max_iter: int, em_thresh: float, ks_written: list[int], checksums: list[str], began: datetime, ended: datetime)

Create a ClusterReport from EmClustering objects.

classmethod get_file_seg_type()

Type of the last segment in the path.

classmethod get_param_report_fields()

Parameter fields of the report.

classmethod get_result_report_fields()

Result fields of the report.

class seismicrna.cluster.report.JoinClusterReport(**kwargs: Any | Callable[[Report], Any])

Bases: JoinReport, BaseClusterReport

classmethod get_param_report_fields()

Parameter fields of the report.

seismicrna.cluster.summary.graph_attr(attr: Series, passing_text: list[str] | None = None)

Graph one attribute.

seismicrna.cluster.summary.graph_attrs(table: DataFrame, to_dir: Path)

Graph every attribute.

seismicrna.cluster.summary.tabulate(ks: list[EMRunsK])

Tabulate all attributes.

seismicrna.cluster.summary.tabulate_attr(ks: list[EMRunsK], attr: str)

Tabulate the values for one attribute.

seismicrna.cluster.summary.write_summaries(ks: list[EMRunsK], to_dir: Path)
seismicrna.cluster.summary.write_table(table: DataFrame, cluster_dir: Path)
class seismicrna.cluster.uniq.UniqReads(sample: str, branches: dict[str, str], region: Region, min_mut_gap: int, quick_unbias: bool, quick_unbias_thresh: float, muts_per_pos: list[ndarray], batch_to_uniq: list[Series], counts_per_uniq: ndarray, **kwargs)

Bases: EndCoords

Collection of bit vectors of unique reads.

classmethod from_dataset(dataset: MaskMutsDataset, branch: str, **kwargs)

Get unique reads from a dataset.

classmethod from_dataset_contig(dataset: MaskMutsDataset, branch: str)

Get unique reads from a dataset of contiguous reads.

get_cov_matrix()

Full boolean matrix of the covered positions.

get_mut_matrix()

Full boolean matrix of the mutations.

property log_obs

Log number of times each read was observed.

property num_batches

Number of batches.

property num_nonuniq: int

Number of total reads (including non-unique reads).

property num_obs

Number of times each read was observed.

property num_uniq

Number of unique reads.

property read_end3s_zero

3’ end coordinates (0-indexed in the region).

property read_end5s_zero

5’ end coordinates (0-indexed in the region).

property ref

Reference name.

property seg_end3s_zero

3’ end of every segment (0-indexed in the region).

property seg_end5s_zero

5’ end of every segment (0-indexed in the region).

property uniq_names

Unique read names as byte strings.

seismicrna.cluster.uniq.get_uniq_reads(pos_nums: Iterable[int], pattern: RelPattern, batches: Iterable[RegionMutsBatch], **kwargs)
seismicrna.cluster.write.cluster(dataset: MaskMutsDataset | JoinMaskMutsDataset, *, branch: str, min_clusters: int, max_clusters: int, try_all_ks: bool, write_all_ks: bool, max_em_runs: int, num_cpus: int, brotli_level: int, force: bool, cluster_pos_table: bool, cluster_abundance_table: bool, verify_times: bool, tmp_pfx, keep_tmp, **kwargs)

Cluster unique reads from one mask dataset.

seismicrna.cluster.write.run_k(uniq_reads: UniqReads, k: int, num_runs: int, *, num_cpus: int, **kwargs) list[EMRun]

Run EM with a specific number of clusters.

seismicrna.cluster.write.run_ks(uniq_reads: UniqReads, ks: Iterable[int], *, min_em_runs: int, max_em_runs: int, try_all_ks: bool, min_marcd_run: float, max_pearson_run: float, max_arcd_vs_ens_avg: float, max_gini_run: float, max_jackpot_quotient: float, max_loglike_vs_best: float, min_pearson_vs_best: float, max_marcd_vs_best: float, min_iter: int, max_iter: int, em_thresh: float, num_cpus: int, top: Path, **kwargs)

Run EM with multiple numbers of clusters.