seismicrna.cluster package

Subpackages

seismicrna.cluster.tests package
- Submodules

Submodules

class seismicrna.cluster.batch.ClusterMutsBatch(*, resps: DataFrame, **kwargs)

Bases: ClusterReadBatch, PartialRegionMutsBatch

property read_weights: Weights for each read when computing counts.

class seismicrna.cluster.batch.ClusterReadBatch(*, resps: DataFrame, **kwargs)

Bases: PartialReadBatch

property num_reads: Series: Number of reads.

property read_nums: Read numbers.

class seismicrna.cluster.data.ClusterAbundanceTable

Bases: AbundanceTable, PartialTable, ClusterFile, ABC

classmethod get_header_type(): Type of the header for the table.

classmethod get_index_depth(): Number of columns in the index.

classmethod get_load_function(): LoadFunction for all Dataset types for this Table.

property proportions: Proportion of each item.

class seismicrna.cluster.data.ClusterAbundanceTableLoader(table_file: str | Path, **kwargs)

Bases: TableLoader, ClusterAbundanceTable

Load cluster data indexed by cluster.

property data: Series: Table’s data.

class seismicrna.cluster.data.ClusterAbundanceTableWriter(tabulator: Tabulator): Bases: AbundanceTableWriter, ClusterAbundanceTable

class seismicrna.cluster.data.ClusterBatchTabulator(*, get_batch_count_all: Callable, num_batches: int, num_cpus: int = 1, **kwargs): Bases: BatchTabulator, ClusterTabulator

class seismicrna.cluster.data.ClusterCountTabulator(*, batch_counts: Iterable[tuple[Any, Any, Any, Any]], **kwargs): Bases: CountTabulator, ClusterTabulator

class seismicrna.cluster.data.ClusterDataset(report_file: str | Path, verify_times: bool = True)

Bases: Dataset, ABC

Dataset of clustered data.

class seismicrna.cluster.data.ClusterDatasetTabulator(*, dataset: MutsDataset, validate: bool = False, **kwargs)

Bases: PartialDatasetTabulator, ClusterTabulator

classmethod init_kws(): Attributes of the dataset to use as keyword arguments in super().__init__().

class seismicrna.cluster.data.ClusterMutsDataset(dataset2_report_file: Path, **kwargs)

Bases: ClusterDataset, MultistepDataset, UnbiasDataset

Merge cluster responsibilities with mutation data.

property best_k: Best number of clusters.

classmethod get_dataset1_load_func(): Function to load Dataset 1.

classmethod get_dataset2_type(): Type of Dataset 2.

property ks: Numbers of clusters.

property min_mut_gap: Minimum gap between two mutations.

property pattern: Pattern of mutations to count.

property quick_unbias: Use the quick heuristic for unbiasing.

property quick_unbias_thresh: Consider mutation rates less than or equal to this threshold to be 0 when using the quick heuristic for unbiasing.

property region: Region of the dataset.

class seismicrna.cluster.data.ClusterPositionTable: Bases: ClusterTable, PartialPositionTable, ABC

class seismicrna.cluster.data.ClusterPositionTableLoader(table_file: str | Path, **kwargs)

Bases: PositionTableLoader, ClusterPositionTable

Load cluster data indexed by position.

class seismicrna.cluster.data.ClusterPositionTableWriter(tabulator: Tabulator): Bases: PositionTableWriter, ClusterPositionTable

class seismicrna.cluster.data.ClusterReadDataset(report_file: str | Path, verify_times: bool = True)

Bases: ClusterDataset, LoadedDataset

Load clustering results.

property best_k: Best number of clusters.

classmethod get_batch_type(): Type of batch.

classmethod get_report_type(): Type of report.

property ks: Numbers of clusters.

property pattern: Pattern of mutations to count.

class seismicrna.cluster.data.ClusterTable

Bases: RelTypeTable, ClusterFile, ABC

classmethod get_header_type(): Type of the header for the table.

classmethod get_load_function(): LoadFunction for all Dataset types for this Table.

class seismicrna.cluster.data.ClusterTabulator(*, ks: list[int], **kwargs)

Bases: PartialTabulator, ABC

property clust_header: Header of the per-cluster data.

property data_per_clust: Number of reads in each cluster.

classmethod table_types(): Types of tables that this tabulator can write.

class seismicrna.cluster.data.JoinClusterMutsDataset(*args, **kwargs)

Bases: ClusterDataset, JoinMutsDataset, MergedUnbiasDataset

property best_k: Best number of clusters.

property clusts: Index of k and cluster numbers.

classmethod get_batch_type(): Type of batch.

classmethod get_dataset_load_func(): Function to load one constituent dataset.

classmethod get_report_type(): Type of report.

property ks: Numbers of clusters.

classmethod name_batch_attrs(): Name the attributes of each batch.

seismicrna.cluster.data.get_clust_params(dataset: ClusterMutsDataset, num_cpus: int = 1): Get the mutation rates and proportion for each cluster. If table files already exist, then use them to get the parameters; otherwise, calculate the parameters from the dataset.

class seismicrna.cluster.em.EMRun(uniq_reads: UniqReads, k: int, seed: int | None = None, *, min_iter: int, max_iter: int, em_thresh: float, jackpot: bool, jackpot_conf_level: float, max_jackpot_quotient: float)

Bases: object

Run expectation-maximization to cluster the given reads into the specified number of clusters.

property bic: Bayesian Information Criterion of the model.

get_resps(batch_num: int): Cluster memberships of the reads in the batch.

property jackpot_quotient

jackpotting score divided by the score of the median null model.

Type:: Jackpotting quotient

property jackpot_score

expected value of the G-anomaly.

Type:: Jackpotting score

property log_exp: Expected log count of each unique read (with names).

property log_exp_values: Expected log count of each unique read (values only).

property log_like: Return the current log likelihood, which is the last item in the trajectory of log likelihood values.

property max_arcd_vs_ens_avg: Maximum arcsine distance between the mutation rates of the clusters and the ensemble average.

property max_gini: Maximum Gini coefficient among all clusters.

property max_pearson: Maximum Pearson correlation among any pair of clusters.

property min_marcd: Minimum MARCD among any pair of clusters.

property mus: Mutation rate of each position in each cluster, corrected for observer bias.

property mus_ens_avg: Mutation rate of each position in the ensemble average, corrected for observer bias.

property n_reads: Total number of reads (including redundant reads).

property pis: Proportion of each cluster, corrected for observer bias.

property region_end5: 5’ end of the region.

property semi_g_anomalies: Semi-G-anomaly of each read.

class seismicrna.cluster.emk.EMRunsK(runs: list[EMRun], max_pearson_run: float, min_marcd_run: float, max_arcd_vs_ens_avg: float, max_gini_run: float, max_jackpot_quotient: float, max_loglike_vs_best: float, min_pearson_vs_best: float, max_marcd_vs_best: float)

Bases: object

One or more EM runs with the same number of clusters.

best_index(**kwargs) → int: Index of the best valid run.

enough_runs_passing(**kwargs): Whether enough runs passed.

get_valid_index(i: int | list[int] | ndarray, **kwargs): Index(es) of valid run number(s) i.

loglike_vs_best(**kwargs): Log likelihood difference between the best and second-best runs.

marcd_vs_best(**kwargs): Minimum MARCD between the best run and any other run.

n_runs_passing(**kwargs): Number of runs passing the filters.

passing(**kwargs): Whether this number of clusters passes the filters.

pearson_vs_best(**kwargs): Maximum Pearson correlation between the best run and any other run.

run_passing(allow_underclustered: bool = False): Whether each run passed the filters.

subopt_indexes(**kwargs): Indexes of the valid suboptimal runs.

summarize(**kwargs): Summarize the results of the runs.

seismicrna.cluster.emk.assign_clusterings(mus1: ndarray, mus2: ndarray): Optimally assign clusters from two groups to each other.

seismicrna.cluster.emk.calc_mean_arcsine_distance_clusters(mus1: ndarray, mus2: ndarray): MARCD between all clusters after matching.

seismicrna.cluster.emk.calc_mean_pearson_clusters(mus1: ndarray, mus2: ndarray): Pearson correlation between all clusters after matching.

seismicrna.cluster.emk.find_best_k(ks: Iterable[EMRunsK], **kwargs): Find the best number of clusters.

seismicrna.cluster.emk.get_common_k(runs: list[EMRun]): Find the number of clusters (k) from among EM clustering runs. If there are multiple ks, then raise a ValueError.

seismicrna.cluster.emk.sort_runs(runs: list[EMRun]): Sort the runs of EM clustering by decreasing likelihood so that the run with the best (largest) likelihood comes first.

class seismicrna.cluster.io.ClusterBatchIO(*, resps: DataFrame, **kwargs)

Bases: ClusterReadBatch, ReadBatchIO, RegBrickleIO, ClusterIO

classmethod get_file_seg_type(): Type of the last segment in the path.

class seismicrna.cluster.io.ClusterBatchWriter(dataset: MaskMutsDataset, ks: list[EMRunsK], brotli_level: int, top: Path, branch: str)

Bases: object

property branches

get_read_nums(batch_num: int): Get the read numbers for one batch.

property ks_written

write_batches(): Save the batches.

class seismicrna.cluster.io.ClusterFile

Bases: HasRegFilePath, ABC

classmethod get_step(): Step of the workflow.

class seismicrna.cluster.io.ClusterIO: Bases: ClusterFile, RegFileIO, ABC

seismicrna.cluster.jackpot.bootstrap_jackpot_scores(uniq_end5s: ndarray, uniq_end3s: ndarray, counts_per_uniq: ndarray, p_mut: ndarray, p_ends: ndarray, p_clust: ndarray, min_mut_gap: int, unmasked: ndarray, real_jackpot_score: float, confidence_level: float, max_jackpot_quotient: float)

Bootstrap jackpotting scores from the null model.

Parameters:

uniq_end5s (np.ndarray) – 1D (unique reads) array of the 5’ ends (0-indexed) of the unique reads.
uniq_end3s (np.ndarray) – 1D (unique reads) array of the 3’ ends (0-indexed) of the unique reads.
counts_per_uniq (np.ndarray) – 1D (unique reads) array of the number of times each unique read was observed.
p_mut (np.ndarray) – 2D (positions x clusters) array of the mutation rate at each position in each cluster.
p_ends (np.ndarray) – 2D (positions x positions) array of the probability distribution of 5’/3’ end coordinates.
p_clust (np.ndarray) – 1D (clusters) probability that a read comes from each cluster.
min_mut_gap (int) – Minimum number of non-mutated positions between two mutations.
unmasked (np.ndarray) – Positions (0-indexed) of p_mut that are not masked.
real_jackpot_score (float) – Jackpotting score of the real dataset.
confidence_level (float) – Confidence level for computing a confidence interval of the jackpotting quotient.
max_jackpot_quotient (float) – Stop bootstrapping once the confidence interval lies entirely above or entirely below this threshold.

Returns:

Array of normalized G-test statistics for the null model.

Return type:

np.ndarray

seismicrna.cluster.jackpot.calc_jackpot_quotient(real_jackpot_score: float, null_jackpot_score: float)

Calculate the jackpotting quotient.

The jackpotting quotient indicates how much more overrepresented the average read is in the real dataset compared to the null dataset.

Since the jackpotting score is the expected log-ratio of a read’s observed to expected count, raising e to the power of it yields the observed-to-expected ratio, which measures jackpotting.

JS = average{log(O / E)} exp(JS) = exp(average{log(O / E)})

Then, the jackpotting quotient is the jackpotting score of the real dataset divided by that of the null dataset:

JQ = exp(JS_real) / exp(JS_null): = exp(JS_real - JS_null)

Parameters:

real_jackpot_score (float) – Jackpotting score of the real dataset.
null_jackpot_score (float) – Jackpotting score of the null model.

Returns:

Jackpotting quotient

Return type:

float

seismicrna.cluster.jackpot.calc_jackpot_score(semi_g_anomalies: ndarray, n_reads: int)

Calculate the jackpotting score.

The jackpotting score is defined as the average log of the ratio of observed to expected counts for a read, weighted by the count of each read:

JS = sum{O_i * log(O_i / E_i)} / sum{O_i} where i is each unique read.

This formula is based on that of the G-test statistic, which is identical except that the latter is multiplied by 2 and not divided by the number of observations.

seismicrna.cluster.jackpot.calc_jackpot_score_ci(jackpot_scores: Iterable[float], confidence_level: float): Calculate the confidence interval of the mean of an array of jackpotting scores.

seismicrna.cluster.jackpot.calc_semi_g_anomaly(num_obs: int | ndarray, log_exp: float | ndarray): Calculate each read’s semi-G-anomaly, i.e. half its contribution to the G-test statistic.

seismicrna.cluster.jackpot.linearize_ends_matrix(p_ends: ndarray): Linearize an N x N matrix of end coordinate probabilities/counts into a length-N list of 5’/3’ coordinates and their values.

seismicrna.cluster.jackpot.sim_obs_exp(end5s: ndarray, end3s: ndarray, p_mut: ndarray, p_ends: ndarray, p_clust: ndarray, min_mut_gap: int, unmasked: ndarray): Simulate observed and expected counts.

seismicrna.cluster.main.run(input_path: Iterable[str | Path], *, branch: str = '', tmp_pfx: str | Path = './tmp', keep_tmp: bool = False, min_clusters: int = 1, max_clusters: int = 0, min_em_runs: int = 6, max_em_runs: int = 30, min_em_iter: int = 10, max_em_iter: int = 500, em_thresh: float = 0.37, min_marcd_run: float = 0.016, max_pearson_run: float = 0.9, max_arcd_vs_ens_avg: float = 0.2, max_gini_run: float = 0.667, jackpot: bool = True, jackpot_conf_level: float = 0.95, max_jackpot_quotient: float = 1.1, max_loglike_vs_best: float = 0.0, min_pearson_vs_best: float = 0.97, max_marcd_vs_best: float = 0.008, try_all_ks: bool = False, write_all_ks: bool = False, cluster_pos_table: bool = True, cluster_abundance_table: bool = True, verify_times: bool = True, brotli_level: int = 10, num_cpus: int = 4, force: bool = False) → list[Path]

Infer alternative structures by clustering reads’ mutations.

Parameters:

branch (str) – Create a new branch of the workflow with this name [keyword-only, default: ‘’]
tmp_pfx (str | pathlib._local.Path) – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]
keep_tmp (bool) – Keep temporary files after finishing [keyword-only, default: False]
min_clusters (int) – Start at this many clusters [keyword-only, default: 1]
max_clusters (int) – Stop at this many clusters (0 for no limit) [keyword-only, default: 0]
min_em_runs (int) – Run EM (successfully) at least this number of times for each K [keyword-only, default: 6]
max_em_runs (int) – Run EM (successfully or not) at most this number of times for each K [keyword-only, default: 30]
min_em_iter (int) – Run EM for at least this many iterations [keyword-only, default: 10]
max_em_iter (int) – Run EM for at most this many iterations [keyword-only, default: 500]
em_thresh (float) – Stop EM when the log likelihood increases by less than this threshold [keyword-only, default: 0.37]
min_marcd_run (float) – Remove runs with two clusters that differ by less than this MARCD [keyword-only, default: 0.016]
max_pearson_run (float) – Remove runs with two clusters more similar than this correlation [keyword-only, default: 0.9]
max_arcd_vs_ens_avg (float) – Remove runs where a cluster differs by more than this ARCD from the ensemble average at any position [keyword-only, default: 0.2]
max_gini_run (float) – Remove runs where any cluster’s Gini coefficient exceeds this limit [keyword-only, default: 0.667]
jackpot (bool) – Calculate the jackpotting quotient to find over-represented reads [keyword-only, default: True]
jackpot_conf_level (float) – Confidence level for the jackpotting quotient confidence interval [keyword-only, default: 0.95]
max_jackpot_quotient (float) – Remove runs whose jackpotting quotient exceeds this limit [keyword-only, default: 1.1]
max_loglike_vs_best (float) – Remove Ks with a log likelihood gap larger than this (0 for no limit) [keyword-only, default: 0.0]
min_pearson_vs_best (float) – Remove Ks where every run has less than this correlation vs. the best [keyword-only, default: 0.97]
max_marcd_vs_best (float) – Remove Ks where every run has more than this MARCD vs. the best [keyword-only, default: 0.008]
try_all_ks (bool) – Try all numbers of clusters (Ks), even after finding the best number [keyword-only, default: False]
write_all_ks (bool) – Write all numbers of clusters (Ks), rather than only the best number [keyword-only, default: False]
cluster_pos_table (bool) – Tabulate relationships per position for cluster data [keyword-only, default: True]
cluster_abundance_table (bool) – Tabulate number of reads per cluster for cluster data [keyword-only, default: True]
verify_times (bool) – Verify that report files from later steps have later timestamps [keyword-only, default: True]
brotli_level (int) – Compress pickle files with this level of Brotli (0 - 11) [keyword-only, default: 10]
num_cpus (int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]
force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

seismicrna.cluster.marginal.calc_marginal(*args, **kwargs)

seismicrna.cluster.marginal.calc_marginal_resps(*args, **kwargs)

seismicrna.cluster.obsexp.assemble_log_obs_exp(uniq_reads: UniqReads, ks: list[EMRunsK]): Assemble the expected and observed log counts of each read.

seismicrna.cluster.obsexp.format_exp_count_col(k: int)

seismicrna.cluster.obsexp.graph_log_obs_exp(log_obs_exp: DataFrame, ks: list[EMRunsK], to_dir: Path): Graph the expected vs. observed log counts of unique reads.

seismicrna.cluster.obsexp.table_log_obs_exp(log_obs_exp: DataFrame, to_dir: Path): Write the expected and observed log counts of unique reads.

seismicrna.cluster.obsexp.write_obs_exp_counts(uniq_reads: UniqReads, ks: list[EMRunsK], to_dir: Path)

seismicrna.cluster.params.get_table_path(top: Path, branches: dict[str, str], sample: str, ref: str, reg: str, table: str, k: int, run: int): Build a path for a table of clustering results.

seismicrna.cluster.params.write_mus(run: EMRun, top: Path, branches: dict[str, str], sample: str, ref: str, reg: str, rank: int)

seismicrna.cluster.params.write_pis(run: EMRun, top: Path, branches: dict[str, str], sample: str, ref: str, reg: str, rank: int)

seismicrna.cluster.params.write_single_run_table(run: EMRun, top: Path, branches: dict[str, str], sample: str, ref: str, reg: str, rank: int, *, attr: str, table: str): Write a DataFrame of one type of data from one independent run of EM clustering to a CSV file.

class seismicrna.cluster.report.BaseClusterReport(**kwargs: Any | Callable[[Report], Any])

Bases: RegReport, ClusterIO, ABC

classmethod get_file_seg_type(): Type of the last segment in the path.

class seismicrna.cluster.report.ClusterReport(**kwargs: Any | Callable[[Report], Any])

Bases: BatchedReport, BaseClusterReport

classmethod from_clusters(ks: list[EMRunsK], uniq_reads: UniqReads, *, min_clusters: int, max_clusters: int, min_em_runs: int, max_em_runs: int, try_all_ks: bool, write_all_ks: bool, min_marcd_run: float, max_pearson_run: float, max_arcd_vs_ens_avg: float, max_gini_run: float, jackpot: bool, jackpot_conf_level: float, max_jackpot_quotient: float, max_loglike_vs_best: float, min_pearson_vs_best: float, max_marcd_vs_best: float, min_iter: int, max_iter: int, em_thresh: float, ks_written: list[int], checksums: list[str], began: datetime, ended: datetime): Create a ClusterReport from EmClustering objects.

classmethod get_file_seg_type(): Type of the last segment in the path.

classmethod get_param_report_fields(): Parameter fields of the report.

classmethod get_result_report_fields(): Result fields of the report.

class seismicrna.cluster.report.JoinClusterReport(**kwargs: Any | Callable[[Report], Any])

Bases: JoinReport, BaseClusterReport

classmethod get_param_report_fields(): Parameter fields of the report.

seismicrna.cluster.summary.graph_attr(attr: Series, passing_text: list[str] | None = None): Graph one attribute.

seismicrna.cluster.summary.graph_attrs(table: DataFrame, to_dir: Path): Graph every attribute.

seismicrna.cluster.summary.tabulate(ks: list[EMRunsK]): Tabulate all attributes.

seismicrna.cluster.summary.tabulate_attr(ks: list[EMRunsK], attr: str): Tabulate the values for one attribute.

seismicrna.cluster.summary.write_summaries(ks: list[EMRunsK], to_dir: Path)

seismicrna.cluster.summary.write_table(table: DataFrame, cluster_dir: Path)

class seismicrna.cluster.uniq.UniqReads(sample: str, branches: dict[str, str], region: Region, min_mut_gap: int, quick_unbias: bool, quick_unbias_thresh: float, muts_per_pos: list[ndarray], batch_to_uniq: list[Series], counts_per_uniq: ndarray, **kwargs)

Bases: EndCoords

Collection of bit vectors of unique reads.

classmethod from_dataset(dataset: MaskMutsDataset, branch: str, **kwargs): Get unique reads from a dataset.

classmethod from_dataset_contig(dataset: MaskMutsDataset, branch: str): Get unique reads from a dataset of contiguous reads.

get_cov_matrix(): Full boolean matrix of the covered positions.

get_mut_matrix(): Full boolean matrix of the mutations.

property log_obs: Log number of times each read was observed.

property num_batches: Number of batches.

property num_nonuniq: int: Number of total reads (including non-unique reads).

property num_obs: Number of times each read was observed.

property num_uniq: Number of unique reads.

property read_end3s_zero: 3’ end coordinates (0-indexed in the region).

property read_end5s_zero: 5’ end coordinates (0-indexed in the region).

property ref: Reference name.

property seg_end3s_zero: 3’ end of every segment (0-indexed in the region).

property seg_end5s_zero: 5’ end of every segment (0-indexed in the region).

property uniq_names: Unique read names as byte strings.

seismicrna.cluster.uniq.get_uniq_reads(pos_nums: Iterable[int], pattern: RelPattern, batches: Iterable[RegionMutsBatch], **kwargs)

seismicrna.cluster.write.cluster(dataset: MaskMutsDataset | JoinMaskMutsDataset, *, branch: str, min_clusters: int, max_clusters: int, try_all_ks: bool, write_all_ks: bool, max_em_runs: int, num_cpus: int, brotli_level: int, force: bool, cluster_pos_table: bool, cluster_abundance_table: bool, verify_times: bool, tmp_pfx, keep_tmp, **kwargs): Cluster unique reads from one mask dataset.

seismicrna.cluster.write.run_k(uniq_reads: UniqReads, k: int, num_runs: int, *, num_cpus: int, **kwargs) → list[EMRun]: Run EM with a specific number of clusters.

seismicrna.cluster.write.run_ks(uniq_reads: UniqReads, ks: Iterable[int], *, min_em_runs: int, max_em_runs: int, try_all_ks: bool, min_marcd_run: float, max_pearson_run: float, max_arcd_vs_ens_avg: float, max_gini_run: float, max_jackpot_quotient: float, max_loglike_vs_best: float, min_pearson_vs_best: float, max_marcd_vs_best: float, min_iter: int, max_iter: int, em_thresh: float, num_cpus: int, top: Path, **kwargs): Run EM with multiple numbers of clusters.