seismicrna.cluster package
Subpackages
- seismicrna.cluster.tests package- Submodules- TestAssignClusterings
- TestBootstrapJackpotScores- TestBootstrapJackpotScores.REF
- TestBootstrapJackpotScores.REFS
- TestBootstrapJackpotScores.SAMPLE
- TestBootstrapJackpotScores.SIM_DIR
- TestBootstrapJackpotScores.calc_confidence_interval()
- TestBootstrapJackpotScores.setUp()
- TestBootstrapJackpotScores.sim_jackpot_quotient()
- TestBootstrapJackpotScores.tearDown()
- TestBootstrapJackpotScores.test_ideal_jackpot()
 
- TestCalcSemiGAnomaly
- TestSimClusters
- TestSimReads
- TestMarginalResps
 
 
- Submodules
Submodules
- class seismicrna.cluster.batch.ClusterMutsBatch(*, resps: DataFrame, **kwargs)
- Bases: - ClusterReadBatch,- PartialRegionMutsBatch- property read_weights
- Weights for each read when computing counts. 
 
- class seismicrna.cluster.batch.ClusterReadBatch(*, resps: DataFrame, **kwargs)
- Bases: - PartialReadBatch- property num_reads: Series
- Number of reads. 
 - property read_nums
- Read numbers. 
 
- class seismicrna.cluster.data.ClusterAbundanceTable
- Bases: - AbundanceTable,- PartialTable,- ClusterFile,- ABC- classmethod get_header_type()
- Type of the header for the table. 
 - classmethod get_index_depth()
- Number of columns in the index. 
 - classmethod get_load_function()
- LoadFunction for all Dataset types for this Table. 
 - property proportions
- Proportion of each item. 
 
- class seismicrna.cluster.data.ClusterAbundanceTableLoader(table_file: str | Path, **kwargs)
- Bases: - TableLoader,- ClusterAbundanceTable- Load cluster data indexed by cluster. - property data: Series
- Table’s data. 
 
- class seismicrna.cluster.data.ClusterBatchTabulator(*, get_batch_count_all: Callable, num_batches: int, num_cpus: int = 1, **kwargs)
- Bases: - BatchTabulator,- ClusterTabulator
- class seismicrna.cluster.data.ClusterCountTabulator(*, batch_counts: Iterable[tuple[Any, Any, Any, Any]], **kwargs)
- Bases: - CountTabulator,- ClusterTabulator
- class seismicrna.cluster.data.ClusterDataset(report_file: str | Path, verify_times: bool = True)
- 
Dataset of clustered data. 
- class seismicrna.cluster.data.ClusterDatasetTabulator(*, dataset: MutsDataset, validate: bool = False, **kwargs)
- Bases: - PartialDatasetTabulator,- ClusterTabulator- classmethod init_kws()
- Attributes of the dataset to use as keyword arguments in super().__init__(). 
 
- class seismicrna.cluster.data.ClusterMutsDataset(dataset2_report_file: Path, **kwargs)
- Bases: - ClusterDataset,- MultistepDataset,- UnbiasDataset- Merge cluster responsibilities with mutation data. - property best_k
- Best number of clusters. 
 - classmethod get_dataset1_load_func()
- Function to load Dataset 1. 
 - classmethod get_dataset2_type()
- Type of Dataset 2. 
 - property ks
- Numbers of clusters. 
 - property min_mut_gap
- Minimum gap between two mutations. 
 - property pattern
- Pattern of mutations to count. 
 - property quick_unbias
- Use the quick heuristic for unbiasing. 
 - property quick_unbias_thresh
- Consider mutation rates less than or equal to this threshold to be 0 when using the quick heuristic for unbiasing. 
 - property region
- Region of the dataset. 
 
- class seismicrna.cluster.data.ClusterPositionTable
- Bases: - ClusterTable,- PartialPositionTable,- ABC
- class seismicrna.cluster.data.ClusterPositionTableLoader(table_file: str | Path, **kwargs)
- Bases: - PositionTableLoader,- ClusterPositionTable- Load cluster data indexed by position. 
- class seismicrna.cluster.data.ClusterReadDataset(report_file: str | Path, verify_times: bool = True)
- Bases: - ClusterDataset,- LoadedDataset- Load clustering results. - property best_k
- Best number of clusters. 
 - classmethod get_batch_type()
- Type of batch. 
 - classmethod get_report_type()
- Type of report. 
 - property ks
- Numbers of clusters. 
 - property pattern
- Pattern of mutations to count. 
 
- class seismicrna.cluster.data.ClusterTable
- Bases: - RelTypeTable,- ClusterFile,- ABC- classmethod get_header_type()
- Type of the header for the table. 
 - classmethod get_load_function()
- LoadFunction for all Dataset types for this Table. 
 
- class seismicrna.cluster.data.ClusterTabulator(*, ks: list[int], **kwargs)
- Bases: - PartialTabulator,- ABC- property clust_header
- Header of the per-cluster data. 
 - property data_per_clust
- Number of reads in each cluster. 
 - classmethod table_types()
- Types of tables that this tabulator can write. 
 
- class seismicrna.cluster.data.JoinClusterMutsDataset(*args, **kwargs)
- Bases: - ClusterDataset,- JoinMutsDataset,- MergedUnbiasDataset- property best_k
- Best number of clusters. 
 - property clusts
- Index of k and cluster numbers. 
 - classmethod get_batch_type()
- Type of batch. 
 - classmethod get_dataset_load_func()
- Function to load one constituent dataset. 
 - classmethod get_report_type()
- Type of report. 
 - property ks
- Numbers of clusters. 
 - classmethod name_batch_attrs()
- Name the attributes of each batch. 
 
- seismicrna.cluster.data.get_clust_params(dataset: ClusterMutsDataset, num_cpus: int = 1)
- Get the mutation rates and proportion for each cluster. If table files already exist, then use them to get the parameters; otherwise, calculate the parameters from the dataset. 
- class seismicrna.cluster.em.EMRun(uniq_reads: UniqReads, k: int, seed: int | None = None, *, min_iter: int, max_iter: int, em_thresh: float, jackpot: bool, jackpot_conf_level: float, max_jackpot_quotient: float)
- Bases: - object- Run expectation-maximization to cluster the given reads into the specified number of clusters. - property bic
- Bayesian Information Criterion of the model. 
 - property jackpot_quotient
- jackpotting score divided by the score of the median null model. - Type:
- Jackpotting quotient 
 
 - property jackpot_score
- expected value of the G-anomaly. - Type:
- Jackpotting score 
 
 - property log_exp
- Expected log count of each unique read (with names). 
 - property log_exp_values
- Expected log count of each unique read (values only). 
 - property log_like
- Return the current log likelihood, which is the last item in the trajectory of log likelihood values. 
 - property max_arcd_vs_ens_avg
- Maximum arcsine distance between the mutation rates of the clusters and the ensemble average. 
 - property max_gini
- Maximum Gini coefficient among all clusters. 
 - property max_pearson
- Maximum Pearson correlation among any pair of clusters. 
 - property min_marcd
- Minimum MARCD among any pair of clusters. 
 - property mus
- Mutation rate of each position in each cluster, corrected for observer bias. 
 - property mus_ens_avg
- Mutation rate of each position in the ensemble average, corrected for observer bias. 
 - property n_reads
- Total number of reads (including redundant reads). 
 - property pis
- Proportion of each cluster, corrected for observer bias. 
 - property region_end5
- 5’ end of the region. 
 - property semi_g_anomalies
- Semi-G-anomaly of each read. 
 
- class seismicrna.cluster.emk.EMRunsK(runs: list[EMRun], max_pearson_run: float, min_marcd_run: float, max_arcd_vs_ens_avg: float, max_gini_run: float, max_jackpot_quotient: float, max_loglike_vs_best: float, min_pearson_vs_best: float, max_marcd_vs_best: float)
- Bases: - object- One or more EM runs with the same number of clusters. - enough_runs_passing(**kwargs)
- Whether enough runs passed. 
 - loglike_vs_best(**kwargs)
- Log likelihood difference between the best and second-best runs. 
 - marcd_vs_best(**kwargs)
- Minimum MARCD between the best run and any other run. 
 - n_runs_passing(**kwargs)
- Number of runs passing the filters. 
 - passing(**kwargs)
- Whether this number of clusters passes the filters. 
 - pearson_vs_best(**kwargs)
- Maximum Pearson correlation between the best run and any other run. 
 - subopt_indexes(**kwargs)
- Indexes of the valid suboptimal runs. 
 - summarize(**kwargs)
- Summarize the results of the runs. 
 
- seismicrna.cluster.emk.assign_clusterings(mus1: ndarray, mus2: ndarray)
- Optimally assign clusters from two groups to each other. 
- seismicrna.cluster.emk.calc_mean_arcsine_distance_clusters(mus1: ndarray, mus2: ndarray)
- MARCD between all clusters after matching. 
- seismicrna.cluster.emk.calc_mean_pearson_clusters(mus1: ndarray, mus2: ndarray)
- Pearson correlation between all clusters after matching. 
- seismicrna.cluster.emk.find_best_k(ks: Iterable[EMRunsK], **kwargs)
- Find the best number of clusters. 
- seismicrna.cluster.emk.get_common_k(runs: list[EMRun])
- Find the number of clusters (k) from among EM clustering runs. If there are multiple ks, then raise a ValueError. 
- seismicrna.cluster.emk.sort_runs(runs: list[EMRun])
- Sort the runs of EM clustering by decreasing likelihood so that the run with the best (largest) likelihood comes first. 
- class seismicrna.cluster.io.ClusterBatchIO(*, resps: DataFrame, **kwargs)
- Bases: - ClusterReadBatch,- ReadBatchIO,- RegBrickleIO,- ClusterIO- classmethod get_file_seg_type()
- Type of the last segment in the path. 
 
- class seismicrna.cluster.io.ClusterBatchWriter(dataset: MaskMutsDataset, ks: list[EMRunsK], brotli_level: int, top: Path, branch: str)
- Bases: - object- property branches
 - property ks_written
 - write_batches()
- Save the batches. 
 
- class seismicrna.cluster.io.ClusterFile
- Bases: - HasRegFilePath,- ABC- classmethod get_step()
- Step of the workflow. 
 
- class seismicrna.cluster.io.ClusterIO
- Bases: - ClusterFile,- RegFileIO,- ABC
- seismicrna.cluster.jackpot.bootstrap_jackpot_scores(uniq_end5s: ndarray, uniq_end3s: ndarray, counts_per_uniq: ndarray, p_mut: ndarray, p_ends: ndarray, p_clust: ndarray, min_mut_gap: int, unmasked: ndarray, real_jackpot_score: float, confidence_level: float, max_jackpot_quotient: float)
- Bootstrap jackpotting scores from the null model. - Parameters:
- uniq_end5s ( - np.ndarray) – 1D (unique reads) array of the 5’ ends (0-indexed) of the unique reads.
- uniq_end3s ( - np.ndarray) – 1D (unique reads) array of the 3’ ends (0-indexed) of the unique reads.
- counts_per_uniq ( - np.ndarray) – 1D (unique reads) array of the number of times each unique read was observed.
- p_mut ( - np.ndarray) – 2D (positions x clusters) array of the mutation rate at each position in each cluster.
- p_ends ( - np.ndarray) – 2D (positions x positions) array of the probability distribution of 5’/3’ end coordinates.
- p_clust ( - np.ndarray) – 1D (clusters) probability that a read comes from each cluster.
- min_mut_gap ( - int) – Minimum number of non-mutated positions between two mutations.
- unmasked ( - np.ndarray) – Positions (0-indexed) of p_mut that are not masked.
- real_jackpot_score ( - float) – Jackpotting score of the real dataset.
- confidence_level ( - float) – Confidence level for computing a confidence interval of the jackpotting quotient.
- max_jackpot_quotient ( - float) – Stop bootstrapping once the confidence interval lies entirely above or entirely below this threshold.
 
- Returns:
- Array of normalized G-test statistics for the null model. 
- Return type:
- np.ndarray
 
- seismicrna.cluster.jackpot.calc_jackpot_quotient(real_jackpot_score: float, null_jackpot_score: float)
- Calculate the jackpotting quotient. - The jackpotting quotient indicates how much more overrepresented the average read is in the real dataset compared to the null dataset. - Since the jackpotting score is the expected log-ratio of a read’s observed to expected count, raising e to the power of it yields the observed-to-expected ratio, which measures jackpotting. - JS = average{log(O / E)} exp(JS) = exp(average{log(O / E)}) - Then, the jackpotting quotient is the jackpotting score of the real dataset divided by that of the null dataset: - JQ = exp(JS_real) / exp(JS_null)
- = exp(JS_real - JS_null) 
 
- seismicrna.cluster.jackpot.calc_jackpot_score(semi_g_anomalies: ndarray, n_reads: int)
- Calculate the jackpotting score. - The jackpotting score is defined as the average log of the ratio of observed to expected counts for a read, weighted by the count of each read: - JS = sum{O_i * log(O_i / E_i)} / sum{O_i} where i is each unique read. - This formula is based on that of the G-test statistic, which is identical except that the latter is multiplied by 2 and not divided by the number of observations. 
- seismicrna.cluster.jackpot.calc_jackpot_score_ci(jackpot_scores: Iterable[float], confidence_level: float)
- Calculate the confidence interval of the mean of an array of jackpotting scores. 
- seismicrna.cluster.jackpot.calc_semi_g_anomaly(num_obs: int | ndarray, log_exp: float | ndarray)
- Calculate each read’s semi-G-anomaly, i.e. half its contribution to the G-test statistic. 
- seismicrna.cluster.jackpot.linearize_ends_matrix(p_ends: ndarray)
- Linearize an N x N matrix of end coordinate probabilities/counts into a length-N list of 5’/3’ coordinates and their values. 
- seismicrna.cluster.jackpot.sim_obs_exp(end5s: ndarray, end3s: ndarray, p_mut: ndarray, p_ends: ndarray, p_clust: ndarray, min_mut_gap: int, unmasked: ndarray)
- Simulate observed and expected counts. 
- seismicrna.cluster.main.run(input_path: Iterable[str | Path], *, branch: str = '', tmp_pfx: str | Path = './tmp', keep_tmp: bool = False, min_clusters: int = 1, max_clusters: int = 0, min_em_runs: int = 6, max_em_runs: int = 30, min_em_iter: int = 10, max_em_iter: int = 500, em_thresh: float = 0.37, min_marcd_run: float = 0.016, max_pearson_run: float = 0.9, max_arcd_vs_ens_avg: float = 0.2, max_gini_run: float = 0.667, jackpot: bool = True, jackpot_conf_level: float = 0.95, max_jackpot_quotient: float = 1.1, max_loglike_vs_best: float = 0.0, min_pearson_vs_best: float = 0.97, max_marcd_vs_best: float = 0.008, try_all_ks: bool = False, write_all_ks: bool = False, cluster_pos_table: bool = True, cluster_abundance_table: bool = True, verify_times: bool = True, brotli_level: int = 10, num_cpus: int = 4, force: bool = False) list[Path]
- Infer alternative structures by clustering reads’ mutations. - Parameters:
- branch ( - str) – Create a new branch of the workflow with this name [keyword-only, default: ‘’]
- tmp_pfx ( - str | pathlib._local.Path) – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]
- keep_tmp ( - bool) – Keep temporary files after finishing [keyword-only, default: False]
- min_clusters ( - int) – Start at this many clusters [keyword-only, default: 1]
- max_clusters ( - int) – Stop at this many clusters (0 for no limit) [keyword-only, default: 0]
- min_em_runs ( - int) – Run EM (successfully) at least this number of times for each K [keyword-only, default: 6]
- max_em_runs ( - int) – Run EM (successfully or not) at most this number of times for each K [keyword-only, default: 30]
- min_em_iter ( - int) – Run EM for at least this many iterations [keyword-only, default: 10]
- max_em_iter ( - int) – Run EM for at most this many iterations [keyword-only, default: 500]
- em_thresh ( - float) – Stop EM when the log likelihood increases by less than this threshold [keyword-only, default: 0.37]
- min_marcd_run ( - float) – Remove runs with two clusters that differ by less than this MARCD [keyword-only, default: 0.016]
- max_pearson_run ( - float) – Remove runs with two clusters more similar than this correlation [keyword-only, default: 0.9]
- max_arcd_vs_ens_avg ( - float) – Remove runs where a cluster differs by more than this ARCD from the ensemble average at any position [keyword-only, default: 0.2]
- max_gini_run ( - float) – Remove runs where any cluster’s Gini coefficient exceeds this limit [keyword-only, default: 0.667]
- jackpot ( - bool) – Calculate the jackpotting quotient to find over-represented reads [keyword-only, default: True]
- jackpot_conf_level ( - float) – Confidence level for the jackpotting quotient confidence interval [keyword-only, default: 0.95]
- max_jackpot_quotient ( - float) – Remove runs whose jackpotting quotient exceeds this limit [keyword-only, default: 1.1]
- max_loglike_vs_best ( - float) – Remove Ks with a log likelihood gap larger than this (0 for no limit) [keyword-only, default: 0.0]
- min_pearson_vs_best ( - float) – Remove Ks where every run has less than this correlation vs. the best [keyword-only, default: 0.97]
- max_marcd_vs_best ( - float) – Remove Ks where every run has more than this MARCD vs. the best [keyword-only, default: 0.008]
- try_all_ks ( - bool) – Try all numbers of clusters (Ks), even after finding the best number [keyword-only, default: False]
- write_all_ks ( - bool) – Write all numbers of clusters (Ks), rather than only the best number [keyword-only, default: False]
- cluster_pos_table ( - bool) – Tabulate relationships per position for cluster data [keyword-only, default: True]
- cluster_abundance_table ( - bool) – Tabulate number of reads per cluster for cluster data [keyword-only, default: True]
- verify_times ( - bool) – Verify that report files from later steps have later timestamps [keyword-only, default: True]
- brotli_level ( - int) – Compress pickle files with this level of Brotli (0 - 11) [keyword-only, default: 10]
- num_cpus ( - int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]
- force ( - bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]
 
 
- seismicrna.cluster.marginal.calc_marginal(*args, **kwargs)
- seismicrna.cluster.marginal.calc_marginal_resps(*args, **kwargs)
- seismicrna.cluster.obsexp.assemble_log_obs_exp(uniq_reads: UniqReads, ks: list[EMRunsK])
- Assemble the expected and observed log counts of each read. 
- seismicrna.cluster.obsexp.graph_log_obs_exp(log_obs_exp: DataFrame, ks: list[EMRunsK], to_dir: Path)
- Graph the expected vs. observed log counts of unique reads. 
- seismicrna.cluster.obsexp.table_log_obs_exp(log_obs_exp: DataFrame, to_dir: Path)
- Write the expected and observed log counts of unique reads. 
- seismicrna.cluster.obsexp.write_obs_exp_counts(uniq_reads: UniqReads, ks: list[EMRunsK], to_dir: Path)
- seismicrna.cluster.params.get_table_path(top: Path, branches: dict[str, str], sample: str, ref: str, reg: str, table: str, k: int, run: int)
- Build a path for a table of clustering results. 
- seismicrna.cluster.params.write_mus(run: EMRun, top: Path, branches: dict[str, str], sample: str, ref: str, reg: str, rank: int)
- seismicrna.cluster.params.write_pis(run: EMRun, top: Path, branches: dict[str, str], sample: str, ref: str, reg: str, rank: int)
- seismicrna.cluster.params.write_single_run_table(run: EMRun, top: Path, branches: dict[str, str], sample: str, ref: str, reg: str, rank: int, *, attr: str, table: str)
- Write a DataFrame of one type of data from one independent run of EM clustering to a CSV file. 
- class seismicrna.cluster.report.BaseClusterReport(**kwargs: Any | Callable[[Report], Any])
- Bases: - RegReport,- ClusterIO,- ABC- classmethod get_file_seg_type()
- Type of the last segment in the path. 
 
- class seismicrna.cluster.report.ClusterReport(**kwargs: Any | Callable[[Report], Any])
- Bases: - BatchedReport,- BaseClusterReport- classmethod from_clusters(ks: list[EMRunsK], uniq_reads: UniqReads, *, min_clusters: int, max_clusters: int, min_em_runs: int, max_em_runs: int, try_all_ks: bool, write_all_ks: bool, min_marcd_run: float, max_pearson_run: float, max_arcd_vs_ens_avg: float, max_gini_run: float, jackpot: bool, jackpot_conf_level: float, max_jackpot_quotient: float, max_loglike_vs_best: float, min_pearson_vs_best: float, max_marcd_vs_best: float, min_iter: int, max_iter: int, em_thresh: float, ks_written: list[int], checksums: list[str], began: datetime, ended: datetime)
- Create a ClusterReport from EmClustering objects. 
 - classmethod get_file_seg_type()
- Type of the last segment in the path. 
 - classmethod get_param_report_fields()
- Parameter fields of the report. 
 - classmethod get_result_report_fields()
- Result fields of the report. 
 
- class seismicrna.cluster.report.JoinClusterReport(**kwargs: Any | Callable[[Report], Any])
- Bases: - JoinReport,- BaseClusterReport- classmethod get_param_report_fields()
- Parameter fields of the report. 
 
- seismicrna.cluster.summary.graph_attr(attr: Series, passing_text: list[str] | None = None)
- Graph one attribute. 
- seismicrna.cluster.summary.tabulate_attr(ks: list[EMRunsK], attr: str)
- Tabulate the values for one attribute. 
- class seismicrna.cluster.uniq.UniqReads(sample: str, branches: dict[str, str], region: Region, min_mut_gap: int, quick_unbias: bool, quick_unbias_thresh: float, muts_per_pos: list[ndarray], batch_to_uniq: list[Series], counts_per_uniq: ndarray, **kwargs)
- Bases: - EndCoords- Collection of bit vectors of unique reads. - classmethod from_dataset(dataset: MaskMutsDataset, branch: str, **kwargs)
- Get unique reads from a dataset. 
 - classmethod from_dataset_contig(dataset: MaskMutsDataset, branch: str)
- Get unique reads from a dataset of contiguous reads. 
 - get_cov_matrix()
- Full boolean matrix of the covered positions. 
 - get_mut_matrix()
- Full boolean matrix of the mutations. 
 - property log_obs
- Log number of times each read was observed. 
 - property num_batches
- Number of batches. 
 - property num_obs
- Number of times each read was observed. 
 - property num_uniq
- Number of unique reads. 
 - property read_end3s_zero
- 3’ end coordinates (0-indexed in the region). 
 - property read_end5s_zero
- 5’ end coordinates (0-indexed in the region). 
 - property ref
- Reference name. 
 - property seg_end3s_zero
- 3’ end of every segment (0-indexed in the region). 
 - property seg_end5s_zero
- 5’ end of every segment (0-indexed in the region). 
 - property uniq_names
- Unique read names as byte strings. 
 
- seismicrna.cluster.uniq.get_uniq_reads(pos_nums: Iterable[int], pattern: RelPattern, batches: Iterable[RegionMutsBatch], **kwargs)
- seismicrna.cluster.write.cluster(dataset: MaskMutsDataset | JoinMaskMutsDataset, *, branch: str, min_clusters: int, max_clusters: int, try_all_ks: bool, write_all_ks: bool, max_em_runs: int, num_cpus: int, brotli_level: int, force: bool, cluster_pos_table: bool, cluster_abundance_table: bool, verify_times: bool, tmp_pfx, keep_tmp, **kwargs)
- Cluster unique reads from one mask dataset. 
- seismicrna.cluster.write.run_k(uniq_reads: UniqReads, k: int, num_runs: int, *, num_cpus: int, **kwargs) list[EMRun]
- Run EM with a specific number of clusters. 
- seismicrna.cluster.write.run_ks(uniq_reads: UniqReads, ks: Iterable[int], *, min_em_runs: int, max_em_runs: int, try_all_ks: bool, min_marcd_run: float, max_pearson_run: float, max_arcd_vs_ens_avg: float, max_gini_run: float, max_jackpot_quotient: float, max_loglike_vs_best: float, min_pearson_vs_best: float, max_marcd_vs_best: float, min_iter: int, max_iter: int, em_thresh: float, num_cpus: int, top: Path, **kwargs)
- Run EM with multiple numbers of clusters.