seismicrna.cluster package
Subpackages
- seismicrna.cluster.tests package
- Submodules
TestAssignClusterings
TestBootstrapJackpotScores
TestBootstrapJackpotScores.REF
TestBootstrapJackpotScores.REFS
TestBootstrapJackpotScores.SAMPLE
TestBootstrapJackpotScores.SIM_DIR
TestBootstrapJackpotScores.calc_confidence_interval()
TestBootstrapJackpotScores.setUp()
TestBootstrapJackpotScores.sim_jackpot_quotient()
TestBootstrapJackpotScores.tearDown()
TestBootstrapJackpotScores.test_ideal_jackpot()
TestCalcSemiGAnomaly
TestSimClusters
TestSimReads
TestMarginalResps
- Submodules
Submodules
- class seismicrna.cluster.batch.ClusterMutsBatch(*, resps: DataFrame, **kwargs)
Bases:
ClusterReadBatch
,PartialRegionMutsBatch
- property read_weights
Weights for each read when computing counts.
- class seismicrna.cluster.batch.ClusterReadBatch(*, resps: DataFrame, **kwargs)
Bases:
PartialReadBatch
- property num_reads: Series
Number of reads.
- property read_nums
Read numbers.
- class seismicrna.cluster.dataset.ClusterDataset(report_file: Path, verify_times: bool = True)
-
Dataset of clustered data.
- class seismicrna.cluster.dataset.ClusterMutsDataset(dataset2_report_file: Path, **kwargs)
Bases:
ClusterDataset
,MultistepDataset
,UnbiasDataset
Merge cluster responsibilities with mutation data.
- property best_k
Best number of clusters.
- classmethod get_dataset1_load_func()
Function to load Dataset 1.
- classmethod get_dataset2_type()
Type of Dataset 2.
- property ks
Numbers of clusters.
- property min_mut_gap
Minimum gap between two mutations.
- property pattern
Pattern of mutations to count.
- property quick_unbias
Use the quick heuristic for unbiasing.
- property quick_unbias_thresh
Consider mutation rates less than or equal to this threshold to be 0 when using the quick heuristic for unbiasing.
- property region
Region of the dataset.
- class seismicrna.cluster.dataset.ClusterReadDataset(report_file: Path, verify_times: bool = True)
Bases:
ClusterDataset
,LoadedDataset
Load clustering results.
- property best_k
Best number of clusters.
- classmethod get_batch_type()
Type of batch.
- classmethod get_report_type()
Type of report.
- property ks
Numbers of clusters.
- property pattern
Pattern of mutations to count.
- class seismicrna.cluster.dataset.JoinClusterMutsDataset(*args, **kwargs)
Bases:
ClusterDataset
,JoinMutsDataset
,MergedUnbiasDataset
- property best_k
Best number of clusters.
- property clusts
Index of k and cluster numbers.
- classmethod get_batch_type()
Type of batch.
- classmethod get_dataset_load_func()
Function to load one constituent dataset.
- classmethod get_report_type()
Type of report.
- property ks
Numbers of clusters.
- classmethod name_batch_attrs()
Name the attributes of each batch.
- seismicrna.cluster.dataset.get_clust_params(dataset: ClusterMutsDataset, max_procs: int = 1)
Get the mutation rates and proportion for each cluster. If table files already exist, then use them to get the parameters; otherwise, calculate the parameters from the dataset.
- class seismicrna.cluster.em.EMRun(uniq_reads: UniqReads, k: int, seed: int | None = None, *, min_iter: int, max_iter: int, em_thresh: float, jackpot: bool, jackpot_conf_level: float, max_jackpot_quotient: float)
Bases:
object
Run expectation-maximization to cluster the given reads into the specified number of clusters.
- property bic
Bayesian Information Criterion of the model.
- property jackpot_quotient
jackpotting score divided by the score of the median null model.
- Type:
Jackpotting quotient
- property jackpot_score
expected value of the G-anomaly.
- Type:
Jackpotting score
- property log_exp
Expected log count of each unique read (with names).
- property log_exp_values
Expected log count of each unique read (values only).
- property log_like
Return the current log likelihood, which is the last item in the trajectory of log likelihood values.
- property max_pearson
Maximum Pearson correlation among any pair of clusters.
- property min_marcd
Minimum MARCD among any pair of clusters.
- property mus
Log mutation rate at each position for each cluster.
- property n_reads
Total number of reads (including redundant reads).
- property pis
Real and observed log proportion of each cluster.
- property region_end5
5’ end of the region.
- property semi_g_anomalies
Semi-G-anomaly of each read.
- class seismicrna.cluster.emk.EMRunsK(runs: list[EMRun], max_pearson_run: float, min_marcd_run: float, max_jackpot_quotient: float, max_loglike_vs_best: float, min_pearson_vs_best: float, max_marcd_vs_best: float)
Bases:
object
One or more EM runs with the same number of clusters.
- enough_runs_passing(**kwargs)
Whether enough runs passed.
- loglike_vs_best(**kwargs)
Log likelihood difference between the best and second-best runs.
- marcd_vs_best(**kwargs)
Minimum MARCD between the best run and any other run.
- n_runs_passing(**kwargs)
Number of runs passing the filters.
- passing(**kwargs)
Whether this number of clusters passes the filters.
- pearson_vs_best(**kwargs)
Maximum Pearson correlation between the best run and any other run.
- subopt_indexes(**kwargs)
Indexes of the valid suboptimal runs.
- summarize(**kwargs)
Summarize the results of the runs.
- seismicrna.cluster.emk.assign_clusterings(mus1: ndarray, mus2: ndarray)
Optimally assign clusters from two groups to each other.
- seismicrna.cluster.emk.calc_mean_arcsine_distance_clusters(mus1: ndarray, mus2: ndarray)
Mean MARCD between the clusters.
- seismicrna.cluster.emk.calc_mean_pearson_clusters(mus1: ndarray, mus2: ndarray)
Mean Pearson correlation between the clusters.
- seismicrna.cluster.emk.find_best_k(ks: Iterable[EMRunsK], **kwargs)
Find the best number of clusters.
- seismicrna.cluster.emk.get_common_k(runs: list[EMRun])
Find the number of clusters (k) from among EM clustering runs. If there are multiple ks, then raise a ValueError.
- seismicrna.cluster.emk.sort_runs(runs: list[EMRun])
Sort the runs of EM clustering by decreasing likelihood so that the run with the best (largest) likelihood comes first.
- class seismicrna.cluster.io.ClusterBatchIO(*, reg: str, **kwargs)
Bases:
ReadBatchIO
,ClusterIO
,ClusterReadBatch
- classmethod file_seg_type()
Type of the last segment in the path.
- class seismicrna.cluster.io.ClusterBatchWriter(dataset: MaskMutsDataset, ks: list[EMRunsK], brotli_level: int, top: Path)
Bases:
object
- property ks_written
- write_batches()
Save the batches.
- class seismicrna.cluster.io.ClusterIO(*, reg: str, **kwargs)
-
- classmethod auto_fields()
Names and automatic values of selected fields.
- seismicrna.cluster.jackpot.bootstrap_jackpot_scores(uniq_end5s: ndarray, uniq_end3s: ndarray, counts_per_uniq: ndarray, p_mut: ndarray, p_ends: ndarray, p_clust: ndarray, min_mut_gap: int, unmasked: ndarray, real_jackpot_score: float, confidence_level: float, max_jackpot_quotient: float)
Bootstrap jackpotting scores from the null model.
- Parameters:
uniq_end5s (
np.ndarray
) – 1D (unique reads) array of the 5’ ends (0-indexed) of the unique reads.uniq_end3s (
np.ndarray
) – 1D (unique reads) array of the 3’ ends (0-indexed) of the unique reads.counts_per_uniq (
np.ndarray
) – 1D (unique reads) array of the number of times each unique read was observed.p_mut (
np.ndarray
) – 2D (positions x clusters) array of the mutation rate at each position in each cluster.p_ends (
np.ndarray
) – 2D (positions x positions) array of the probability distribution of 5’/3’ end coordinates.p_clust (
np.ndarray
) – 1D (clusters) probability that a read comes from each cluster.min_mut_gap (
int
) – Minimum number of non-mutated positions between two mutations.unmasked (
np.ndarray
) – Positions (0-indexed) of p_mut that are not masked.real_jackpot_score (
float
) – Jackpotting score of the real dataset.confidence_level (
float
) – Confidence level for computing a confidence interval of the jackpotting quotient.max_jackpot_quotient (
float
) – Stop bootstrapping once the confidence interval lies entirely above or entirely below this threshold.
- Returns:
Array of normalized G-test statistics for the null model.
- Return type:
np.ndarray
- seismicrna.cluster.jackpot.calc_jackpot_quotient(real_jackpot_score: float, null_jackpot_score: float)
Calculate the jackpotting quotient.
The jackpotting quotient indicates how much more overrepresented the average read is in the real dataset compared to the null dataset.
Since the jackpotting score is the expected log-ratio of a read’s observed to expected count, raising e to the power of it yields the observed-to-expected ratio, which measures jackpotting.
JS = average{log(O / E)} exp(JS) = exp(average{log(O / E)})
Then, the jackpotting quotient is the jackpotting score of the real dataset divided by that of the null dataset:
- JQ = exp(JS_real) / exp(JS_null)
= exp(JS_real - JS_null)
- seismicrna.cluster.jackpot.calc_jackpot_score(semi_g_anomalies: ndarray, n_reads: int)
Calculate the jackpotting score.
The jackpotting score is defined as the average log of the ratio of observed to expected counts for a read, weighted by the count of each read:
JS = sum{O_i * log(O_i / E_i)} / sum{O_i} where i is each unique read.
This formula is based on that of the G-test statistic, which is identical except that the latter is multiplied by 2 and not divided by the number of observations.
- seismicrna.cluster.jackpot.calc_jackpot_score_ci(jackpot_scores: Iterable[float], confidence_level: float)
Calculate the confidence interval of the mean of an array of jackpotting scores.
- seismicrna.cluster.jackpot.calc_semi_g_anomaly(num_obs: int | ndarray, log_exp: float | ndarray)
Calculate each read’s semi-G-anomaly, i.e. half its contribution to the G-test statistic.
- seismicrna.cluster.jackpot.linearize_ends_matrix(p_ends: ndarray)
Linearize an N x N matrix of end coordinate probabilities/counts into a length-N list of 5’/3’ coordinates and their values.
- seismicrna.cluster.jackpot.sim_obs_exp(end5s: ndarray, end3s: ndarray, p_mut: ndarray, p_ends: ndarray, p_clust: ndarray, min_mut_gap: int, unmasked: ndarray)
Simulate observed and expected counts.
- seismicrna.cluster.main.run(input_path: Iterable[str | Path], *, tmp_pfx: str | Path = './tmp', keep_tmp: bool = False, min_clusters: int = 1, max_clusters: int = 0, em_runs: int = 12, min_em_iter: int = 10, max_em_iter: int = 500, em_thresh: float = 0.37, min_marcd_run: float = 0.0175, max_pearson_run: float = 0.9, jackpot: bool = True, jackpot_conf_level: float = 0.95, max_jackpot_quotient: float = 1.1, max_loglike_vs_best: float = 0.0, min_pearson_vs_best: float = 0.98, max_marcd_vs_best: float = 0.005, try_all_ks: bool = False, write_all_ks: bool = False, cluster_pos_table: bool = True, cluster_abundance_table: bool = True, verify_times: bool = True, brotli_level: int = 10, max_procs: int = 4, force: bool = False) list[Path]
Infer alternative structures by clustering reads’ mutations.
- Parameters:
tmp_pfx (
str | pathlib._local.Path
) – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]keep_tmp (
bool
) – Keep temporary files after finishing [keyword-only, default: False]min_clusters (
int
) – Start at this many clusters [keyword-only, default: 1]max_clusters (
int
) – Stop at this many clusters (0 for no limit) [keyword-only, default: 0]em_runs (
int
) – Run EM this many times for each number of clusters (K) except K = 1 [keyword-only, default: 12]min_em_iter (
int
) – Run EM for at least this many iterations (times number of clusters) [keyword-only, default: 10]max_em_iter (
int
) – Run EM for at most this many iterations (times number of clusters) [keyword-only, default: 500]em_thresh (
float
) – Stop EM when the log likelihood increases by less than this threshold [keyword-only, default: 0.37]min_marcd_run (
float
) – Remove runs with two clusters different by less than this MARCD [keyword-only, default: 0.0175]max_pearson_run (
float
) – Remove runs with two clusters more similar than this correlation [keyword-only, default: 0.9]jackpot (
bool
) – Calculate the jackpotting quotient to find over-represented reads [keyword-only, default: True]jackpot_conf_level (
float
) – Confidence level for the jackpotting quotient confidence interval [keyword-only, default: 0.95]max_jackpot_quotient (
float
) – Remove runs whose jackpotting quotient exceeds this limit [keyword-only, default: 1.1]max_loglike_vs_best (
float
) – Remove Ks with a log likelihood gap larger than this (0 for no limit) [keyword-only, default: 0.0]min_pearson_vs_best (
float
) – Remove Ks where every run has less than this correlation vs. the best [keyword-only, default: 0.98]max_marcd_vs_best (
float
) – Remove Ks where every run has more than this MARCD vs. the best [keyword-only, default: 0.005]try_all_ks (
bool
) – Try all numbers of clusters (Ks), even after finding the best number [keyword-only, default: False]write_all_ks (
bool
) – Write all numbers of clusters (Ks), rather than only the best number [keyword-only, default: False]cluster_pos_table (
bool
) – Tabulate relationships per position for cluster data [keyword-only, default: True]cluster_abundance_table (
bool
) – Tabulate number of reads per cluster for cluster data [keyword-only, default: True]verify_times (
bool
) – Verify that report files from later steps have later timestamps [keyword-only, default: True]brotli_level (
int
) – Compress pickle files with this level of Brotli (0 - 11) [keyword-only, default: 10]max_procs (
int
) – Run up to this many processes simultaneously [keyword-only, default: 4]force (
bool
) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]
- seismicrna.cluster.marginal.calc_marginal(*args, **kwargs)
- seismicrna.cluster.marginal.calc_marginal_resps(*args, **kwargs)
Cluster – Names Module
Auth: Matty
Define names for the indexes of the cluster tables.
- seismicrna.cluster.obsexp.assemble_log_obs_exp(uniq_reads: UniqReads, ks: list[EMRunsK])
Assemble the expected and observed log counts of each read.
- seismicrna.cluster.obsexp.graph_log_obs_exp(log_obs_exp: DataFrame, ks: list[EMRunsK], to_dir: Path)
Graph the expected vs. observed log counts of unique reads.
- seismicrna.cluster.obsexp.table_log_obs_exp(log_obs_exp: DataFrame, to_dir: Path)
Write the expected and observed log counts of unique reads.
- seismicrna.cluster.obsexp.write_obs_exp_counts(uniq_reads: UniqReads, ks: list[EMRunsK], to_dir: Path)
- seismicrna.cluster.params.get_table_path(top: Path, sample: str, ref: str, reg: str, table: str, k: int, run: int)
Build a path for a table of clustering results.
- seismicrna.cluster.params.write_mus(run: EMRun, top: Path, sample: str, ref: str, reg: str, rank: int)
- seismicrna.cluster.params.write_pis(run: EMRun, top: Path, sample: str, ref: str, reg: str, rank: int)
- seismicrna.cluster.params.write_single_run_table(run: EMRun, top: Path, sample: str, ref: str, reg: str, rank: int, *, attr: str, table: str)
Write a DataFrame of one type of data from one independent run of EM clustering to a CSV file.
- class seismicrna.cluster.report.ClusterReport(**kwargs: Any | Callable[[Report], Any])
Bases:
BatchedReport
,ClusterIO
- classmethod auto_fields()
Names and automatic values of selected fields.
- classmethod fields()
All fields of the report.
- classmethod file_seg_type()
Type of the last segment in the path.
- classmethod from_clusters(ks: list[EMRunsK], uniq_reads: UniqReads, *, min_clusters: int, max_clusters: int, em_runs: int, try_all_ks: bool, write_all_ks: bool, min_marcd_run: float, max_pearson_run: float, jackpot: bool, jackpot_conf_level: float, max_jackpot_quotient: float, max_loglike_vs_best: float, min_pearson_vs_best: float, max_marcd_vs_best: float, min_iter: int, max_iter: int, em_thresh: float, ks_written: list[int], checksums: list[str], began: datetime, ended: datetime)
Create a ClusterReport from EmClustering objects.
- class seismicrna.cluster.report.JoinClusterReport(**kwargs: Any | Callable[[Report], Any])
Bases:
JoinReport
- classmethod auto_fields()
Names and automatic values of selected fields.
- classmethod fields()
All fields of the report.
- classmethod file_seg_type()
Type of the last segment in the path.
- seismicrna.cluster.summary.graph_attr(attr: Series, passing_text: list[str] | None = None)
Graph one attribute.
- seismicrna.cluster.summary.graph_attrs(table: DataFrame, to_dir: Path)
Graph every attribute.
- seismicrna.cluster.summary.tabulate_attr(ks: list[EMRunsK], attr: str)
Tabulate the values for one attribute.
- seismicrna.cluster.summary.write_table(table: DataFrame, cluster_dir: Path)
- class seismicrna.cluster.table.ClusterDatasetTabulator(*, dataset: MutsDataset, validate: bool = False, **kwargs)
Bases:
PartialDatasetTabulator
,ClusterTabulator
- classmethod init_kws()
Attributes of the dataset to use as keyword arguments in super().__init__().
- classmethod load_function()
LoadFunction for all Dataset types for this Tabulator.
- class seismicrna.cluster.uniq.UniqReads(sample: str, region: Region, min_mut_gap: int, quick_unbias: bool, quick_unbias_thresh: float, muts_per_pos: list[ndarray], batch_to_uniq: list[Series], counts_per_uniq: ndarray, **kwargs)
Bases:
EndCoords
Collection of bit vectors of unique reads.
- classmethod from_dataset(dataset: MaskMutsDataset, **kwargs)
Get unique reads from a dataset.
- classmethod from_dataset_contig(dataset: MaskMutsDataset)
Get unique reads from a dataset of contiguous reads.
- get_cov_matrix()
Full boolean matrix of the covered positions.
- get_mut_matrix()
Full boolean matrix of the mutations.
- property log_obs
Log number of times each read was observed.
- property num_batches
Number of batches.
- property num_obs
Number of times each read was observed.
- property num_uniq
Number of unique reads.
- property read_end3s_zero
3’ end coordinates (0-indexed in the region).
- property read_end5s_zero
5’ end coordinates (0-indexed in the region).
- property ref
Reference name.
- property seg_end3s_zero
3’ end of every segment (0-indexed in the region).
- property seg_end5s_zero
5’ end of every segment (0-indexed in the region).
- property uniq_names
Unique read names as byte strings.
- seismicrna.cluster.uniq.get_uniq_reads(pos_nums: Iterable[int], pattern: RelPattern, batches: Iterable[RegionMutsBatch], **kwargs)
- seismicrna.cluster.write.cluster(mask_report_file: Path, *, min_clusters: int, max_clusters: int, try_all_ks: bool, write_all_ks: bool, em_runs: int, n_procs: int, brotli_level: int, force: bool, cluster_pos_table: bool, cluster_abundance_table: bool, verify_times: bool, tmp_pfx, keep_tmp, **kwargs)
Cluster unique reads from one mask dataset.
- seismicrna.cluster.write.run_k(uniq_reads: UniqReads, k: int, em_runs: int, *, n_procs: int, **kwargs) list[EMRun]
Run EM with a specific number of clusters.
- seismicrna.cluster.write.run_ks(uniq_reads: UniqReads, ks: Iterable[int], em_runs: int, *, try_all_ks: bool, min_marcd_run: float, max_pearson_run: float, max_jackpot_quotient: float, max_loglike_vs_best: float, min_pearson_vs_best: float, max_marcd_vs_best: float, min_iter: int, max_iter: int, em_thresh: float, n_procs: int, top: Path, **kwargs)
Run EM with multiple numbers of clusters.