seismicrna.cluster package
Subpackages
- seismicrna.cluster.tests package
- Submodules
TestAssignClusterings
TestBootstrapJackpotScores
TestBootstrapJackpotScores.REF
TestBootstrapJackpotScores.REFS
TestBootstrapJackpotScores.SAMPLE
TestBootstrapJackpotScores.SIM_DIR
TestBootstrapJackpotScores.calc_confidence_interval()
TestBootstrapJackpotScores.setUp()
TestBootstrapJackpotScores.sim_jackpot_quotient()
TestBootstrapJackpotScores.tearDown()
TestBootstrapJackpotScores.test_ideal_jackpot()
TestCalcSemiGAnomaly
TestSimClusters
TestSimReads
TestMarginalResps
- Submodules
Submodules
- class seismicrna.cluster.batch.ClusterMutsBatch(*, resps: DataFrame, **kwargs)
Bases:
ClusterReadBatch
,PartialRegionMutsBatch
- property read_weights
Weights for each read when computing counts.
- class seismicrna.cluster.batch.ClusterReadBatch(*, resps: DataFrame, **kwargs)
Bases:
PartialReadBatch
- property num_reads: Series
Number of reads.
- property read_nums
Read numbers.
- class seismicrna.cluster.data.ClusterAbundanceTable
Bases:
AbundanceTable
,PartialTable
,ClusterFile
,ABC
- classmethod get_header_type()
Type of the header for the table.
- classmethod get_index_depth()
Number of columns in the index.
- classmethod get_load_function()
LoadFunction for all Dataset types for this Table.
- property proportions
Proportion of each item.
- class seismicrna.cluster.data.ClusterAbundanceTableLoader(table_file: str | Path, **kwargs)
Bases:
TableLoader
,ClusterAbundanceTable
Load cluster data indexed by cluster.
- property data: Series
Table’s data.
- class seismicrna.cluster.data.ClusterBatchTabulator(*, get_batch_count_all: Callable, num_batches: int, num_cpus: int = 1, **kwargs)
Bases:
BatchTabulator
,ClusterTabulator
- class seismicrna.cluster.data.ClusterCountTabulator(*, batch_counts: Iterable[tuple[Any, Any, Any, Any]], **kwargs)
Bases:
CountTabulator
,ClusterTabulator
- class seismicrna.cluster.data.ClusterDataset(report_file: str | Path, verify_times: bool = True)
-
Dataset of clustered data.
- class seismicrna.cluster.data.ClusterDatasetTabulator(*, dataset: MutsDataset, validate: bool = False, **kwargs)
Bases:
PartialDatasetTabulator
,ClusterTabulator
- classmethod init_kws()
Attributes of the dataset to use as keyword arguments in super().__init__().
- class seismicrna.cluster.data.ClusterMutsDataset(dataset2_report_file: Path, **kwargs)
Bases:
ClusterDataset
,MultistepDataset
,UnbiasDataset
Merge cluster responsibilities with mutation data.
- property best_k
Best number of clusters.
- classmethod get_dataset1_load_func()
Function to load Dataset 1.
- classmethod get_dataset2_type()
Type of Dataset 2.
- property ks
Numbers of clusters.
- property min_mut_gap
Minimum gap between two mutations.
- property pattern
Pattern of mutations to count.
- property quick_unbias
Use the quick heuristic for unbiasing.
- property quick_unbias_thresh
Consider mutation rates less than or equal to this threshold to be 0 when using the quick heuristic for unbiasing.
- property region
Region of the dataset.
- class seismicrna.cluster.data.ClusterPositionTable
Bases:
ClusterTable
,PartialPositionTable
,ABC
- class seismicrna.cluster.data.ClusterPositionTableLoader(table_file: str | Path, **kwargs)
Bases:
PositionTableLoader
,ClusterPositionTable
Load cluster data indexed by position.
- class seismicrna.cluster.data.ClusterReadDataset(report_file: str | Path, verify_times: bool = True)
Bases:
ClusterDataset
,LoadedDataset
Load clustering results.
- property best_k
Best number of clusters.
- classmethod get_batch_type()
Type of batch.
- classmethod get_report_type()
Type of report.
- property ks
Numbers of clusters.
- property pattern
Pattern of mutations to count.
- class seismicrna.cluster.data.ClusterTable
Bases:
RelTypeTable
,ClusterFile
,ABC
- classmethod get_header_type()
Type of the header for the table.
- classmethod get_load_function()
LoadFunction for all Dataset types for this Table.
- class seismicrna.cluster.data.ClusterTabulator(*, ks: list[int], **kwargs)
Bases:
PartialTabulator
,ABC
- property clust_header
Header of the per-cluster data.
- property data_per_clust
Number of reads in each cluster.
- classmethod table_types()
Types of tables that this tabulator can write.
- class seismicrna.cluster.data.JoinClusterMutsDataset(*args, **kwargs)
Bases:
ClusterDataset
,JoinMutsDataset
,MergedUnbiasDataset
- property best_k
Best number of clusters.
- property clusts
Index of k and cluster numbers.
- classmethod get_batch_type()
Type of batch.
- classmethod get_dataset_load_func()
Function to load one constituent dataset.
- classmethod get_report_type()
Type of report.
- property ks
Numbers of clusters.
- classmethod name_batch_attrs()
Name the attributes of each batch.
- seismicrna.cluster.data.get_clust_params(dataset: ClusterMutsDataset, num_cpus: int = 1)
Get the mutation rates and proportion for each cluster. If table files already exist, then use them to get the parameters; otherwise, calculate the parameters from the dataset.
- class seismicrna.cluster.em.EMRun(uniq_reads: UniqReads, k: int, seed: int | None = None, *, min_iter: int, max_iter: int, em_thresh: float, jackpot: bool, jackpot_conf_level: float, max_jackpot_quotient: float)
Bases:
object
Run expectation-maximization to cluster the given reads into the specified number of clusters.
- property bic
Bayesian Information Criterion of the model.
- property jackpot_quotient
jackpotting score divided by the score of the median null model.
- Type:
Jackpotting quotient
- property jackpot_score
expected value of the G-anomaly.
- Type:
Jackpotting score
- property log_exp
Expected log count of each unique read (with names).
- property log_exp_values
Expected log count of each unique read (values only).
- property log_like
Return the current log likelihood, which is the last item in the trajectory of log likelihood values.
- property max_arcd_vs_ens_avg
Maximum arcsine distance between the mutation rates of the clusters and the ensemble average.
- property max_gini
Maximum Gini coefficient among all clusters.
- property max_pearson
Maximum Pearson correlation among any pair of clusters.
- property min_marcd
Minimum MARCD among any pair of clusters.
- property mus
Mutation rate of each position in each cluster, corrected for observer bias.
- property mus_ens_avg
Mutation rate of each position in the ensemble average, corrected for observer bias.
- property n_reads
Total number of reads (including redundant reads).
- property pis
Proportion of each cluster, corrected for observer bias.
- property region_end5
5’ end of the region.
- property semi_g_anomalies
Semi-G-anomaly of each read.
- class seismicrna.cluster.emk.EMRunsK(runs: list[EMRun], max_pearson_run: float, min_marcd_run: float, max_arcd_vs_ens_avg: float, max_gini_run: float, max_jackpot_quotient: float, max_loglike_vs_best: float, min_pearson_vs_best: float, max_marcd_vs_best: float)
Bases:
object
One or more EM runs with the same number of clusters.
- enough_runs_passing(**kwargs)
Whether enough runs passed.
- loglike_vs_best(**kwargs)
Log likelihood difference between the best and second-best runs.
- marcd_vs_best(**kwargs)
Minimum MARCD between the best run and any other run.
- n_runs_passing(**kwargs)
Number of runs passing the filters.
- passing(**kwargs)
Whether this number of clusters passes the filters.
- pearson_vs_best(**kwargs)
Maximum Pearson correlation between the best run and any other run.
- subopt_indexes(**kwargs)
Indexes of the valid suboptimal runs.
- summarize(**kwargs)
Summarize the results of the runs.
- seismicrna.cluster.emk.assign_clusterings(mus1: ndarray, mus2: ndarray)
Optimally assign clusters from two groups to each other.
- seismicrna.cluster.emk.calc_mean_arcsine_distance_clusters(mus1: ndarray, mus2: ndarray)
MARCD between all clusters after matching.
- seismicrna.cluster.emk.calc_mean_pearson_clusters(mus1: ndarray, mus2: ndarray)
Pearson correlation between all clusters after matching.
- seismicrna.cluster.emk.find_best_k(ks: Iterable[EMRunsK], **kwargs)
Find the best number of clusters.
- seismicrna.cluster.emk.get_common_k(runs: list[EMRun])
Find the number of clusters (k) from among EM clustering runs. If there are multiple ks, then raise a ValueError.
- seismicrna.cluster.emk.sort_runs(runs: list[EMRun])
Sort the runs of EM clustering by decreasing likelihood so that the run with the best (largest) likelihood comes first.
- class seismicrna.cluster.io.ClusterBatchIO(*, resps: DataFrame, **kwargs)
Bases:
ClusterReadBatch
,ReadBatchIO
,RegBrickleIO
,ClusterIO
- classmethod get_file_seg_type()
Type of the last segment in the path.
- class seismicrna.cluster.io.ClusterBatchWriter(dataset: MaskMutsDataset, ks: list[EMRunsK], brotli_level: int, top: Path, branch: str)
Bases:
object
- property branches
- property ks_written
- write_batches()
Save the batches.
- class seismicrna.cluster.io.ClusterFile
Bases:
HasRegFilePath
,ABC
- classmethod get_step()
Step of the workflow.
- class seismicrna.cluster.io.ClusterIO
Bases:
ClusterFile
,RegFileIO
,ABC
- seismicrna.cluster.jackpot.bootstrap_jackpot_scores(uniq_end5s: ndarray, uniq_end3s: ndarray, counts_per_uniq: ndarray, p_mut: ndarray, p_ends: ndarray, p_clust: ndarray, min_mut_gap: int, unmasked: ndarray, real_jackpot_score: float, confidence_level: float, max_jackpot_quotient: float)
Bootstrap jackpotting scores from the null model.
- Parameters:
uniq_end5s (
np.ndarray
) – 1D (unique reads) array of the 5’ ends (0-indexed) of the unique reads.uniq_end3s (
np.ndarray
) – 1D (unique reads) array of the 3’ ends (0-indexed) of the unique reads.counts_per_uniq (
np.ndarray
) – 1D (unique reads) array of the number of times each unique read was observed.p_mut (
np.ndarray
) – 2D (positions x clusters) array of the mutation rate at each position in each cluster.p_ends (
np.ndarray
) – 2D (positions x positions) array of the probability distribution of 5’/3’ end coordinates.p_clust (
np.ndarray
) – 1D (clusters) probability that a read comes from each cluster.min_mut_gap (
int
) – Minimum number of non-mutated positions between two mutations.unmasked (
np.ndarray
) – Positions (0-indexed) of p_mut that are not masked.real_jackpot_score (
float
) – Jackpotting score of the real dataset.confidence_level (
float
) – Confidence level for computing a confidence interval of the jackpotting quotient.max_jackpot_quotient (
float
) – Stop bootstrapping once the confidence interval lies entirely above or entirely below this threshold.
- Returns:
Array of normalized G-test statistics for the null model.
- Return type:
np.ndarray
- seismicrna.cluster.jackpot.calc_jackpot_quotient(real_jackpot_score: float, null_jackpot_score: float)
Calculate the jackpotting quotient.
The jackpotting quotient indicates how much more overrepresented the average read is in the real dataset compared to the null dataset.
Since the jackpotting score is the expected log-ratio of a read’s observed to expected count, raising e to the power of it yields the observed-to-expected ratio, which measures jackpotting.
JS = average{log(O / E)} exp(JS) = exp(average{log(O / E)})
Then, the jackpotting quotient is the jackpotting score of the real dataset divided by that of the null dataset:
- JQ = exp(JS_real) / exp(JS_null)
= exp(JS_real - JS_null)
- seismicrna.cluster.jackpot.calc_jackpot_score(semi_g_anomalies: ndarray, n_reads: int)
Calculate the jackpotting score.
The jackpotting score is defined as the average log of the ratio of observed to expected counts for a read, weighted by the count of each read:
JS = sum{O_i * log(O_i / E_i)} / sum{O_i} where i is each unique read.
This formula is based on that of the G-test statistic, which is identical except that the latter is multiplied by 2 and not divided by the number of observations.
- seismicrna.cluster.jackpot.calc_jackpot_score_ci(jackpot_scores: Iterable[float], confidence_level: float)
Calculate the confidence interval of the mean of an array of jackpotting scores.
- seismicrna.cluster.jackpot.calc_semi_g_anomaly(num_obs: int | ndarray, log_exp: float | ndarray)
Calculate each read’s semi-G-anomaly, i.e. half its contribution to the G-test statistic.
- seismicrna.cluster.jackpot.linearize_ends_matrix(p_ends: ndarray)
Linearize an N x N matrix of end coordinate probabilities/counts into a length-N list of 5’/3’ coordinates and their values.
- seismicrna.cluster.jackpot.sim_obs_exp(end5s: ndarray, end3s: ndarray, p_mut: ndarray, p_ends: ndarray, p_clust: ndarray, min_mut_gap: int, unmasked: ndarray)
Simulate observed and expected counts.
- seismicrna.cluster.main.run(input_path: Iterable[str | Path], *, branch: str = '', tmp_pfx: str | Path = './tmp', keep_tmp: bool = False, min_clusters: int = 1, max_clusters: int = 0, min_em_runs: int = 6, max_em_runs: int = 30, min_em_iter: int = 10, max_em_iter: int = 500, em_thresh: float = 0.37, min_marcd_run: float = 0.016, max_pearson_run: float = 0.9, max_arcd_vs_ens_avg: float = 0.2, max_gini_run: float = 0.667, jackpot: bool = True, jackpot_conf_level: float = 0.95, max_jackpot_quotient: float = 1.1, max_loglike_vs_best: float = 0.0, min_pearson_vs_best: float = 0.97, max_marcd_vs_best: float = 0.008, try_all_ks: bool = False, write_all_ks: bool = False, cluster_pos_table: bool = True, cluster_abundance_table: bool = True, verify_times: bool = True, brotli_level: int = 10, num_cpus: int = 4, force: bool = False) list[Path]
Infer alternative structures by clustering reads’ mutations.
- Parameters:
branch (
str
) – Create a new branch of the workflow with this name [keyword-only, default: ‘’]tmp_pfx (
str | pathlib._local.Path
) – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]keep_tmp (
bool
) – Keep temporary files after finishing [keyword-only, default: False]min_clusters (
int
) – Start at this many clusters [keyword-only, default: 1]max_clusters (
int
) – Stop at this many clusters (0 for no limit) [keyword-only, default: 0]min_em_runs (
int
) – Run EM (successfully) at least this number of times for each K [keyword-only, default: 6]max_em_runs (
int
) – Run EM (successfully or not) at most this number of times for each K [keyword-only, default: 30]min_em_iter (
int
) – Run EM for at least this many iterations [keyword-only, default: 10]max_em_iter (
int
) – Run EM for at most this many iterations [keyword-only, default: 500]em_thresh (
float
) – Stop EM when the log likelihood increases by less than this threshold [keyword-only, default: 0.37]min_marcd_run (
float
) – Remove runs with two clusters that differ by less than this MARCD [keyword-only, default: 0.016]max_pearson_run (
float
) – Remove runs with two clusters more similar than this correlation [keyword-only, default: 0.9]max_arcd_vs_ens_avg (
float
) – Remove runs where a cluster differs by more than this ARCD from the ensemble average at any position [keyword-only, default: 0.2]max_gini_run (
float
) – Remove runs where any cluster’s Gini coefficient exceeds this limit [keyword-only, default: 0.667]jackpot (
bool
) – Calculate the jackpotting quotient to find over-represented reads [keyword-only, default: True]jackpot_conf_level (
float
) – Confidence level for the jackpotting quotient confidence interval [keyword-only, default: 0.95]max_jackpot_quotient (
float
) – Remove runs whose jackpotting quotient exceeds this limit [keyword-only, default: 1.1]max_loglike_vs_best (
float
) – Remove Ks with a log likelihood gap larger than this (0 for no limit) [keyword-only, default: 0.0]min_pearson_vs_best (
float
) – Remove Ks where every run has less than this correlation vs. the best [keyword-only, default: 0.97]max_marcd_vs_best (
float
) – Remove Ks where every run has more than this MARCD vs. the best [keyword-only, default: 0.008]try_all_ks (
bool
) – Try all numbers of clusters (Ks), even after finding the best number [keyword-only, default: False]write_all_ks (
bool
) – Write all numbers of clusters (Ks), rather than only the best number [keyword-only, default: False]cluster_pos_table (
bool
) – Tabulate relationships per position for cluster data [keyword-only, default: True]cluster_abundance_table (
bool
) – Tabulate number of reads per cluster for cluster data [keyword-only, default: True]verify_times (
bool
) – Verify that report files from later steps have later timestamps [keyword-only, default: True]brotli_level (
int
) – Compress pickle files with this level of Brotli (0 - 11) [keyword-only, default: 10]num_cpus (
int
) – Use up to this many CPUs simultaneously [keyword-only, default: 4]force (
bool
) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]
- seismicrna.cluster.marginal.calc_marginal(*args, **kwargs)
- seismicrna.cluster.marginal.calc_marginal_resps(*args, **kwargs)
- seismicrna.cluster.obsexp.assemble_log_obs_exp(uniq_reads: UniqReads, ks: list[EMRunsK])
Assemble the expected and observed log counts of each read.
- seismicrna.cluster.obsexp.graph_log_obs_exp(log_obs_exp: DataFrame, ks: list[EMRunsK], to_dir: Path)
Graph the expected vs. observed log counts of unique reads.
- seismicrna.cluster.obsexp.table_log_obs_exp(log_obs_exp: DataFrame, to_dir: Path)
Write the expected and observed log counts of unique reads.
- seismicrna.cluster.obsexp.write_obs_exp_counts(uniq_reads: UniqReads, ks: list[EMRunsK], to_dir: Path)
- seismicrna.cluster.params.get_table_path(top: Path, branches: dict[str, str], sample: str, ref: str, reg: str, table: str, k: int, run: int)
Build a path for a table of clustering results.
- seismicrna.cluster.params.write_mus(run: EMRun, top: Path, branches: dict[str, str], sample: str, ref: str, reg: str, rank: int)
- seismicrna.cluster.params.write_pis(run: EMRun, top: Path, branches: dict[str, str], sample: str, ref: str, reg: str, rank: int)
- seismicrna.cluster.params.write_single_run_table(run: EMRun, top: Path, branches: dict[str, str], sample: str, ref: str, reg: str, rank: int, *, attr: str, table: str)
Write a DataFrame of one type of data from one independent run of EM clustering to a CSV file.
- class seismicrna.cluster.report.BaseClusterReport(**kwargs: Any | Callable[[Report], Any])
Bases:
RegReport
,ClusterIO
,ABC
- classmethod get_file_seg_type()
Type of the last segment in the path.
- class seismicrna.cluster.report.ClusterReport(**kwargs: Any | Callable[[Report], Any])
Bases:
BatchedReport
,BaseClusterReport
- classmethod from_clusters(ks: list[EMRunsK], uniq_reads: UniqReads, *, min_clusters: int, max_clusters: int, min_em_runs: int, max_em_runs: int, try_all_ks: bool, write_all_ks: bool, min_marcd_run: float, max_pearson_run: float, max_arcd_vs_ens_avg: float, max_gini_run: float, jackpot: bool, jackpot_conf_level: float, max_jackpot_quotient: float, max_loglike_vs_best: float, min_pearson_vs_best: float, max_marcd_vs_best: float, min_iter: int, max_iter: int, em_thresh: float, ks_written: list[int], checksums: list[str], began: datetime, ended: datetime)
Create a ClusterReport from EmClustering objects.
- classmethod get_file_seg_type()
Type of the last segment in the path.
- classmethod get_param_report_fields()
Parameter fields of the report.
- classmethod get_result_report_fields()
Result fields of the report.
- class seismicrna.cluster.report.JoinClusterReport(**kwargs: Any | Callable[[Report], Any])
Bases:
JoinReport
,BaseClusterReport
- classmethod get_param_report_fields()
Parameter fields of the report.
- seismicrna.cluster.summary.graph_attr(attr: Series, passing_text: list[str] | None = None)
Graph one attribute.
- seismicrna.cluster.summary.tabulate_attr(ks: list[EMRunsK], attr: str)
Tabulate the values for one attribute.
- class seismicrna.cluster.uniq.UniqReads(sample: str, branches: dict[str, str], region: Region, min_mut_gap: int, quick_unbias: bool, quick_unbias_thresh: float, muts_per_pos: list[ndarray], batch_to_uniq: list[Series], counts_per_uniq: ndarray, **kwargs)
Bases:
EndCoords
Collection of bit vectors of unique reads.
- classmethod from_dataset(dataset: MaskMutsDataset, branch: str, **kwargs)
Get unique reads from a dataset.
- classmethod from_dataset_contig(dataset: MaskMutsDataset, branch: str)
Get unique reads from a dataset of contiguous reads.
- get_cov_matrix()
Full boolean matrix of the covered positions.
- get_mut_matrix()
Full boolean matrix of the mutations.
- property log_obs
Log number of times each read was observed.
- property num_batches
Number of batches.
- property num_obs
Number of times each read was observed.
- property num_uniq
Number of unique reads.
- property read_end3s_zero
3’ end coordinates (0-indexed in the region).
- property read_end5s_zero
5’ end coordinates (0-indexed in the region).
- property ref
Reference name.
- property seg_end3s_zero
3’ end of every segment (0-indexed in the region).
- property seg_end5s_zero
5’ end of every segment (0-indexed in the region).
- property uniq_names
Unique read names as byte strings.
- seismicrna.cluster.uniq.get_uniq_reads(pos_nums: Iterable[int], pattern: RelPattern, batches: Iterable[RegionMutsBatch], **kwargs)
- seismicrna.cluster.write.cluster(dataset: MaskMutsDataset | JoinMaskMutsDataset, *, branch: str, min_clusters: int, max_clusters: int, try_all_ks: bool, write_all_ks: bool, max_em_runs: int, num_cpus: int, brotli_level: int, force: bool, cluster_pos_table: bool, cluster_abundance_table: bool, verify_times: bool, tmp_pfx, keep_tmp, **kwargs)
Cluster unique reads from one mask dataset.
- seismicrna.cluster.write.run_k(uniq_reads: UniqReads, k: int, num_runs: int, *, num_cpus: int, **kwargs) list[EMRun]
Run EM with a specific number of clusters.
- seismicrna.cluster.write.run_ks(uniq_reads: UniqReads, ks: Iterable[int], *, min_em_runs: int, max_em_runs: int, try_all_ks: bool, min_marcd_run: float, max_pearson_run: float, max_arcd_vs_ens_avg: float, max_gini_run: float, max_jackpot_quotient: float, max_loglike_vs_best: float, min_pearson_vs_best: float, max_marcd_vs_best: float, min_iter: int, max_iter: int, em_thresh: float, num_cpus: int, top: Path, **kwargs)
Run EM with multiple numbers of clusters.