seismicrna.ensembles package

Submodules

class seismicrna.ensembles.io.EnsemblesFile

Bases: HasRegFilePath, ABC

classmethod get_step()

Step of the workflow.

class seismicrna.ensembles.io.EnsemblesIO

Bases: EnsemblesFile, RegFileIO, ABC

seismicrna.ensembles.main.run(input_path: Iterable[str | Path] = Sentinel.UNSET, *, branch: str = '', tmp_pfx: str | Path = './tmp', keep_tmp: bool = False, brotli_level: int = 10, force: bool = False, num_cpus: int = 4, tile_length: int = 0, tile_min_overlap: float = 0.5, erase_tiles: bool = True, pair_fdr: float = 0.05, min_pairs: int = 2, threshold_multiplier: float = 1.0, min_cluster_length: int = 20, max_cluster_length: int = 1200, gap_mode: str = 'omit', mask_coords: Iterable[tuple[str, int, int]] = (), mask_primers: Iterable[tuple[str, DNA, DNA]] = (), primer_gap: int = 0, mask_regions_file: str | None = None, count_del: bool = True, count_ins: bool = True, no_mut: Iterable[str] = (), only_mut: Iterable[str] = (), probe: str = 'DMS', mask_a: bool | None = None, mask_c: bool | None = None, mask_g: bool | None = None, mask_u: bool | None = None, mask_polya: int = 5, mask_pos: Iterable[tuple[str, int]] = (), mask_pos_file: Iterable[str | Path] = (), mask_read: Iterable[str] = (), mask_read_file: Iterable[str | Path] = (), mask_discontig: bool = True, min_ncov_read: int = 1, min_fcov_read: float = 0.0, min_finfo_read: float = 0.95, max_fmut_read: float = 1.0, min_mut_gap: int | None = None, mut_collisions: str = 'auto', min_ninfo_pos: int = 1000, max_fmut_pos: float = 1.0, quick_unbias: bool = True, quick_unbias_thresh: float = 0.001, max_mask_iter: int = 0, mask_pos_table: bool = True, mask_read_table: bool = True, min_clusters: int = 1, max_clusters: int = 0, min_em_runs: int = 6, max_em_runs: int = 30, jackpot: bool = True, jackpot_conf_level: float = 0.95, max_jackpot_quotient: float = 1.1, max_jackpot_sims: int = 12, jackpot_max_data: int = 268435456, min_em_iter: int = 10, max_em_iter: int = 500, em_thresh: float = 0.37, min_marcd_run: float = 0.016, max_pearson_run: float = 0.9, max_arcd_vs_ens_avg: float = 0.2, max_gini_run: float = 0.667, max_loglike_vs_best: float = 0.0, min_pearson_vs_best: float = 0.97, max_marcd_vs_best: float = 0.008, try_all_ks: bool = False, write_all_ks: bool = False, cluster_pos_table: bool = True, cluster_abundance_table: bool = True, verify_times: bool = True, seed: int | None = None)

Infer independent structure ensembles along an entire RNA.

Parameters:
  • branch (str) – Create a new branch of the workflow with this name [keyword-only, default: ‘’]

  • tmp_pfx (str | pathlib._local.Path) – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]

  • keep_tmp (bool) – Keep temporary files after finishing [keyword-only, default: False]

  • brotli_level (int) – Compress pickle files with this level of Brotli (0 - 11) [keyword-only, default: 10]

  • force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

  • num_cpus (int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]

  • tile_length (int) – Make each tile this length (if 0, use 2x the median read length) [keyword-only, default: 0]

  • tile_min_overlap (float) – Make adjacent tiles overlap by at least this fraction of length [keyword-only, default: 0.5]

  • erase_tiles (bool) – Erase the mask reports/batches from the tiling step [keyword-only, default: True]

  • pair_fdr (float) – Find correlated pairs at this false discovery rate (FDR) [keyword-only, default: 0.05]

  • min_pairs (int) – Cluster only the regions with at least this many correlated pairs [keyword-only, default: 2]

  • threshold_multiplier (float) – Multiply the threshold for detecting modules by this factor [keyword-only, default: 1.0]

  • min_cluster_length (int) – Cluster only the regions with at least this many positions [keyword-only, default: 20]

  • max_cluster_length (int) – Cluster only the regions with no more than this many positions [keyword-only, default: 1200]

  • gap_mode (str) – If there are gaps between regions to cluster, OMIT (do not cluster) the gaps, INSERT a new region into each gap, or EXPAND the existing regions to fill the gaps [keyword-only, default: ‘omit’]

  • mask_coords (Iterable) – Select a region of a reference given its 5’ and 3’ end coordinates [keyword-only, default: ()]

  • mask_primers (Iterable) – Select a region of a reference given its forward and reverse primers [keyword-only, default: ()]

  • primer_gap (int) – Leave a gap of this many bases between the primer and the region [keyword-only, default: 0]

  • mask_regions_file (str | None) – Select regions of references from coordinates/primers in a CSV file [keyword-only, default: None]

  • count_del (bool) – Count deletions as mutations [keyword-only, default: True]

  • count_ins (bool) – Count insertions as mutations [keyword-only, default: True]

  • no_mut (Iterable) – Do not count this type of mutation (overrides –count-del/ins) [keyword-only, default: ()]

  • only_mut (Iterable) – Count only this type of mutation (overrides other mutation settings) [keyword-only, default: ()]

  • probe (str) – Use default mask options for this chemical probe [keyword-only, default: ‘DMS’]

  • mask_a (bool | None) – Mask positions with base A [keyword-only, default: None]

  • mask_c (bool | None) – Mask positions with base C [keyword-only, default: None]

  • mask_g (bool | None) – Mask positions with base G [keyword-only, default: None]

  • mask_u (bool | None) – Mask positions with base U [keyword-only, default: None]

  • mask_polya (int) – Mask stretches of at least this many consecutive A bases (0 disables) [keyword-only, default: 5]

  • mask_pos (Iterable) – Mask this position in this reference [keyword-only, default: ()]

  • mask_pos_file (Iterable) – Mask positions in references from a file [keyword-only, default: ()]

  • mask_read (Iterable) – Mask the read with this name [keyword-only, default: ()]

  • mask_read_file (Iterable) – Mask the reads with names in this file [keyword-only, default: ()]

  • mask_discontig (bool) – Mask paired-end reads with discontiguous mates [keyword-only, default: True]

  • min_ncov_read (int) – Mask reads with fewer than this many bases covering the region [keyword-only, default: 1]

  • min_fcov_read (float) – Mask reads covering less than this fraction of the region [keyword-only, default: 0.0]

  • min_finfo_read (float) – Mask reads with less than this fraction of informative base calls [keyword-only, default: 0.95]

  • max_fmut_read (float) – Mask reads with more than this fraction of mutated base calls [keyword-only, default: 1.0]

  • min_mut_gap (int | None) – Mask reads with two mutations separated by fewer than this many bases [keyword-only, default: None]

  • mut_collisions (str) – If two mutations are closer than –min-mut-gap positions, MERGE the mutations, DROP the read, or AUTO-select based on the probe. [keyword-only, default: ‘auto’]

  • min_ninfo_pos (int) – Mask positions with fewer than this many informative base calls [keyword-only, default: 1000]

  • max_fmut_pos (float) – Mask positions with more than this fraction of mutated base calls [keyword-only, default: 1.0]

  • quick_unbias (bool) – Correct observer bias using a quick (typically linear time) heuristic [keyword-only, default: True]

  • quick_unbias_thresh (float) – Treat mutated fractions under this threshold as 0 with –quick-unbias [keyword-only, default: 0.001]

  • max_mask_iter (int) – Stop masking after this many iterations (0 for no limit) [keyword-only, default: 0]

  • mask_pos_table (bool) – Tabulate relationships per position for mask data [keyword-only, default: True]

  • mask_read_table (bool) – Tabulate relationships per read for mask data [keyword-only, default: True]

  • min_clusters (int) – Start at this many clusters [keyword-only, default: 1]

  • max_clusters (int) – Stop at this many clusters (0 for no limit) [keyword-only, default: 0]

  • min_em_runs (int) – Run EM (successfully) at least this number of times for each K [keyword-only, default: 6]

  • max_em_runs (int) – Run EM (successfully or not) at most this number of times for each K [keyword-only, default: 30]

  • jackpot (bool) – Calculate the jackpotting quotient to find over-represented reads [keyword-only, default: True]

  • jackpot_conf_level (float) – Confidence level for the jackpotting quotient confidence interval [keyword-only, default: 0.95]

  • max_jackpot_quotient (float) – Remove runs whose jackpotting quotient exceeds this limit [keyword-only, default: 1.1]

  • max_jackpot_sims (int) – Maximum number of simulations to compute the jackpotting quotient [keyword-only, default: 12]

  • jackpot_max_data (int) – Skip calculating the jackpotting quotient if reads × positions exceeds this limit [keyword-only, default: 268435456]

  • min_em_iter (int) – Run EM for at least this many iterations [keyword-only, default: 10]

  • max_em_iter (int) – Run EM for at most this many iterations [keyword-only, default: 500]

  • em_thresh (float) – Stop EM when the log likelihood increases by less than this threshold [keyword-only, default: 0.37]

  • min_marcd_run (float) – Remove runs with two clusters that differ by less than this MARCD [keyword-only, default: 0.016]

  • max_pearson_run (float) – Remove runs with two clusters more similar than this correlation [keyword-only, default: 0.9]

  • max_arcd_vs_ens_avg (float) – Remove runs where a cluster differs by more than this ARCD from the ensemble average at any position [keyword-only, default: 0.2]

  • max_gini_run (float) – Remove runs where any cluster’s Gini coefficient exceeds this limit [keyword-only, default: 0.667]

  • max_loglike_vs_best (float) – Remove Ks with a log likelihood gap larger than this (0 for no limit) [keyword-only, default: 0.0]

  • min_pearson_vs_best (float) – Remove Ks where every run has less than this correlation vs. the best [keyword-only, default: 0.97]

  • max_marcd_vs_best (float) – Remove Ks where every run has more than this MARCD vs. the best [keyword-only, default: 0.008]

  • try_all_ks (bool) – Try all numbers of clusters (Ks), even after finding the best number [keyword-only, default: False]

  • write_all_ks (bool) – Write all numbers of clusters (Ks), rather than only the best number [keyword-only, default: False]

  • cluster_pos_table (bool) – Tabulate relationships per position for cluster data [keyword-only, default: True]

  • cluster_abundance_table (bool) – Tabulate number of reads per cluster for cluster data [keyword-only, default: True]

  • verify_times (bool) – Verify that report files from later steps have later timestamps [keyword-only, default: True]

  • seed (int | None) – Seed for the random number generator [keyword-only, default: None]

class seismicrna.ensembles.report.EnsemblesReport(**kwargs: Any | Callable[[Report], Any])

Bases: RegReport, EnsemblesIO

classmethod get_file_seg_type()

Type of the last segment in the path.

classmethod get_param_report_fields()

Parameter fields of the report.

classmethod get_result_report_fields()

Result fields of the report.

seismicrna.ensembles.write.ensembles(relate_report_file: Path, *, branch: str, tmp_pfx: str | Path, keep_tmp: bool, brotli_level: int, force: bool, num_cpus: int, tile_length: int, tile_min_overlap: float, erase_tiles: bool, pair_fdr: float, min_pairs: int, threshold_multiplier: float, min_cluster_length: int, max_cluster_length: int, gap_mode: str, mask_coords: Iterable[tuple[str, int, int]], mask_primers: Iterable[tuple[str, DNA, DNA]], primer_gap: int, mask_regions_file: str | None, count_del: bool, count_ins: bool, no_mut: Iterable[str], only_mut: Iterable[str], probe: str, mask_a: bool | None, mask_c: bool | None, mask_g: bool | None, mask_u: bool | None, mask_polya: int, mask_pos: Iterable[tuple[str, int]], mask_pos_file: Iterable[str | Path], mask_read: Iterable[str], mask_read_file: Iterable[str | Path], mask_discontig: bool, min_ncov_read: int, min_fcov_read: float, min_finfo_read: float, max_fmut_read: float, min_mut_gap: int | None, mut_collisions: str, min_ninfo_pos: int, max_fmut_pos: float, quick_unbias: bool, quick_unbias_thresh: float, max_mask_iter: int, mask_pos_table: bool, mask_read_table: bool, min_clusters: int, max_clusters: int, min_em_runs: int, max_em_runs: int, jackpot: bool, jackpot_conf_level: float, max_jackpot_quotient: float, max_jackpot_sims: int, jackpot_max_data: int, min_em_iter: int, max_em_iter: int, em_thresh: float, min_marcd_run: float, max_pearson_run: float, max_arcd_vs_ens_avg: float, max_gini_run: float, max_loglike_vs_best: float, min_pearson_vs_best: float, max_marcd_vs_best: float, try_all_ks: bool, write_all_ks: bool, cluster_pos_table: bool, cluster_abundance_table: bool, verify_times: bool, seed: int | None)

Run one relate report through the full ensembles pipeline.