seismicrna package

Subpackages

Submodules

exception seismicrna.ensembles.CalcRefRegionLengthError

Bases: ValueError

Error when calculating mutation densities.

class seismicrna.ensembles.RegionInfo(reg: str, end5: int, end3: int, ks: Iterable[int], report_file: Path, verify_times: bool, num_cpus: int)

Bases: object

property clust_params

property ends

seismicrna.ensembles.calc_ref_region_length(datasets: Iterable[MutsDataset], pattern: RelPattern, mask_discontig: bool, min_mut_gap: int)

seismicrna.ensembles.calc_regions(total_end5: int, total_end3: int, region_length: int, region_min_overlap: float)

seismicrna.ensembles.generate_regions(input_path: Iterable[str | Path], coords: Iterable[tuple[str, int, int]], primers: Iterable[tuple[str, DNA, DNA]], primer_gap: int, regions_file: str | None, region_length: int, region_min_overlap: float, mask_del: bool, mask_ins: bool, mask_mut: list[str], count_mut: list[str], mask_discontig: bool, min_mut_gap: int): For each reference, list the regions over which to mask.

seismicrna.ensembles.group_clusters(cluster_dirs: Iterable[Path], max_marcd_join, verify_times: bool, num_cpus: int)

seismicrna.ensembles.run(input_path: Iterable[str | Path], *, branch: str = '', tmp_pfx: str | Path = './tmp', keep_tmp: bool = False, brotli_level: int = 10, force: bool = False, num_cpus: int = 4, mask_coords: Iterable[tuple[str, int, int]] = (), mask_primers: Iterable[tuple[str, DNA, DNA]] = (), primer_gap: int = 0, mask_regions_file: str | None = None, mask_del: bool = True, mask_ins: bool = True, mask_mut: Iterable[str] = (), count_mut: Iterable[str] = (), mask_polya: int = 5, mask_gu: bool = True, mask_pos: Iterable[tuple[str, int]] = (), mask_pos_file: Iterable[str | Path] = (), mask_read: Iterable[str] = (), mask_read_file: Iterable[str | Path] = (), mask_discontig: bool = True, min_ncov_read: int = 1, min_finfo_read: float = 0.95, max_fmut_read: float = 1.0, min_mut_gap: int = 4, min_ninfo_pos: int = 1000, max_fmut_pos: float = 1.0, quick_unbias: bool = True, quick_unbias_thresh: float = 0.001, max_mask_iter: int = 0, mask_pos_table: bool = True, mask_read_table: bool = True, min_clusters: int = 1, max_clusters: int = 0, min_em_runs: int = 6, max_em_runs: int = 30, jackpot: bool = True, jackpot_conf_level: float = 0.95, max_jackpot_quotient: float = 1.1, min_em_iter: int = 10, max_em_iter: int = 500, em_thresh: float = 0.37, min_marcd_run: float = 0.016, max_pearson_run: float = 0.9, max_arcd_vs_ens_avg: float = 0.2, max_gini_run: float = 0.667, max_loglike_vs_best: float = 0.0, min_pearson_vs_best: float = 0.97, max_marcd_vs_best: float = 0.008, try_all_ks: bool = False, write_all_ks: bool = False, cluster_pos_table: bool = True, cluster_abundance_table: bool = True, verify_times: bool = True, joined: str = '', region_length: int = 0, region_min_overlap: float = 0.6666666666666666, max_marcd_join: float = 0.016)

Infer independent structure ensembles along an entire RNA.

Parameters:

branch (str) – Create a new branch of the workflow with this name [keyword-only, default: ‘’]
tmp_pfx (str | pathlib._local.Path) – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]
keep_tmp (bool) – Keep temporary files after finishing [keyword-only, default: False]
brotli_level (int) – Compress pickle files with this level of Brotli (0 - 11) [keyword-only, default: 10]
force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]
num_cpus (int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]
mask_coords (Iterable) – Select a region of a reference given its 5’ and 3’ end coordinates [keyword-only, default: ()]
mask_primers (Iterable) – Select a region of a reference given its forward and reverse primers [keyword-only, default: ()]
primer_gap (int) – Leave a gap of this many bases between the primer and the region [keyword-only, default: 0]
mask_regions_file (str | None) – Select regions of references from coordinates/primers in a CSV file [keyword-only, default: None]
mask_del (bool) – Mask deletions [keyword-only, default: True]
mask_ins (bool) – Mask insertions [keyword-only, default: True]
mask_mut (Iterable) – Mask this type of mutation [keyword-only, default: ()]
count_mut (Iterable) – Count only this type of mutation [keyword-only, default: ()]
mask_polya (int) – Mask stretches of at least this many consecutive A bases (0 disables) [keyword-only, default: 5]
mask_gu (bool) – Mask G and U bases [keyword-only, default: True]
mask_pos (Iterable) – Mask this position in this reference [keyword-only, default: ()]
mask_pos_file (Iterable) – Mask positions in references from a file [keyword-only, default: ()]
mask_read (Iterable) – Mask the read with this name [keyword-only, default: ()]
mask_read_file (Iterable) – Mask the reads with names in this file [keyword-only, default: ()]
mask_discontig (bool) – Mask paired-end reads with discontiguous mates [keyword-only, default: True]
min_ncov_read (int) – Mask reads with fewer than this many bases covering the region [keyword-only, default: 1]
min_finfo_read (float) – Mask reads with less than this fraction of informative base calls [keyword-only, default: 0.95]
max_fmut_read (float) – Mask reads with more than this fraction of mutated base calls [keyword-only, default: 1.0]
min_mut_gap (int) – Mask reads with two mutations separated by fewer than this many bases [keyword-only, default: 4]
min_ninfo_pos (int) – Mask positions with fewer than this many informative base calls [keyword-only, default: 1000]
max_fmut_pos (float) – Mask positions with more than this fraction of mutated base calls [keyword-only, default: 1.0]
quick_unbias (bool) – Correct observer bias using a quick (typically linear time) heuristic [keyword-only, default: True]
quick_unbias_thresh (float) – Treat mutated fractions under this threshold as 0 with –quick-unbias [keyword-only, default: 0.001]
max_mask_iter (int) – Stop masking after this many iterations (0 for no limit) [keyword-only, default: 0]
mask_pos_table (bool) – Tabulate relationships per position for mask data [keyword-only, default: True]
mask_read_table (bool) – Tabulate relationships per read for mask data [keyword-only, default: True]
min_clusters (int) – Start at this many clusters [keyword-only, default: 1]
max_clusters (int) – Stop at this many clusters (0 for no limit) [keyword-only, default: 0]
min_em_runs (int) – Run EM (successfully) at least this number of times for each K [keyword-only, default: 6]
max_em_runs (int) – Run EM (successfully or not) at most this number of times for each K [keyword-only, default: 30]
jackpot (bool) – Calculate the jackpotting quotient to find over-represented reads [keyword-only, default: True]
jackpot_conf_level (float) – Confidence level for the jackpotting quotient confidence interval [keyword-only, default: 0.95]
max_jackpot_quotient (float) – Remove runs whose jackpotting quotient exceeds this limit [keyword-only, default: 1.1]
min_em_iter (int) – Run EM for at least this many iterations [keyword-only, default: 10]
max_em_iter (int) – Run EM for at most this many iterations [keyword-only, default: 500]
em_thresh (float) – Stop EM when the log likelihood increases by less than this threshold [keyword-only, default: 0.37]
min_marcd_run (float) – Remove runs with two clusters that differ by less than this MARCD [keyword-only, default: 0.016]
max_pearson_run (float) – Remove runs with two clusters more similar than this correlation [keyword-only, default: 0.9]
max_arcd_vs_ens_avg (float) – Remove runs where a cluster differs by more than this ARCD from the ensemble average at any position [keyword-only, default: 0.2]
max_gini_run (float) – Remove runs where any cluster’s Gini coefficient exceeds this limit [keyword-only, default: 0.667]
max_loglike_vs_best (float) – Remove Ks with a log likelihood gap larger than this (0 for no limit) [keyword-only, default: 0.0]
min_pearson_vs_best (float) – Remove Ks where every run has less than this correlation vs. the best [keyword-only, default: 0.97]
max_marcd_vs_best (float) – Remove Ks where every run has more than this MARCD vs. the best [keyword-only, default: 0.008]
try_all_ks (bool) – Try all numbers of clusters (Ks), even after finding the best number [keyword-only, default: False]
write_all_ks (bool) – Write all numbers of clusters (Ks), rather than only the best number [keyword-only, default: False]
cluster_pos_table (bool) – Tabulate relationships per position for cluster data [keyword-only, default: True]
cluster_abundance_table (bool) – Tabulate number of reads per cluster for cluster data [keyword-only, default: True]
verify_times (bool) – Verify that report files from later steps have later timestamps [keyword-only, default: True]
joined (str) – Name of the region formed by joining other regions [keyword-only, default: ‘’]
region_length (int) – Make each region this length (if 0, then calculate the length over which the average read has 2 mutations) [keyword-only, default: 0]
region_min_overlap (float) – Make adjacent regions overlap by at least this fraction of length [keyword-only, default: 0.6666666666666666]
max_marcd_join (float) – Join regions with the same numbers of clusters only if the mean arcsine distance (MARCD) of their mutation rates and proportions does not exceed this threshold [keyword-only, default: 0.016]

seismicrna.interface.dataset_from_report(report_path: str | Path, verify_times: bool = True) → MutsDataset

Load a dataset from a report file.

Parameters:

report_path (str | Path) – The path to a report json file from the relate, mask, or cluster steps.
verify_times (bool = True) – Ensure that the report file does not have a timestamp that is earlier than that of one of its constituents.

Returns:

The type of MutsDataset returned depends on the report file.

Return type:

RelateMutsDataset | MaskMutsDataset | ClusterMutsDataset

seismicrna.interface.table_from_dataset(dataset: MutsDataset, table: str = 'pos') → TableWriter

Tabulate a dataset to generate a TableWriter

Parameters:

dataset (RelateMutsDataset | MaskMutsDataset | ClusterMutsDataset) – A dataset from the Relate, Mask, or Cluster steps.
table (str = "pos") – The type of table to generate. Valid options include “pos” for per-position table, “read” for per-read table, and “abundance” for a cluster abundance table.

Returns:

The type of TableWriter returned depends on the Dataset type.

Return type:

TableWriter

seismicrna.interface.table_from_report(report_path: str | Path, verify_times: bool = True, table: str = 'pos')

seismicrna.join.join_regions(out_dir: Path, name: str, sample: str, branches_flat: Iterable[str], ref: str, regs: Iterable[str], clustered: bool, *, clusts: dict[str, dict[int, dict[int, int]]], mask_pos_table: bool, mask_read_table: bool, cluster_pos_table: bool, cluster_abundance_table: bool, verify_times: bool, num_cpus: int, force: bool, tmp_pfx, keep_tmp)

Join one or more regions (horizontally).

Parameters:

out_dir (pathlib.Path) – Output directory.
name (str) – Name of the joined region.
branches_flat (Iterable[str]) – Branches of the datasets being pooled.
sample (str) – Name of the sample.
ref (str) – Name of the reference.
regs (Iterable[str]) – Names of the regions being joined.
clustered (bool) – Whether the dataset is clustered.
tmp_dir (Path) – Temporary directory.
clusts (dict[str, dict[int, dict[int, int]]]) – For each region, for each number of clusters, the cluster from the original region to use as the cluster in the joined region (ignored if clustered is False).
mask_pos_table (bool) – Tabulate relationships per position for mask data.
mask_read_table (bool) – Tabulate relationships per read for mask data
cluster_pos_table (bool) – Tabulate relationships per position for cluster data.
cluster_abundance_table (bool) – Tabulate number of reads per cluster for cluster data.
verify_times (bool) – Verify that report files from later steps have later timestamps.
num_cpus (bool) – Number of processors to use.
force (bool) – Force the report to be written, even if it exists.

Returns:

Path of the Join report file.

Return type:

pathlib.Path

seismicrna.join.joined_mask_report_exists(top: Path, sample: str, branches_flat: Iterable[str], ref: str, joined: str, regs: Iterable[str]): Return whether a mask report for the joined region exists.

seismicrna.join.parse_join_clusts_file(file: str | Path): Parse a file of joined clusters.

seismicrna.join.run(input_path: Iterable[str | Path], *, joined: str = '', join_clusts: str | None = None, mask_pos_table: bool = True, mask_read_table: bool = True, cluster_pos_table: bool = True, cluster_abundance_table: bool = True, verify_times: bool = True, tmp_pfx: str | Path = './tmp', keep_tmp: bool = False, num_cpus: int = 4, force: bool = False) → list[Path]

Merge regions (horizontally) from the Mask or Cluster step.

Parameters:

joined (str) – Name of the region formed by joining other regions [keyword-only, default: ‘’]
join_clusts (str | None) – Specify which clusters to join clusters using this CSV file [keyword-only, default: None]
mask_pos_table (bool) – Tabulate relationships per position for mask data [keyword-only, default: True]
mask_read_table (bool) – Tabulate relationships per read for mask data [keyword-only, default: True]
cluster_pos_table (bool) – Tabulate relationships per position for cluster data [keyword-only, default: True]
cluster_abundance_table (bool) – Tabulate number of reads per cluster for cluster data [keyword-only, default: True]
verify_times (bool) – Verify that report files from later steps have later timestamps [keyword-only, default: True]
tmp_pfx (str | pathlib._local.Path) – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]
keep_tmp (bool) – Keep temporary files after finishing [keyword-only, default: False]
num_cpus (int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]
force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

seismicrna.join.write_report(report_type: type[JoinReport], out_dir: Path, **kwargs)

seismicrna.lists.find_pos(table: PositionTable, max_fmut_pos: float, complement: bool)

seismicrna.lists.iter_tables(input_path: Iterable[str | Path], **kwargs): Iterate through all types of List and all Tables from which each type of List can be created.

seismicrna.lists.run(input_path: Iterable[str | Path], *, branch: str = '', min_ninfo_pos: int = 1000, max_fmut_pos: float = 1.0, force: bool = False, num_cpus: int = 4) → list[Path]

List positions to mask.

Parameters:

branch (str) – Create a new branch of the workflow with this name [keyword-only, default: ‘’]
min_ninfo_pos (int) – Mask positions with fewer than this many informative base calls [keyword-only, default: 1000]
max_fmut_pos (float) – Mask positions with more than this fraction of mutated base calls [keyword-only, default: 1.0]
force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]
num_cpus (int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]

seismicrna.lists.write_list(table: PositionTableLoader, list_type: type[List], *, branch: str, min_ninfo_pos: int, max_fmut_pos: float, force: bool): Write a List based on a Table.

seismicrna.logo.compute_arc_points(center, radius, theta1, theta2, n=100)

seismicrna.logo.draw_seismic_logo(report: bool = False, out_svg: str | Path | None = None, dpi: int = 300)

exception seismicrna.migrate.FindAndReplaceError: Bases: ValueError

seismicrna.migrate.find_and_replace(file: Path, find: str, replace: str, count: int = 1)

seismicrna.migrate.migrate_align_dir(align_dir: Path)

seismicrna.migrate.migrate_cluster_ref_dir(ref_dir: Path, num_cpus: int)

seismicrna.migrate.migrate_cluster_reg_dir(reg_dir: Path)

seismicrna.migrate.migrate_fold_ref_dir(ref_dir: Path, num_cpus: int)

seismicrna.migrate.migrate_fold_reg_dir(reg_dir: Path)

seismicrna.migrate.migrate_mask_ref_dir(ref_dir: Path, num_cpus: int)

seismicrna.migrate.migrate_mask_reg_dir(reg_dir: Path)

seismicrna.migrate.migrate_out_dir(out_dir: Path, num_cpus: int)

seismicrna.migrate.migrate_relate_ref_dir(ref_dir: Path)

seismicrna.migrate.migrate_sample_dir(sample_dir: Path, num_cpus: int)

seismicrna.migrate.run(input_path: Iterable[str | Path], *, out_dir: str | Path = './out', num_cpus: int = 4)

Migrate output directories from v0.23 to v0.24

Parameters:

out_dir (str | pathlib._local.Path) – Write all output files to this directory [keyword-only, default: ‘./out’]
num_cpus (int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]

seismicrna.migrate.update_brickle(file: str | Path, md5_checksum: str)

seismicrna.migrate.update_brickles(report: Report, top: str | Path)

seismicrna.pool.pool_samples(out_dir: Path, name: str, branches_flat: Iterable[str], ref: str, samples: Iterable[str], *, relate_pos_table: bool, relate_read_table: bool, verify_times: bool, num_cpus: int, force: bool, tmp_pfx, keep_tmp)

Pool one or more samples (vertically).

Parameters:

out_dir (pathlib.Path) – Output directory.
name (str) – Name of the pooled sample.
branches_flat (Iterable[str]) – Branches of the datasets being pooled.
ref (str) – Name of the reference
samples (Iterable[str]) – Names of the samples in the pool.
tmp_dir (Path) – Temporary directory.
relate_pos_table (bool) – Tabulate relationships per position for relate data.
relate_read_table (bool) – Tabulate relationships per read for relate data
verify_times (bool) – Verify that report files from later steps have later timestamps.
num_cpus (bool) – Number of processors to use.
force (bool) – Force the report to be written, even if it exists.

Returns:

Path of the Pool report file.

Return type:

pathlib.Path

seismicrna.pool.run(input_path: Iterable[str | Path], *, pooled: str = '', relate_pos_table: bool = True, relate_read_table: bool = False, verify_times: bool = True, tmp_pfx: str | Path = './tmp', keep_tmp: bool = False, num_cpus: int = 4, force: bool = False) → list[Path]

Merge samples (vertically) from the Relate step.

Parameters:

pooled (str) – Pooled sample name [keyword-only, default: ‘’]
relate_pos_table (bool) – Tabulate relationships per position for relate data [keyword-only, default: True]
relate_read_table (bool) – Tabulate relationships per read for relate data [keyword-only, default: False]
verify_times (bool) – Verify that report files from later steps have later timestamps [keyword-only, default: True]
tmp_pfx (str | pathlib._local.Path) – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]
keep_tmp (bool) – Keep temporary files after finishing [keyword-only, default: False]
num_cpus (int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]
force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

seismicrna.pool.write_report(out_dir: Path, **kwargs)

seismicrna.table.get_dataset_flags(dataset: MutsDataset, relate_pos_table: bool, relate_read_table: bool, mask_pos_table: bool, mask_read_table: bool, cluster_pos_table: bool, cluster_abundance_table: bool): Return the tabulator and table flags for a dataset.

seismicrna.table.get_tabulator_type(dataset_type: type[Dataset], count: bool = False): Determine which class of Tabulator can process the dataset.

seismicrna.table.load_all_datasets(input_path: Iterable[str | Path], **kwargs): Load datasets from all steps of the workflow.

seismicrna.table.run(input_path: Iterable[str | Path], *, relate_pos_table: bool = True, relate_read_table: bool = False, mask_pos_table: bool = True, mask_read_table: bool = True, cluster_pos_table: bool = True, cluster_abundance_table: bool = True, verify_times: bool = True, num_cpus: int = 4, force: bool = False) → list[Path]

Tabulate counts of relationships per read and position.

Parameters:

relate_pos_table (bool) – Tabulate relationships per position for relate data [keyword-only, default: True]
relate_read_table (bool) – Tabulate relationships per read for relate data [keyword-only, default: False]
mask_pos_table (bool) – Tabulate relationships per position for mask data [keyword-only, default: True]
mask_read_table (bool) – Tabulate relationships per read for mask data [keyword-only, default: True]
cluster_pos_table (bool) – Tabulate relationships per position for cluster data [keyword-only, default: True]
cluster_abundance_table (bool) – Tabulate number of reads per cluster for cluster data [keyword-only, default: True]
verify_times (bool) – Verify that report files from later steps have later timestamps [keyword-only, default: True]
num_cpus (int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]
force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

seismicrna.table.tabulate(dataset: MutsDataset, tabulator_type: type[DatasetTabulator], pos_table: bool, read_table: bool, clust_table: bool, force: bool, num_cpus: int)

seismicrna.urls.open_url(url: str)

seismicrna.wf.flatten(nested)

seismicrna.wf.run(fasta: str | Path, input_path: Iterable[str | Path], *, out_dir: str | Path = './out', tmp_pfx: str | Path = './tmp', keep_tmp: bool = False, brotli_level: int = 10, force: bool = False, num_cpus: int = 4, fastqz: Iterable[str | Path] = (), fastqy: Iterable[str | Path] = (), fastqx: Iterable[str | Path] = (), phred_enc: int = 33, demulti_overwrite: bool = False, demult_on: bool = False, parallel_demultiplexing: bool = False, clipped: int = 0, mismatch_tolerence: int = 0, index_tolerance: int = 0, barcode_start: int = 0, barcode_end: int = 0, dmfastqz: Iterable[str | Path] = (), dmfastqy: Iterable[str | Path] = (), dmfastqx: Iterable[str | Path] = (), fastp: bool = True, fastp_5: bool = False, fastp_3: bool = True, fastp_w: int = 6, fastp_m: int = 25, fastp_poly_g: str = 'auto', fastp_poly_g_min_len: int = 10, fastp_poly_x: bool = False, fastp_poly_x_min_len: int = 10, fastp_adapter_trimming: bool = True, fastp_adapter_1: str = '', fastp_adapter_2: str = '', fastp_adapter_fasta: str | None = None, fastp_detect_adapter_for_pe: bool = True, fastp_min_length: int = 9, bt2_local: bool = True, bt2_discordant: bool = False, bt2_mixed: bool = False, bt2_dovetail: bool = False, bt2_contain: bool = True, bt2_score_min_e2e: str = 'L,-1,-0.8', bt2_score_min_loc: str = 'L,1,0.8', bt2_i: int = 0, bt2_x: int = 600, bt2_gbar: int = 4, bt2_l: int = 20, bt2_s: str = 'L,1,0.1', bt2_d: int = 4, bt2_r: int = 2, bt2_dpad: int = 2, bt2_orient: str = 'fr', bt2_un: bool = True, min_mapq: int = 25, sep_strands: bool = False, f1r2_fwd: bool = False, rev_label: str = '-rev', min_phred: int = 25, min_reads: int = 1000, insert3: bool = True, ambindel: bool = True, overhangs: bool = True, clip_end5: int = 4, clip_end3: int = 4, batch_size: int = 65536, write_read_names: bool = False, relate_pos_table: bool = True, relate_read_table: bool = False, relate_cx: bool = True, mask_coords: Iterable[tuple[str, int, int]] = (), mask_primers: Iterable[tuple[str, DNA, DNA]] = (), primer_gap: int = 0, mask_regions_file: str | None = None, mask_del: bool = True, mask_ins: bool = True, mask_mut: Iterable[str] = (), count_mut: Iterable[str] = (), mask_polya: int = 5, mask_gu: bool = True, mask_pos: Iterable[tuple[str, int]] = (), mask_pos_file: Iterable[str | Path] = (), mask_read: Iterable[str] = (), mask_read_file: Iterable[str | Path] = (), mask_discontig: bool = True, min_ncov_read: int = 1, min_finfo_read: float = 0.95, max_fmut_read: float = 1.0, min_mut_gap: int = 4, min_ninfo_pos: int = 1000, max_fmut_pos: float = 1.0, quick_unbias: bool = True, quick_unbias_thresh: float = 0.001, max_mask_iter: int = 0, mask_pos_table: bool = True, mask_read_table: bool = True, cluster: bool = False, min_clusters: int = 1, max_clusters: int = 0, min_em_runs: int = 6, max_em_runs: int = 30, jackpot: bool = True, jackpot_conf_level: float = 0.95, max_jackpot_quotient: float = 1.1, min_em_iter: int = 10, max_em_iter: int = 500, em_thresh: float = 0.37, min_marcd_run: float = 0.016, max_pearson_run: float = 0.9, max_arcd_vs_ens_avg: float = 0.2, max_gini_run: float = 0.667, max_loglike_vs_best: float = 0.0, min_pearson_vs_best: float = 0.97, max_marcd_vs_best: float = 0.008, try_all_ks: bool = False, write_all_ks: bool = False, cluster_pos_table: bool = True, cluster_abundance_table: bool = True, verify_times: bool = True, fold: bool = False, fold_coords: Iterable[tuple[str, int, int]] = (), fold_primers: Iterable[tuple[str, DNA, DNA]] = (), fold_regions_file: str | None = None, fold_full: bool = True, fold_vienna: bool = False, fold_temp: float = 310.15, fold_fpaired: float = 0.5, fold_mu_eps: float = 0.005, fold_constraint: str | None = None, fold_commands: str | None = None, fold_md: int = 0, fold_mfe: bool = False, fold_max: int = 20, fold_percent: float = 20.0, draw: bool = False, struct_num: Iterable[int] = (), color: bool = True, draw_svg: bool = True, draw_png: bool = False, update_rnartistcore: bool = False, export: bool = False, samples_meta: str = None, refs_meta: str = None, all_pos: bool = True, cgroup: str = 'k', hist_bins: int = 10, hist_margin: float = 0.1, struct_file: Iterable[str | Path] = (), terminal_pairs: bool = True, window: int = 45, winmin: int = 9, csv: bool = True, html: bool = True, svg: bool = False, pdf: bool = False, png: bool = False, graph_mprof: bool = True, graph_tmprof: bool = True, graph_ncov: bool = True, graph_mhist: bool = True, graph_abundance: bool = True, graph_giniroll: bool = False, graph_roc: bool = True, graph_aucroll: bool = False, graph_poscorr: bool = False, graph_mutdist: bool = False, mutdist_null: bool = True, collate: bool = True, name: str = 'collated', verbose_name: bool = False, include_svg: bool = True, include_graph: bool = True, group: str = 'sample', portable: bool = False, collate_out_dir: str | Path | None = None)

Run the entire workflow.

Parameters:

out_dir (str | pathlib._local.Path) – Write all output files to this directory [keyword-only, default: ‘./out’]
tmp_pfx (str | pathlib._local.Path) – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]
keep_tmp (bool) – Keep temporary files after finishing [keyword-only, default: False]
brotli_level (int) – Compress pickle files with this level of Brotli (0 - 11) [keyword-only, default: 10]
force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]
num_cpus (int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]
fastqz (Iterable) – FASTQ file(s) of single-end reads [keyword-only, default: ()]
fastqy (Iterable) – FASTQ file(s) of paired-end reads with mates 1 and 2 interleaved [keyword-only, default: ()]
fastqx (Iterable) – FASTQ files of paired-end reads with mates 1 and 2 in separate files [keyword-only, default: ()]
phred_enc (int) – Specify the Phred score encoding of FASTQ and SAM/BAM/CRAM files [keyword-only, default: 33]
demulti_overwrite (bool) – Desiginates whether to overwrite the grepped fastq. should only be used if changing setting on the same sample [keyword-only, default: False]
demult_on (bool) – Enable demultiplexing [keyword-only, default: False]
parallel_demultiplexing (bool) – Whether to run demultiplexing at maximum speed by submitting multithreaded grep functions [keyword-only, default: False]
clipped (int) – Designates the amount of clipped patterns to search for in the sample, will raise compution time [keyword-only, default: 0]
mismatch_tolerence (int) – Designates the allowable amount of mismatches allowed in a string and still be considered a valid pattern find. will increase non-parallel computation at a factorial rate. use caution going above 2 mismatches. does not apply to clipped sequences. [keyword-only, default: 0]
index_tolerance (int) – Designates the allowable amount of distance you allow the pattern to be found in a read from the reference index [keyword-only, default: 0]
barcode_start (int) – Index of start of barcode [keyword-only, default: 0]
barcode_end (int) – Length of barcode [keyword-only, default: 0]
dmfastqz (Iterable) – Demultiplexed FASTQ files of single-end reads [keyword-only, default: ()]
dmfastqy (Iterable) – Demultiplexed FASTQ files of paired-end reads interleaved in one file [keyword-only, default: ()]
dmfastqx (Iterable) – Demultiplexed FASTQ files of mate 1 and mate 2 reads [keyword-only, default: ()]
fastp (bool) – Use fastp to QC, filter, and trim reads before alignment [keyword-only, default: True]
fastp_5 (bool) – Trim low-quality bases from the 5’ ends of reads [keyword-only, default: False]
fastp_3 (bool) – Trim low-quality bases from the 3’ ends of reads [keyword-only, default: True]
fastp_w (int) – Use this window size (nt) for –fastp-5 and –fastp-3 [keyword-only, default: 6]
fastp_m (int) – Use this mean quality threshold for –fastp-5 and –fastp-3 [keyword-only, default: 25]
fastp_poly_g (str) – Trim poly(G) tails (two-color sequencing artifacts) from the 3’ end [keyword-only, default: ‘auto’]
fastp_poly_g_min_len (int) – Minimum number of Gs to consider a poly(G) tail for –fastp-poly-g [keyword-only, default: 10]
fastp_poly_x (bool) – Trim poly(X) tails (i.e. of any nucleotide) from the 3’ end [keyword-only, default: False]
fastp_poly_x_min_len (int) – Minimum number of bases to consider a poly(X) tail for –fastp-poly-x [keyword-only, default: 10]
fastp_adapter_trimming (bool) – Trim adapter sequences from the 3’ ends of reads [keyword-only, default: True]
fastp_adapter_1 (str) – Trim this adapter sequence from the 3’ ends of read 1s [keyword-only, default: ‘’]
fastp_adapter_2 (str) – Trim this adapter sequence from the 3’ ends of read 2s [keyword-only, default: ‘’]
fastp_adapter_fasta (str | None) – Trim adapter sequences in this FASTA file from the 3’ ends of reads [keyword-only, default: None]
fastp_detect_adapter_for_pe (bool) – Automatically detect the adapter sequences for paired-end reads [keyword-only, default: True]
fastp_min_length (int) – Discard reads shorter than this length [keyword-only, default: 9]
bt2_local (bool) – Align reads in local mode rather than end-to-end mode [keyword-only, default: True]
bt2_discordant (bool) – Output paired-end reads whose mates align discordantly [keyword-only, default: False]
bt2_mixed (bool) – Attempt to align individual mates of pairs that fail to align [keyword-only, default: False]
bt2_dovetail (bool) – Consider dovetailed mate pairs to align concordantly [keyword-only, default: False]
bt2_contain (bool) – Consider nested mate pairs to align concordantly [keyword-only, default: True]
bt2_score_min_e2e (str) – Discard alignments that score below this threshold in end-to-end mode [keyword-only, default: ‘L,-1,-0.8’]
bt2_score_min_loc (str) – Discard alignments that score below this threshold in local mode [keyword-only, default: ‘L,1,0.8’]
bt2_i (int) – Discard paired-end alignments shorter than this many bases [keyword-only, default: 0]
bt2_x (int) – Discard paired-end alignments longer than this many bases [keyword-only, default: 600]
bt2_gbar (int) – Do not place gaps within this many bases from the end of a read [keyword-only, default: 4]
bt2_l (int) – Use this seed length for Bowtie2 [keyword-only, default: 20]
bt2_s (str) – Seed Bowtie2 alignments at this interval [keyword-only, default: ‘L,1,0.1’]
bt2_d (int) – Discard alignments if over this many consecutive seed extensions fail [keyword-only, default: 4]
bt2_r (int) – Re-seed reads with repetitive seeds up to this many times [keyword-only, default: 2]
bt2_dpad (int) – Pad the alignment matrix with this many bases (to allow gaps) [keyword-only, default: 2]
bt2_orient (str) – Require paired mates to have this orientation [keyword-only, default: ‘fr’]
bt2_un (bool) – Output unaligned reads to a FASTQ file [keyword-only, default: True]
min_mapq (int) – Discard reads with mapping qualities below this threshold [keyword-only, default: 25]
sep_strands (bool) – Separate each alignment map into forward- and reverse-strand reads [keyword-only, default: False]
f1r2_fwd (bool) – With –sep-strands, consider forward mate 1s and reverse mate 2s to be forward-stranded [keyword-only, default: False]
rev_label (str) – With –sep-strands, add this label to each reverse-strand reference [keyword-only, default: ‘-rev’]
min_phred (int) – Mark base calls with Phred scores lower than this threshold as ambiguous [keyword-only, default: 25]
min_reads (int) – Discard alignment maps with fewer than this many reads [keyword-only, default: 1000]
insert3 (bool) – Mark each insertion on the base to its 3’ (True) or 5’ (False) side [keyword-only, default: True]
ambindel (bool) – Mark all ambiguous insertions and deletions (indels) [keyword-only, default: True]
overhangs (bool) – Retain the overhangs of paired-end mates that dovetail [keyword-only, default: True]
clip_end5 (int) – Clip this many bases from the 5’ end of each read [keyword-only, default: 4]
clip_end3 (int) – Clip this many bases from the 3’ end of each read [keyword-only, default: 4]
batch_size (int) – Limit batches to at most this many reads [keyword-only, default: 65536]
write_read_names (bool) – Write the name of each read in a second set of batches (necessary for the options –mask-read or –mask-read-file) [keyword-only, default: False]
relate_pos_table (bool) – Tabulate relationships per position for relate data [keyword-only, default: True]
relate_read_table (bool) – Tabulate relationships per read for relate data [keyword-only, default: False]
relate_cx (bool) – Use a fast (C extension module) version of the relate algorithm; the slow (Python) version is still avilable as a fallback if the C extension cannot be loaded, and for debugging/benchmarking [keyword-only, default: True]
mask_coords (Iterable) – Select a region of a reference given its 5’ and 3’ end coordinates [keyword-only, default: ()]
mask_primers (Iterable) – Select a region of a reference given its forward and reverse primers [keyword-only, default: ()]
primer_gap (int) – Leave a gap of this many bases between the primer and the region [keyword-only, default: 0]
mask_regions_file (str | None) – Select regions of references from coordinates/primers in a CSV file [keyword-only, default: None]
mask_del (bool) – Mask deletions [keyword-only, default: True]
mask_ins (bool) – Mask insertions [keyword-only, default: True]
mask_mut (Iterable) – Mask this type of mutation [keyword-only, default: ()]
count_mut (Iterable) – Count only this type of mutation [keyword-only, default: ()]
mask_polya (int) – Mask stretches of at least this many consecutive A bases (0 disables) [keyword-only, default: 5]
mask_gu (bool) – Mask G and U bases [keyword-only, default: True]
mask_pos (Iterable) – Mask this position in this reference [keyword-only, default: ()]
mask_pos_file (Iterable) – Mask positions in references from a file [keyword-only, default: ()]
mask_read (Iterable) – Mask the read with this name [keyword-only, default: ()]
mask_read_file (Iterable) – Mask the reads with names in this file [keyword-only, default: ()]
mask_discontig (bool) – Mask paired-end reads with discontiguous mates [keyword-only, default: True]
min_ncov_read (int) – Mask reads with fewer than this many bases covering the region [keyword-only, default: 1]
min_finfo_read (float) – Mask reads with less than this fraction of informative base calls [keyword-only, default: 0.95]
max_fmut_read (float) – Mask reads with more than this fraction of mutated base calls [keyword-only, default: 1.0]
min_mut_gap (int) – Mask reads with two mutations separated by fewer than this many bases [keyword-only, default: 4]
min_ninfo_pos (int) – Mask positions with fewer than this many informative base calls [keyword-only, default: 1000]
max_fmut_pos (float) – Mask positions with more than this fraction of mutated base calls [keyword-only, default: 1.0]
quick_unbias (bool) – Correct observer bias using a quick (typically linear time) heuristic [keyword-only, default: True]
quick_unbias_thresh (float) – Treat mutated fractions under this threshold as 0 with –quick-unbias [keyword-only, default: 0.001]
max_mask_iter (int) – Stop masking after this many iterations (0 for no limit) [keyword-only, default: 0]
mask_pos_table (bool) – Tabulate relationships per position for mask data [keyword-only, default: True]
mask_read_table (bool) – Tabulate relationships per read for mask data [keyword-only, default: True]
cluster (bool) – Cluster reads to find alternative structures [keyword-only, default: False]
min_clusters (int) – Start at this many clusters [keyword-only, default: 1]
max_clusters (int) – Stop at this many clusters (0 for no limit) [keyword-only, default: 0]
min_em_runs (int) – Run EM (successfully) at least this number of times for each K [keyword-only, default: 6]
max_em_runs (int) – Run EM (successfully or not) at most this number of times for each K [keyword-only, default: 30]
jackpot (bool) – Calculate the jackpotting quotient to find over-represented reads [keyword-only, default: True]
jackpot_conf_level (float) – Confidence level for the jackpotting quotient confidence interval [keyword-only, default: 0.95]
max_jackpot_quotient (float) – Remove runs whose jackpotting quotient exceeds this limit [keyword-only, default: 1.1]
min_em_iter (int) – Run EM for at least this many iterations [keyword-only, default: 10]
max_em_iter (int) – Run EM for at most this many iterations [keyword-only, default: 500]
em_thresh (float) – Stop EM when the log likelihood increases by less than this threshold [keyword-only, default: 0.37]
min_marcd_run (float) – Remove runs with two clusters that differ by less than this MARCD [keyword-only, default: 0.016]
max_pearson_run (float) – Remove runs with two clusters more similar than this correlation [keyword-only, default: 0.9]
max_arcd_vs_ens_avg (float) – Remove runs where a cluster differs by more than this ARCD from the ensemble average at any position [keyword-only, default: 0.2]
max_gini_run (float) – Remove runs where any cluster’s Gini coefficient exceeds this limit [keyword-only, default: 0.667]
max_loglike_vs_best (float) – Remove Ks with a log likelihood gap larger than this (0 for no limit) [keyword-only, default: 0.0]
min_pearson_vs_best (float) – Remove Ks where every run has less than this correlation vs. the best [keyword-only, default: 0.97]
max_marcd_vs_best (float) – Remove Ks where every run has more than this MARCD vs. the best [keyword-only, default: 0.008]
try_all_ks (bool) – Try all numbers of clusters (Ks), even after finding the best number [keyword-only, default: False]
write_all_ks (bool) – Write all numbers of clusters (Ks), rather than only the best number [keyword-only, default: False]
cluster_pos_table (bool) – Tabulate relationships per position for cluster data [keyword-only, default: True]
cluster_abundance_table (bool) – Tabulate number of reads per cluster for cluster data [keyword-only, default: True]
verify_times (bool) – Verify that report files from later steps have later timestamps [keyword-only, default: True]
fold (bool) – Predict the secondary structure using the RNAstructure Fold program [keyword-only, default: False]
fold_coords (Iterable) – Fold a region of a reference given its 5’ and 3’ end coordinates [keyword-only, default: ()]
fold_primers (Iterable) – Fold a region of a reference given its forward and reverse primers [keyword-only, default: ()]
fold_regions_file (str | None) – Fold regions of references from coordinates/primers in a CSV file [keyword-only, default: None]
fold_full (bool) – If no regions are specified, whether to default to the full region or to the table’s region [keyword-only, default: True]
fold_vienna (bool) – Use RNAfold from ViennaRNA as the folding engine [keyword-only, default: False]
fold_temp (float) – Predict structures at this temperature (Kelvin) [keyword-only, default: 310.15]
fold_fpaired (float) – Scale mutation rates assuming this is the fraction of paired bases [keyword-only, default: 0.5]
fold_mu_eps (float) – Clip folding mutation rates to [eps, 1 - eps] to avoid division by 0 [keyword-only, default: 0.005]
fold_constraint (str | None) – Force bases to be paired/unpaired from a file of constraints [keyword-only, default: None]
fold_commands (str | None) – Command file for ViennaRNA [keyword-only, default: None]
fold_md (int) – Limit base pair distances to this number of bases (0 for no limit) [keyword-only, default: 0]
fold_mfe (bool) – Predict only the minimum free energy (MFE) structure [keyword-only, default: False]
fold_max (int) – Output at most this many structures (overriden by –fold-mfe) [keyword-only, default: 20]
fold_percent (float) – Stop outputting structures when the % difference in energy exceeds this value (overriden by –fold-mfe) [keyword-only, default: 20.0]
draw (bool) – Draw secondary structures with RNArtist. [keyword-only, default: False]
struct_num (Iterable) – Draw the specified structure (zero-indexed) or -1 for all structures. By default, draw the structure with the best AUROC. [keyword-only, default: ()]
color (bool) – Color bases by their reactivity [keyword-only, default: True]
draw_svg (bool) – Output each drawing in a Scalable Vector Graphics file [keyword-only, default: True]
draw_png (bool) – Output each drawing in a Portable Network Graphics file [keyword-only, default: False]
update_rnartistcore (bool) – Check for and install updates to RNArtistCore. [keyword-only, default: False]
export (bool) – Export each sample to SEISMICgraph (https://seismicrna.org) [keyword-only, default: False]
samples_meta (str) – Add sample metadata from this CSV file to exported results [keyword-only, default: None]
refs_meta (str) – Add reference metadata from this CSV file to exported results [keyword-only, default: None]
all_pos (bool) – Export all positions (not just unmasked positions) [keyword-only, default: True]
cgroup (str) – Put each Cluster in its own file, each K in its own file, or All clusters in one file [keyword-only, default: ‘k’]
hist_bins (int) – Number of bins in each histogram; must be ≥ 1 [keyword-only, default: 10]
hist_margin (float) – Autofill margins of at most this width in histograms of ratios [keyword-only, default: 0.1]
struct_file (Iterable) – Compare mutational profiles to the structure(s) in this CT file [keyword-only, default: ()]
terminal_pairs (bool) – Include terminal base pairs (at the ends of stems) in ROC curves [keyword-only, default: True]
window (int) – Use a sliding window of this many bases [keyword-only, default: 45]
winmin (int) – Mask sliding windows with fewer than this number of data [keyword-only, default: 9]
csv (bool) – Output the data for each graph in a Comma-Separated Values file [keyword-only, default: True]
html (bool) – Output each graph in an interactive HyperText Markup Language file [keyword-only, default: True]
svg (bool) – Output each graph in a Scalable Vector Graphics file [keyword-only, default: False]
pdf (bool) – Output each graph in a Portable Document Format file [keyword-only, default: False]
png (bool) – Output each graph in a Portable Network Graphics file [keyword-only, default: False]
graph_mprof (bool) – Graph mutational profiles [keyword-only, default: True]
graph_tmprof (bool) – Graph typed mutational profiles [keyword-only, default: True]
graph_ncov (bool) – Graph coverages per position [keyword-only, default: True]
graph_mhist (bool) – Graph histograms of mutations per read [keyword-only, default: True]
graph_abundance (bool) – Graph abundance of each cluster [keyword-only, default: True]
graph_giniroll (bool) – Graph rolling Gini coefficients [keyword-only, default: False]
graph_roc (bool) – Graph receiver operating characteristic (ROC) curves [keyword-only, default: True]
graph_aucroll (bool) – Graph rolling areas under ROC curves (AUC-ROC) [keyword-only, default: False]
graph_poscorr (bool) – Graph phi correlations between positions [keyword-only, default: False]
graph_mutdist (bool) – Graph distances between mutations [keyword-only, default: False]
mutdist_null (bool) – Include the null distribution of distances between mutations [keyword-only, default: True]
collate (bool) – Collate HTML graphs and SVG drawings into an HTML report file. [keyword-only, default: True]
name (str) – Prefix the HTML report with this name. [keyword-only, default: ‘collated’]
verbose_name (bool) – Add collated file information to report name. [keyword-only, default: False]
include_svg (bool) – Include RNA structure drawings from the draw module. [keyword-only, default: True]
include_graph (bool) – Include graphs from the graph module. [keyword-only, default: True]
group (str) – Group collated graphs by one of ‘sample’, ‘graph’, ‘branches’, ‘region’, or ‘all’. [keyword-only, default: ‘sample’]
portable (bool) – Embed collated graphs into the output HTML file for portability at the expense of live updates and file size. [keyword-only, default: False]
collate_out_dir (str | pathlib._local.Path | None) – Write collated report to this directory. By default, write to the lowest level directory common to all input graphs. [keyword-only, default: None]