seismicrna.sim package

Subpackages

Submodules

seismicrna.sim.abstract.abstract_seismicgraph_file(seismicgraph_file: Path, min_aucroc: float = 0.0)
seismicrna.sim.abstract.abstract_table(table: MaskPositionTableLoader, struct_file: str | Path, min_aucroc: float = 0.0)
seismicrna.sim.abstract.get_acgt_parameters()
seismicrna.sim.abstract.get_other_parameters()
seismicrna.sim.abstract.new_parameter_dict()
seismicrna.sim.abstract.run(input_path: Iterable[str | Path] = Sentinel.UNSET, *, struct_file: Iterable[str | Path] = (), min_aucroc: float = 0.85, print_params: bool = True, verify_times: bool = True, num_cpus: int = 4)

Abstract simulation parameters from existing datasets.

Parameters:
  • struct_file (Iterable) – Compare mutational profiles to the structure(s) in this CT file [keyword-only, default: ()]

  • min_aucroc (float) – Skip tables/profiles where the AUC-ROC is less than this value [keyword-only, default: 0.85]

  • verify_times (bool) – Verify that report files from later steps have later timestamps [keyword-only, default: True]

  • num_cpus (int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]

seismicrna.sim.clusts.load_pclust(pclust_file: Path)

Load cluster proportions from a file.

seismicrna.sim.clusts.run(*, ct_file: Iterable[str | Path] = (), clust_conc: float = 0.0, force: bool = False, num_cpus: int = 4, seed: int | None = None)

Simulate the rate of each kind of mutation at each position.

Parameters:
  • ct_file (Iterable) – Simulate parameters using the structure(s) in this CT file [keyword-only, default: ()]

  • clust_conc (float) – Set the concentration parameter for simulating cluster proportions [keyword-only, default: 0.0]

  • force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

  • num_cpus (int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]

  • seed (int | None) – Seed for the random number generator [keyword-only, default: None]

seismicrna.sim.clusts.sim_pclust(num_clusters: int, concentration: float | None = None, sort: bool = True, seed: int | None = None)

Simulate proportions of clusters using a Dirichlet distribution.

Parameters:
  • num_clusters (int) – Number of clusters to simulate; must be ≥ 1.

  • concentration (float | None) – Concentration parameter for Dirichlet distribution; defaults to 1 / (num_clusters - 1); must be > 0.

  • sort (bool) – Sort the cluster proportions from greatest to least.

Returns:

Simulated proportion of each cluster.

Return type:

pd.Series

seismicrna.sim.clusts.sim_pclust_ct(ct_file: Path, *, concentration: float, force: bool, seed: int | None)

Simulate cluster proportions for a CT file and write them to disk.

The number of clusters is inferred from the number of structures in the CT file.

Parameters:
  • ct_file (Path) – Path to the connectivity table (CT) file whose structures define the number of clusters.

  • concentration (float) – Concentration parameter for the Dirichlet distribution used to simulate cluster proportions; must be > 0.

  • force (bool) – Whether to overwrite an existing output file.

  • seed (int | None) – Random seed for reproducibility; None for no fixed seed.

Returns:

Path of the written cluster proportions CSV file.

Return type:

Path

seismicrna.sim.ends.load_pends(pends_file: Path)

Load end coordinate proportions from a file.

seismicrna.sim.ends.run(*, ct_file: Iterable[str | Path] = (), center_fmean: float = 0.5, center_fvar: float = 0.3333333333333333, length_fmean: float = 0.5, length_fvar: float = 0.012345679012345678, force: bool = False, num_cpus: int = 4)

Simulate the rate of each kind of mutation at each position.

Parameters:
  • ct_file (Iterable) – Simulate parameters using the structure(s) in this CT file [keyword-only, default: ()]

  • center_fmean (float) – Set the mean read center as a fraction of the region length [keyword-only, default: 0.5]

  • center_fvar (float) – Set the variance of the read center as a fraction of its maximum [keyword-only, default: 0.3333333333333333]

  • length_fmean (float) – Set the mean read length as a fraction of the region length [keyword-only, default: 0.5]

  • length_fvar (float) – Set the variance of the read length as a fraction of its maximum [keyword-only, default: 0.012345679012345678]

  • force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

  • num_cpus (int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]

seismicrna.sim.ends.sim_pends(end5: int, end3: int, center_fmean: float, center_fvar: float, length_fmean: float, length_fvar: float, keep_empty_reads: bool = True)

Simulate segment end coordinate probabilities.

Parameters:
  • end5 (int) – 5’ end of the region (minimum allowed 5’ end coordinate).

  • end3 (int) – 3’ end of the region (maximum allowed 5’ end coordinate).

  • center_fmean (float) – Mean read center, as a fraction of the reference length.

  • center_fvar (float) – Variance of the read center, as a fraction of its maximum.

  • length_fmean (float) – Mean read length, as a fraction of the available length.

  • length_fvar (float) – Variance of the read length, as a fraction of its maximum.

  • keep_empty_reads (bool) – Whether to keep reads whose lengths are 0.

Returns:

5’ and 3’ coordinates and their probabilities.

Return type:

tuple[np.ndarray, np.ndarray, np.ndarray]

seismicrna.sim.ends.sim_pends_ct(ct_file: Path, *, center_fmean: float, center_fvar: float, length_fmean: float, length_fvar: float, force: bool)

Simulate read-end coordinate probabilities for a CT file region.

Determines the region boundaries from the CT file, simulates the probability distribution over all (5’, 3’) end coordinate pairs, and writes the result to a CSV file.

Parameters:
  • ct_file (Path) – Path to the connectivity table (CT) file defining the region.

  • center_fmean (float) – Mean read center as a fraction of the region length (0 to 1).

  • center_fvar (float) – Variance of the read center as a fraction of its maximum (0 to 1).

  • length_fmean (float) – Mean read length as a fraction of the available length (0 to 1).

  • length_fvar (float) – Variance of the read length as a fraction of its maximum (0 to 1).

  • force (bool) – Whether to overwrite an existing output file.

Returns:

Path of the written end-coordinate proportions CSV file.

Return type:

Path

seismicrna.sim.fastq.from_param_dir(param_dir: Path, *, sample: str, profile: str, read_length: int, paired: bool, p_rev: float, fq_gzip: bool, force: bool, seed: int | None, **kwargs)

Simulate a FASTQ file from parameter files.

seismicrna.sim.fastq.from_report(report_file: Path, *, read_length: int, p_rev: float, fq_gzip: bool, force: bool, seed: int | None)

Simulate a FASTQ file from a Relate report.

seismicrna.sim.fastq.generate_fastq(top: Path, sample: str, ref: str, refseq: DNA, paired: bool, read_length: int, batches: Iterable[tuple[RelateRegionMutsBatch, ReadNamesBatch]], p_rev: float = 0.5, fq_gzip: bool = True, force: bool = False, seed: int | None = None)

Generate FASTQ file(s) from a dataset.

seismicrna.sim.fastq.generate_fastq_record(name: str, rels: ndarray, refseq: str, adapter: str, read_length: int, reverse: bool = False, hi_qual: str = 'I', lo_qual: str = '!')

Generate a FASTQ line for a read.

seismicrna.sim.fastq.run(*, input_path: Iterable[str | Path] = Sentinel.UNSET, param_dir: Iterable[str | Path] = (), profile_name: str = 'simulated', sample: str = 'sim-sample', paired_end: bool = True, read_length: int = 151, reverse_fraction: float = 0.5, probe: str = 'DMS', min_mut_gap: int | None = None, min_mut_gap_weights: str = '', mut_collisions: str = 'auto', mut_probs: str | None = '0.2,0.04,0.008', fq_gzip: bool = True, num_reads: int = 65536, num_cpus: int = 4, force: bool = False, seed: int | None = None)

Simulate FASTQ file(s) from relate reports or parameter directories.

Parameters:
  • input_path (Iterable[str | Path]) – Paths to relate report files or directories containing them; used to generate FASTQ files from existing relate data.

  • param_dir (Iterable) – Paths to simulation parameter directories; used to generate FASTQ files from CT/parameter files.

  • profile_name (str) – Name of the mutation profile to use from the parameter directory.

  • sample (str) – Sample name to embed in the output FASTQ paths.

  • paired_end (bool) – Whether to simulate paired-end reads.

  • read_length (int) – Length of each simulated read.

  • reverse_fraction (float) – Fraction of reads where mate 1 is reverse-complemented.

  • probe (str) – Probe type (e.g. DMS); used to set default min_mut_gap.

  • min_mut_gap (int | None) – Minimum gap between mutations; None to use the probe default.

  • min_mut_gap_weights (str) – Comma-separated gap:weight pairs for a bias mixture; empty string to use the single min_mut_gap.

  • mut_collisions (str) – How to handle reads with close mutations: “drop” or “merge”.

  • fq_gzip (bool) – Whether to gzip-compress the output FASTQ files.

  • num_reads (int) – Total number of reads to simulate per param_dir run.

  • num_cpus (int) – Number of CPU cores to use.

  • force (bool) – Whether to overwrite existing output files.

  • seed (int | None) – Random seed for reproducibility; None for no fixed seed.

  • param_dir – Simulate data using parameter files in this directory [keyword-only, default: ()]

  • profile_name – Give the simulated structure and parameters this profile name [keyword-only, default: ‘simulated’]

  • sample – Give this name to the simulated sample [keyword-only, default: ‘sim-sample’]

  • paired_end – Simulate paired-end or single-end reads [keyword-only, default: True]

  • read_length – Simulate reads with this many base calls [keyword-only, default: 151]

  • reverse_fraction – Simulate this fraction of reverse-oriented reads [keyword-only, default: 0.5]

  • probe – Use default mask options for this chemical probe [keyword-only, default: ‘DMS’]

  • min_mut_gap – Mask reads with two mutations separated by fewer than this many bases [keyword-only, default: None]

  • min_mut_gap_weights – Comma-separated gap:weight pairs defining a mixture of min_mut_gap biases, e.g. ‘0:0.18,2:0.05,3:0.21’. When given, overrides –min-mut-gap. [keyword-only, default: ‘’]

  • mut_collisions – If two mutations are closer than –min-mut-gap positions, MERGE the mutations, DROP the read, or AUTO-select based on the probe. [keyword-only, default: ‘auto’]

  • mut_probs (str | None) – Comma-separated probabilities of injecting a mutation at each successive position 5’ of an existing mutation (used with –mut-collisions merge) [keyword-only, default: ‘0.2,0.04,0.008’]

  • fq_gzip – Simulate FASTQ files with gzip compression or as plain text [keyword-only, default: True]

  • num_reads – Simulate this many reads [keyword-only, default: 65536]

  • num_cpus – Use up to this many CPUs simultaneously [keyword-only, default: 4]

  • force – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

  • seed – Seed for the random number generator [keyword-only, default: None]

Returns:

Paths of all generated FASTQ files.

Return type:

list[Path]

seismicrna.sim.fold.fold_region(region: Region, *, sim_dir: Path, tmp_dir: Path, profile_name: str, fold_backend: str, fold_constraint: Path | None, fold_temp: float, fold_md: int, fold_mfe: bool, fold_max: int, fold_min: int, fold_percent: float, keep_tmp: bool, force: bool, num_cpus: int)

Predict RNA secondary structures for one region using RNAstructure.

Parameters:
  • region (Region) – Sequence region to fold.

  • sim_dir (Path) – Simulation output directory; CT file is written under its parameter subdirectory.

  • tmp_dir (Path) – Directory for temporary FASTA and intermediate CT files.

  • profile_name (str) – Profile label embedded in the output CT file path.

  • fold_constraint (Path | None) – Path to a folding constraint file; None for no constraints.

  • fold_temp (float) – Folding temperature; interpreted as Celsius if in the typical physiological range, otherwise as Kelvin.

  • fold_md (int) – Maximum distance between paired bases (0 for no limit).

  • fold_mfe (bool) – Whether to predict only the minimum free energy structure.

  • fold_max (int) – Maximum number of structures to predict.

  • fold_percent (float) – Maximum percent energy difference from the MFE structure.

  • keep_tmp (bool) – Whether to retain temporary files after folding.

  • force (bool) – Whether to overwrite an existing output CT file.

  • num_cpus (int) – Number of CPU cores to use.

Returns:

Path of the written CT file.

Return type:

Path

seismicrna.sim.fold.get_ct_path(top: Path, region: Region, profile: str)

Get the path of a connectivity table (CT) file.

seismicrna.sim.fold.run(fasta: str | Path = Sentinel.UNSET, *, sim_dir: str | Path = './sim', profile_name: str = 'simulated', probe: str = 'DMS', fold_backend: str = 'auto', fold_coords: Iterable[tuple[str, int, int]] = (), fold_primers: Iterable[tuple[str, DNA, DNA]] = (), fold_regions_file: str | None = None, fold_constraint: str | None = None, fold_temp: float = 37.0, fold_md: int = 0, fold_mfe: bool = False, fold_max: int = 20, fold_min: int = 1, fold_percent: float = 20.0, keep_tmp: bool = False, force: bool = False, num_cpus: int = 4, tmp_pfx='./tmp')

Fold regions of a reference FASTA file and write CT files.

Parameters:
  • fasta (str | Path) – Path to the reference FASTA file.

  • sim_dir (str | pathlib._local.Path) – Simulation output directory for writing CT parameter files.

  • profile_name (str) – Profile label embedded in each output CT file path.

  • fold_coords (Iterable) – Explicit (ref, end5, end3) coordinate tuples defining regions.

  • fold_primers (Iterable) – Primer sequences used to define region boundaries.

  • fold_regions_file (str | None) – Path to a file listing regions to fold; None to use coords/primers.

  • fold_constraint (str | None) – Path to a folding constraint file; None for no constraints.

  • fold_temp (float) – Folding temperature (Celsius or Kelvin).

  • fold_md (int) – Maximum pairing distance (0 for no limit).

  • fold_mfe (bool) – Whether to predict only the minimum free energy structure.

  • fold_max (int) – Maximum number of structures per region.

  • fold_min (int) – Minimum number of structures required per region.

  • fold_percent (float) – Maximum percent energy difference from the MFE structure.

  • keep_tmp (bool) – Whether to retain temporary files.

  • tmp_dir (Path) – Directory for temporary files (injected by run_func).

  • force (bool) – Whether to overwrite existing CT files.

  • num_cpus (int) – Number of CPU cores to use.

  • sim_dir – Write all simulated files to this directory [keyword-only, default: ‘./sim’]

  • profile_name – Give the simulated structure and parameters this profile name [keyword-only, default: ‘simulated’]

  • probe (str) – Use default mask options for this chemical probe [keyword-only, default: ‘DMS’]

  • fold_backend (str) – Model RNA structures using Fold (RNAstructure), ShapeKnots (RNAstructure), or RNAfold (ViennaRNA); auto selects Fold for DMS and RNAFold for other probes [keyword-only, default: ‘auto’]

  • fold_coords – Fold a region of a reference given its 5’ and 3’ end coordinates [keyword-only, default: ()]

  • fold_primers – Fold a region of a reference given its forward and reverse primers [keyword-only, default: ()]

  • fold_regions_file – Fold regions of references from coordinates/primers in a CSV file [keyword-only, default: None]

  • fold_constraint – Force bases to be paired/unpaired from a file of constraints [keyword-only, default: None]

  • fold_temp – Predict structures at this temperature (Celsius) [keyword-only, default: 37.0]

  • fold_md – Limit base pair distances to this number of bases (0 for no limit) [keyword-only, default: 0]

  • fold_mfe – Predict only the minimum free energy (MFE) structure [keyword-only, default: False]

  • fold_max – Output at most this many structures (overriden by –fold-mfe) [keyword-only, default: 20]

  • fold_min – Require at least this many structures (overriden by –fold-mfe) [keyword-only, default: 1]

  • fold_percent – Stop outputting structures when the % difference in energy exceeds this value (overriden by –fold-mfe) [keyword-only, default: 20.0]

  • keep_tmp – Keep temporary files after finishing [keyword-only, default: False]

  • force – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

  • num_cpus – Use up to this many CPUs simultaneously [keyword-only, default: 4]

  • tmp_pfx – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]

Returns:

Paths of all written CT files.

Return type:

list[Path]

seismicrna.sim.muts.load_pmut(pmut_file: Path)

Load mutation rates from a file.

seismicrna.sim.muts.make_pmut_means(*, ploq: float, pam: float, pac: float, pag: float, pat: float, pcm: float, pca: float, pcg: float, pct: float, pgm: float, pga: float, pgc: float, pgt: float, ptm: float, pta: float, ptc: float, ptg: float, pnm: float, pnd: float)

Generate mean mutation rates.

Mutations are assumed to behave as follows:

  • A base n mutates with probability pnm.

    • If it mutates, then it is a substitution with probability (pna + pnc + png + pnt).

      • If it is a substitution, then it is high-quality with probability (1 - ploq).

      • Otherwise, it is low-quality.

    • Otherwise, it is a deletion.

  • Otherwise, it is low-quality with probability ploq.

So the overall probability of being low-quailty is the probability given a mutation, pnm * (pna + pnc + png + pnt) * ploq, plus the probability given no mutation, (1 - pnm) * ploq, which equals ploq * (1 - pam * (1 - (pna + pnc + png + pnt))).

Parameters:
  • ploq (float) – Probability that a base is low-quality.

  • pam (float) – Probability that an A is mutated.

  • pac (float) – Probability that a mutated A is a substitution to C.

  • pag (float) – Probability that a mutated A is a substitution to G.

  • pat (float) – Probability that a mutated A is a substitution to T.

  • pcm (float) – Probability that a C is mutated.

  • pca (float) – Probability that a mutated C is a substitution to A.

  • pcg (float) – Probability that a mutated C is a substitution to G.

  • pct (float) – Probability that a mutated C is a substitution to T.

  • pgm (float) – Probability that a G is mutated.

  • pga (float) – Probability that a mutated G is a substitution to A.

  • pgc (float) – Probability that a mutated G is a substitution to C.

  • pgt (float) – Probability that a mutated G is a substitution to T.

  • ptm (float) – Probability that a T is mutated.

  • pta (float) – Probability that a mutated T is a substitution to A.

  • ptc (float) – Probability that a mutated T is a substitution to C.

  • ptg (float) – Probability that a mutated T is a substitution to G.

  • pnm (float) – Probability that an N is mutated.

  • pnd (float) – Probability that a mutated N is a deletion.

Returns:

Mean rate of each type of mutation (column) and each base (row).

Return type:

pd.DataFrame

seismicrna.sim.muts.make_pmut_means_paired(probe: str, **kwargs: float)

Generate mean mutation rates for paired bases.

seismicrna.sim.muts.make_pmut_means_unpaired(probe: str, **kwargs: float)

Generate mean mutation rates for unpaired bases.

seismicrna.sim.muts.run(*, ct_file: Iterable[str | Path] = (), pmut_paired: Iterable[tuple[str, float]] = (), pmut_unpaired: Iterable[tuple[str, float]] = (), vmut_paired: float = 0.001, vmut_unpaired: float = 0.02, probe: str = 'DMS', mask_coords: Iterable[tuple[str, int, int]] = (), mask_primers: Iterable[tuple[str, DNA, DNA]] = (), force: bool = False, num_cpus: int = 4, seed: int | None = None)

Simulate the rate of each kind of mutation at each position.

Parameters:
  • ct_file (Iterable) – Simulate parameters using the structure(s) in this CT file [keyword-only, default: ()]

  • pmut_paired (Iterable) – Set the mean rate of each kind of mutation for paired bases [keyword-only, default: ()]

  • pmut_unpaired (Iterable) – Set the mean rate of each kind of mutation for unpaired bases [keyword-only, default: ()]

  • vmut_paired (float) – Set the relative variance of mutation rates of paired bases [keyword-only, default: 0.001]

  • vmut_unpaired (float) – Set the relative variance of mutation rates of unpaired bases [keyword-only, default: 0.02]

  • probe (str) – Use default mask options for this chemical probe [keyword-only, default: ‘DMS’]

  • mask_coords (Iterable) – Select a region of a reference given its 5’ and 3’ end coordinates [keyword-only, default: ()]

  • mask_primers (Iterable) – Select a region of a reference given its forward and reverse primers [keyword-only, default: ()]

  • force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

  • num_cpus (int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]

  • seed (int | None) – Seed for the random number generator [keyword-only, default: None]

seismicrna.sim.muts.run_struct(ct_file: Path, pmut_paired: Iterable[tuple[str, float]], pmut_unpaired: Iterable[tuple[str, float]], vmut_paired: float, vmut_unpaired: float, probe: str, force: bool, seed: int | None, mask_coords: Iterable[tuple[str, int, int]] = (), mask_primers: Iterable[tuple[str, DNA, DNA]] = ())

Simulate per-position mutation rates for a CT file and write them.

For each structure in the CT file, mutation rates are simulated using a Dirichlet distribution, with separate mean rates for paired and unpaired bases.

Parameters:
  • ct_file (Path) – Path to the connectivity table (CT) file defining structures and base-pairing.

  • pmut_paired (Iterable[tuple[str, float]]) – Mutation-type/probability pairs for paired bases, passed to make_pmut_means_paired.

  • pmut_unpaired (Iterable[tuple[str, float]]) – Mutation-type/probability pairs for unpaired bases, passed to make_pmut_means_unpaired.

  • vmut_paired (float) – Relative variance of mutation rates for paired bases (0 to 1).

  • vmut_unpaired (float) – Relative variance of mutation rates for unpaired bases (0 to 1).

  • force (bool) – Whether to overwrite an existing output file.

  • seed (int | None) – Random seed for reproducibility; None for no fixed seed.

Returns:

Path of the written mutation-rate CSV file.

Return type:

Path

seismicrna.sim.muts.sim_pmut(positions: Index, mean: DataFrame, relative_variance: float, end5: int | None, end3: int | None, seed: int | None)

Simulate mutation rates using a Dirichlet distribution.

Parameters:
  • positions (pd.Index) – Index of positions and bases.

  • mean (pd.DataFrame) – Mean of the mutation rates for each type of base.

  • relative_variance (float) – Variance of the mutation rates, as a fraction of its supremum.

Returns:

Mutation rates, with the same index as

Return type:

pd.DataFrame

seismicrna.sim.muts.verify_proportions(p: Any)

Verify that p is a valid set of proportions:

  • Every element of p must be ≥ 0 and ≤ 1.

  • The sum of p must equal 1.

Parameters:

p (Any) – Proportions to verify; must be a NumPy array or convertable into a NumPy array.

seismicrna.sim.params.run(*, ct_file: Iterable[str | Path] = (), pmut_paired: Iterable[tuple[str, float]] = (), pmut_unpaired: Iterable[tuple[str, float]] = (), vmut_paired: float = 0.001, vmut_unpaired: float = 0.02, probe: str = 'DMS', mask_coords: Iterable[tuple[str, int, int]] = (), mask_primers: Iterable[tuple[str, DNA, DNA]] = (), center_fmean: float = 0.5, center_fvar: float = 0.3333333333333333, length_fmean: float = 0.5, length_fvar: float = 0.012345679012345678, clust_conc: float = 0.0, force: bool = False, num_cpus: int = 4, seed: int | None = None)

Simulate parameter files.

Parameters:
  • ct_file (Iterable) – Simulate parameters using the structure(s) in this CT file [keyword-only, default: ()]

  • pmut_paired (Iterable) – Set the mean rate of each kind of mutation for paired bases [keyword-only, default: ()]

  • pmut_unpaired (Iterable) – Set the mean rate of each kind of mutation for unpaired bases [keyword-only, default: ()]

  • vmut_paired (float) – Set the relative variance of mutation rates of paired bases [keyword-only, default: 0.001]

  • vmut_unpaired (float) – Set the relative variance of mutation rates of unpaired bases [keyword-only, default: 0.02]

  • probe (str) – Use default mask options for this chemical probe [keyword-only, default: ‘DMS’]

  • mask_coords (Iterable) – Select a region of a reference given its 5’ and 3’ end coordinates [keyword-only, default: ()]

  • mask_primers (Iterable) – Select a region of a reference given its forward and reverse primers [keyword-only, default: ()]

  • center_fmean (float) – Set the mean read center as a fraction of the region length [keyword-only, default: 0.5]

  • center_fvar (float) – Set the variance of the read center as a fraction of its maximum [keyword-only, default: 0.3333333333333333]

  • length_fmean (float) – Set the mean read length as a fraction of the region length [keyword-only, default: 0.5]

  • length_fvar (float) – Set the variance of the read length as a fraction of its maximum [keyword-only, default: 0.012345679012345678]

  • clust_conc (float) – Set the concentration parameter for simulating cluster proportions [keyword-only, default: 0.0]

  • force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

  • num_cpus (int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]

  • seed (int | None) – Seed for the random number generator [keyword-only, default: None]

seismicrna.sim.ref.get_fasta_path(top: Path, ref: str)

Get the path of a FASTA file.

seismicrna.sim.ref.run(*, sim_dir: str | Path = './sim', refs: str = 'sim-refs', ref: str = 'sim-ref', reflen: int = 280, force: bool = False, seed: int | None = None)

Simulate a random reference sequence and write it to a FASTA file.

Parameters:
  • sim_dir (str | pathlib._local.Path) – Simulation output directory; the FASTA file is written under its references subdirectory.

  • refs (str) – Name of the reference set (used for the FASTA file name).

  • ref (str) – Name of the single reference sequence record in the FASTA file.

  • reflen (int) – Length of the reference sequence to generate.

  • force (bool) – Whether to overwrite an existing FASTA file.

  • seed (int | None) – Random seed for reproducibility; None for no fixed seed.

  • sim_dir – Write all simulated files to this directory [keyword-only, default: ‘./sim’]

  • refs – Give this name to the file of simulated references [keyword-only, default: ‘sim-refs’]

  • ref – Give this name to the simulated reference [keyword-only, default: ‘sim-ref’]

  • reflen – Simulate a reference sequence with this many bases [keyword-only, default: 280]

  • force – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

  • seed – Seed for the random number generator [keyword-only, default: None]

Returns:

Path of the written FASTA file.

Return type:

Path

seismicrna.sim.relate.parse_min_mut_gap_weights(min_mut_gap_weights: str) dict[int, float]

Parse a comma-separated ‘gap:weight’ string into a dict.

seismicrna.sim.relate.run(*, param_dir: Iterable[str | Path] = (), profile_name: str = 'simulated', sample: str = 'sim-sample', branch: str = '', paired_end: bool = True, read_length: int = 151, reverse_fraction: float = 0.5, probe: str = 'DMS', min_mut_gap: int | None = None, min_mut_gap_weights: str = '', mut_collisions: str = 'auto', mut_probs: str | None = '0.2,0.04,0.008', num_reads: int = 65536, batch_size: int = 65536, write_read_names: bool = False, brotli_level: int = 10, force: bool = False, num_cpus: int = 4, seed: int | None = None, tmp_pfx='./tmp', keep_tmp=False)

Simulate a Relate dataset.

Parameters:
  • param_dir (Iterable) – Simulate data using parameter files in this directory [keyword-only, default: ()]

  • profile_name (str) – Give the simulated structure and parameters this profile name [keyword-only, default: ‘simulated’]

  • sample (str) – Give this name to the simulated sample [keyword-only, default: ‘sim-sample’]

  • branch (str) – Create a new branch of the workflow with this name [keyword-only, default: ‘’]

  • paired_end (bool) – Simulate paired-end or single-end reads [keyword-only, default: True]

  • read_length (int) – Simulate reads with this many base calls [keyword-only, default: 151]

  • reverse_fraction (float) – Simulate this fraction of reverse-oriented reads [keyword-only, default: 0.5]

  • probe (str) – Use default mask options for this chemical probe [keyword-only, default: ‘DMS’]

  • min_mut_gap (int | None) – Mask reads with two mutations separated by fewer than this many bases [keyword-only, default: None]

  • min_mut_gap_weights (str) – Comma-separated gap:weight pairs defining a mixture of min_mut_gap biases, e.g. ‘0:0.18,2:0.05,3:0.21’. When given, overrides –min-mut-gap. [keyword-only, default: ‘’]

  • mut_collisions (str) – If two mutations are closer than –min-mut-gap positions, MERGE the mutations, DROP the read, or AUTO-select based on the probe. [keyword-only, default: ‘auto’]

  • mut_probs (str | None) – Comma-separated probabilities of injecting a mutation at each successive position 5’ of an existing mutation (used with –mut-collisions merge) [keyword-only, default: ‘0.2,0.04,0.008’]

  • num_reads (int) – Simulate this many reads [keyword-only, default: 65536]

  • batch_size (int) – Limit batches to at most this many reads [keyword-only, default: 65536]

  • write_read_names (bool) – Write the name of each read in a second set of batches (necessary for the options –mask-read or –mask-read-file) [keyword-only, default: False]

  • brotli_level (int) – Compress pickle files with this level of Brotli (0 - 11) [keyword-only, default: 10]

  • force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

  • num_cpus (int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]

  • seed (int | None) – Seed for the random number generator [keyword-only, default: None]

  • tmp_pfx – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]

  • keep_tmp – Keep temporary files after finishing [keyword-only, default: False]

seismicrna.sim.total.run(*, sim_dir: str | Path = './sim', tmp_pfx: str | Path = './tmp', sample: str = 'sim-sample', refs: str = 'sim-refs', ref: str = 'sim-ref', reflen: int = 280, profile_name: str = 'simulated', fold_backend: str = 'auto', fold_coords: Iterable[tuple[str, int, int]] = (), fold_primers: Iterable[tuple[str, DNA, DNA]] = (), fold_regions_file: str | None = None, fold_constraint: str | None = None, fold_temp: float = 37.0, fold_md: int = 0, fold_mfe: bool = False, fold_max: int = 20, fold_min: int = 1, fold_percent: float = 20.0, pmut_paired: Iterable[tuple[str, float]] = (), pmut_unpaired: Iterable[tuple[str, float]] = (), vmut_paired: float = 0.001, vmut_unpaired: float = 0.02, center_fmean: float = 0.5, center_fvar: float = 0.3333333333333333, length_fmean: float = 0.5, length_fvar: float = 0.012345679012345678, clust_conc: float = 0.0, mask_coords: Iterable[tuple[str, int, int]] = (), mask_primers: Iterable[tuple[str, DNA, DNA]] = (), mask_a: bool | None = None, mask_c: bool | None = None, mask_g: bool | None = None, mask_u: bool | None = None, mask_polya: int = 5, max_fraction_ident: float = 1.0, max_pearson_sim: float = 1.0, min_marcd_sim: float = 0.0, max_tries: int = 10, paired_end: bool = True, read_length: int = 151, reverse_fraction: float = 0.5, probe: str = 'DMS', min_mut_gap: int = None, min_mut_gap_weights: str = '', mut_collisions: str = 'auto', mut_probs: str | None = '0.2,0.04,0.008', fq_gzip: bool = True, num_reads: int = 65536, keep_tmp: bool = False, force: bool = False, num_cpus: int = 4, seed: int | None = None)

Simulate FASTQ files from scratch.

Parameters:
  • sim_dir (str | pathlib._local.Path) – Write all simulated files to this directory [keyword-only, default: ‘./sim’]

  • tmp_pfx (str | pathlib._local.Path) – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]

  • sample (str) – Give this name to the simulated sample [keyword-only, default: ‘sim-sample’]

  • refs (str) – Give this name to the file of simulated references [keyword-only, default: ‘sim-refs’]

  • ref (str) – Give this name to the simulated reference [keyword-only, default: ‘sim-ref’]

  • reflen (int) – Simulate a reference sequence with this many bases [keyword-only, default: 280]

  • profile_name (str) – Give the simulated structure and parameters this profile name [keyword-only, default: ‘simulated’]

  • fold_backend (str) – Model RNA structures using Fold (RNAstructure), ShapeKnots (RNAstructure), or RNAfold (ViennaRNA); auto selects Fold for DMS and RNAFold for other probes [keyword-only, default: ‘auto’]

  • fold_coords (Iterable) – Fold a region of a reference given its 5’ and 3’ end coordinates [keyword-only, default: ()]

  • fold_primers (Iterable) – Fold a region of a reference given its forward and reverse primers [keyword-only, default: ()]

  • fold_regions_file (str | None) – Fold regions of references from coordinates/primers in a CSV file [keyword-only, default: None]

  • fold_constraint (str | None) – Force bases to be paired/unpaired from a file of constraints [keyword-only, default: None]

  • fold_temp (float) – Predict structures at this temperature (Celsius) [keyword-only, default: 37.0]

  • fold_md (int) – Limit base pair distances to this number of bases (0 for no limit) [keyword-only, default: 0]

  • fold_mfe (bool) – Predict only the minimum free energy (MFE) structure [keyword-only, default: False]

  • fold_max (int) – Output at most this many structures (overriden by –fold-mfe) [keyword-only, default: 20]

  • fold_min (int) – Require at least this many structures (overriden by –fold-mfe) [keyword-only, default: 1]

  • fold_percent (float) – Stop outputting structures when the % difference in energy exceeds this value (overriden by –fold-mfe) [keyword-only, default: 20.0]

  • pmut_paired (Iterable) – Set the mean rate of each kind of mutation for paired bases [keyword-only, default: ()]

  • pmut_unpaired (Iterable) – Set the mean rate of each kind of mutation for unpaired bases [keyword-only, default: ()]

  • vmut_paired (float) – Set the relative variance of mutation rates of paired bases [keyword-only, default: 0.001]

  • vmut_unpaired (float) – Set the relative variance of mutation rates of unpaired bases [keyword-only, default: 0.02]

  • center_fmean (float) – Set the mean read center as a fraction of the region length [keyword-only, default: 0.5]

  • center_fvar (float) – Set the variance of the read center as a fraction of its maximum [keyword-only, default: 0.3333333333333333]

  • length_fmean (float) – Set the mean read length as a fraction of the region length [keyword-only, default: 0.5]

  • length_fvar (float) – Set the variance of the read length as a fraction of its maximum [keyword-only, default: 0.012345679012345678]

  • clust_conc (float) – Set the concentration parameter for simulating cluster proportions [keyword-only, default: 0.0]

  • mask_coords (Iterable) – Select a region of a reference given its 5’ and 3’ end coordinates [keyword-only, default: ()]

  • mask_primers (Iterable) – Select a region of a reference given its forward and reverse primers [keyword-only, default: ()]

  • mask_a (bool | None) – Mask positions with base A [keyword-only, default: None]

  • mask_c (bool | None) – Mask positions with base C [keyword-only, default: None]

  • mask_g (bool | None) – Mask positions with base G [keyword-only, default: None]

  • mask_u (bool | None) – Mask positions with base U [keyword-only, default: None]

  • mask_polya (int) – Mask stretches of at least this many consecutive A bases (0 disables) [keyword-only, default: 5]

  • max_fraction_ident (float) – Retry if any two clusters have more than this fraction of identical mutation rates [keyword-only, default: 1.0]

  • max_pearson_sim (float) – Retry if any two clusters have a Pearson correlation larger than this [keyword-only, default: 1.0]

  • min_marcd_sim (float) – Retry if any two clusters have a MARCD less than this (0 disables) [keyword-only, default: 0.0]

  • max_tries (int) – Simulate the parameters with up to this many attempts [keyword-only, default: 10]

  • paired_end (bool) – Simulate paired-end or single-end reads [keyword-only, default: True]

  • read_length (int) – Simulate reads with this many base calls [keyword-only, default: 151]

  • reverse_fraction (float) – Simulate this fraction of reverse-oriented reads [keyword-only, default: 0.5]

  • probe (str) – Use default mask options for this chemical probe [keyword-only, default: ‘DMS’]

  • min_mut_gap (int) – Mask reads with two mutations separated by fewer than this many bases [keyword-only, default: None]

  • min_mut_gap_weights (str) – Comma-separated gap:weight pairs defining a mixture of min_mut_gap biases, e.g. ‘0:0.18,2:0.05,3:0.21’. When given, overrides –min-mut-gap. [keyword-only, default: ‘’]

  • mut_collisions (str) – If two mutations are closer than –min-mut-gap positions, MERGE the mutations, DROP the read, or AUTO-select based on the probe. [keyword-only, default: ‘auto’]

  • mut_probs (str | None) – Comma-separated probabilities of injecting a mutation at each successive position 5’ of an existing mutation (used with –mut-collisions merge) [keyword-only, default: ‘0.2,0.04,0.008’]

  • fq_gzip (bool) – Simulate FASTQ files with gzip compression or as plain text [keyword-only, default: True]

  • num_reads (int) – Simulate this many reads [keyword-only, default: 65536]

  • keep_tmp (bool) – Keep temporary files after finishing [keyword-only, default: False]

  • force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

  • num_cpus (int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]

  • seed (int | None) – Seed for the random number generator [keyword-only, default: None]