seismicrna.sim package

Subpackages

Submodules

seismicrna.sim.clusts.load_pclust(pclust_file: Path)

Load cluster proportions from a file.

seismicrna.sim.clusts.run(*, ct_file: Iterable[str | Path] = (), clust_conc: float = 0.0, force: bool = False, max_procs: int = 4)

Simulate the rate of each kind of mutation at each position.

Parameters:
  • ct_file (Iterable) – Simulate parameters using the structure(s) in this CT file [keyword-only, default: ()]

  • clust_conc (float) – Set the concentration parameter for simulating cluster proportions [keyword-only, default: 0.0]

  • force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

  • max_procs (int) – Run up to this many processes simultaneously [keyword-only, default: 4]

seismicrna.sim.clusts.sim_pclust(num_clusters: int, concentration: float | None = None, sort: bool = True)

Simulate proportions of clusters using a Dirichlet distribution.

Parameters:
  • num_clusters (int) – Number of clusters to simulate; must be ≥ 1.

  • concentration (float | None = None) – Concentration parameter for Dirichlet distribution; defaults to 1 / (num_clusters - 1); must be > 0.

  • sort (bool = False) – Sort the cluster proportions from greatest to least.

Returns:

Simulated proportion of each cluster.

Return type:

pd.Series

seismicrna.sim.clusts.sim_pclust_ct(ct_file: Path, *, concentration: float, force: bool)
seismicrna.sim.ends.load_pends(pends_file: Path)

Load end coordinate proportions from a file.

seismicrna.sim.ends.run(*, ct_file: Iterable[str | Path] = (), center_fmean: float = 0.5, center_fvar: float = 0.3333333333333333, length_fmean: float = 0.5, length_fvar: float = 0.012345679012345678, force: bool = False, max_procs: int = 4)

Simulate the rate of each kind of mutation at each position.

Parameters:
  • ct_file (Iterable) – Simulate parameters using the structure(s) in this CT file [keyword-only, default: ()]

  • center_fmean (float) – Set the mean read center as a fraction of the region length [keyword-only, default: 0.5]

  • center_fvar (float) – Set the variance of the read center as a fraction of its maximum [keyword-only, default: 0.3333333333333333]

  • length_fmean (float) – Set the mean read length as a fraction of the region length [keyword-only, default: 0.5]

  • length_fvar (float) – Set the variance of the read length as a fraction of its maximum [keyword-only, default: 0.012345679012345678]

  • force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

  • max_procs (int) – Run up to this many processes simultaneously [keyword-only, default: 4]

seismicrna.sim.ends.sim_pends(end5: int, end3: int, center_fmean: float, center_fvar: float, length_fmean: float, length_fvar: float, keep_empty_reads: bool = True)

Simulate segment end coordinate probabilities.

Parameters:
  • end5 (int) – 5’ end of the region (minimum allowed 5’ end coordinate).

  • end3 (int) – 3’ end of the region (maximum allowed 5’ end coordinate).

  • center_fmean (float) – Mean read center, as a fraction of the reference length.

  • center_fvar (float) – Variance of the read center, as a fraction of its maximum.

  • length_fmean (float) – Mean read length, as a fraction of the available length.

  • length_fvar (float) – Variance of the read length, as a fraction of its maximum.

  • keep_empty_reads (bool) – Whether to keep reads whose lengths are 0.

Returns:

5’ and 3’ coordinates and their probabilities.

Return type:

tuple[np.ndarray, np.ndarray, np.ndarray]

seismicrna.sim.ends.sim_pends_ct(ct_file: Path, *, center_fmean: float, center_fvar: float, length_fmean: float, length_fvar: float, force: bool)
seismicrna.sim.fastq.from_param_dir(param_dir: Path, *, sample: str, profile: str, read_length: int, paired: bool, p_rev: float, fq_gzip: bool, force: bool, **kwargs)

Simulate a FASTQ file from parameter files.

seismicrna.sim.fastq.from_report(report_file: Path, *, read_length: int, p_rev: float, fq_gzip: bool, force: bool)

Simulate a FASTQ file from a Relate report.

seismicrna.sim.fastq.generate_fastq(top: Path, sample: str, ref: str, refseq: DNA, paired: bool, read_length: int, batches: Iterable[tuple[RelateBatch, ReadNamesBatch]], p_rev: float = 0.5, fq_gzip: bool = True, force: bool = False)

Generate FASTQ file(s) from a dataset.

seismicrna.sim.fastq.generate_fastq_record(name: str, rels: ndarray, refseq: str, adapter: str, read_length: int, reverse: bool = False, hi_qual: str = 'I', lo_qual: str = '!')

Generate a FASTQ line for a read.

seismicrna.sim.fastq.run(*, input_path: Iterable[str | Path], param_dir: Iterable[str | Path] = (), profile_name: str = 'simulated', sample: str = 'sim-sample', paired_end: bool = True, read_length: int = 151, reverse_fraction: float = 0.5, min_mut_gap: int = 3, fq_gzip: bool = True, num_reads: int = 65536, max_procs: int = 4, force: bool = False)
Parameters:
  • param_dir (Iterable) – Simulate data using parameter files in this directory [keyword-only, default: ()]

  • profile_name (str) – Give the simulated structure and parameters this profile name [keyword-only, default: ‘simulated’]

  • sample (str) – Give this name to the simulated sample [keyword-only, default: ‘sim-sample’]

  • paired_end (bool) – Simulate paired-end or single-end reads [keyword-only, default: True]

  • read_length (int) – Simulate reads with this many base calls [keyword-only, default: 151]

  • reverse_fraction (float) – Simulate this fraction of reverse-oriented reads [keyword-only, default: 0.5]

  • min_mut_gap (int) – Mask reads with two mutations separated by fewer than this many bases [keyword-only, default: 3]

  • fq_gzip (bool) – Simulate FASTQ files with gzip compression or as plain text [keyword-only, default: True]

  • num_reads (int) – Simulate this many reads [keyword-only, default: 65536]

  • max_procs (int) – Run up to this many processes simultaneously [keyword-only, default: 4]

  • force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

seismicrna.sim.fold.fold_region(region: Region, *, sim_dir: Path, tmp_dir: Path, profile_name: str, fold_constraint: Path | None, fold_temp: float, fold_md: int, fold_mfe: bool, fold_max: int, fold_percent: float, keep_tmp: bool, force: bool, n_procs: int)
seismicrna.sim.fold.get_ct_path(top: Path, region: Region, profile: str)

Get the path of a connectivity table (CT) file.

seismicrna.sim.fold.run(fasta: str | Path, *, sim_dir: str | Path = './sim', profile_name: str = 'simulated', fold_coords: Iterable[tuple[str, int, int]] = (), fold_primers: Iterable[tuple[str, DNA, DNA]] = (), fold_regions_file: str | None = None, fold_constraint: str | None = None, fold_temp: float = 310.15, fold_md: int = 0, fold_mfe: bool = False, fold_max: int = 20, fold_percent: float = 20.0, keep_tmp: bool = False, force: bool = False, max_procs: int = 4, tmp_pfx='./tmp')
Parameters:
  • sim_dir (str | pathlib._local.Path) – Write all simulated files to this directory [keyword-only, default: ‘./sim’]

  • profile_name (str) – Give the simulated structure and parameters this profile name [keyword-only, default: ‘simulated’]

  • fold_coords (Iterable) – Fold a region of a reference given its 5’ and 3’ end coordinates [keyword-only, default: ()]

  • fold_primers (Iterable) – Fold a region of a reference given its forward and reverse primers [keyword-only, default: ()]

  • fold_regions_file (str | None) – Fold regions of references from coordinates/primers in a CSV file [keyword-only, default: None]

  • fold_constraint (str | None) – Force bases to be paired/unpaired from a file of constraints [keyword-only, default: None]

  • fold_temp (float) – Predict structures at this temperature (Kelvin) [keyword-only, default: 310.15]

  • fold_md (int) – Limit base pair distances to this number of bases (0 for no limit) [keyword-only, default: 0]

  • fold_mfe (bool) – Predict only the minimum free energy (MFE) structure [keyword-only, default: False]

  • fold_max (int) – Output at most this many structures (overriden by –fold-mfe) [keyword-only, default: 20]

  • fold_percent (float) – Stop outputting structures when the % difference in energy exceeds this value (overriden by –fold-mfe) [keyword-only, default: 20.0]

  • keep_tmp (bool) – Keep temporary files after finishing [keyword-only, default: False]

  • force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

  • max_procs (int) – Run up to this many processes simultaneously [keyword-only, default: 4]

  • tmp_pfx – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]

seismicrna.sim.muts.calc_pmut_pattern(pmut: DataFrame, pattern: RelPattern)

Calculate the rate of a given type of mutation.

seismicrna.sim.muts.load_pmut(pmut_file: Path)

Load mutation rates from a file.

seismicrna.sim.muts.make_pmut_means(*, ploq: float = 0.02, pam: float, pac: float = 0.3, pag: float = 0.16, pat: float = 0.5, pcm: float, pca: float = 0.32, pcg: float = 0.32, pct: float = 0.32, pgm: float, pga: float = 0.32, pgc: float = 0.32, pgt: float = 0.32, ptm: float, pta: float = 0.32, ptc: float = 0.32, ptg: float = 0.32, pnm: float = 0.0, pnd: float = 0.04)

Generate mean mutation rates.

Mutations are assumed to behave as follows:

  • A base n mutates with probability pnm. - If it mutates, then it is a substitution with probability

    (pna + pnc + png + pnt). - If it is a substitution, then it is high-quality with

    probability (1 - ploq).

    • Otherwise, it is low-quality.

    • Otherwise, it is a deletion.

  • Otherwise, it is low-quality with probability ploq.

So the overall probability of being low-quailty is the probability given a mutation, pnm * (pna + pnc + png + pnt) * ploq, plus the probability given no mutation, (1 - pnm) * ploq, which equals ploq * (1 - pam * (1 - (pna + pnc + png + pnt))).

Parameters:
  • ploq (float) – Probability that a base is low-quality.

  • pam (float) – Probability that an A is mutated.

  • pac (float) – Probability that a mutated A is a substitution to C.

  • pag (float) – Probability that a mutated A is a substitution to G.

  • pat (float) – Probability that a mutated A is a substitution to T.

  • pcm (float) – Probability that a C is mutated.

  • pca (float) – Probability that a mutated C is a substitution to A.

  • pcg (float) – Probability that a mutated C is a substitution to G.

  • pct (float) – Probability that a mutated C is a substitution to T.

  • pgm (float) – Probability that a G is mutated.

  • pga (float) – Probability that a mutated G is a substitution to A.

  • pgc (float) – Probability that a mutated G is a substitution to C.

  • pgt (float) – Probability that a mutated G is a substitution to T.

  • ptm (float) – Probability that a T is mutated.

  • pta (float) – Probability that a mutated T is a substitution to A.

  • ptc (float) – Probability that a mutated T is a substitution to C.

  • ptg (float) – Probability that a mutated T is a substitution to G.

  • pnm (float) – Probability that an N is mutated.

  • pnd (float) – Probability that a mutated N is a deletion.

Returns:

Mean rate of each type of mutation (column) and each base (row).

Return type:

pd.DataFrame

seismicrna.sim.muts.make_pmut_means_paired(pam: float = 0.005, pcm: float = 0.003, pgm: float = 0.003, ptm: float = 0.001, pnm: float = 0.002, **kwargs)

Generate mean mutation rates for paired bases.

seismicrna.sim.muts.make_pmut_means_unpaired(pam: float = 0.045, pcm: float = 0.039, pgm: float = 0.003, ptm: float = 0.001, pnm: float = 0.002, **kwargs)

Generate mean mutation rates for unpaired bases.

seismicrna.sim.muts.run(*, ct_file: Iterable[str | Path] = (), pmut_paired: Iterable[tuple[str, float]] = (), pmut_unpaired: Iterable[tuple[str, float]] = (), vmut_paired: float = 0.001, vmut_unpaired: float = 0.02, force: bool = False, max_procs: int = 4)

Simulate the rate of each kind of mutation at each position.

Parameters:
  • ct_file (Iterable) – Simulate parameters using the structure(s) in this CT file [keyword-only, default: ()]

  • pmut_paired (Iterable) – Set the mean rate of each kind of mutation for paired bases [keyword-only, default: ()]

  • pmut_unpaired (Iterable) – Set the mean rate of each kind of mutation for unpaired bases [keyword-only, default: ()]

  • vmut_paired (float) – Set the relative variance of mutation rates of paired bases [keyword-only, default: 0.001]

  • vmut_unpaired (float) – Set the relative variance of mutation rates of unpaired bases [keyword-only, default: 0.02]

  • force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

  • max_procs (int) – Run up to this many processes simultaneously [keyword-only, default: 4]

seismicrna.sim.muts.run_struct(ct_file: Path, pmut_paired: Iterable[tuple[str, float]], pmut_unpaired: Iterable[tuple[str, float]], vmut_paired: float, vmut_unpaired: float, force: bool)
seismicrna.sim.muts.sim_pmut(positions: Index, mean: DataFrame, relative_variance: float)

Simulate mutation rates using a Dirichlet distribution.

Parameters:
  • positions (pd.Index) – Index of positions and bases.

  • mean (pd.DataFrame) – Mean of the mutation rates for each type of base.

  • relative_variance (float) – Variance of the mutation rates, as a fraction of its supremum.

Returns:

Mutation rates, with the same index as

Return type:

pd.DataFrame

seismicrna.sim.muts.verify_proportions(p: Any)

Verify that p is a valid set of proportions:

  • Every element of p must be ≥ 0 and ≤ 1.

  • The sum of p must equal 1.

Parameters:

p (Any) – Proportions to verify; must be a NumPy array or convertable into a NumPy array.

seismicrna.sim.params.run(*, ct_file: Iterable[str | Path] = (), pmut_paired: Iterable[tuple[str, float]] = (), pmut_unpaired: Iterable[tuple[str, float]] = (), vmut_paired: float = 0.001, vmut_unpaired: float = 0.02, center_fmean: float = 0.5, center_fvar: float = 0.3333333333333333, length_fmean: float = 0.5, length_fvar: float = 0.012345679012345678, clust_conc: float = 0.0, force: bool = False, max_procs: int = 4)

Simulate parameter files.

Parameters:
  • ct_file (Iterable) – Simulate parameters using the structure(s) in this CT file [keyword-only, default: ()]

  • pmut_paired (Iterable) – Set the mean rate of each kind of mutation for paired bases [keyword-only, default: ()]

  • pmut_unpaired (Iterable) – Set the mean rate of each kind of mutation for unpaired bases [keyword-only, default: ()]

  • vmut_paired (float) – Set the relative variance of mutation rates of paired bases [keyword-only, default: 0.001]

  • vmut_unpaired (float) – Set the relative variance of mutation rates of unpaired bases [keyword-only, default: 0.02]

  • center_fmean (float) – Set the mean read center as a fraction of the region length [keyword-only, default: 0.5]

  • center_fvar (float) – Set the variance of the read center as a fraction of its maximum [keyword-only, default: 0.3333333333333333]

  • length_fmean (float) – Set the mean read length as a fraction of the region length [keyword-only, default: 0.5]

  • length_fvar (float) – Set the variance of the read length as a fraction of its maximum [keyword-only, default: 0.012345679012345678]

  • clust_conc (float) – Set the concentration parameter for simulating cluster proportions [keyword-only, default: 0.0]

  • force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

  • max_procs (int) – Run up to this many processes simultaneously [keyword-only, default: 4]

seismicrna.sim.ref.get_fasta_path(top: Path, ref: str)

Get the path of a FASTA file.

seismicrna.sim.ref.run(*, sim_dir: str | Path = './sim', refs: str = 'sim-refs', ref: str = 'sim-ref', reflen: int = 280, force: bool = False)
Parameters:
  • sim_dir (str | pathlib._local.Path) – Write all simulated files to this directory [keyword-only, default: ‘./sim’]

  • refs (str) – Give this name to the file of simulated references [keyword-only, default: ‘sim-refs’]

  • ref (str) – Give this name to the simulated reference [keyword-only, default: ‘sim-ref’]

  • reflen (int) – Simulate a reference sequence with this many bases [keyword-only, default: 280]

  • force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

seismicrna.sim.relate.from_param_dir(param_dir: Path, profile: str, min_mut_gap: int, **kwargs)

Simulate a Relate dataset given parameter files.

seismicrna.sim.relate.get_param_dir_fields(param_dir: Path)
seismicrna.sim.relate.load_param_dir(param_dir: Path, profile: str)

Load all parameters for a profile in a directory.

seismicrna.sim.relate.run(*, param_dir: Iterable[str | Path] = (), profile_name: str = 'simulated', sample: str = 'sim-sample', paired_end: bool = True, read_length: int = 151, reverse_fraction: float = 0.5, min_mut_gap: int = 3, num_reads: int = 65536, batch_size: int = 65536, write_read_names: bool = False, brotli_level: int = 10, force: bool = False, max_procs: int = 4, tmp_pfx='./tmp', keep_tmp=False)

Simulate a Relate dataset.

Parameters:
  • param_dir (Iterable) – Simulate data using parameter files in this directory [keyword-only, default: ()]

  • profile_name (str) – Give the simulated structure and parameters this profile name [keyword-only, default: ‘simulated’]

  • sample (str) – Give this name to the simulated sample [keyword-only, default: ‘sim-sample’]

  • paired_end (bool) – Simulate paired-end or single-end reads [keyword-only, default: True]

  • read_length (int) – Simulate reads with this many base calls [keyword-only, default: 151]

  • reverse_fraction (float) – Simulate this fraction of reverse-oriented reads [keyword-only, default: 0.5]

  • min_mut_gap (int) – Mask reads with two mutations separated by fewer than this many bases [keyword-only, default: 3]

  • num_reads (int) – Simulate this many reads [keyword-only, default: 65536]

  • batch_size (int) – Limit batches to at most this many reads [keyword-only, default: 65536]

  • write_read_names (bool) – Write the name of each read in a second set of batches (necessary for the options –mask-read or –mask-read-file) [keyword-only, default: False]

  • brotli_level (int) – Compress pickle files with this level of Brotli (0 - 11) [keyword-only, default: 10]

  • force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

  • max_procs (int) – Run up to this many processes simultaneously [keyword-only, default: 4]

  • tmp_pfx – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]

  • keep_tmp – Keep temporary files after finishing [keyword-only, default: False]

seismicrna.sim.total.run(*, sim_dir: str | Path = './sim', tmp_pfx: str | Path = './tmp', sample: str = 'sim-sample', refs: str = 'sim-refs', ref: str = 'sim-ref', reflen: int = 280, profile_name: str = 'simulated', fold_coords: Iterable[tuple[str, int, int]] = (), fold_primers: Iterable[tuple[str, DNA, DNA]] = (), fold_regions_file: str | None = None, fold_constraint: str | None = None, fold_temp: float = 310.15, fold_md: int = 0, fold_mfe: bool = False, fold_max: int = 20, fold_percent: float = 20.0, pmut_paired: Iterable[tuple[str, float]] = (), pmut_unpaired: Iterable[tuple[str, float]] = (), vmut_paired: float = 0.001, vmut_unpaired: float = 0.02, center_fmean: float = 0.5, center_fvar: float = 0.3333333333333333, length_fmean: float = 0.5, length_fvar: float = 0.012345679012345678, clust_conc: float = 0.0, paired_end: bool = True, read_length: int = 151, reverse_fraction: float = 0.5, min_mut_gap: int = 3, fq_gzip: bool = True, num_reads: int = 65536, keep_tmp: bool = False, force: bool = False, max_procs: int = 4)

Simulate FASTQ files from scratch.

Parameters:
  • sim_dir (str | pathlib._local.Path) – Write all simulated files to this directory [keyword-only, default: ‘./sim’]

  • tmp_pfx (str | pathlib._local.Path) – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]

  • sample (str) – Give this name to the simulated sample [keyword-only, default: ‘sim-sample’]

  • refs (str) – Give this name to the file of simulated references [keyword-only, default: ‘sim-refs’]

  • ref (str) – Give this name to the simulated reference [keyword-only, default: ‘sim-ref’]

  • reflen (int) – Simulate a reference sequence with this many bases [keyword-only, default: 280]

  • profile_name (str) – Give the simulated structure and parameters this profile name [keyword-only, default: ‘simulated’]

  • fold_coords (Iterable) – Fold a region of a reference given its 5’ and 3’ end coordinates [keyword-only, default: ()]

  • fold_primers (Iterable) – Fold a region of a reference given its forward and reverse primers [keyword-only, default: ()]

  • fold_regions_file (str | None) – Fold regions of references from coordinates/primers in a CSV file [keyword-only, default: None]

  • fold_constraint (str | None) – Force bases to be paired/unpaired from a file of constraints [keyword-only, default: None]

  • fold_temp (float) – Predict structures at this temperature (Kelvin) [keyword-only, default: 310.15]

  • fold_md (int) – Limit base pair distances to this number of bases (0 for no limit) [keyword-only, default: 0]

  • fold_mfe (bool) – Predict only the minimum free energy (MFE) structure [keyword-only, default: False]

  • fold_max (int) – Output at most this many structures (overriden by –fold-mfe) [keyword-only, default: 20]

  • fold_percent (float) – Stop outputting structures when the % difference in energy exceeds this value (overriden by –fold-mfe) [keyword-only, default: 20.0]

  • pmut_paired (Iterable) – Set the mean rate of each kind of mutation for paired bases [keyword-only, default: ()]

  • pmut_unpaired (Iterable) – Set the mean rate of each kind of mutation for unpaired bases [keyword-only, default: ()]

  • vmut_paired (float) – Set the relative variance of mutation rates of paired bases [keyword-only, default: 0.001]

  • vmut_unpaired (float) – Set the relative variance of mutation rates of unpaired bases [keyword-only, default: 0.02]

  • center_fmean (float) – Set the mean read center as a fraction of the region length [keyword-only, default: 0.5]

  • center_fvar (float) – Set the variance of the read center as a fraction of its maximum [keyword-only, default: 0.3333333333333333]

  • length_fmean (float) – Set the mean read length as a fraction of the region length [keyword-only, default: 0.5]

  • length_fvar (float) – Set the variance of the read length as a fraction of its maximum [keyword-only, default: 0.012345679012345678]

  • clust_conc (float) – Set the concentration parameter for simulating cluster proportions [keyword-only, default: 0.0]

  • paired_end (bool) – Simulate paired-end or single-end reads [keyword-only, default: True]

  • read_length (int) – Simulate reads with this many base calls [keyword-only, default: 151]

  • reverse_fraction (float) – Simulate this fraction of reverse-oriented reads [keyword-only, default: 0.5]

  • min_mut_gap (int) – Mask reads with two mutations separated by fewer than this many bases [keyword-only, default: 3]

  • fq_gzip (bool) – Simulate FASTQ files with gzip compression or as plain text [keyword-only, default: True]

  • num_reads (int) – Simulate this many reads [keyword-only, default: 65536]

  • keep_tmp (bool) – Keep temporary files after finishing [keyword-only, default: False]

  • force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

  • max_procs (int) – Run up to this many processes simultaneously [keyword-only, default: 4]