seismicrna.sim package

Subpackages

seismicrna.sim.tests package
- Submodules
  - TestSimPEnds

Submodules

seismicrna.sim.abstract.abstract_seismicgraph_file(seismicgraph_file: Path)

seismicrna.sim.abstract.abstract_table(table: MaskPositionTableLoader, struct_file: str | Path)

seismicrna.sim.abstract.get_acgt_parameters()

seismicrna.sim.abstract.get_other_parameters()

seismicrna.sim.abstract.new_parameter_dict()

seismicrna.sim.abstract.run(input_path: Iterable[str | Path], *, struct_file: Iterable[str | Path] = (), print_params: bool = True, verify_times: bool = True, num_cpus: int = 4)

Abstract simulation parameters from existing datasets.

Parameters:

struct_file (Iterable) – Compare mutational profiles to the structure(s) in this CT file [keyword-only, default: ()]
verify_times (bool) – Verify that report files from later steps have later timestamps [keyword-only, default: True]
num_cpus (int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]

seismicrna.sim.clusts.load_pclust(pclust_file: Path): Load cluster proportions from a file.

seismicrna.sim.clusts.run(*, ct_file: Iterable[str | Path] = (), clust_conc: float = 0.0, force: bool = False, num_cpus: int = 4)

Simulate the rate of each kind of mutation at each position.

Parameters:

ct_file (Iterable) – Simulate parameters using the structure(s) in this CT file [keyword-only, default: ()]
clust_conc (float) – Set the concentration parameter for simulating cluster proportions [keyword-only, default: 0.0]
force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]
num_cpus (int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]

seismicrna.sim.clusts.sim_pclust(num_clusters: int, concentration: float | None = None, sort: bool = True)

Simulate proportions of clusters using a Dirichlet distribution.

Parameters:

num_clusters (int) – Number of clusters to simulate; must be ≥ 1.
concentration (float | None) – Concentration parameter for Dirichlet distribution; defaults to 1 / (num_clusters - 1); must be > 0.
sort (bool) – Sort the cluster proportions from greatest to least.

Returns:

Simulated proportion of each cluster.

Return type:

pd.Series

seismicrna.sim.clusts.sim_pclust_ct(ct_file: Path, *, concentration: float, force: bool)

seismicrna.sim.ends.load_pends(pends_file: Path): Load end coordinate proportions from a file.

seismicrna.sim.ends.run(*, ct_file: Iterable[str | Path] = (), center_fmean: float = 0.5, center_fvar: float = 0.3333333333333333, length_fmean: float = 0.5, length_fvar: float = 0.012345679012345678, force: bool = False, num_cpus: int = 4)

Simulate the rate of each kind of mutation at each position.

Parameters:

ct_file (Iterable) – Simulate parameters using the structure(s) in this CT file [keyword-only, default: ()]
center_fmean (float) – Set the mean read center as a fraction of the region length [keyword-only, default: 0.5]
center_fvar (float) – Set the variance of the read center as a fraction of its maximum [keyword-only, default: 0.3333333333333333]
length_fmean (float) – Set the mean read length as a fraction of the region length [keyword-only, default: 0.5]
length_fvar (float) – Set the variance of the read length as a fraction of its maximum [keyword-only, default: 0.012345679012345678]
force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]
num_cpus (int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]

seismicrna.sim.ends.sim_pends(end5: int, end3: int, center_fmean: float, center_fvar: float, length_fmean: float, length_fvar: float, keep_empty_reads: bool = True)

Simulate segment end coordinate probabilities.

Parameters:

end5 (int) – 5’ end of the region (minimum allowed 5’ end coordinate).
end3 (int) – 3’ end of the region (maximum allowed 5’ end coordinate).
center_fmean (float) – Mean read center, as a fraction of the reference length.
center_fvar (float) – Variance of the read center, as a fraction of its maximum.
length_fmean (float) – Mean read length, as a fraction of the available length.
length_fvar (float) – Variance of the read length, as a fraction of its maximum.
keep_empty_reads (bool) – Whether to keep reads whose lengths are 0.

Returns:

5’ and 3’ coordinates and their probabilities.

Return type:

tuple[np.ndarray, np.ndarray, np.ndarray]

seismicrna.sim.ends.sim_pends_ct(ct_file: Path, *, center_fmean: float, center_fvar: float, length_fmean: float, length_fvar: float, force: bool)

seismicrna.sim.fastq.from_param_dir(param_dir: Path, *, sample: str, profile: str, read_length: int, paired: bool, p_rev: float, fq_gzip: bool, force: bool, **kwargs): Simulate a FASTQ file from parameter files.

seismicrna.sim.fastq.from_report(report_file: Path, *, read_length: int, p_rev: float, fq_gzip: bool, force: bool): Simulate a FASTQ file from a Relate report.

seismicrna.sim.fastq.generate_fastq(top: Path, sample: str, ref: str, refseq: DNA, paired: bool, read_length: int, batches: Iterable[tuple[RelateRegionMutsBatch, ReadNamesBatch]], p_rev: float = 0.5, fq_gzip: bool = True, force: bool = False): Generate FASTQ file(s) from a dataset.

seismicrna.sim.fastq.generate_fastq_record(name: str, rels: ndarray, refseq: str, adapter: str, read_length: int, reverse: bool = False, hi_qual: str = 'I', lo_qual: str = '!'): Generate a FASTQ line for a read.

seismicrna.sim.fastq.run(*, input_path: Iterable[str | Path], param_dir: Iterable[str | Path] = (), profile_name: str = 'simulated', sample: str = 'sim-sample', paired_end: bool = True, read_length: int = 151, reverse_fraction: float = 0.5, min_mut_gap: int = 4, fq_gzip: bool = True, num_reads: int = 65536, num_cpus: int = 4, force: bool = False)

Parameters:

param_dir (Iterable) – Simulate data using parameter files in this directory [keyword-only, default: ()]
profile_name (str) – Give the simulated structure and parameters this profile name [keyword-only, default: ‘simulated’]
sample (str) – Give this name to the simulated sample [keyword-only, default: ‘sim-sample’]
paired_end (bool) – Simulate paired-end or single-end reads [keyword-only, default: True]
read_length (int) – Simulate reads with this many base calls [keyword-only, default: 151]
reverse_fraction (float) – Simulate this fraction of reverse-oriented reads [keyword-only, default: 0.5]
min_mut_gap (int) – Mask reads with two mutations separated by fewer than this many bases [keyword-only, default: 4]
fq_gzip (bool) – Simulate FASTQ files with gzip compression or as plain text [keyword-only, default: True]
num_reads (int) – Simulate this many reads [keyword-only, default: 65536]
num_cpus (int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]
force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

seismicrna.sim.fold.fold_region(region: Region, *, sim_dir: Path, tmp_dir: Path, profile_name: str, fold_constraint: Path | None, fold_temp: float, fold_md: int, fold_mfe: bool, fold_max: int, fold_percent: float, keep_tmp: bool, force: bool, num_cpus: int)

seismicrna.sim.fold.get_ct_path(top: Path, region: Region, profile: str): Get the path of a connectivity table (CT) file.

seismicrna.sim.fold.run(fasta: str | Path, *, sim_dir: str | Path = './sim', profile_name: str = 'simulated', fold_coords: Iterable[tuple[str, int, int]] = (), fold_primers: Iterable[tuple[str, DNA, DNA]] = (), fold_regions_file: str | None = None, fold_constraint: str | None = None, fold_temp: float = 310.15, fold_md: int = 0, fold_mfe: bool = False, fold_max: int = 20, fold_percent: float = 20.0, keep_tmp: bool = False, force: bool = False, num_cpus: int = 4, tmp_pfx='./tmp')

Parameters:

sim_dir (str | pathlib._local.Path) – Write all simulated files to this directory [keyword-only, default: ‘./sim’]
profile_name (str) – Give the simulated structure and parameters this profile name [keyword-only, default: ‘simulated’]
fold_coords (Iterable) – Fold a region of a reference given its 5’ and 3’ end coordinates [keyword-only, default: ()]
fold_primers (Iterable) – Fold a region of a reference given its forward and reverse primers [keyword-only, default: ()]
fold_regions_file (str | None) – Fold regions of references from coordinates/primers in a CSV file [keyword-only, default: None]
fold_constraint (str | None) – Force bases to be paired/unpaired from a file of constraints [keyword-only, default: None]
fold_temp (float) – Predict structures at this temperature (Kelvin) [keyword-only, default: 310.15]
fold_md (int) – Limit base pair distances to this number of bases (0 for no limit) [keyword-only, default: 0]
fold_mfe (bool) – Predict only the minimum free energy (MFE) structure [keyword-only, default: False]
fold_max (int) – Output at most this many structures (overriden by –fold-mfe) [keyword-only, default: 20]
fold_percent (float) – Stop outputting structures when the % difference in energy exceeds this value (overriden by –fold-mfe) [keyword-only, default: 20.0]
keep_tmp (bool) – Keep temporary files after finishing [keyword-only, default: False]
force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]
num_cpus (int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]
tmp_pfx – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]

seismicrna.sim.muts.calc_pmut_pattern(pmut: DataFrame, pattern: RelPattern): Calculate the rate of a given type of mutation.

seismicrna.sim.muts.load_pmut(pmut_file: Path): Load mutation rates from a file.

seismicrna.sim.muts.make_pmut_means(*, ploq: float = 0.02, pam: float, pac: float = 0.3, pag: float = 0.16, pat: float = 0.5, pcm: float, pca: float = 0.32, pcg: float = 0.32, pct: float = 0.32, pgm: float, pga: float = 0.32, pgc: float = 0.32, pgt: float = 0.32, ptm: float, pta: float = 0.32, ptc: float = 0.32, ptg: float = 0.32, pnm: float = 0.0, pnd: float = 0.04)

Generate mean mutation rates.

Mutations are assumed to behave as follows:

A base n mutates with probability pnm. - If it mutates, then it is a substitution with probability
(pna + pnc + png + pnt). - If it is a substitution, then it is high-quality with

probability (1 - ploq).
- Otherwise, it is low-quality.
- Otherwise, it is a deletion.
Otherwise, it is low-quality with probability ploq.

So the overall probability of being low-quailty is the probability given a mutation, pnm * (pna + pnc + png + pnt) * ploq, plus the probability given no mutation, (1 - pnm) * ploq, which equals ploq * (1 - pam * (1 - (pna + pnc + png + pnt))).

Parameters:

ploq (float) – Probability that a base is low-quality.
pam (float) – Probability that an A is mutated.
pac (float) – Probability that a mutated A is a substitution to C.
pag (float) – Probability that a mutated A is a substitution to G.
pat (float) – Probability that a mutated A is a substitution to T.
pcm (float) – Probability that a C is mutated.
pca (float) – Probability that a mutated C is a substitution to A.
pcg (float) – Probability that a mutated C is a substitution to G.
pct (float) – Probability that a mutated C is a substitution to T.
pgm (float) – Probability that a G is mutated.
pga (float) – Probability that a mutated G is a substitution to A.
pgc (float) – Probability that a mutated G is a substitution to C.
pgt (float) – Probability that a mutated G is a substitution to T.
ptm (float) – Probability that a T is mutated.
pta (float) – Probability that a mutated T is a substitution to A.
ptc (float) – Probability that a mutated T is a substitution to C.
ptg (float) – Probability that a mutated T is a substitution to G.
pnm (float) – Probability that an N is mutated.
pnd (float) – Probability that a mutated N is a deletion.

Returns:

Mean rate of each type of mutation (column) and each base (row).

Return type:

pd.DataFrame

seismicrna.sim.muts.make_pmut_means_paired(pam: float = 0.005, pcm: float = 0.003, pgm: float = 0.003, ptm: float = 0.001, pnm: float = 0.002, **kwargs): Generate mean mutation rates for paired bases.

seismicrna.sim.muts.make_pmut_means_unpaired(pam: float = 0.045, pcm: float = 0.039, pgm: float = 0.003, ptm: float = 0.001, pnm: float = 0.002, **kwargs): Generate mean mutation rates for unpaired bases.

seismicrna.sim.muts.run(*, ct_file: Iterable[str | Path] = (), pmut_paired: Iterable[tuple[str, float]] = (), pmut_unpaired: Iterable[tuple[str, float]] = (), vmut_paired: float = 0.001, vmut_unpaired: float = 0.02, force: bool = False, num_cpus: int = 4)

Simulate the rate of each kind of mutation at each position.

Parameters:

ct_file (Iterable) – Simulate parameters using the structure(s) in this CT file [keyword-only, default: ()]
pmut_paired (Iterable) – Set the mean rate of each kind of mutation for paired bases [keyword-only, default: ()]
pmut_unpaired (Iterable) – Set the mean rate of each kind of mutation for unpaired bases [keyword-only, default: ()]
vmut_paired (float) – Set the relative variance of mutation rates of paired bases [keyword-only, default: 0.001]
vmut_unpaired (float) – Set the relative variance of mutation rates of unpaired bases [keyword-only, default: 0.02]
force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]
num_cpus (int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]

seismicrna.sim.muts.run_struct(ct_file: Path, pmut_paired: Iterable[tuple[str, float]], pmut_unpaired: Iterable[tuple[str, float]], vmut_paired: float, vmut_unpaired: float, force: bool)

seismicrna.sim.muts.sim_pmut(positions: Index, mean: DataFrame, relative_variance: float)

Simulate mutation rates using a Dirichlet distribution.

Parameters:

positions (pd.Index) – Index of positions and bases.
mean (pd.DataFrame) – Mean of the mutation rates for each type of base.
relative_variance (float) – Variance of the mutation rates, as a fraction of its supremum.

Returns:

Mutation rates, with the same index as

Return type:

pd.DataFrame

seismicrna.sim.muts.verify_proportions(p: Any)

Verify that p is a valid set of proportions:

Every element of p must be ≥ 0 and ≤ 1.
The sum of p must equal 1.

Parameters:: p (Any) – Proportions to verify; must be a NumPy array or convertable into a NumPy array.

seismicrna.sim.params.run(*, ct_file: Iterable[str | Path] = (), pmut_paired: Iterable[tuple[str, float]] = (), pmut_unpaired: Iterable[tuple[str, float]] = (), vmut_paired: float = 0.001, vmut_unpaired: float = 0.02, center_fmean: float = 0.5, center_fvar: float = 0.3333333333333333, length_fmean: float = 0.5, length_fvar: float = 0.012345679012345678, clust_conc: float = 0.0, force: bool = False, num_cpus: int = 4)

Simulate parameter files.

Parameters:

ct_file (Iterable) – Simulate parameters using the structure(s) in this CT file [keyword-only, default: ()]
pmut_paired (Iterable) – Set the mean rate of each kind of mutation for paired bases [keyword-only, default: ()]
pmut_unpaired (Iterable) – Set the mean rate of each kind of mutation for unpaired bases [keyword-only, default: ()]
vmut_paired (float) – Set the relative variance of mutation rates of paired bases [keyword-only, default: 0.001]
vmut_unpaired (float) – Set the relative variance of mutation rates of unpaired bases [keyword-only, default: 0.02]
center_fmean (float) – Set the mean read center as a fraction of the region length [keyword-only, default: 0.5]
center_fvar (float) – Set the variance of the read center as a fraction of its maximum [keyword-only, default: 0.3333333333333333]
length_fmean (float) – Set the mean read length as a fraction of the region length [keyword-only, default: 0.5]
length_fvar (float) – Set the variance of the read length as a fraction of its maximum [keyword-only, default: 0.012345679012345678]
clust_conc (float) – Set the concentration parameter for simulating cluster proportions [keyword-only, default: 0.0]
force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]
num_cpus (int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]

seismicrna.sim.ref.get_fasta_path(top: Path, ref: str): Get the path of a FASTA file.

seismicrna.sim.ref.run(*, sim_dir: str | Path = './sim', refs: str = 'sim-refs', ref: str = 'sim-ref', reflen: int = 280, force: bool = False)

Parameters:

sim_dir (str | pathlib._local.Path) – Write all simulated files to this directory [keyword-only, default: ‘./sim’]
refs (str) – Give this name to the file of simulated references [keyword-only, default: ‘sim-refs’]
ref (str) – Give this name to the simulated reference [keyword-only, default: ‘sim-ref’]
reflen (int) – Simulate a reference sequence with this many bases [keyword-only, default: 280]
force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

seismicrna.sim.relate.from_param_dir(param_dir: Path, profile: str, min_mut_gap: int, **kwargs): Simulate a Relate dataset given parameter files.

seismicrna.sim.relate.get_param_dir_fields(param_dir: Path)

seismicrna.sim.relate.load_param_dir(param_dir: Path, profile: str): Load all parameters for a profile in a directory.

seismicrna.sim.relate.run(*, param_dir: Iterable[str | Path] = (), profile_name: str = 'simulated', sample: str = 'sim-sample', branch: str = '', paired_end: bool = True, read_length: int = 151, reverse_fraction: float = 0.5, min_mut_gap: int = 4, num_reads: int = 65536, batch_size: int = 65536, write_read_names: bool = False, brotli_level: int = 10, force: bool = False, num_cpus: int = 4, tmp_pfx='./tmp', keep_tmp=False)

Simulate a Relate dataset.

Parameters:

param_dir (Iterable) – Simulate data using parameter files in this directory [keyword-only, default: ()]
profile_name (str) – Give the simulated structure and parameters this profile name [keyword-only, default: ‘simulated’]
sample (str) – Give this name to the simulated sample [keyword-only, default: ‘sim-sample’]
branch (str) – Create a new branch of the workflow with this name [keyword-only, default: ‘’]
paired_end (bool) – Simulate paired-end or single-end reads [keyword-only, default: True]
read_length (int) – Simulate reads with this many base calls [keyword-only, default: 151]
reverse_fraction (float) – Simulate this fraction of reverse-oriented reads [keyword-only, default: 0.5]
min_mut_gap (int) – Mask reads with two mutations separated by fewer than this many bases [keyword-only, default: 4]
num_reads (int) – Simulate this many reads [keyword-only, default: 65536]
batch_size (int) – Limit batches to at most this many reads [keyword-only, default: 65536]
write_read_names (bool) – Write the name of each read in a second set of batches (necessary for the options –mask-read or –mask-read-file) [keyword-only, default: False]
brotli_level (int) – Compress pickle files with this level of Brotli (0 - 11) [keyword-only, default: 10]
force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]
num_cpus (int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]
tmp_pfx – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]
keep_tmp – Keep temporary files after finishing [keyword-only, default: False]

seismicrna.sim.total.run(*, sim_dir: str | Path = './sim', tmp_pfx: str | Path = './tmp', sample: str = 'sim-sample', refs: str = 'sim-refs', ref: str = 'sim-ref', reflen: int = 280, profile_name: str = 'simulated', fold_coords: Iterable[tuple[str, int, int]] = (), fold_primers: Iterable[tuple[str, DNA, DNA]] = (), fold_regions_file: str | None = None, fold_constraint: str | None = None, fold_temp: float = 310.15, fold_md: int = 0, fold_mfe: bool = False, fold_max: int = 20, fold_percent: float = 20.0, pmut_paired: Iterable[tuple[str, float]] = (), pmut_unpaired: Iterable[tuple[str, float]] = (), vmut_paired: float = 0.001, vmut_unpaired: float = 0.02, center_fmean: float = 0.5, center_fvar: float = 0.3333333333333333, length_fmean: float = 0.5, length_fvar: float = 0.012345679012345678, clust_conc: float = 0.0, paired_end: bool = True, read_length: int = 151, reverse_fraction: float = 0.5, min_mut_gap: int = 4, fq_gzip: bool = True, num_reads: int = 65536, keep_tmp: bool = False, force: bool = False, num_cpus: int = 4)

Simulate FASTQ files from scratch.

Parameters:

sim_dir (str | pathlib._local.Path) – Write all simulated files to this directory [keyword-only, default: ‘./sim’]
tmp_pfx (str | pathlib._local.Path) – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]
sample (str) – Give this name to the simulated sample [keyword-only, default: ‘sim-sample’]
refs (str) – Give this name to the file of simulated references [keyword-only, default: ‘sim-refs’]
ref (str) – Give this name to the simulated reference [keyword-only, default: ‘sim-ref’]
reflen (int) – Simulate a reference sequence with this many bases [keyword-only, default: 280]
profile_name (str) – Give the simulated structure and parameters this profile name [keyword-only, default: ‘simulated’]
fold_coords (Iterable) – Fold a region of a reference given its 5’ and 3’ end coordinates [keyword-only, default: ()]
fold_primers (Iterable) – Fold a region of a reference given its forward and reverse primers [keyword-only, default: ()]
fold_regions_file (str | None) – Fold regions of references from coordinates/primers in a CSV file [keyword-only, default: None]
fold_constraint (str | None) – Force bases to be paired/unpaired from a file of constraints [keyword-only, default: None]
fold_temp (float) – Predict structures at this temperature (Kelvin) [keyword-only, default: 310.15]
fold_md (int) – Limit base pair distances to this number of bases (0 for no limit) [keyword-only, default: 0]
fold_mfe (bool) – Predict only the minimum free energy (MFE) structure [keyword-only, default: False]
fold_max (int) – Output at most this many structures (overriden by –fold-mfe) [keyword-only, default: 20]
fold_percent (float) – Stop outputting structures when the % difference in energy exceeds this value (overriden by –fold-mfe) [keyword-only, default: 20.0]
pmut_paired (Iterable) – Set the mean rate of each kind of mutation for paired bases [keyword-only, default: ()]
pmut_unpaired (Iterable) – Set the mean rate of each kind of mutation for unpaired bases [keyword-only, default: ()]
vmut_paired (float) – Set the relative variance of mutation rates of paired bases [keyword-only, default: 0.001]
vmut_unpaired (float) – Set the relative variance of mutation rates of unpaired bases [keyword-only, default: 0.02]
center_fmean (float) – Set the mean read center as a fraction of the region length [keyword-only, default: 0.5]
center_fvar (float) – Set the variance of the read center as a fraction of its maximum [keyword-only, default: 0.3333333333333333]
length_fmean (float) – Set the mean read length as a fraction of the region length [keyword-only, default: 0.5]
length_fvar (float) – Set the variance of the read length as a fraction of its maximum [keyword-only, default: 0.012345679012345678]
clust_conc (float) – Set the concentration parameter for simulating cluster proportions [keyword-only, default: 0.0]
paired_end (bool) – Simulate paired-end or single-end reads [keyword-only, default: True]
read_length (int) – Simulate reads with this many base calls [keyword-only, default: 151]
reverse_fraction (float) – Simulate this fraction of reverse-oriented reads [keyword-only, default: 0.5]
min_mut_gap (int) – Mask reads with two mutations separated by fewer than this many bases [keyword-only, default: 4]
fq_gzip (bool) – Simulate FASTQ files with gzip compression or as plain text [keyword-only, default: True]
num_reads (int) – Simulate this many reads [keyword-only, default: 65536]
keep_tmp (bool) – Keep temporary files after finishing [keyword-only, default: False]
force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]
num_cpus (int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]