seismicrna.sim package
Subpackages
Submodules
- seismicrna.sim.abstract.abstract_table(table: MaskPositionTableLoader, struct_file: str | Path)
- seismicrna.sim.abstract.get_acgt_parameters()
- seismicrna.sim.abstract.get_other_parameters()
- seismicrna.sim.abstract.new_parameter_dict()
- seismicrna.sim.abstract.run(input_path: Iterable[str | Path], *, struct_file: Iterable[str | Path] = (), print_params: bool = True, verify_times: bool = True, num_cpus: int = 4)
Abstract simulation parameters from existing datasets.
- Parameters:
struct_file (
Iterable
) – Compare mutational profiles to the structure(s) in this CT file [keyword-only, default: ()]verify_times (
bool
) – Verify that report files from later steps have later timestamps [keyword-only, default: True]num_cpus (
int
) – Use up to this many CPUs simultaneously [keyword-only, default: 4]
- seismicrna.sim.clusts.run(*, ct_file: Iterable[str | Path] = (), clust_conc: float = 0.0, force: bool = False, num_cpus: int = 4)
Simulate the rate of each kind of mutation at each position.
- Parameters:
ct_file (
Iterable
) – Simulate parameters using the structure(s) in this CT file [keyword-only, default: ()]clust_conc (
float
) – Set the concentration parameter for simulating cluster proportions [keyword-only, default: 0.0]force (
bool
) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]num_cpus (
int
) – Use up to this many CPUs simultaneously [keyword-only, default: 4]
- seismicrna.sim.clusts.sim_pclust(num_clusters: int, concentration: float | None = None, sort: bool = True)
Simulate proportions of clusters using a Dirichlet distribution.
- Parameters:
- Returns:
Simulated proportion of each cluster.
- Return type:
pd.Series
- seismicrna.sim.ends.run(*, ct_file: Iterable[str | Path] = (), center_fmean: float = 0.5, center_fvar: float = 0.3333333333333333, length_fmean: float = 0.5, length_fvar: float = 0.012345679012345678, force: bool = False, num_cpus: int = 4)
Simulate the rate of each kind of mutation at each position.
- Parameters:
ct_file (
Iterable
) – Simulate parameters using the structure(s) in this CT file [keyword-only, default: ()]center_fmean (
float
) – Set the mean read center as a fraction of the region length [keyword-only, default: 0.5]center_fvar (
float
) – Set the variance of the read center as a fraction of its maximum [keyword-only, default: 0.3333333333333333]length_fmean (
float
) – Set the mean read length as a fraction of the region length [keyword-only, default: 0.5]length_fvar (
float
) – Set the variance of the read length as a fraction of its maximum [keyword-only, default: 0.012345679012345678]force (
bool
) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]num_cpus (
int
) – Use up to this many CPUs simultaneously [keyword-only, default: 4]
- seismicrna.sim.ends.sim_pends(end5: int, end3: int, center_fmean: float, center_fvar: float, length_fmean: float, length_fvar: float, keep_empty_reads: bool = True)
Simulate segment end coordinate probabilities.
- Parameters:
end5 (
int
) – 5’ end of the region (minimum allowed 5’ end coordinate).end3 (
int
) – 3’ end of the region (maximum allowed 5’ end coordinate).center_fmean (
float
) – Mean read center, as a fraction of the reference length.center_fvar (
float
) – Variance of the read center, as a fraction of its maximum.length_fmean (
float
) – Mean read length, as a fraction of the available length.length_fvar (
float
) – Variance of the read length, as a fraction of its maximum.keep_empty_reads (
bool
) – Whether to keep reads whose lengths are 0.
- Returns:
5’ and 3’ coordinates and their probabilities.
- Return type:
tuple[np.ndarray
,np.ndarray
,np.ndarray]
- seismicrna.sim.ends.sim_pends_ct(ct_file: Path, *, center_fmean: float, center_fvar: float, length_fmean: float, length_fvar: float, force: bool)
- seismicrna.sim.fastq.from_param_dir(param_dir: Path, *, sample: str, profile: str, read_length: int, paired: bool, p_rev: float, fq_gzip: bool, force: bool, **kwargs)
Simulate a FASTQ file from parameter files.
- seismicrna.sim.fastq.from_report(report_file: Path, *, read_length: int, p_rev: float, fq_gzip: bool, force: bool)
Simulate a FASTQ file from a Relate report.
- seismicrna.sim.fastq.generate_fastq(top: Path, sample: str, ref: str, refseq: DNA, paired: bool, read_length: int, batches: Iterable[tuple[RelateRegionMutsBatch, ReadNamesBatch]], p_rev: float = 0.5, fq_gzip: bool = True, force: bool = False)
Generate FASTQ file(s) from a dataset.
- seismicrna.sim.fastq.generate_fastq_record(name: str, rels: ndarray, refseq: str, adapter: str, read_length: int, reverse: bool = False, hi_qual: str = 'I', lo_qual: str = '!')
Generate a FASTQ line for a read.
- seismicrna.sim.fastq.run(*, input_path: Iterable[str | Path], param_dir: Iterable[str | Path] = (), profile_name: str = 'simulated', sample: str = 'sim-sample', paired_end: bool = True, read_length: int = 151, reverse_fraction: float = 0.5, min_mut_gap: int = 4, fq_gzip: bool = True, num_reads: int = 65536, num_cpus: int = 4, force: bool = False)
- Parameters:
param_dir (
Iterable
) – Simulate data using parameter files in this directory [keyword-only, default: ()]profile_name (
str
) – Give the simulated structure and parameters this profile name [keyword-only, default: ‘simulated’]sample (
str
) – Give this name to the simulated sample [keyword-only, default: ‘sim-sample’]paired_end (
bool
) – Simulate paired-end or single-end reads [keyword-only, default: True]read_length (
int
) – Simulate reads with this many base calls [keyword-only, default: 151]reverse_fraction (
float
) – Simulate this fraction of reverse-oriented reads [keyword-only, default: 0.5]min_mut_gap (
int
) – Mask reads with two mutations separated by fewer than this many bases [keyword-only, default: 4]fq_gzip (
bool
) – Simulate FASTQ files with gzip compression or as plain text [keyword-only, default: True]num_reads (
int
) – Simulate this many reads [keyword-only, default: 65536]num_cpus (
int
) – Use up to this many CPUs simultaneously [keyword-only, default: 4]force (
bool
) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]
- seismicrna.sim.fold.fold_region(region: Region, *, sim_dir: Path, tmp_dir: Path, profile_name: str, fold_constraint: Path | None, fold_temp: float, fold_md: int, fold_mfe: bool, fold_max: int, fold_percent: float, keep_tmp: bool, force: bool, num_cpus: int)
- seismicrna.sim.fold.get_ct_path(top: Path, region: Region, profile: str)
Get the path of a connectivity table (CT) file.
- seismicrna.sim.fold.run(fasta: str | Path, *, sim_dir: str | Path = './sim', profile_name: str = 'simulated', fold_coords: Iterable[tuple[str, int, int]] = (), fold_primers: Iterable[tuple[str, DNA, DNA]] = (), fold_regions_file: str | None = None, fold_constraint: str | None = None, fold_temp: float = 310.15, fold_md: int = 0, fold_mfe: bool = False, fold_max: int = 20, fold_percent: float = 20.0, keep_tmp: bool = False, force: bool = False, num_cpus: int = 4, tmp_pfx='./tmp')
- Parameters:
sim_dir (
str | pathlib._local.Path
) – Write all simulated files to this directory [keyword-only, default: ‘./sim’]profile_name (
str
) – Give the simulated structure and parameters this profile name [keyword-only, default: ‘simulated’]fold_coords (
Iterable
) – Fold a region of a reference given its 5’ and 3’ end coordinates [keyword-only, default: ()]fold_primers (
Iterable
) – Fold a region of a reference given its forward and reverse primers [keyword-only, default: ()]fold_regions_file (
str | None
) – Fold regions of references from coordinates/primers in a CSV file [keyword-only, default: None]fold_constraint (
str | None
) – Force bases to be paired/unpaired from a file of constraints [keyword-only, default: None]fold_temp (
float
) – Predict structures at this temperature (Kelvin) [keyword-only, default: 310.15]fold_md (
int
) – Limit base pair distances to this number of bases (0 for no limit) [keyword-only, default: 0]fold_mfe (
bool
) – Predict only the minimum free energy (MFE) structure [keyword-only, default: False]fold_max (
int
) – Output at most this many structures (overriden by –fold-mfe) [keyword-only, default: 20]fold_percent (
float
) – Stop outputting structures when the % difference in energy exceeds this value (overriden by –fold-mfe) [keyword-only, default: 20.0]keep_tmp (
bool
) – Keep temporary files after finishing [keyword-only, default: False]force (
bool
) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]num_cpus (
int
) – Use up to this many CPUs simultaneously [keyword-only, default: 4]tmp_pfx – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]
- seismicrna.sim.muts.calc_pmut_pattern(pmut: DataFrame, pattern: RelPattern)
Calculate the rate of a given type of mutation.
- seismicrna.sim.muts.make_pmut_means(*, ploq: float = 0.02, pam: float, pac: float = 0.3, pag: float = 0.16, pat: float = 0.5, pcm: float, pca: float = 0.32, pcg: float = 0.32, pct: float = 0.32, pgm: float, pga: float = 0.32, pgc: float = 0.32, pgt: float = 0.32, ptm: float, pta: float = 0.32, ptc: float = 0.32, ptg: float = 0.32, pnm: float = 0.0, pnd: float = 0.04)
Generate mean mutation rates.
Mutations are assumed to behave as follows:
A base n mutates with probability pnm. - If it mutates, then it is a substitution with probability
(pna + pnc + png + pnt). - If it is a substitution, then it is high-quality with
probability (1 - ploq).
Otherwise, it is low-quality.
Otherwise, it is a deletion.
Otherwise, it is low-quality with probability ploq.
So the overall probability of being low-quailty is the probability given a mutation, pnm * (pna + pnc + png + pnt) * ploq, plus the probability given no mutation, (1 - pnm) * ploq, which equals ploq * (1 - pam * (1 - (pna + pnc + png + pnt))).
- Parameters:
ploq (
float
) – Probability that a base is low-quality.pam (
float
) – Probability that an A is mutated.pac (
float
) – Probability that a mutated A is a substitution to C.pag (
float
) – Probability that a mutated A is a substitution to G.pat (
float
) – Probability that a mutated A is a substitution to T.pcm (
float
) – Probability that a C is mutated.pca (
float
) – Probability that a mutated C is a substitution to A.pcg (
float
) – Probability that a mutated C is a substitution to G.pct (
float
) – Probability that a mutated C is a substitution to T.pgm (
float
) – Probability that a G is mutated.pga (
float
) – Probability that a mutated G is a substitution to A.pgc (
float
) – Probability that a mutated G is a substitution to C.pgt (
float
) – Probability that a mutated G is a substitution to T.ptm (
float
) – Probability that a T is mutated.pta (
float
) – Probability that a mutated T is a substitution to A.ptc (
float
) – Probability that a mutated T is a substitution to C.ptg (
float
) – Probability that a mutated T is a substitution to G.pnm (
float
) – Probability that an N is mutated.pnd (
float
) – Probability that a mutated N is a deletion.
- Returns:
Mean rate of each type of mutation (column) and each base (row).
- Return type:
pd.DataFrame
- seismicrna.sim.muts.make_pmut_means_paired(pam: float = 0.005, pcm: float = 0.003, pgm: float = 0.003, ptm: float = 0.001, pnm: float = 0.002, **kwargs)
Generate mean mutation rates for paired bases.
- seismicrna.sim.muts.make_pmut_means_unpaired(pam: float = 0.045, pcm: float = 0.039, pgm: float = 0.003, ptm: float = 0.001, pnm: float = 0.002, **kwargs)
Generate mean mutation rates for unpaired bases.
- seismicrna.sim.muts.run(*, ct_file: Iterable[str | Path] = (), pmut_paired: Iterable[tuple[str, float]] = (), pmut_unpaired: Iterable[tuple[str, float]] = (), vmut_paired: float = 0.001, vmut_unpaired: float = 0.02, force: bool = False, num_cpus: int = 4)
Simulate the rate of each kind of mutation at each position.
- Parameters:
ct_file (
Iterable
) – Simulate parameters using the structure(s) in this CT file [keyword-only, default: ()]pmut_paired (
Iterable
) – Set the mean rate of each kind of mutation for paired bases [keyword-only, default: ()]pmut_unpaired (
Iterable
) – Set the mean rate of each kind of mutation for unpaired bases [keyword-only, default: ()]vmut_paired (
float
) – Set the relative variance of mutation rates of paired bases [keyword-only, default: 0.001]vmut_unpaired (
float
) – Set the relative variance of mutation rates of unpaired bases [keyword-only, default: 0.02]force (
bool
) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]num_cpus (
int
) – Use up to this many CPUs simultaneously [keyword-only, default: 4]
- seismicrna.sim.muts.run_struct(ct_file: Path, pmut_paired: Iterable[tuple[str, float]], pmut_unpaired: Iterable[tuple[str, float]], vmut_paired: float, vmut_unpaired: float, force: bool)
- seismicrna.sim.muts.sim_pmut(positions: Index, mean: DataFrame, relative_variance: float)
Simulate mutation rates using a Dirichlet distribution.
- Parameters:
positions (
pd.Index
) – Index of positions and bases.mean (
pd.DataFrame
) – Mean of the mutation rates for each type of base.relative_variance (
float
) – Variance of the mutation rates, as a fraction of its supremum.
- Returns:
Mutation rates, with the same index as
- Return type:
pd.DataFrame
- seismicrna.sim.muts.verify_proportions(p: Any)
Verify that p is a valid set of proportions:
Every element of p must be ≥ 0 and ≤ 1.
The sum of p must equal 1.
- Parameters:
p (
Any
) – Proportions to verify; must be a NumPy array or convertable into a NumPy array.
- seismicrna.sim.params.run(*, ct_file: Iterable[str | Path] = (), pmut_paired: Iterable[tuple[str, float]] = (), pmut_unpaired: Iterable[tuple[str, float]] = (), vmut_paired: float = 0.001, vmut_unpaired: float = 0.02, center_fmean: float = 0.5, center_fvar: float = 0.3333333333333333, length_fmean: float = 0.5, length_fvar: float = 0.012345679012345678, clust_conc: float = 0.0, force: bool = False, num_cpus: int = 4)
Simulate parameter files.
- Parameters:
ct_file (
Iterable
) – Simulate parameters using the structure(s) in this CT file [keyword-only, default: ()]pmut_paired (
Iterable
) – Set the mean rate of each kind of mutation for paired bases [keyword-only, default: ()]pmut_unpaired (
Iterable
) – Set the mean rate of each kind of mutation for unpaired bases [keyword-only, default: ()]vmut_paired (
float
) – Set the relative variance of mutation rates of paired bases [keyword-only, default: 0.001]vmut_unpaired (
float
) – Set the relative variance of mutation rates of unpaired bases [keyword-only, default: 0.02]center_fmean (
float
) – Set the mean read center as a fraction of the region length [keyword-only, default: 0.5]center_fvar (
float
) – Set the variance of the read center as a fraction of its maximum [keyword-only, default: 0.3333333333333333]length_fmean (
float
) – Set the mean read length as a fraction of the region length [keyword-only, default: 0.5]length_fvar (
float
) – Set the variance of the read length as a fraction of its maximum [keyword-only, default: 0.012345679012345678]clust_conc (
float
) – Set the concentration parameter for simulating cluster proportions [keyword-only, default: 0.0]force (
bool
) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]num_cpus (
int
) – Use up to this many CPUs simultaneously [keyword-only, default: 4]
- seismicrna.sim.ref.run(*, sim_dir: str | Path = './sim', refs: str = 'sim-refs', ref: str = 'sim-ref', reflen: int = 280, force: bool = False)
- Parameters:
sim_dir (
str | pathlib._local.Path
) – Write all simulated files to this directory [keyword-only, default: ‘./sim’]refs (
str
) – Give this name to the file of simulated references [keyword-only, default: ‘sim-refs’]ref (
str
) – Give this name to the simulated reference [keyword-only, default: ‘sim-ref’]reflen (
int
) – Simulate a reference sequence with this many bases [keyword-only, default: 280]force (
bool
) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]
- seismicrna.sim.relate.from_param_dir(param_dir: Path, profile: str, min_mut_gap: int, **kwargs)
Simulate a Relate dataset given parameter files.
- seismicrna.sim.relate.load_param_dir(param_dir: Path, profile: str)
Load all parameters for a profile in a directory.
- seismicrna.sim.relate.run(*, param_dir: Iterable[str | Path] = (), profile_name: str = 'simulated', sample: str = 'sim-sample', branch: str = '', paired_end: bool = True, read_length: int = 151, reverse_fraction: float = 0.5, min_mut_gap: int = 4, num_reads: int = 65536, batch_size: int = 65536, write_read_names: bool = False, brotli_level: int = 10, force: bool = False, num_cpus: int = 4, tmp_pfx='./tmp', keep_tmp=False)
Simulate a Relate dataset.
- Parameters:
param_dir (
Iterable
) – Simulate data using parameter files in this directory [keyword-only, default: ()]profile_name (
str
) – Give the simulated structure and parameters this profile name [keyword-only, default: ‘simulated’]sample (
str
) – Give this name to the simulated sample [keyword-only, default: ‘sim-sample’]branch (
str
) – Create a new branch of the workflow with this name [keyword-only, default: ‘’]paired_end (
bool
) – Simulate paired-end or single-end reads [keyword-only, default: True]read_length (
int
) – Simulate reads with this many base calls [keyword-only, default: 151]reverse_fraction (
float
) – Simulate this fraction of reverse-oriented reads [keyword-only, default: 0.5]min_mut_gap (
int
) – Mask reads with two mutations separated by fewer than this many bases [keyword-only, default: 4]num_reads (
int
) – Simulate this many reads [keyword-only, default: 65536]batch_size (
int
) – Limit batches to at most this many reads [keyword-only, default: 65536]write_read_names (
bool
) – Write the name of each read in a second set of batches (necessary for the options –mask-read or –mask-read-file) [keyword-only, default: False]brotli_level (
int
) – Compress pickle files with this level of Brotli (0 - 11) [keyword-only, default: 10]force (
bool
) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]num_cpus (
int
) – Use up to this many CPUs simultaneously [keyword-only, default: 4]tmp_pfx – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]
keep_tmp – Keep temporary files after finishing [keyword-only, default: False]
- seismicrna.sim.total.run(*, sim_dir: str | Path = './sim', tmp_pfx: str | Path = './tmp', sample: str = 'sim-sample', refs: str = 'sim-refs', ref: str = 'sim-ref', reflen: int = 280, profile_name: str = 'simulated', fold_coords: Iterable[tuple[str, int, int]] = (), fold_primers: Iterable[tuple[str, DNA, DNA]] = (), fold_regions_file: str | None = None, fold_constraint: str | None = None, fold_temp: float = 310.15, fold_md: int = 0, fold_mfe: bool = False, fold_max: int = 20, fold_percent: float = 20.0, pmut_paired: Iterable[tuple[str, float]] = (), pmut_unpaired: Iterable[tuple[str, float]] = (), vmut_paired: float = 0.001, vmut_unpaired: float = 0.02, center_fmean: float = 0.5, center_fvar: float = 0.3333333333333333, length_fmean: float = 0.5, length_fvar: float = 0.012345679012345678, clust_conc: float = 0.0, paired_end: bool = True, read_length: int = 151, reverse_fraction: float = 0.5, min_mut_gap: int = 4, fq_gzip: bool = True, num_reads: int = 65536, keep_tmp: bool = False, force: bool = False, num_cpus: int = 4)
Simulate FASTQ files from scratch.
- Parameters:
sim_dir (
str | pathlib._local.Path
) – Write all simulated files to this directory [keyword-only, default: ‘./sim’]tmp_pfx (
str | pathlib._local.Path
) – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]sample (
str
) – Give this name to the simulated sample [keyword-only, default: ‘sim-sample’]refs (
str
) – Give this name to the file of simulated references [keyword-only, default: ‘sim-refs’]ref (
str
) – Give this name to the simulated reference [keyword-only, default: ‘sim-ref’]reflen (
int
) – Simulate a reference sequence with this many bases [keyword-only, default: 280]profile_name (
str
) – Give the simulated structure and parameters this profile name [keyword-only, default: ‘simulated’]fold_coords (
Iterable
) – Fold a region of a reference given its 5’ and 3’ end coordinates [keyword-only, default: ()]fold_primers (
Iterable
) – Fold a region of a reference given its forward and reverse primers [keyword-only, default: ()]fold_regions_file (
str | None
) – Fold regions of references from coordinates/primers in a CSV file [keyword-only, default: None]fold_constraint (
str | None
) – Force bases to be paired/unpaired from a file of constraints [keyword-only, default: None]fold_temp (
float
) – Predict structures at this temperature (Kelvin) [keyword-only, default: 310.15]fold_md (
int
) – Limit base pair distances to this number of bases (0 for no limit) [keyword-only, default: 0]fold_mfe (
bool
) – Predict only the minimum free energy (MFE) structure [keyword-only, default: False]fold_max (
int
) – Output at most this many structures (overriden by –fold-mfe) [keyword-only, default: 20]fold_percent (
float
) – Stop outputting structures when the % difference in energy exceeds this value (overriden by –fold-mfe) [keyword-only, default: 20.0]pmut_paired (
Iterable
) – Set the mean rate of each kind of mutation for paired bases [keyword-only, default: ()]pmut_unpaired (
Iterable
) – Set the mean rate of each kind of mutation for unpaired bases [keyword-only, default: ()]vmut_paired (
float
) – Set the relative variance of mutation rates of paired bases [keyword-only, default: 0.001]vmut_unpaired (
float
) – Set the relative variance of mutation rates of unpaired bases [keyword-only, default: 0.02]center_fmean (
float
) – Set the mean read center as a fraction of the region length [keyword-only, default: 0.5]center_fvar (
float
) – Set the variance of the read center as a fraction of its maximum [keyword-only, default: 0.3333333333333333]length_fmean (
float
) – Set the mean read length as a fraction of the region length [keyword-only, default: 0.5]length_fvar (
float
) – Set the variance of the read length as a fraction of its maximum [keyword-only, default: 0.012345679012345678]clust_conc (
float
) – Set the concentration parameter for simulating cluster proportions [keyword-only, default: 0.0]paired_end (
bool
) – Simulate paired-end or single-end reads [keyword-only, default: True]read_length (
int
) – Simulate reads with this many base calls [keyword-only, default: 151]reverse_fraction (
float
) – Simulate this fraction of reverse-oriented reads [keyword-only, default: 0.5]min_mut_gap (
int
) – Mask reads with two mutations separated by fewer than this many bases [keyword-only, default: 4]fq_gzip (
bool
) – Simulate FASTQ files with gzip compression or as plain text [keyword-only, default: True]num_reads (
int
) – Simulate this many reads [keyword-only, default: 65536]keep_tmp (
bool
) – Keep temporary files after finishing [keyword-only, default: False]force (
bool
) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]num_cpus (
int
) – Use up to this many CPUs simultaneously [keyword-only, default: 4]