seismicrna.sim package
Subpackages
Submodules
- seismicrna.sim.clusts.load_pclust(pclust_file: Path)
Load cluster proportions from a file.
- seismicrna.sim.clusts.run(*, ct_file: Iterable[str | Path] = (), clust_conc: float = 0.0, force: bool = False, max_procs: int = 4)
Simulate the rate of each kind of mutation at each position.
- Parameters:
ct_file (
Iterable
) – Simulate parameters using the structure(s) in this CT file [keyword-only, default: ()]clust_conc (
float
) – Set the concentration parameter for simulating cluster proportions [keyword-only, default: 0.0]force (
bool
) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]max_procs (
int
) – Run up to this many processes simultaneously [keyword-only, default: 4]
- seismicrna.sim.clusts.sim_pclust(num_clusters: int, concentration: float | None = None, sort: bool = True)
Simulate proportions of clusters using a Dirichlet distribution.
- Parameters:
num_clusters (
int
) – Number of clusters to simulate; must be ≥ 1.concentration (
float | None = None
) – Concentration parameter for Dirichlet distribution; defaults to 1 / (num_clusters - 1); must be > 0.sort (
bool = False
) – Sort the cluster proportions from greatest to least.
- Returns:
Simulated proportion of each cluster.
- Return type:
pd.Series
- seismicrna.sim.ends.load_pends(pends_file: Path)
Load end coordinate proportions from a file.
- seismicrna.sim.ends.run(*, ct_file: Iterable[str | Path] = (), center_fmean: float = 0.5, center_fvar: float = 0.3333333333333333, length_fmean: float = 0.5, length_fvar: float = 0.012345679012345678, force: bool = False, max_procs: int = 4)
Simulate the rate of each kind of mutation at each position.
- Parameters:
ct_file (
Iterable
) – Simulate parameters using the structure(s) in this CT file [keyword-only, default: ()]center_fmean (
float
) – Set the mean read center as a fraction of the region length [keyword-only, default: 0.5]center_fvar (
float
) – Set the variance of the read center as a fraction of its maximum [keyword-only, default: 0.3333333333333333]length_fmean (
float
) – Set the mean read length as a fraction of the region length [keyword-only, default: 0.5]length_fvar (
float
) – Set the variance of the read length as a fraction of its maximum [keyword-only, default: 0.012345679012345678]force (
bool
) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]max_procs (
int
) – Run up to this many processes simultaneously [keyword-only, default: 4]
- seismicrna.sim.ends.sim_pends(end5: int, end3: int, center_fmean: float, center_fvar: float, length_fmean: float, length_fvar: float, keep_empty_reads: bool = True)
Simulate segment end coordinate probabilities.
- Parameters:
end5 (
int
) – 5’ end of the region (minimum allowed 5’ end coordinate).end3 (
int
) – 3’ end of the region (maximum allowed 5’ end coordinate).center_fmean (
float
) – Mean read center, as a fraction of the reference length.center_fvar (
float
) – Variance of the read center, as a fraction of its maximum.length_fmean (
float
) – Mean read length, as a fraction of the available length.length_fvar (
float
) – Variance of the read length, as a fraction of its maximum.keep_empty_reads (
bool
) – Whether to keep reads whose lengths are 0.
- Returns:
5’ and 3’ coordinates and their probabilities.
- Return type:
tuple[np.ndarray
,np.ndarray
,np.ndarray]
- seismicrna.sim.ends.sim_pends_ct(ct_file: Path, *, center_fmean: float, center_fvar: float, length_fmean: float, length_fvar: float, force: bool)
- seismicrna.sim.fastq.from_param_dir(param_dir: Path, *, sample: str, profile: str, read_length: int, paired: bool, p_rev: float, fq_gzip: bool, force: bool, **kwargs)
Simulate a FASTQ file from parameter files.
- seismicrna.sim.fastq.from_report(report_file: Path, *, read_length: int, p_rev: float, fq_gzip: bool, force: bool)
Simulate a FASTQ file from a Relate report.
- seismicrna.sim.fastq.generate_fastq(top: Path, sample: str, ref: str, refseq: DNA, paired: bool, read_length: int, batches: Iterable[tuple[RelateBatch, ReadNamesBatch]], p_rev: float = 0.5, fq_gzip: bool = True, force: bool = False)
Generate FASTQ file(s) from a dataset.
- seismicrna.sim.fastq.generate_fastq_record(name: str, rels: ndarray, refseq: str, adapter: str, read_length: int, reverse: bool = False, hi_qual: str = 'I', lo_qual: str = '!')
Generate a FASTQ line for a read.
- seismicrna.sim.fastq.run(*, input_path: Iterable[str | Path], param_dir: Iterable[str | Path] = (), profile_name: str = 'simulated', sample: str = 'sim-sample', paired_end: bool = True, read_length: int = 151, reverse_fraction: float = 0.5, min_mut_gap: int = 3, fq_gzip: bool = True, num_reads: int = 65536, max_procs: int = 4, force: bool = False)
- Parameters:
param_dir (
Iterable
) – Simulate data using parameter files in this directory [keyword-only, default: ()]profile_name (
str
) – Give the simulated structure and parameters this profile name [keyword-only, default: ‘simulated’]sample (
str
) – Give this name to the simulated sample [keyword-only, default: ‘sim-sample’]paired_end (
bool
) – Simulate paired-end or single-end reads [keyword-only, default: True]read_length (
int
) – Simulate reads with this many base calls [keyword-only, default: 151]reverse_fraction (
float
) – Simulate this fraction of reverse-oriented reads [keyword-only, default: 0.5]min_mut_gap (
int
) – Mask reads with two mutations separated by fewer than this many bases [keyword-only, default: 3]fq_gzip (
bool
) – Simulate FASTQ files with gzip compression or as plain text [keyword-only, default: True]num_reads (
int
) – Simulate this many reads [keyword-only, default: 65536]max_procs (
int
) – Run up to this many processes simultaneously [keyword-only, default: 4]force (
bool
) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]
- seismicrna.sim.fold.fold_region(region: Region, *, sim_dir: Path, tmp_dir: Path, profile_name: str, fold_constraint: Path | None, fold_temp: float, fold_md: int, fold_mfe: bool, fold_max: int, fold_percent: float, keep_tmp: bool, force: bool, n_procs: int)
- seismicrna.sim.fold.get_ct_path(top: Path, region: Region, profile: str)
Get the path of a connectivity table (CT) file.
- seismicrna.sim.fold.run(fasta: str | Path, *, sim_dir: str | Path = './sim', profile_name: str = 'simulated', fold_coords: Iterable[tuple[str, int, int]] = (), fold_primers: Iterable[tuple[str, DNA, DNA]] = (), fold_regions_file: str | None = None, fold_constraint: str | None = None, fold_temp: float = 310.15, fold_md: int = 0, fold_mfe: bool = False, fold_max: int = 20, fold_percent: float = 20.0, keep_tmp: bool = False, force: bool = False, max_procs: int = 4, tmp_pfx='./tmp')
- Parameters:
sim_dir (
str | pathlib._local.Path
) – Write all simulated files to this directory [keyword-only, default: ‘./sim’]profile_name (
str
) – Give the simulated structure and parameters this profile name [keyword-only, default: ‘simulated’]fold_coords (
Iterable
) – Fold a region of a reference given its 5’ and 3’ end coordinates [keyword-only, default: ()]fold_primers (
Iterable
) – Fold a region of a reference given its forward and reverse primers [keyword-only, default: ()]fold_regions_file (
str | None
) – Fold regions of references from coordinates/primers in a CSV file [keyword-only, default: None]fold_constraint (
str | None
) – Force bases to be paired/unpaired from a file of constraints [keyword-only, default: None]fold_temp (
float
) – Predict structures at this temperature (Kelvin) [keyword-only, default: 310.15]fold_md (
int
) – Limit base pair distances to this number of bases (0 for no limit) [keyword-only, default: 0]fold_mfe (
bool
) – Predict only the minimum free energy (MFE) structure [keyword-only, default: False]fold_max (
int
) – Output at most this many structures (overriden by –fold-mfe) [keyword-only, default: 20]fold_percent (
float
) – Stop outputting structures when the % difference in energy exceeds this value (overriden by –fold-mfe) [keyword-only, default: 20.0]keep_tmp (
bool
) – Keep temporary files after finishing [keyword-only, default: False]force (
bool
) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]max_procs (
int
) – Run up to this many processes simultaneously [keyword-only, default: 4]tmp_pfx – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]
- seismicrna.sim.muts.calc_pmut_pattern(pmut: DataFrame, pattern: RelPattern)
Calculate the rate of a given type of mutation.
- seismicrna.sim.muts.load_pmut(pmut_file: Path)
Load mutation rates from a file.
- seismicrna.sim.muts.make_pmut_means(*, ploq: float = 0.02, pam: float, pac: float = 0.3, pag: float = 0.16, pat: float = 0.5, pcm: float, pca: float = 0.32, pcg: float = 0.32, pct: float = 0.32, pgm: float, pga: float = 0.32, pgc: float = 0.32, pgt: float = 0.32, ptm: float, pta: float = 0.32, ptc: float = 0.32, ptg: float = 0.32, pnm: float = 0.0, pnd: float = 0.04)
Generate mean mutation rates.
Mutations are assumed to behave as follows:
A base n mutates with probability pnm. - If it mutates, then it is a substitution with probability
(pna + pnc + png + pnt). - If it is a substitution, then it is high-quality with
probability (1 - ploq).
Otherwise, it is low-quality.
Otherwise, it is a deletion.
Otherwise, it is low-quality with probability ploq.
So the overall probability of being low-quailty is the probability given a mutation, pnm * (pna + pnc + png + pnt) * ploq, plus the probability given no mutation, (1 - pnm) * ploq, which equals ploq * (1 - pam * (1 - (pna + pnc + png + pnt))).
- Parameters:
ploq (
float
) – Probability that a base is low-quality.pam (
float
) – Probability that an A is mutated.pac (
float
) – Probability that a mutated A is a substitution to C.pag (
float
) – Probability that a mutated A is a substitution to G.pat (
float
) – Probability that a mutated A is a substitution to T.pcm (
float
) – Probability that a C is mutated.pca (
float
) – Probability that a mutated C is a substitution to A.pcg (
float
) – Probability that a mutated C is a substitution to G.pct (
float
) – Probability that a mutated C is a substitution to T.pgm (
float
) – Probability that a G is mutated.pga (
float
) – Probability that a mutated G is a substitution to A.pgc (
float
) – Probability that a mutated G is a substitution to C.pgt (
float
) – Probability that a mutated G is a substitution to T.ptm (
float
) – Probability that a T is mutated.pta (
float
) – Probability that a mutated T is a substitution to A.ptc (
float
) – Probability that a mutated T is a substitution to C.ptg (
float
) – Probability that a mutated T is a substitution to G.pnm (
float
) – Probability that an N is mutated.pnd (
float
) – Probability that a mutated N is a deletion.
- Returns:
Mean rate of each type of mutation (column) and each base (row).
- Return type:
pd.DataFrame
- seismicrna.sim.muts.make_pmut_means_paired(pam: float = 0.005, pcm: float = 0.003, pgm: float = 0.003, ptm: float = 0.001, pnm: float = 0.002, **kwargs)
Generate mean mutation rates for paired bases.
- seismicrna.sim.muts.make_pmut_means_unpaired(pam: float = 0.045, pcm: float = 0.039, pgm: float = 0.003, ptm: float = 0.001, pnm: float = 0.002, **kwargs)
Generate mean mutation rates for unpaired bases.
- seismicrna.sim.muts.run(*, ct_file: Iterable[str | Path] = (), pmut_paired: Iterable[tuple[str, float]] = (), pmut_unpaired: Iterable[tuple[str, float]] = (), vmut_paired: float = 0.001, vmut_unpaired: float = 0.02, force: bool = False, max_procs: int = 4)
Simulate the rate of each kind of mutation at each position.
- Parameters:
ct_file (
Iterable
) – Simulate parameters using the structure(s) in this CT file [keyword-only, default: ()]pmut_paired (
Iterable
) – Set the mean rate of each kind of mutation for paired bases [keyword-only, default: ()]pmut_unpaired (
Iterable
) – Set the mean rate of each kind of mutation for unpaired bases [keyword-only, default: ()]vmut_paired (
float
) – Set the relative variance of mutation rates of paired bases [keyword-only, default: 0.001]vmut_unpaired (
float
) – Set the relative variance of mutation rates of unpaired bases [keyword-only, default: 0.02]force (
bool
) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]max_procs (
int
) – Run up to this many processes simultaneously [keyword-only, default: 4]
- seismicrna.sim.muts.run_struct(ct_file: Path, pmut_paired: Iterable[tuple[str, float]], pmut_unpaired: Iterable[tuple[str, float]], vmut_paired: float, vmut_unpaired: float, force: bool)
- seismicrna.sim.muts.sim_pmut(positions: Index, mean: DataFrame, relative_variance: float)
Simulate mutation rates using a Dirichlet distribution.
- Parameters:
positions (
pd.Index
) – Index of positions and bases.mean (
pd.DataFrame
) – Mean of the mutation rates for each type of base.relative_variance (
float
) – Variance of the mutation rates, as a fraction of its supremum.
- Returns:
Mutation rates, with the same index as
- Return type:
pd.DataFrame
- seismicrna.sim.muts.verify_proportions(p: Any)
Verify that p is a valid set of proportions:
Every element of p must be ≥ 0 and ≤ 1.
The sum of p must equal 1.
- Parameters:
p (
Any
) – Proportions to verify; must be a NumPy array or convertable into a NumPy array.
- seismicrna.sim.params.run(*, ct_file: Iterable[str | Path] = (), pmut_paired: Iterable[tuple[str, float]] = (), pmut_unpaired: Iterable[tuple[str, float]] = (), vmut_paired: float = 0.001, vmut_unpaired: float = 0.02, center_fmean: float = 0.5, center_fvar: float = 0.3333333333333333, length_fmean: float = 0.5, length_fvar: float = 0.012345679012345678, clust_conc: float = 0.0, force: bool = False, max_procs: int = 4)
Simulate parameter files.
- Parameters:
ct_file (
Iterable
) – Simulate parameters using the structure(s) in this CT file [keyword-only, default: ()]pmut_paired (
Iterable
) – Set the mean rate of each kind of mutation for paired bases [keyword-only, default: ()]pmut_unpaired (
Iterable
) – Set the mean rate of each kind of mutation for unpaired bases [keyword-only, default: ()]vmut_paired (
float
) – Set the relative variance of mutation rates of paired bases [keyword-only, default: 0.001]vmut_unpaired (
float
) – Set the relative variance of mutation rates of unpaired bases [keyword-only, default: 0.02]center_fmean (
float
) – Set the mean read center as a fraction of the region length [keyword-only, default: 0.5]center_fvar (
float
) – Set the variance of the read center as a fraction of its maximum [keyword-only, default: 0.3333333333333333]length_fmean (
float
) – Set the mean read length as a fraction of the region length [keyword-only, default: 0.5]length_fvar (
float
) – Set the variance of the read length as a fraction of its maximum [keyword-only, default: 0.012345679012345678]clust_conc (
float
) – Set the concentration parameter for simulating cluster proportions [keyword-only, default: 0.0]force (
bool
) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]max_procs (
int
) – Run up to this many processes simultaneously [keyword-only, default: 4]
- seismicrna.sim.ref.run(*, sim_dir: str | Path = './sim', refs: str = 'sim-refs', ref: str = 'sim-ref', reflen: int = 280, force: bool = False)
- Parameters:
sim_dir (
str | pathlib._local.Path
) – Write all simulated files to this directory [keyword-only, default: ‘./sim’]refs (
str
) – Give this name to the file of simulated references [keyword-only, default: ‘sim-refs’]ref (
str
) – Give this name to the simulated reference [keyword-only, default: ‘sim-ref’]reflen (
int
) – Simulate a reference sequence with this many bases [keyword-only, default: 280]force (
bool
) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]
- seismicrna.sim.relate.from_param_dir(param_dir: Path, profile: str, min_mut_gap: int, **kwargs)
Simulate a Relate dataset given parameter files.
- seismicrna.sim.relate.get_param_dir_fields(param_dir: Path)
- seismicrna.sim.relate.load_param_dir(param_dir: Path, profile: str)
Load all parameters for a profile in a directory.
- seismicrna.sim.relate.run(*, param_dir: Iterable[str | Path] = (), profile_name: str = 'simulated', sample: str = 'sim-sample', paired_end: bool = True, read_length: int = 151, reverse_fraction: float = 0.5, min_mut_gap: int = 3, num_reads: int = 65536, batch_size: int = 65536, write_read_names: bool = False, brotli_level: int = 10, force: bool = False, max_procs: int = 4, tmp_pfx='./tmp', keep_tmp=False)
Simulate a Relate dataset.
- Parameters:
param_dir (
Iterable
) – Simulate data using parameter files in this directory [keyword-only, default: ()]profile_name (
str
) – Give the simulated structure and parameters this profile name [keyword-only, default: ‘simulated’]sample (
str
) – Give this name to the simulated sample [keyword-only, default: ‘sim-sample’]paired_end (
bool
) – Simulate paired-end or single-end reads [keyword-only, default: True]read_length (
int
) – Simulate reads with this many base calls [keyword-only, default: 151]reverse_fraction (
float
) – Simulate this fraction of reverse-oriented reads [keyword-only, default: 0.5]min_mut_gap (
int
) – Mask reads with two mutations separated by fewer than this many bases [keyword-only, default: 3]num_reads (
int
) – Simulate this many reads [keyword-only, default: 65536]batch_size (
int
) – Limit batches to at most this many reads [keyword-only, default: 65536]write_read_names (
bool
) – Write the name of each read in a second set of batches (necessary for the options –mask-read or –mask-read-file) [keyword-only, default: False]brotli_level (
int
) – Compress pickle files with this level of Brotli (0 - 11) [keyword-only, default: 10]force (
bool
) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]max_procs (
int
) – Run up to this many processes simultaneously [keyword-only, default: 4]tmp_pfx – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]
keep_tmp – Keep temporary files after finishing [keyword-only, default: False]
- seismicrna.sim.total.run(*, sim_dir: str | Path = './sim', tmp_pfx: str | Path = './tmp', sample: str = 'sim-sample', refs: str = 'sim-refs', ref: str = 'sim-ref', reflen: int = 280, profile_name: str = 'simulated', fold_coords: Iterable[tuple[str, int, int]] = (), fold_primers: Iterable[tuple[str, DNA, DNA]] = (), fold_regions_file: str | None = None, fold_constraint: str | None = None, fold_temp: float = 310.15, fold_md: int = 0, fold_mfe: bool = False, fold_max: int = 20, fold_percent: float = 20.0, pmut_paired: Iterable[tuple[str, float]] = (), pmut_unpaired: Iterable[tuple[str, float]] = (), vmut_paired: float = 0.001, vmut_unpaired: float = 0.02, center_fmean: float = 0.5, center_fvar: float = 0.3333333333333333, length_fmean: float = 0.5, length_fvar: float = 0.012345679012345678, clust_conc: float = 0.0, paired_end: bool = True, read_length: int = 151, reverse_fraction: float = 0.5, min_mut_gap: int = 3, fq_gzip: bool = True, num_reads: int = 65536, keep_tmp: bool = False, force: bool = False, max_procs: int = 4)
Simulate FASTQ files from scratch.
- Parameters:
sim_dir (
str | pathlib._local.Path
) – Write all simulated files to this directory [keyword-only, default: ‘./sim’]tmp_pfx (
str | pathlib._local.Path
) – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]sample (
str
) – Give this name to the simulated sample [keyword-only, default: ‘sim-sample’]refs (
str
) – Give this name to the file of simulated references [keyword-only, default: ‘sim-refs’]ref (
str
) – Give this name to the simulated reference [keyword-only, default: ‘sim-ref’]reflen (
int
) – Simulate a reference sequence with this many bases [keyword-only, default: 280]profile_name (
str
) – Give the simulated structure and parameters this profile name [keyword-only, default: ‘simulated’]fold_coords (
Iterable
) – Fold a region of a reference given its 5’ and 3’ end coordinates [keyword-only, default: ()]fold_primers (
Iterable
) – Fold a region of a reference given its forward and reverse primers [keyword-only, default: ()]fold_regions_file (
str | None
) – Fold regions of references from coordinates/primers in a CSV file [keyword-only, default: None]fold_constraint (
str | None
) – Force bases to be paired/unpaired from a file of constraints [keyword-only, default: None]fold_temp (
float
) – Predict structures at this temperature (Kelvin) [keyword-only, default: 310.15]fold_md (
int
) – Limit base pair distances to this number of bases (0 for no limit) [keyword-only, default: 0]fold_mfe (
bool
) – Predict only the minimum free energy (MFE) structure [keyword-only, default: False]fold_max (
int
) – Output at most this many structures (overriden by –fold-mfe) [keyword-only, default: 20]fold_percent (
float
) – Stop outputting structures when the % difference in energy exceeds this value (overriden by –fold-mfe) [keyword-only, default: 20.0]pmut_paired (
Iterable
) – Set the mean rate of each kind of mutation for paired bases [keyword-only, default: ()]pmut_unpaired (
Iterable
) – Set the mean rate of each kind of mutation for unpaired bases [keyword-only, default: ()]vmut_paired (
float
) – Set the relative variance of mutation rates of paired bases [keyword-only, default: 0.001]vmut_unpaired (
float
) – Set the relative variance of mutation rates of unpaired bases [keyword-only, default: 0.02]center_fmean (
float
) – Set the mean read center as a fraction of the region length [keyword-only, default: 0.5]center_fvar (
float
) – Set the variance of the read center as a fraction of its maximum [keyword-only, default: 0.3333333333333333]length_fmean (
float
) – Set the mean read length as a fraction of the region length [keyword-only, default: 0.5]length_fvar (
float
) – Set the variance of the read length as a fraction of its maximum [keyword-only, default: 0.012345679012345678]clust_conc (
float
) – Set the concentration parameter for simulating cluster proportions [keyword-only, default: 0.0]paired_end (
bool
) – Simulate paired-end or single-end reads [keyword-only, default: True]read_length (
int
) – Simulate reads with this many base calls [keyword-only, default: 151]reverse_fraction (
float
) – Simulate this fraction of reverse-oriented reads [keyword-only, default: 0.5]min_mut_gap (
int
) – Mask reads with two mutations separated by fewer than this many bases [keyword-only, default: 3]fq_gzip (
bool
) – Simulate FASTQ files with gzip compression or as plain text [keyword-only, default: True]num_reads (
int
) – Simulate this many reads [keyword-only, default: 65536]keep_tmp (
bool
) – Keep temporary files after finishing [keyword-only, default: False]force (
bool
) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]max_procs (
int
) – Run up to this many processes simultaneously [keyword-only, default: 4]