seismicrna.sim package
Subpackages
- seismicrna.sim.tests package
- Submodules
TestSimCLIInvocationTestSimCLIInvocation.setUp()TestSimCLIInvocation.tearDown()TestSimCLIInvocation.test_abstract_empty()TestSimCLIInvocation.test_abstract_help()TestSimCLIInvocation.test_clusts_empty()TestSimCLIInvocation.test_clusts_help()TestSimCLIInvocation.test_ends_empty()TestSimCLIInvocation.test_ends_help()TestSimCLIInvocation.test_fastq_empty()TestSimCLIInvocation.test_fastq_help()TestSimCLIInvocation.test_fold_help()TestSimCLIInvocation.test_muts_empty()TestSimCLIInvocation.test_muts_help()TestSimCLIInvocation.test_params_empty()TestSimCLIInvocation.test_params_help()TestSimCLIInvocation.test_ref_empty()TestSimCLIInvocation.test_ref_help()TestSimCLIInvocation.test_relate_empty()TestSimCLIInvocation.test_relate_help()TestSimCLIInvocation.test_total_empty()TestSimCLIInvocation.test_total_help()
TestSimCLIParamsTestSimCLIParams.test_abstract()TestSimCLIParams.test_clusts()TestSimCLIParams.test_ends()TestSimCLIParams.test_fastq()TestSimCLIParams.test_fold()TestSimCLIParams.test_muts()TestSimCLIParams.test_params()TestSimCLIParams.test_ref()TestSimCLIParams.test_relate()TestSimCLIParams.test_total()
TestSimPEndsTestParseMinMutGapWeightsTestParseMinMutGapWeights.test_duplicate_gap()TestParseMinMutGapWeights.test_empty_string()TestParseMinMutGapWeights.test_gap_float_string()TestParseMinMutGapWeights.test_gap_not_integer()TestParseMinMutGapWeights.test_multiple_pairs()TestParseMinMutGapWeights.test_multiple_zero_weights_excluded()TestParseMinMutGapWeights.test_negative_gap()TestParseMinMutGapWeights.test_negative_weight()TestParseMinMutGapWeights.test_output_sorted_by_gap()TestParseMinMutGapWeights.test_pair_extra_colon()TestParseMinMutGapWeights.test_pair_missing_colon()TestParseMinMutGapWeights.test_single_pair_gap0()TestParseMinMutGapWeights.test_single_pair_nonzero_gap()TestParseMinMutGapWeights.test_single_pair_weight_zero_invalid()TestParseMinMutGapWeights.test_weight_above_1()TestParseMinMutGapWeights.test_weight_not_float()TestParseMinMutGapWeights.test_weights_do_not_sum_to_1()TestParseMinMutGapWeights.test_weights_exceed_1_in_total()TestParseMinMutGapWeights.test_weights_sum_to_1_within_float_tolerance()TestParseMinMutGapWeights.test_zero_weight_excluded()
TestSimulateBatches
- Submodules
Submodules
- seismicrna.sim.abstract.abstract_seismicgraph_file(seismicgraph_file: Path, min_aucroc: float = 0.0)
- seismicrna.sim.abstract.abstract_table(table: MaskPositionTableLoader, struct_file: str | Path, min_aucroc: float = 0.0)
- seismicrna.sim.abstract.get_acgt_parameters()
- seismicrna.sim.abstract.get_other_parameters()
- seismicrna.sim.abstract.new_parameter_dict()
- seismicrna.sim.abstract.run(input_path: Iterable[str | Path] = Sentinel.UNSET, *, struct_file: Iterable[str | Path] = (), min_aucroc: float = 0.85, print_params: bool = True, verify_times: bool = True, num_cpus: int = 4)
Abstract simulation parameters from existing datasets.
- Parameters:
struct_file (
Iterable) – Compare mutational profiles to the structure(s) in this CT file [keyword-only, default: ()]min_aucroc (
float) – Skip tables/profiles where the AUC-ROC is less than this value [keyword-only, default: 0.85]verify_times (
bool) – Verify that report files from later steps have later timestamps [keyword-only, default: True]num_cpus (
int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]
- seismicrna.sim.clusts.run(*, ct_file: Iterable[str | Path] = (), clust_conc: float = 0.0, force: bool = False, num_cpus: int = 4, seed: int | None = None)
Simulate the rate of each kind of mutation at each position.
- Parameters:
ct_file (
Iterable) – Simulate parameters using the structure(s) in this CT file [keyword-only, default: ()]clust_conc (
float) – Set the concentration parameter for simulating cluster proportions [keyword-only, default: 0.0]force (
bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]num_cpus (
int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]seed (
int | None) – Seed for the random number generator [keyword-only, default: None]
- seismicrna.sim.clusts.sim_pclust(num_clusters: int, concentration: float | None = None, sort: bool = True, seed: int | None = None)
Simulate proportions of clusters using a Dirichlet distribution.
- Parameters:
- Returns:
Simulated proportion of each cluster.
- Return type:
pd.Series
- seismicrna.sim.clusts.sim_pclust_ct(ct_file: Path, *, concentration: float, force: bool, seed: int | None)
Simulate cluster proportions for a CT file and write them to disk.
The number of clusters is inferred from the number of structures in the CT file.
- Parameters:
ct_file (
Path) – Path to the connectivity table (CT) file whose structures define the number of clusters.concentration (
float) – Concentration parameter for the Dirichlet distribution used to simulate cluster proportions; must be > 0.force (
bool) – Whether to overwrite an existing output file.seed (
int | None) – Random seed for reproducibility; None for no fixed seed.
- Returns:
Path of the written cluster proportions CSV file.
- Return type:
Path
- seismicrna.sim.ends.run(*, ct_file: Iterable[str | Path] = (), center_fmean: float = 0.5, center_fvar: float = 0.3333333333333333, length_fmean: float = 0.5, length_fvar: float = 0.012345679012345678, force: bool = False, num_cpus: int = 4)
Simulate the rate of each kind of mutation at each position.
- Parameters:
ct_file (
Iterable) – Simulate parameters using the structure(s) in this CT file [keyword-only, default: ()]center_fmean (
float) – Set the mean read center as a fraction of the region length [keyword-only, default: 0.5]center_fvar (
float) – Set the variance of the read center as a fraction of its maximum [keyword-only, default: 0.3333333333333333]length_fmean (
float) – Set the mean read length as a fraction of the region length [keyword-only, default: 0.5]length_fvar (
float) – Set the variance of the read length as a fraction of its maximum [keyword-only, default: 0.012345679012345678]force (
bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]num_cpus (
int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]
- seismicrna.sim.ends.sim_pends(end5: int, end3: int, center_fmean: float, center_fvar: float, length_fmean: float, length_fvar: float, keep_empty_reads: bool = True)
Simulate segment end coordinate probabilities.
- Parameters:
end5 (
int) – 5’ end of the region (minimum allowed 5’ end coordinate).end3 (
int) – 3’ end of the region (maximum allowed 5’ end coordinate).center_fmean (
float) – Mean read center, as a fraction of the reference length.center_fvar (
float) – Variance of the read center, as a fraction of its maximum.length_fmean (
float) – Mean read length, as a fraction of the available length.length_fvar (
float) – Variance of the read length, as a fraction of its maximum.keep_empty_reads (
bool) – Whether to keep reads whose lengths are 0.
- Returns:
5’ and 3’ coordinates and their probabilities.
- Return type:
tuple[np.ndarray,np.ndarray,np.ndarray]
- seismicrna.sim.ends.sim_pends_ct(ct_file: Path, *, center_fmean: float, center_fvar: float, length_fmean: float, length_fvar: float, force: bool)
Simulate read-end coordinate probabilities for a CT file region.
Determines the region boundaries from the CT file, simulates the probability distribution over all (5’, 3’) end coordinate pairs, and writes the result to a CSV file.
- Parameters:
ct_file (
Path) – Path to the connectivity table (CT) file defining the region.center_fmean (
float) – Mean read center as a fraction of the region length (0 to 1).center_fvar (
float) – Variance of the read center as a fraction of its maximum (0 to 1).length_fmean (
float) – Mean read length as a fraction of the available length (0 to 1).length_fvar (
float) – Variance of the read length as a fraction of its maximum (0 to 1).force (
bool) – Whether to overwrite an existing output file.
- Returns:
Path of the written end-coordinate proportions CSV file.
- Return type:
Path
- seismicrna.sim.fastq.from_param_dir(param_dir: Path, *, sample: str, profile: str, read_length: int, paired: bool, p_rev: float, fq_gzip: bool, force: bool, seed: int | None, **kwargs)
Simulate a FASTQ file from parameter files.
- seismicrna.sim.fastq.from_report(report_file: Path, *, read_length: int, p_rev: float, fq_gzip: bool, force: bool, seed: int | None)
Simulate a FASTQ file from a Relate report.
- seismicrna.sim.fastq.generate_fastq(top: Path, sample: str, ref: str, refseq: DNA, paired: bool, read_length: int, batches: Iterable[tuple[RelateRegionMutsBatch, ReadNamesBatch]], p_rev: float = 0.5, fq_gzip: bool = True, force: bool = False, seed: int | None = None)
Generate FASTQ file(s) from a dataset.
- seismicrna.sim.fastq.generate_fastq_record(name: str, rels: ndarray, refseq: str, adapter: str, read_length: int, reverse: bool = False, hi_qual: str = 'I', lo_qual: str = '!')
Generate a FASTQ line for a read.
- seismicrna.sim.fastq.run(*, input_path: Iterable[str | Path] = Sentinel.UNSET, param_dir: Iterable[str | Path] = (), profile_name: str = 'simulated', sample: str = 'sim-sample', paired_end: bool = True, read_length: int = 151, reverse_fraction: float = 0.5, probe: str = 'DMS', min_mut_gap: int | None = None, min_mut_gap_weights: str = '', mut_collisions: str = 'auto', mut_probs: str | None = '0.2,0.04,0.008', fq_gzip: bool = True, num_reads: int = 65536, num_cpus: int = 4, force: bool = False, seed: int | None = None)
Simulate FASTQ file(s) from relate reports or parameter directories.
- Parameters:
input_path (
Iterable[str | Path]) – Paths to relate report files or directories containing them; used to generate FASTQ files from existing relate data.param_dir (
Iterable) – Paths to simulation parameter directories; used to generate FASTQ files from CT/parameter files.profile_name (
str) – Name of the mutation profile to use from the parameter directory.sample (
str) – Sample name to embed in the output FASTQ paths.paired_end (
bool) – Whether to simulate paired-end reads.read_length (
int) – Length of each simulated read.reverse_fraction (
float) – Fraction of reads where mate 1 is reverse-complemented.probe (
str) – Probe type (e.g. DMS); used to set default min_mut_gap.min_mut_gap (
int | None) – Minimum gap between mutations; None to use the probe default.min_mut_gap_weights (
str) – Comma-separated gap:weight pairs for a bias mixture; empty string to use the single min_mut_gap.mut_collisions (
str) – How to handle reads with close mutations: “drop” or “merge”.fq_gzip (
bool) – Whether to gzip-compress the output FASTQ files.num_reads (
int) – Total number of reads to simulate per param_dir run.num_cpus (
int) – Number of CPU cores to use.force (
bool) – Whether to overwrite existing output files.seed (
int | None) – Random seed for reproducibility; None for no fixed seed.param_dir – Simulate data using parameter files in this directory [keyword-only, default: ()]
profile_name – Give the simulated structure and parameters this profile name [keyword-only, default: ‘simulated’]
sample – Give this name to the simulated sample [keyword-only, default: ‘sim-sample’]
paired_end – Simulate paired-end or single-end reads [keyword-only, default: True]
read_length – Simulate reads with this many base calls [keyword-only, default: 151]
reverse_fraction – Simulate this fraction of reverse-oriented reads [keyword-only, default: 0.5]
probe – Use default mask options for this chemical probe [keyword-only, default: ‘DMS’]
min_mut_gap – Mask reads with two mutations separated by fewer than this many bases [keyword-only, default: None]
min_mut_gap_weights – Comma-separated gap:weight pairs defining a mixture of min_mut_gap biases, e.g. ‘0:0.18,2:0.05,3:0.21’. When given, overrides –min-mut-gap. [keyword-only, default: ‘’]
mut_collisions – If two mutations are closer than –min-mut-gap positions, MERGE the mutations, DROP the read, or AUTO-select based on the probe. [keyword-only, default: ‘auto’]
mut_probs (
str | None) – Comma-separated probabilities of injecting a mutation at each successive position 5’ of an existing mutation (used with –mut-collisions merge) [keyword-only, default: ‘0.2,0.04,0.008’]fq_gzip – Simulate FASTQ files with gzip compression or as plain text [keyword-only, default: True]
num_reads – Simulate this many reads [keyword-only, default: 65536]
num_cpus – Use up to this many CPUs simultaneously [keyword-only, default: 4]
force – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]
seed – Seed for the random number generator [keyword-only, default: None]
- Returns:
Paths of all generated FASTQ files.
- Return type:
list[Path]
- seismicrna.sim.fold.fold_region(region: Region, *, sim_dir: Path, tmp_dir: Path, profile_name: str, fold_backend: str, fold_constraint: Path | None, fold_temp: float, fold_md: int, fold_mfe: bool, fold_max: int, fold_min: int, fold_percent: float, keep_tmp: bool, force: bool, num_cpus: int)
Predict RNA secondary structures for one region using RNAstructure.
- Parameters:
region (
Region) – Sequence region to fold.sim_dir (
Path) – Simulation output directory; CT file is written under its parameter subdirectory.tmp_dir (
Path) – Directory for temporary FASTA and intermediate CT files.profile_name (
str) – Profile label embedded in the output CT file path.fold_constraint (
Path | None) – Path to a folding constraint file; None for no constraints.fold_temp (
float) – Folding temperature; interpreted as Celsius if in the typical physiological range, otherwise as Kelvin.fold_md (
int) – Maximum distance between paired bases (0 for no limit).fold_mfe (
bool) – Whether to predict only the minimum free energy structure.fold_max (
int) – Maximum number of structures to predict.fold_percent (
float) – Maximum percent energy difference from the MFE structure.keep_tmp (
bool) – Whether to retain temporary files after folding.force (
bool) – Whether to overwrite an existing output CT file.num_cpus (
int) – Number of CPU cores to use.
- Returns:
Path of the written CT file.
- Return type:
Path
- seismicrna.sim.fold.get_ct_path(top: Path, region: Region, profile: str)
Get the path of a connectivity table (CT) file.
- seismicrna.sim.fold.run(fasta: str | Path = Sentinel.UNSET, *, sim_dir: str | Path = './sim', profile_name: str = 'simulated', probe: str = 'DMS', fold_backend: str = 'auto', fold_coords: Iterable[tuple[str, int, int]] = (), fold_primers: Iterable[tuple[str, DNA, DNA]] = (), fold_regions_file: str | None = None, fold_constraint: str | None = None, fold_temp: float = 37.0, fold_md: int = 0, fold_mfe: bool = False, fold_max: int = 20, fold_min: int = 1, fold_percent: float = 20.0, keep_tmp: bool = False, force: bool = False, num_cpus: int = 4, tmp_pfx='./tmp')
Fold regions of a reference FASTA file and write CT files.
- Parameters:
fasta (
str | Path) – Path to the reference FASTA file.sim_dir (
str | pathlib._local.Path) – Simulation output directory for writing CT parameter files.profile_name (
str) – Profile label embedded in each output CT file path.fold_coords (
Iterable) – Explicit (ref, end5, end3) coordinate tuples defining regions.fold_primers (
Iterable) – Primer sequences used to define region boundaries.fold_regions_file (
str | None) – Path to a file listing regions to fold; None to use coords/primers.fold_constraint (
str | None) – Path to a folding constraint file; None for no constraints.fold_temp (
float) – Folding temperature (Celsius or Kelvin).fold_md (
int) – Maximum pairing distance (0 for no limit).fold_mfe (
bool) – Whether to predict only the minimum free energy structure.fold_max (
int) – Maximum number of structures per region.fold_min (
int) – Minimum number of structures required per region.fold_percent (
float) – Maximum percent energy difference from the MFE structure.keep_tmp (
bool) – Whether to retain temporary files.tmp_dir (
Path) – Directory for temporary files (injected by run_func).force (
bool) – Whether to overwrite existing CT files.num_cpus (
int) – Number of CPU cores to use.sim_dir – Write all simulated files to this directory [keyword-only, default: ‘./sim’]
profile_name – Give the simulated structure and parameters this profile name [keyword-only, default: ‘simulated’]
probe (
str) – Use default mask options for this chemical probe [keyword-only, default: ‘DMS’]fold_backend (
str) – Model RNA structures using Fold (RNAstructure), ShapeKnots (RNAstructure), or RNAfold (ViennaRNA); auto selects Fold for DMS and RNAFold for other probes [keyword-only, default: ‘auto’]fold_coords – Fold a region of a reference given its 5’ and 3’ end coordinates [keyword-only, default: ()]
fold_primers – Fold a region of a reference given its forward and reverse primers [keyword-only, default: ()]
fold_regions_file – Fold regions of references from coordinates/primers in a CSV file [keyword-only, default: None]
fold_constraint – Force bases to be paired/unpaired from a file of constraints [keyword-only, default: None]
fold_temp – Predict structures at this temperature (Celsius) [keyword-only, default: 37.0]
fold_md – Limit base pair distances to this number of bases (0 for no limit) [keyword-only, default: 0]
fold_mfe – Predict only the minimum free energy (MFE) structure [keyword-only, default: False]
fold_max – Output at most this many structures (overriden by –fold-mfe) [keyword-only, default: 20]
fold_min – Require at least this many structures (overriden by –fold-mfe) [keyword-only, default: 1]
fold_percent – Stop outputting structures when the % difference in energy exceeds this value (overriden by –fold-mfe) [keyword-only, default: 20.0]
keep_tmp – Keep temporary files after finishing [keyword-only, default: False]
force – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]
num_cpus – Use up to this many CPUs simultaneously [keyword-only, default: 4]
tmp_pfx – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]
- Returns:
Paths of all written CT files.
- Return type:
list[Path]
- seismicrna.sim.muts.make_pmut_means(*, ploq: float, pam: float, pac: float, pag: float, pat: float, pcm: float, pca: float, pcg: float, pct: float, pgm: float, pga: float, pgc: float, pgt: float, ptm: float, pta: float, ptc: float, ptg: float, pnm: float, pnd: float)
Generate mean mutation rates.
Mutations are assumed to behave as follows:
A base
nmutates with probabilitypnm.If it mutates, then it is a substitution with probability (
pna+pnc+png+pnt).If it is a substitution, then it is high-quality with probability (1 -
ploq).Otherwise, it is low-quality.
Otherwise, it is a deletion.
Otherwise, it is low-quality with probability
ploq.
So the overall probability of being low-quailty is the probability given a mutation, pnm * (pna + pnc + png + pnt) * ploq, plus the probability given no mutation, (1 - pnm) * ploq, which equals ploq * (1 - pam * (1 - (pna + pnc + png + pnt))).
- Parameters:
ploq (
float) – Probability that a base is low-quality.pam (
float) – Probability that an A is mutated.pac (
float) – Probability that a mutated A is a substitution to C.pag (
float) – Probability that a mutated A is a substitution to G.pat (
float) – Probability that a mutated A is a substitution to T.pcm (
float) – Probability that a C is mutated.pca (
float) – Probability that a mutated C is a substitution to A.pcg (
float) – Probability that a mutated C is a substitution to G.pct (
float) – Probability that a mutated C is a substitution to T.pgm (
float) – Probability that a G is mutated.pga (
float) – Probability that a mutated G is a substitution to A.pgc (
float) – Probability that a mutated G is a substitution to C.pgt (
float) – Probability that a mutated G is a substitution to T.ptm (
float) – Probability that a T is mutated.pta (
float) – Probability that a mutated T is a substitution to A.ptc (
float) – Probability that a mutated T is a substitution to C.ptg (
float) – Probability that a mutated T is a substitution to G.pnm (
float) – Probability that an N is mutated.pnd (
float) – Probability that a mutated N is a deletion.
- Returns:
Mean rate of each type of mutation (column) and each base (row).
- Return type:
pd.DataFrame
- seismicrna.sim.muts.make_pmut_means_paired(probe: str, **kwargs: float)
Generate mean mutation rates for paired bases.
- seismicrna.sim.muts.make_pmut_means_unpaired(probe: str, **kwargs: float)
Generate mean mutation rates for unpaired bases.
- seismicrna.sim.muts.run(*, ct_file: Iterable[str | Path] = (), pmut_paired: Iterable[tuple[str, float]] = (), pmut_unpaired: Iterable[tuple[str, float]] = (), vmut_paired: float = 0.001, vmut_unpaired: float = 0.02, probe: str = 'DMS', mask_coords: Iterable[tuple[str, int, int]] = (), mask_primers: Iterable[tuple[str, DNA, DNA]] = (), force: bool = False, num_cpus: int = 4, seed: int | None = None)
Simulate the rate of each kind of mutation at each position.
- Parameters:
ct_file (
Iterable) – Simulate parameters using the structure(s) in this CT file [keyword-only, default: ()]pmut_paired (
Iterable) – Set the mean rate of each kind of mutation for paired bases [keyword-only, default: ()]pmut_unpaired (
Iterable) – Set the mean rate of each kind of mutation for unpaired bases [keyword-only, default: ()]vmut_paired (
float) – Set the relative variance of mutation rates of paired bases [keyword-only, default: 0.001]vmut_unpaired (
float) – Set the relative variance of mutation rates of unpaired bases [keyword-only, default: 0.02]probe (
str) – Use default mask options for this chemical probe [keyword-only, default: ‘DMS’]mask_coords (
Iterable) – Select a region of a reference given its 5’ and 3’ end coordinates [keyword-only, default: ()]mask_primers (
Iterable) – Select a region of a reference given its forward and reverse primers [keyword-only, default: ()]force (
bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]num_cpus (
int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]seed (
int | None) – Seed for the random number generator [keyword-only, default: None]
- seismicrna.sim.muts.run_struct(ct_file: Path, pmut_paired: Iterable[tuple[str, float]], pmut_unpaired: Iterable[tuple[str, float]], vmut_paired: float, vmut_unpaired: float, probe: str, force: bool, seed: int | None, mask_coords: Iterable[tuple[str, int, int]] = (), mask_primers: Iterable[tuple[str, DNA, DNA]] = ())
Simulate per-position mutation rates for a CT file and write them.
For each structure in the CT file, mutation rates are simulated using a Dirichlet distribution, with separate mean rates for paired and unpaired bases.
- Parameters:
ct_file (
Path) – Path to the connectivity table (CT) file defining structures and base-pairing.pmut_paired (
Iterable[tuple[str,float]]) – Mutation-type/probability pairs for paired bases, passed to make_pmut_means_paired.pmut_unpaired (
Iterable[tuple[str,float]]) – Mutation-type/probability pairs for unpaired bases, passed to make_pmut_means_unpaired.vmut_paired (
float) – Relative variance of mutation rates for paired bases (0 to 1).vmut_unpaired (
float) – Relative variance of mutation rates for unpaired bases (0 to 1).force (
bool) – Whether to overwrite an existing output file.seed (
int | None) – Random seed for reproducibility; None for no fixed seed.
- Returns:
Path of the written mutation-rate CSV file.
- Return type:
Path
- seismicrna.sim.muts.sim_pmut(positions: Index, mean: DataFrame, relative_variance: float, end5: int | None, end3: int | None, seed: int | None)
Simulate mutation rates using a Dirichlet distribution.
- Parameters:
positions (
pd.Index) – Index of positions and bases.mean (
pd.DataFrame) – Mean of the mutation rates for each type of base.relative_variance (
float) – Variance of the mutation rates, as a fraction of its supremum.
- Returns:
Mutation rates, with the same index as
- Return type:
pd.DataFrame
- seismicrna.sim.muts.verify_proportions(p: Any)
Verify that p is a valid set of proportions:
Every element of p must be ≥ 0 and ≤ 1.
The sum of p must equal 1.
- Parameters:
p (
Any) – Proportions to verify; must be a NumPy array or convertable into a NumPy array.
- seismicrna.sim.params.run(*, ct_file: Iterable[str | Path] = (), pmut_paired: Iterable[tuple[str, float]] = (), pmut_unpaired: Iterable[tuple[str, float]] = (), vmut_paired: float = 0.001, vmut_unpaired: float = 0.02, probe: str = 'DMS', mask_coords: Iterable[tuple[str, int, int]] = (), mask_primers: Iterable[tuple[str, DNA, DNA]] = (), center_fmean: float = 0.5, center_fvar: float = 0.3333333333333333, length_fmean: float = 0.5, length_fvar: float = 0.012345679012345678, clust_conc: float = 0.0, force: bool = False, num_cpus: int = 4, seed: int | None = None)
Simulate parameter files.
- Parameters:
ct_file (
Iterable) – Simulate parameters using the structure(s) in this CT file [keyword-only, default: ()]pmut_paired (
Iterable) – Set the mean rate of each kind of mutation for paired bases [keyword-only, default: ()]pmut_unpaired (
Iterable) – Set the mean rate of each kind of mutation for unpaired bases [keyword-only, default: ()]vmut_paired (
float) – Set the relative variance of mutation rates of paired bases [keyword-only, default: 0.001]vmut_unpaired (
float) – Set the relative variance of mutation rates of unpaired bases [keyword-only, default: 0.02]probe (
str) – Use default mask options for this chemical probe [keyword-only, default: ‘DMS’]mask_coords (
Iterable) – Select a region of a reference given its 5’ and 3’ end coordinates [keyword-only, default: ()]mask_primers (
Iterable) – Select a region of a reference given its forward and reverse primers [keyword-only, default: ()]center_fmean (
float) – Set the mean read center as a fraction of the region length [keyword-only, default: 0.5]center_fvar (
float) – Set the variance of the read center as a fraction of its maximum [keyword-only, default: 0.3333333333333333]length_fmean (
float) – Set the mean read length as a fraction of the region length [keyword-only, default: 0.5]length_fvar (
float) – Set the variance of the read length as a fraction of its maximum [keyword-only, default: 0.012345679012345678]clust_conc (
float) – Set the concentration parameter for simulating cluster proportions [keyword-only, default: 0.0]force (
bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]num_cpus (
int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]seed (
int | None) – Seed for the random number generator [keyword-only, default: None]
- seismicrna.sim.ref.run(*, sim_dir: str | Path = './sim', refs: str = 'sim-refs', ref: str = 'sim-ref', reflen: int = 280, force: bool = False, seed: int | None = None)
Simulate a random reference sequence and write it to a FASTA file.
- Parameters:
sim_dir (
str | pathlib._local.Path) – Simulation output directory; the FASTA file is written under its references subdirectory.refs (
str) – Name of the reference set (used for the FASTA file name).ref (
str) – Name of the single reference sequence record in the FASTA file.reflen (
int) – Length of the reference sequence to generate.force (
bool) – Whether to overwrite an existing FASTA file.seed (
int | None) – Random seed for reproducibility; None for no fixed seed.sim_dir – Write all simulated files to this directory [keyword-only, default: ‘./sim’]
refs – Give this name to the file of simulated references [keyword-only, default: ‘sim-refs’]
ref – Give this name to the simulated reference [keyword-only, default: ‘sim-ref’]
reflen – Simulate a reference sequence with this many bases [keyword-only, default: 280]
force – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]
seed – Seed for the random number generator [keyword-only, default: None]
- Returns:
Path of the written FASTA file.
- Return type:
Path
- seismicrna.sim.relate.parse_min_mut_gap_weights(min_mut_gap_weights: str) dict[int, float]
Parse a comma-separated ‘gap:weight’ string into a dict.
- seismicrna.sim.relate.run(*, param_dir: Iterable[str | Path] = (), profile_name: str = 'simulated', sample: str = 'sim-sample', branch: str = '', paired_end: bool = True, read_length: int = 151, reverse_fraction: float = 0.5, probe: str = 'DMS', min_mut_gap: int | None = None, min_mut_gap_weights: str = '', mut_collisions: str = 'auto', mut_probs: str | None = '0.2,0.04,0.008', num_reads: int = 65536, batch_size: int = 65536, write_read_names: bool = False, brotli_level: int = 10, force: bool = False, num_cpus: int = 4, seed: int | None = None, tmp_pfx='./tmp', keep_tmp=False)
Simulate a Relate dataset.
- Parameters:
param_dir (
Iterable) – Simulate data using parameter files in this directory [keyword-only, default: ()]profile_name (
str) – Give the simulated structure and parameters this profile name [keyword-only, default: ‘simulated’]sample (
str) – Give this name to the simulated sample [keyword-only, default: ‘sim-sample’]branch (
str) – Create a new branch of the workflow with this name [keyword-only, default: ‘’]paired_end (
bool) – Simulate paired-end or single-end reads [keyword-only, default: True]read_length (
int) – Simulate reads with this many base calls [keyword-only, default: 151]reverse_fraction (
float) – Simulate this fraction of reverse-oriented reads [keyword-only, default: 0.5]probe (
str) – Use default mask options for this chemical probe [keyword-only, default: ‘DMS’]min_mut_gap (
int | None) – Mask reads with two mutations separated by fewer than this many bases [keyword-only, default: None]min_mut_gap_weights (
str) – Comma-separated gap:weight pairs defining a mixture of min_mut_gap biases, e.g. ‘0:0.18,2:0.05,3:0.21’. When given, overrides –min-mut-gap. [keyword-only, default: ‘’]mut_collisions (
str) – If two mutations are closer than –min-mut-gap positions, MERGE the mutations, DROP the read, or AUTO-select based on the probe. [keyword-only, default: ‘auto’]mut_probs (
str | None) – Comma-separated probabilities of injecting a mutation at each successive position 5’ of an existing mutation (used with –mut-collisions merge) [keyword-only, default: ‘0.2,0.04,0.008’]num_reads (
int) – Simulate this many reads [keyword-only, default: 65536]batch_size (
int) – Limit batches to at most this many reads [keyword-only, default: 65536]write_read_names (
bool) – Write the name of each read in a second set of batches (necessary for the options –mask-read or –mask-read-file) [keyword-only, default: False]brotli_level (
int) – Compress pickle files with this level of Brotli (0 - 11) [keyword-only, default: 10]force (
bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]num_cpus (
int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]seed (
int | None) – Seed for the random number generator [keyword-only, default: None]tmp_pfx – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]
keep_tmp – Keep temporary files after finishing [keyword-only, default: False]
- seismicrna.sim.total.run(*, sim_dir: str | Path = './sim', tmp_pfx: str | Path = './tmp', sample: str = 'sim-sample', refs: str = 'sim-refs', ref: str = 'sim-ref', reflen: int = 280, profile_name: str = 'simulated', fold_backend: str = 'auto', fold_coords: Iterable[tuple[str, int, int]] = (), fold_primers: Iterable[tuple[str, DNA, DNA]] = (), fold_regions_file: str | None = None, fold_constraint: str | None = None, fold_temp: float = 37.0, fold_md: int = 0, fold_mfe: bool = False, fold_max: int = 20, fold_min: int = 1, fold_percent: float = 20.0, pmut_paired: Iterable[tuple[str, float]] = (), pmut_unpaired: Iterable[tuple[str, float]] = (), vmut_paired: float = 0.001, vmut_unpaired: float = 0.02, center_fmean: float = 0.5, center_fvar: float = 0.3333333333333333, length_fmean: float = 0.5, length_fvar: float = 0.012345679012345678, clust_conc: float = 0.0, mask_coords: Iterable[tuple[str, int, int]] = (), mask_primers: Iterable[tuple[str, DNA, DNA]] = (), mask_a: bool | None = None, mask_c: bool | None = None, mask_g: bool | None = None, mask_u: bool | None = None, mask_polya: int = 5, max_fraction_ident: float = 1.0, max_pearson_sim: float = 1.0, min_marcd_sim: float = 0.0, max_tries: int = 10, paired_end: bool = True, read_length: int = 151, reverse_fraction: float = 0.5, probe: str = 'DMS', min_mut_gap: int = None, min_mut_gap_weights: str = '', mut_collisions: str = 'auto', mut_probs: str | None = '0.2,0.04,0.008', fq_gzip: bool = True, num_reads: int = 65536, keep_tmp: bool = False, force: bool = False, num_cpus: int = 4, seed: int | None = None)
Simulate FASTQ files from scratch.
- Parameters:
sim_dir (
str | pathlib._local.Path) – Write all simulated files to this directory [keyword-only, default: ‘./sim’]tmp_pfx (
str | pathlib._local.Path) – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]sample (
str) – Give this name to the simulated sample [keyword-only, default: ‘sim-sample’]refs (
str) – Give this name to the file of simulated references [keyword-only, default: ‘sim-refs’]ref (
str) – Give this name to the simulated reference [keyword-only, default: ‘sim-ref’]reflen (
int) – Simulate a reference sequence with this many bases [keyword-only, default: 280]profile_name (
str) – Give the simulated structure and parameters this profile name [keyword-only, default: ‘simulated’]fold_backend (
str) – Model RNA structures using Fold (RNAstructure), ShapeKnots (RNAstructure), or RNAfold (ViennaRNA); auto selects Fold for DMS and RNAFold for other probes [keyword-only, default: ‘auto’]fold_coords (
Iterable) – Fold a region of a reference given its 5’ and 3’ end coordinates [keyword-only, default: ()]fold_primers (
Iterable) – Fold a region of a reference given its forward and reverse primers [keyword-only, default: ()]fold_regions_file (
str | None) – Fold regions of references from coordinates/primers in a CSV file [keyword-only, default: None]fold_constraint (
str | None) – Force bases to be paired/unpaired from a file of constraints [keyword-only, default: None]fold_temp (
float) – Predict structures at this temperature (Celsius) [keyword-only, default: 37.0]fold_md (
int) – Limit base pair distances to this number of bases (0 for no limit) [keyword-only, default: 0]fold_mfe (
bool) – Predict only the minimum free energy (MFE) structure [keyword-only, default: False]fold_max (
int) – Output at most this many structures (overriden by –fold-mfe) [keyword-only, default: 20]fold_min (
int) – Require at least this many structures (overriden by –fold-mfe) [keyword-only, default: 1]fold_percent (
float) – Stop outputting structures when the % difference in energy exceeds this value (overriden by –fold-mfe) [keyword-only, default: 20.0]pmut_paired (
Iterable) – Set the mean rate of each kind of mutation for paired bases [keyword-only, default: ()]pmut_unpaired (
Iterable) – Set the mean rate of each kind of mutation for unpaired bases [keyword-only, default: ()]vmut_paired (
float) – Set the relative variance of mutation rates of paired bases [keyword-only, default: 0.001]vmut_unpaired (
float) – Set the relative variance of mutation rates of unpaired bases [keyword-only, default: 0.02]center_fmean (
float) – Set the mean read center as a fraction of the region length [keyword-only, default: 0.5]center_fvar (
float) – Set the variance of the read center as a fraction of its maximum [keyword-only, default: 0.3333333333333333]length_fmean (
float) – Set the mean read length as a fraction of the region length [keyword-only, default: 0.5]length_fvar (
float) – Set the variance of the read length as a fraction of its maximum [keyword-only, default: 0.012345679012345678]clust_conc (
float) – Set the concentration parameter for simulating cluster proportions [keyword-only, default: 0.0]mask_coords (
Iterable) – Select a region of a reference given its 5’ and 3’ end coordinates [keyword-only, default: ()]mask_primers (
Iterable) – Select a region of a reference given its forward and reverse primers [keyword-only, default: ()]mask_a (
bool | None) – Mask positions with base A [keyword-only, default: None]mask_c (
bool | None) – Mask positions with base C [keyword-only, default: None]mask_g (
bool | None) – Mask positions with base G [keyword-only, default: None]mask_u (
bool | None) – Mask positions with base U [keyword-only, default: None]mask_polya (
int) – Mask stretches of at least this many consecutive A bases (0 disables) [keyword-only, default: 5]max_fraction_ident (
float) – Retry if any two clusters have more than this fraction of identical mutation rates [keyword-only, default: 1.0]max_pearson_sim (
float) – Retry if any two clusters have a Pearson correlation larger than this [keyword-only, default: 1.0]min_marcd_sim (
float) – Retry if any two clusters have a MARCD less than this (0 disables) [keyword-only, default: 0.0]max_tries (
int) – Simulate the parameters with up to this many attempts [keyword-only, default: 10]paired_end (
bool) – Simulate paired-end or single-end reads [keyword-only, default: True]read_length (
int) – Simulate reads with this many base calls [keyword-only, default: 151]reverse_fraction (
float) – Simulate this fraction of reverse-oriented reads [keyword-only, default: 0.5]probe (
str) – Use default mask options for this chemical probe [keyword-only, default: ‘DMS’]min_mut_gap (
int) – Mask reads with two mutations separated by fewer than this many bases [keyword-only, default: None]min_mut_gap_weights (
str) – Comma-separated gap:weight pairs defining a mixture of min_mut_gap biases, e.g. ‘0:0.18,2:0.05,3:0.21’. When given, overrides –min-mut-gap. [keyword-only, default: ‘’]mut_collisions (
str) – If two mutations are closer than –min-mut-gap positions, MERGE the mutations, DROP the read, or AUTO-select based on the probe. [keyword-only, default: ‘auto’]mut_probs (
str | None) – Comma-separated probabilities of injecting a mutation at each successive position 5’ of an existing mutation (used with –mut-collisions merge) [keyword-only, default: ‘0.2,0.04,0.008’]fq_gzip (
bool) – Simulate FASTQ files with gzip compression or as plain text [keyword-only, default: True]num_reads (
int) – Simulate this many reads [keyword-only, default: 65536]keep_tmp (
bool) – Keep temporary files after finishing [keyword-only, default: False]force (
bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]num_cpus (
int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]seed (
int | None) – Seed for the random number generator [keyword-only, default: None]