seismicrna.core package

Subpackages

seismicrna.core.arg package
- Subpackages
  - seismicrna.core.arg.tests package
    - Submodules
- Submodules
seismicrna.core.batch package
- Subpackages
  - seismicrna.core.batch.tests package
    - Submodules
- Submodules
seismicrna.core.extern package
- Subpackages
  - seismicrna.core.extern.tests package
- Submodules
seismicrna.core.io package
- Subpackages
  - seismicrna.core.io.tests package
    - Submodules
- Submodules
seismicrna.core.mu package
- Subpackages
  - seismicrna.core.mu.tests package
    - Submodules
- Submodules
seismicrna.core.ngs package
- Subpackages
  - seismicrna.core.ngs.tests package
    - Submodules
- Submodules
seismicrna.core.rel package
- Subpackages
  - seismicrna.core.rel.tests package
    - Submodules
- Submodules
  - HalfRelPattern
  - RelPattern
seismicrna.core.rna package
- Subpackages
  - seismicrna.core.rna.tests package
    - Submodules
- Submodules
seismicrna.core.seq package
- Subpackages
  - seismicrna.core.seq.tests package
    - Submodules
- Submodules
seismicrna.core.table package
- Submodules
seismicrna.core.tests package
- Submodules

Submodules

seismicrna.core.array.calc_inverse(target: ndarray, require: int = -1, fill: bool = False, fill_rev: bool = False, fill_default: int | None = None, verify: bool = True, what: str = 'array')

Calculate the inverse of target, such that if element i of target has value x, then element x of the inverse has value i.

>>> list(calc_inverse(np.array([3, 2, 7, 5, 1])))
[-1, 4, 1, 0, -1, 3, -1, 2]
>>> list(calc_inverse(np.arange(5)))
[0, 1, 2, 3, 4]

Parameters:

target (np.ndarray) – Target values; must be a 1-dimensional array of non-negative integers with no duplicate values.
require (int = -1) – Require the inverse to contain all indexes up to and including require (i.e. that its length is at least require + 1); ignored if require is -1; must be ≥ -1.
fill (bool = False) – Fill missing indexes (that do not appear in target).
fill_rev (bool = False) – Fill missing indexes in reverse order instead of forward order; only used if fill is True.
fill_default (int | None = None) – Value with which to fill before the first non-missing value has been encountered; if fill_rev is True, defaults to the length of target, otherwise to -1.
verify (bool = True) – Verify that all target values are unique, non-negative integers. If this is incorrect, then if verify is True, then ValueError will be raised; and if False, then the results of this function will be incorrect. Always set to True unless you have already verified that target is unique, non-negative integers.
what (str = "array") – What to name the array (only used for error messages).

Returns:

Inverse of target.

Return type:

np.ndarray

seismicrna.core.array.check_naturals(values: ndarray, what: str = 'values'): Raise ValueError if the values are not monotonically increasing natural numbers.

seismicrna.core.array.ensure_order(array1: ndarray, array2: ndarray, what1: str = 'array1', what2: str = 'array2', gt_eq: bool = False)

Ensure that array1 is ≤ or ≥ array2, element-wise.

Parameters:

array1 (np.ndarray) – Array 1 (same length as array2).
array2 (np.ndarray) – Array 2 (same length as array1).
what1 (str = "array1") – What array1 contains (only used for error messages).
what2 (str = "array2") – What array2 contains (only used for error messages).
gt_eq (bool = False) – Ensure array1 ≥ array2 if True, otherwise array1 ≤ array2.

Returns:

Shared length of array1 and array2.

Return type:

int

seismicrna.core.array.ensure_same_length(arr1: ndarray, arr2: ndarray, what1: str = 'array1', what2: str = 'array2')

seismicrna.core.array.find_dims(dims: Sequence[Sequence[str | None]], arrays: Sequence[ndarray], names: Sequence[str] | None = None, nonzero: Iterable[str] | bool = False): Check the dimensions of the arrays.

seismicrna.core.array.get_length(array: ndarray, what: str = 'array') → int

seismicrna.core.array.list_naturals(n: int): List natural numbers up to and including n.

seismicrna.core.array.locate_elements(collection: ndarray, *elements: ndarray, what: str = 'collection', verify: bool = True)

Find the index at which each element of elements occurs in collection.

>>> list(locate_elements(np.array([4, 1, 2, 7, 5, 3]), np.array([5, 2, 5])))
[4, 2, 4]

Parameters:

collection (np.ndarray) – Collection in which to find each element in elements; must be a 1-dimensional array of non-negative integers with no duplicate values.
*elements (np.ndarray) – Elements to find; must be a 1-dimensional array that is a subset of collection, although duplicate values are permitted.
what (str = "collection") – What to name the collection (only used for error messages).
verify (bool = True) – Verify that all values in collection are unique, non-negative integers and that all items in elements are in collections.

Returns:

Index of each element of elements in collections.

Return type:

np.ndarray

seismicrna.core.array.sanitize_values(values: Iterable[int], lower_limit: int, upper_limit: int, whats: str = 'values'): Validate and sort values, and return them as an array.

seismicrna.core.array.triangular(n: int)

The n th triangular number (n ≥ 0): number of items in an equilateral triangle with n items on each side.

Parameters:: n (int) – Index of the triangular number to return; equivalently, the side length of the equilateral triangle.
Returns:: The triangular number with index n; equivalently, the number of items in the equilateral triangle of side length n.
Return type:: int

exception seismicrna.core.dataset.BadTimeStampError

Bases: RuntimeError

A dataset has a timestamp that is earlier than a dataset that should have been written before it.

class seismicrna.core.dataset.Dataset(report_file: str | Path, verify_times: bool = True)

Bases: ABC

Dataset comprising batches of data.

property batch_nums: Numbers of the batches.

property best_k: int: Best number of clusters.

property branches: Branches of the workflow.

abstract property data_dirs: list[Path]: All directories containing data for the dataset.

property dir: Path: Directory containing the dataset.

abstractmethod get_batch(batch_num: int) → ReadBatch: Get a specific batch of data.

abstractmethod classmethod get_report_type() → type[Report]: Type of report.

property is_clustered: Whether the dataset is clustered.

iter_batches(): Yield each batch.

property ks: list[int]: Numbers of clusters.

link_data_dirs_to_tmp(tmp_dir: Path): Make links to a dataset in a temporary directory.

abstract property num_batches: int: Number of batches.

property num_reads: Number of reads in the dataset.

abstract property pattern: RelPattern | None: Pattern of mutations to count.

property ref: str: Name of the reference.

property sample: str: Name of the sample.

property time_began: datetime: Time at which the data were written.

property time_ended: datetime: Time at which the data were written.

property top: Path: Top-level directory of the dataset.

exception seismicrna.core.dataset.FailedToLoadDatasetError

Bases: RuntimeError

A batch failed to load.

class seismicrna.core.dataset.LoadFunction(dataset_type: type[Dataset], /, *more_types: type[Dataset])

Bases: object

Function to load a dataset.

__call__(report_file: str | Path, **kwargs): Load a dataset from the report file.

build_report_path(path_fields: dict[str, Any]): Build the path of a report file.

iterate(input_path: Iterable[str | Path], *, raise_on_error: bool = False, **kwargs): Yield a Dataset from each report file in input_path.

property report_path_auto_fields: Automatic field values of the report file path.

property report_path_seg_types: Segment types of the report file path.

class seismicrna.core.dataset.LoadedDataset(report_file: str | Path, verify_times: bool = True)

Bases: Dataset, ABC

Dataset created by loading directly from a Report.

property data_dirs: All directories containing data for the dataset.

get_batch(batch_num: int) → ReadBatchIO | MutsBatchIO: Get a specific batch of data.

get_batch_checksum(batch_num: int): Get the checksum of a specific batch from the report.

get_batch_path(batch_num: int): Get the path to a batch of a specific number.

abstractmethod classmethod get_batch_type() → type[ReadBatchIO | MutsBatchIO]: Type of batch.

classmethod get_btype_name(): Name of the type of batch.

abstractmethod classmethod get_report_type() → type[BatchedReport]: Type of report.

property num_batches: Number of batches.

class seismicrna.core.dataset.MergedDataset(report_file: str | Path, verify_times: bool = True)

Bases: Dataset, ABC

Dataset made by merging one or more constituent datasets.

property data_dirs: All directories containing data for the dataset.

property datasets: list[Dataset]: Constituent datasets that were merged.

abstractmethod classmethod get_dataset_load_func() → LoadFunction: Function to load one constituent dataset.

property pattern: Pattern of mutations to count.

class seismicrna.core.dataset.MergedRegionDataset(report_file: str | Path, verify_times: bool = True)

Bases: MergedDataset, RegionDataset, ABC

property refseq: Sequence of the reference.

class seismicrna.core.dataset.MergedUnbiasDataset(*args, masked_read_nums: dict[[<class 'int'>, <class 'list'>]] | None = None, **kwargs)

Bases: MergedDataset, UnbiasDataset, ABC

MergedDataset with attributes for correcting observer bias.

property min_mut_gap: Minimum gap between two mutations.

property quick_unbias: Use the quick heuristic for unbiasing.

property quick_unbias_thresh: Consider mutation rates less than or equal to this threshold to be 0 when using the quick heuristic for unbiasing.

exception seismicrna.core.dataset.MissingBatchError

Bases: RuntimeError

A dataset does not have a batch of a given type and number.

exception seismicrna.core.dataset.MissingBatchTypeError

Bases: MissingBatchError

A dataset does not have a batch of a given type.

class seismicrna.core.dataset.MultistepDataset(dataset2_report_file: Path, **kwargs)

Bases: MutsDataset, ABC

Dataset made by integrating two datasets from different steps of the workflow.

property data_dirs: All directories containing data for the dataset.

get_batch(batch_num: int): Get a specific batch of data.

abstractmethod classmethod get_dataset1_load_func() → LoadFunction: Function to load Dataset 1.

classmethod get_dataset1_report_file(dataset2_report_file: Path, verify_times: bool): Given the report file for Dataset 2, determine the report file for Dataset 1.

classmethod get_dataset2_load_func(): Function to load Dataset 2.

abstractmethod classmethod get_dataset2_type() → type[RegionDataset]: Type of Dataset 2.

classmethod get_report_type(): Type of report.

classmethod load_dataset1(dataset2_report_file: Path, verify_times: bool): Load Dataset 1.

classmethod load_dataset2(dataset2_report_file: Path, verify_times: bool): Load Dataset 2.

property num_batches: Number of batches.

property refseq: Sequence of the reference.

class seismicrna.core.dataset.MutsDataset(report_file: str | Path, verify_times: bool = True)

Bases: RegionDataset, ABC

Dataset with a known region and explicit mutational data.

abstractmethod get_batch(batch_num: int) → RegionMutsBatch: Get a specific batch of data.

get_batch_count_all(batch_num: int, **kwargs): Calculate the counts for a specific batch of data.

iter_batches(): Yield each batch.

class seismicrna.core.dataset.RegionDataset(report_file: str | Path, verify_times: bool = True)

Bases: Dataset, ABC

Dataset with a known reference sequence and region.

property reflen: Length of the reference sequence.

abstract property refseq: DNA: Sequence of the reference.

property region: Region: Region of the dataset.

class seismicrna.core.dataset.TallDataset(*args, **kwargs)

Bases: MergedDataset, ABC

Dataset made by vertically pooling other datasets from one or more samples aligned to the same reference sequence.

property datasets: Constituent datasets that were merged.

get_batch(batch_num: int): Get a specific batch of data.

property num_batches: Number of batches.

property nums_batches: list[int]: Number of batches in each dataset in the pool.

property samples: list[str]: Names of all samples in the pool.

class seismicrna.core.dataset.UnbiasDataset(*args, masked_read_nums: dict[[<class 'int'>, <class 'list'>]] | None = None, **kwargs)

Bases: Dataset, ABC

Dataset with attributes for correcting observer bias.

abstract property min_mut_gap: int: Minimum gap between two mutations.

abstract property quick_unbias: bool: Use the quick heuristic for unbiasing.

abstract property quick_unbias_thresh: float: Consider mutation rates less than or equal to this threshold to be 0 when using the quick heuristic for unbiasing.

class seismicrna.core.dataset.WideDataset(report_file: str | Path, verify_times: bool = True)

Bases: MergedRegionDataset, ABC

Dataset made by horizontally joining other datasets from one or more regions of the same reference sequence.

property datasets: Constituent datasets that were merged.

get_batch(batch_num: int): Get a specific batch of data.

property num_batches: Number of batches.

property region: Region of the dataset.

property region_names: Names of all joined regions.

class seismicrna.core.dataset.WideMutsDataset(report_file: str | Path, verify_times: bool = True)

Bases: WideDataset, MutsDataset, ABC

WideDataset with mutation data.

Generic Exceptions

exception seismicrna.core.error.DuplicateValueError

Bases: ValueError

A value occurred more than once when all must be unique.

exception seismicrna.core.error.IncompatibleOptionsError

Bases: ValueError

Two or more options are incompatible.

exception seismicrna.core.error.IncompatibleValuesError

Bases: ValueError

Two or more values are individually valid, but their combination is not.

exception seismicrna.core.error.InconsistentValueError

Bases: ValueError

Two or more values differ when they should be equal.

exception seismicrna.core.error.NoDataError

Bases: RuntimeError

Data were required, but none were provided.

exception seismicrna.core.error.OutOfBoundsError

Bases: ValueError

A numeric value is outside its proper bounds.

class seismicrna.core.header.ClustHeader(*, ks: Iterable[int], **kwargs)

Bases: Header

Header of clusters.

property clusts

clusters for clustered data, otherwise one track of the average.

Type:: Tracks of data

classmethod get_is_clustered(): Whether the header has clusters.

classmethod get_levels(): Levels of the index.

property index: Index of the header.

iter_clust_indexes(): For each cluster, yield an Index/MultiIndex of every column that is part of the cluster.

property ks: Numbers of clusters.

property signature: Signature of the header, which will generate an identical header if passed as keyword arguments to make_header.

class seismicrna.core.header.Header

Bases: ABC

Header for a table.

property clusts: list[tuple[int, int]]

clusters for clustered data, otherwise one track of the average.

Type:: Tracks of data

get_clust_header(): Corresponding ClustHeader.

abstractmethod classmethod get_is_clustered() → bool: Whether the header has clusters.

classmethod get_level_keys(): Level keys of the index.

classmethod get_level_names(): Level names of the index.

abstractmethod classmethod get_levels(): Levels of the index.

classmethod get_num_levels(): Number of levels.

get_rel_header(): Corresponding RelHeader.

property index: Index: Index of the header.

abstractmethod iter_clust_indexes(): For each cluster, yield an Index/MultiIndex of every column that is part of the cluster.

abstract property ks: list[int]: Numbers of clusters.

modified(**kwargs)

Return a new header with a possibly modified signature.

Parameters:: **kwargs – Keyword arguments for modifying the signature of the header. Each argument given here will be passed to make_header and override the attribute (if any) with the same name in this header’s signature. Attributes of this header’s signature that are not overriden will also be passed to make_header.
Returns:: New header with a possibly modified signature.
Return type:: Header

property names: Formatted name of each track.

select(**kwargs) → Index: Select and return items from the header as an Index.

property signature: Signature of the header, which will generate an identical header if passed as keyword arguments to make_header.

property size: Number of items in the Header.

class seismicrna.core.header.RelClustHeader(*, ks: Iterable[int], **kwargs)

Bases: ClustHeader, RelHeader

Header of relationships and clusters.

property index: Index of the header.

class seismicrna.core.header.RelHeader(*, rels: Iterable[str], **kwargs)

Bases: Header

Header of relationships.

property clusts

clusters for clustered data, otherwise one track of the average.

Type:: Tracks of data

classmethod get_is_clustered(): Whether the header has clusters.

classmethod get_levels(): Levels of the index.

property index: Index of the header.

iter_clust_indexes(): For each cluster, yield an Index/MultiIndex of every column that is part of the cluster.

property ks: Numbers of clusters.

property rels: Relationships.

property signature: Signature of the header, which will generate an identical header if passed as keyword arguments to make_header.

seismicrna.core.header.deduplicate_rels(rels: Iterable)

Remove duplicate relationships while preserving their order.

Parameters:: rels (Iterable) – Relationships
Returns:: Relationships with duplicates removed, in the original order.
Return type:: list[str]

seismicrna.core.header.format_clust_name(k: int, clust: int)

Format a pair of k and cluster numbers into a name.

Parameters:

k (int) – Number of clusters
clust (int) – Cluster number

Returns:

Name specifying k and clust, or “average” if k is 0.

Return type:

str

seismicrna.core.header.format_clust_names(clusts: Iterable[tuple[int, int]], allow_duplicates: bool = False)

Format pairs of k and clust into a list of names.

Parameters:

clusts (Iterable[tuple[int, int]]) – Zero or more pairs of k and cluster numbers.
allow_duplicates (bool = False) – Allow k and clust pairs to be duplicated.

Returns:

List of names of the pairs of k and clust.

Return type:

list[str]

Raises:

ValueError – If allow_duplicates is False and clusts has duplicates.

seismicrna.core.header.list_clusts(k: int)

List all cluster numbers for one k.

Parameters:: k (int) – Number of clusters (≥ 1)
Returns:: List of cluster numbers.
Return type:: list[int]

seismicrna.core.header.list_k_clusts(k: int)

List k and cluster numbers as 2-tuples for one k.

Parameters:: k (int) – Number of clusters (≥ 1)
Returns:: List wherein each item is a tuple of the number of clusters and the cluster number.
Return type:: list[tuple[int, int]]

seismicrna.core.header.list_ks_clusts(ks: Iterable[int])

List k and cluster numbers as 2-tuples.

Parameters:: ks (Iterable[int])
Returns:: List wherein each item is a tuple of the number of clusters and the cluster number.
Return type:: list[tuple[int, int]]

seismicrna.core.header.make_header(*, rels: Iterable[str] | None = None, ks: Iterable[int] | None = None)

Make a new Header of an appropriate type.

Parameters:

rels (Iterable[str] | None = None) – Relationships in the header
ks (Iterable[int] | None = None) – Numbers of clusters

Returns:

Header of the appropriate type.

Return type:

Header

seismicrna.core.header.parse_header(index: Index | MultiIndex)

Parse an Index into a Header of an appropriate type.

Parameters:: index (pd.Index | pd.MultiIndex) – Index to parse.
Returns:: New Header whose index is index.
Return type:: Header

seismicrna.core.header.validate_k_clust(k: int, clust: int)

Validate a pair of k and cluster numbers.

Parameters:

k (int) – Number of clusters
clust (int) – Cluster number

Returns:

If the k and cluster numbers form a valid pair.

Return type:

None

Raises:

TypeError – If k or clust is not an integer.
ValueError – If k and clust do not form a valid pair.

seismicrna.core.header.validate_ks(ks: Iterable)

Validate and sort numbers of clusters.

Parameters:: ks (Iterable) – Numbers of clusters
Returns:: Sorted numbers of clusters
Return type:: list[int]
Raises:: ValueError – If any k is not positive or is repeated.

class seismicrna.core.join.JoinMutsDataset(report_file: str | Path, verify_times: bool = True)

Bases: WideMutsDataset, ABC

classmethod check_batch_type(batch: MutsBatch): Raise TypeError if the batch is the incorrect type.

abstractmethod classmethod get_batch_type() → type[MutsBatch]: Type of batch.

property min_mut_gap

abstractmethod classmethod name_batch_attrs() → list[str]: Name the attributes of each batch.

class seismicrna.core.join.JoinReport(**kwargs: Any | Callable[[Report], Any])

Bases: Report, RegFileIO, ABC

Report for a joined dataset.

class seismicrna.core.lists.List(*, sample: str, branches: Iterable[str], ref: str, data: DataFrame, **kwargs)

Bases: RefFileIO, ABC

List base class.

abstractmethod classmethod from_table(table: RelTypeTable, branch: str, **kwargs) → Self: Create a list from a table.

classmethod get_auto_path_fields(): Default values of the path fields.

classmethod get_by_read() → bool: Whether the list is of reads.

abstractmethod classmethod get_column_names() → list[str]: Names of the index columns.

classmethod get_ext(): File extension.

classmethod get_is_gzip(): Whether the file is compressed with gzip.

classmethod get_path_from_table(table: RelTypeTable, branch: str): Get the path of a list given a table.

abstractmethod classmethod get_table_type() → type[RelTypeTableLoader]: Type of table that this type of list can process.

classmethod load(file: str | Path, **kwargs): Load an object from a file.

classmethod load_data(file: str | Path, only_ref: bool = False)

save(top: Path, force: bool = False): Save the object to a file.

classmethod validate_data(data: DataFrame)

class seismicrna.core.lists.PositionList(*, sample: str, branches: Iterable[str], ref: str, data: DataFrame, **kwargs)

Bases: List, ABC

List of positions.

MASK_FMUT = 'list_max_fmut_pos'

MASK_NINFO = 'list_min_ninfo_pos'

classmethod from_table(table: PositionTableLoader, branch: str, *, min_ninfo_pos: int = 1000, max_fmut_pos: float = 1.0): Create a list from a table.

classmethod get_column_names(): Names of the index columns.

classmethod get_data_type()

classmethod get_file_seg_type(): Type of the last segment in the path.

classmethod list_init_table_attrs(): List the table attribute names to pass to __init__().

class seismicrna.core.lists.ReadList(*, sample: str, branches: Iterable[str], ref: str, data: DataFrame, **kwargs)

Bases: List, ABC

List of reads.

classmethod get_column_names(): Names of the index columns.

classmethod get_data_type()

classmethod get_file_seg_type(): Type of the last segment in the path.

class seismicrna.core.logs.AnsiCode

Bases: object

Format text with ANSI codes.

BOLD = 1

END = 'm'

RESET = 0

START = '\x1b['

classmethod format(code: int): Make a format string for one ANSI code.

classmethod format_color(color: int): Make a format string for one 256-color code.

classmethod reset(): Convenience function to end formatting.

class seismicrna.core.logs.ConsoleStream(filterer: Filterer, formatter: Formatter)

Bases: Stream

Log to the console’s stderr stream.

filterer

formatter

property stream: Text stream to which messages will be logged after filtering and formating.

class seismicrna.core.logs.FileStream(file_path: str | Path, *args, **kwargs)

Bases: Stream

Log to a file.

close(): Close the file stream.

file_path

property stream: Text stream to which messages will be logged after filtering and formating.

class seismicrna.core.logs.Filterer(verbosity: int)

Bases: object

Filter messages before logging.

verbosity

class seismicrna.core.logs.Formatter(formatter: Callable[[Message], str])

Bases: object

Filter messages before logging.

formatter

class seismicrna.core.logs.Level(*values)

Bases: IntEnum

Level of a logging message.

ACTION = 2

DETAIL = 4

ERROR = -2

FATAL = -3

ROUTINE = 3

STATUS = 0

TASK = 1

WARNING = -1

class seismicrna.core.logs.Logger(console_stream: ConsoleStream | None = None, file_stream: FileStream | None = None, exit_on_error: bool = False)

Bases: object

Log messages to the console and to files.

action(content: object)

console_stream

detail(content: object)

error(content: object)

exit_on_error

fatal(content: object)

file_stream

routine(content: object)

status(content: object)

task(content: object)

warning(content: object)

class seismicrna.core.logs.LoggerConfig(verbosity, log_file_path, log_color, exit_on_error)

Bases: tuple

exit_on_error: Alias for field number 3

log_color: Alias for field number 2

log_file_path: Alias for field number 1

verbosity: Alias for field number 0

class seismicrna.core.logs.Message(level: Level, content: object)

Bases: object

Message with a logging level.

content

level

class seismicrna.core.logs.Stream(filterer: Filterer, formatter: Formatter)

Bases: ABC

Log to a stream, such as to the console or to a file.

filterer

formatter

log(message: Message): Log a message to the stream.

abstract property stream: TextIO: Text stream to which messages will be logged after filtering and formating.

seismicrna.core.logs.erase_config(): Erase the existing logger configuration.

seismicrna.core.logs.exc_info(): Whether to log exception information.

seismicrna.core.logs.format_console_color(message: Message): Format a message to log on the console with color.

seismicrna.core.logs.format_console_plain(message: Message): Format a message to log on the console without color.

seismicrna.core.logs.format_logfile(message: Message): Format a message to write into the log file.

seismicrna.core.logs.get_config(): Get the configuration parameters of a logger.

seismicrna.core.logs.log_exceptions(default: Callable | None): If any exception occurs, catch it and return the default.

seismicrna.core.logs.restore_config(func: Callable): After the function exits, restore the logging configuration that was in place before the function ran.

seismicrna.core.logs.set_config(verbosity: int = 0, log_file_path: str | Path | None = None, log_color: bool = True, exit_on_error: bool = False): Configure the main logger with handlers and verbosity.

class seismicrna.core.path.BranchesPathField

Bases: PathField

The field for branches requires special functions.

build(val: Any): Validate a value and return it as a string.

dtype

is_ext

options

parse(text: str): Parse a value from a string, validate it, and return it.

pattern

validate(val: Any): Validate a value before turning it into a string.

class seismicrna.core.path.HasFilePath

Bases: ABC

Object that corresponds to the path of a file (which may or may not actually exist on the file system).

classmethod build_path(path_fields: dict[str, Any]): Build the file path from the given field values.

classmethod get_auto_path_fields() → dict[str, Any]: Names and path fields that have automatic values.

abstractmethod classmethod get_dir_seg_types() → list[PathSegment]: Types of the directory segments in the path.

classmethod get_ext(): File extension.

abstractmethod classmethod get_file_seg_type() → PathSegment: Type of the last segment in the path.

get_path(top: str | Path): Return the file path.

get_path_field_values(top: str | Path | None = None, exclude_auto: bool = False, exclude: Iterable[str] = ()): Path field values as a dict.

classmethod get_path_fields(): Path fields for the file type.

classmethod get_path_seg_types(): Types of the segments in the path.

classmethod parse_path(file: str | Path, exclude_auto: bool = False): Parse a file path to determine the field values.

class seismicrna.core.path.HasRefFilePath

Bases: HasSampleFilePath, ABC

Object that has a path with a reference.

classmethod get_dir_seg_types(): Types of the directory segments in the path.

class seismicrna.core.path.HasRegFilePath

Bases: HasRefFilePath, ABC

Object that has a path with a region.

classmethod get_dir_seg_types(): Types of the directory segments in the path.

class seismicrna.core.path.HasSampleFilePath

Bases: HasFilePath, ABC

Object that has a path with a sample, step, and branches.

classmethod get_auto_path_fields(): Names and path fields that have automatic values.

classmethod get_dir_seg_types(): Types of the directory segments in the path.

abstractmethod classmethod get_step() → str: Step of the workflow.

class seismicrna.core.path.Path(seg_types: Iterable[PathSegment])

Bases: object

build(fields: dict[str, Any]): Return a pathlib.Path instance by assembling the given fields into a full path.

parse(path: str | Path): Return the field names and values from a given path.

exception seismicrna.core.path.PathError

Bases: Exception

Any error involving a path.

exception seismicrna.core.path.PathExistsError

Bases: PathError, FileExistsError

Path exists but should not.

class seismicrna.core.path.PathField(dtype: type[str | int | Path | list], options: Iterable = (), is_ext: bool = False, pattern: str = '')

Bases: object

build(val: Any): Validate a value and return it as a string.

dtype

is_ext

options

parse(text: str) → Any: Parse a value from a string, validate it, and return it.

pattern

validate(val: Any): Validate a value before turning it into a string.

exception seismicrna.core.path.PathNotFoundError

Bases: PathError, FileNotFoundError

Path does not exist but should.

class seismicrna.core.path.PathSegment(segment_name: str, field_types: dict[str, PathField], *, order: int = 0, frmt: str | None = None)

Bases: object

build(vals: dict[str, Any])

property ext_type: Type of the segment’s file extension, or None if it has no file extension.

property exts: list[str]: Valid file extensions of the segment.

match_longest_ext(text: str): Find the longest extension of the given text that matches a valid file extension. If none match, return None.

parse(text: str)

exception seismicrna.core.path.PathTypeError

Bases: PathError, TypeError

Use of the wrong type of path or segment.

exception seismicrna.core.path.PathValueError

Bases: PathError, ValueError

Invalid value of a path segment field.

exception seismicrna.core.path.WrongFileExtensionError

Bases: PathValueError

A file has the wrong extension.

seismicrna.core.path.add_branch(step: str, branch: str, ancestors: dict[str, str]): Add a new branch to a dict of branches.

seismicrna.core.path.build(segment_types: Iterable[PathSegment], field_values: dict[str, Any]): Return a pathlib.Path from the segment types and field values.

seismicrna.core.path.builddir(segment_types: Iterable[PathSegment], field_values: dict[str, Any]): Build the path and create it on the file system as a directory if it does not already exist.

seismicrna.core.path.buildpar(segment_types: Iterable[PathSegment], field_values: dict[str, Any]): Build a path and create its parent directory if it does not already exist.

seismicrna.core.path.cast_path(input_path: str | Path, input_segments: Sequence[PathSegment], output_segments: Sequence[PathSegment], override: dict[str, Any] | None = None)

Cast input_path made of input_segments to a new path made of output_segments.

Parameters:

input_path (str | pathlib.Path) – Input path from which to take the path fields.
input_segments (Sequence[PathSegment]) – Path segments to use to determine the fields in input_path.
output_segments (Sequence[PathSegment]) – Path segments to use to determine the fields in output_path.
override (dict[str, Any] | None) – Override and supplement the fields in input_path.

Returns:

Path comprising output_segments made of fields in input_path (as determined by input_segments).

Return type:

pathlib.Path

seismicrna.core.path.check_file_extension(file: str | Path, extensions: Iterable[str] | PathField)

seismicrna.core.path.create_path_type(segment_types: tuple[PathSegment, ...]): Create and cache a Path instance from the segment types.

seismicrna.core.path.deduplicate(paths: Iterable[str | Path], warn: bool = False): Yield the non-redundant paths.

seismicrna.core.path.deduplicated(func: Callable): Decorate a Path generator to yield non-redundant paths.

seismicrna.core.path.fill_whitespace(path: str | Path, fill: str = '_') → str | Path: Replace all whitespace in path with fill.

seismicrna.core.path.find_files(path: str | Path, segments: Sequence[PathSegment], pre_sanitize: bool = True)

Yield all files that match a sequence of path segments. The behavior depends on what path is:

If it is a file, then yield path if it matches the segments; otherwise, yield nothing.
If it is a directory, then search it recursively and yield every matching file in the directory and its subdirectories.

Parameters:

path (str | pathlib.Path) – Path of a file to check or a directory to search recursively.
segments (Sequence[PathSegment]) – Sequence(s) of Path segments to check if each file matches.
pre_sanitize (bool) – Whether to sanitize the path before searching it.

Returns:

Paths of files matching the segments.

Return type:

Generator[Path, Any, None]

seismicrna.core.path.find_files_chain(paths: Iterable[str | Path], segments: Sequence[PathSegment]): Yield from find_files called on every path in paths.

seismicrna.core.path.flatten_branches(branches: dict[str, str])

seismicrna.core.path.get_ancestors(branches: dict[str, str]): Get all but the last branch in a dict of branches.

seismicrna.core.path.get_fields_in_seg_types(segment_types: Iterable[PathSegment], include_top: bool = False) → dict[str, PathField]: Get all fields among the given segment types.

seismicrna.core.path.get_seismicrna_project_dir(): SEISMIC-RNA project directory, named seismic-rna, containing src, pyproject.toml, and all other project files. Will exist if the entire SEISMIC-RNA project has been downloaded, e.g. from GitHub, but not if SEISMIC-RNA was only installed using pip or conda.

seismicrna.core.path.get_seismicrna_source_dir(): SEISMIC-RNA source directory, named seismicrna, containing __init__.py and the top-level modules and subpackages.

seismicrna.core.path.mkdir_if_needed(path: Path | str): Create a directory and log that event if it does not exist.

seismicrna.core.path.parse(path: str | Path, segment_types: Iterable[PathSegment]): Return the fields of a path based on the segment types.

seismicrna.core.path.parse_top_separate(path: str | Path, segment_types: Iterable[PathSegment]): Return the fields of a path, and the top field separately.

seismicrna.core.path.path_matches(path: str | Path, segments: Iterable[PathSegment])

Check if a path matches a sequence of path segments.

Parameters:

path (str | pathlib.Path) – Path of the file/directory.
segments (Iterable[PathSegment]) – Sequence of path segments to check if the file matches.

Returns:

Whether the path matches any given sequence of path segments.

Return type:

bool

seismicrna.core.path.randdir(parent: str | Path | None = None, prefix: str = '', suffix: str = ''): Build a path of a new directory that does not exist and create it on the file system.

seismicrna.core.path.rmdir_if_needed(path: Path | str, rmtree: bool = False, rmtree_ignore_errors: bool = False, raise_on_rmtree_error: bool = True): Remove a directory and log that event if it exists.

seismicrna.core.path.sanitize(path: str | Path, strict: bool = False)

Sanitize a path-like object by ensuring it is an absolute path, eliminating redundant path separators/references, and returning a Path object.

Parameters:

path (str | pathlib.Path) – Path to sanitize.
strict (bool) – Require the path to exist and contain no symbolic link loops.

Returns:

Normalized absolute path.

Return type:

pathlib.Path

seismicrna.core.path.symlink_if_needed(link_path: Path | str, target_path: Path | str): Make link_path a link pointing to target_path and log that event if it does not exist.

seismicrna.core.path.transpath(to_dir: str | Path, from_dir: str | Path, path: str | Path, strict: bool = False)

Return the path that would be produced by moving path from from_dir to to_dir (but do not actually move the path on the file system). This function does not require that any of the given paths exist, unless strict is True.

Parameters:

to_dir (str | pathlib.Path) – Directory to which to move path.
from_dir (str | pathlib.Path) – Directory from which to move path; must contain path but not necessarily be the direct parent directory of path.
path (str | pathlib.Path) – Path to move; can be a file or directory.
strict (bool = False) – Require that all paths exist and contain no symbolic link loops.

Returns:

Hypothetical path after moving path from indir to outdir.

Return type:

pathlib.Path

seismicrna.core.path.transpaths(to_dir: str | Path, paths: Iterable[str | Path], strict: bool = False)

Return all paths that would be produced by moving all paths in paths from their longest common sub-path to to_dir (but do not actually move the paths on the file system). This function does not require that any of the given paths exist, unless strict is True.

Parameters:

to_dir (str | pathlib.Path) – Directory to which to move every path in path.
paths (Iterable[str | pathlib.Path]) – Paths to move; can be files or directories. A common sub-path must exist among all of these paths.
strict (bool = False) – Require that all paths exist and contain no symbolic link loops.

Returns:

Hypothetical paths after moving all paths in path to outdir.

Return type:

tuple[pathlib.Path, ]

seismicrna.core.path.validate_branch(branch: str)

seismicrna.core.path.validate_branches(branches: dict[str, str])

seismicrna.core.path.validate_branches_flat(branches_flat: list[str])

seismicrna.core.path.validate_int(num: int)

seismicrna.core.path.validate_str(txt: str)

seismicrna.core.path.validate_top(top: Path)

seismicrna.core.random.stochastic_round(values: ndarray | list | float | int, preserve_sum: bool = False)

Round values to integers stochastically, so that the probability of rounding up equals the fractional part of the original value.

Parameters:

values (np.ndarray | list | float | int) – Values to round; if scalar, a 0D integer array will be returned.
preserve_sum (bool) – Whether to ensure that the sum of the rounded values equals the sum of the original values.

Returns:

Values rounded to integers, with the original sum preserved.

Return type:

np.ndarray

class seismicrna.core.report.BatchedReport(**kwargs: Any | Callable[[Report], Any])

Bases: Report, ABC

Report with a number of data batches (one file per batch).

classmethod get_batch_type(btype: str | None = None) → type[ReadBatchIO]: Return a valid type of batch based on its name.

classmethod get_batch_types() → dict[str, type[ReadBatchIO]]: Type(s) of batch(es) for the report, keyed by name.

classmethod get_checksum_report_fields(): Checksum fields of the report.

exception seismicrna.core.report.InvalidReportFieldKeyError

Bases: ReportFieldKeyError

The key does not belog to an actual report field.

exception seismicrna.core.report.InvalidReportFieldTitleError

Bases: ReportFieldKeyError

The title does not belog to an actual report field.

exception seismicrna.core.report.MissingFieldWithNoDefaultError

Bases: ReportFieldValueError

The default value is requested of a field with no default.

class seismicrna.core.report.OptionReportField(option: Option, **kwargs)

Bases: ReportField

Field based on a command line option.

default

dtype

iconv

key

oconv

title

class seismicrna.core.report.RefReport(**kwargs: Any | Callable[[Report], Any])

Bases: Report, RefFileIO, ABC

classmethod get_ident_report_fields(): Identification fields of the report.

class seismicrna.core.report.RegReport(**kwargs: Any | Callable[[Report], Any])

Bases: RefReport, RegFileIO, ABC

classmethod get_ident_report_fields(): Identification fields of the report.

class seismicrna.core.report.Report(**kwargs: Any | Callable[[Report], Any])

Bases: SampleFileIO, ABC

Abstract base class for a report from a step.

__setattr__(key: str, value: Any): Validate the attribute name and value before setting it.

classmethod from_dict(odata: dict[str, Any]): Convert a dict of raw values (keyed by the titles of their fields) into a dict of encoded values (keyed by the keys of their fields), from which a new Report is instantiated.

classmethod get_checksum_report_fields() → list[ReportField]: Checksum fields of the report.

get_field(field: ReportField, missing_ok: bool = False): Return the value of a field of the report using the field instance directly, not its key.

classmethod get_field_keys(): Keys of all fields of the report.

classmethod get_field_keys_set(): Same as get_field_keys but caches and returns a set for fast membership checking.

classmethod get_ident_report_fields() → list[ReportField]: Identification fields of the report.

classmethod get_meta_report_fields() → list[ReportField]: Metadata fields of the report.

classmethod get_param_report_fields() → list[ReportField]: Parameter fields of the report.

classmethod get_report_fields(): All fields of the report.

classmethod get_result_report_fields() → list[ReportField]: Result fields of the report.

classmethod load(file: str | Path) → Report: Load an object from a file.

save(top: Path, force: bool = False): Save the object to a file.

to_dict(): Return a dict of raw values of the fields, keyed by the titles of their fields.

exception seismicrna.core.report.ReportDoesNotHaveFieldError

Bases: ReportFieldAttributeError

A report does not contain this type of field.

class seismicrna.core.report.ReportField(key: str, title: str, dtype: type, default: Any | None = None, *, iconv: Callable[[Any], Any] | None = None, oconv: Callable[[Any], Any] | None = None)

Bases: object

Field of a report.

default

dtype

iconv

key

oconv

title

exception seismicrna.core.report.ReportFieldAttributeError: Bases: ReportFieldError, AttributeError

exception seismicrna.core.report.ReportFieldError

Bases: RuntimeError

Any error involving a field of a report.

exception seismicrna.core.report.ReportFieldKeyError: Bases: ReportFieldError, KeyError

exception seismicrna.core.report.ReportFieldTypeError: Bases: ReportFieldError, TypeError

exception seismicrna.core.report.ReportFieldValueError: Bases: ReportFieldError, ValueError

seismicrna.core.report.calc_dt_minutes(began: datetime, ended: datetime): Calculate the time taken in minutes.

seismicrna.core.report.calc_taken(report: Report): Calculate the time taken in minutes.

seismicrna.core.report.default_key(key: str): Get the default value of a field by its key.

seismicrna.core.report.field_keys() → dict[str, ReportField]

seismicrna.core.report.field_titles() → dict[str, ReportField]

seismicrna.core.report.fields()

seismicrna.core.report.get_oconv_dict(dtype: type, precision: int = 3)

seismicrna.core.report.get_oconv_dict_list(dtype: type, precision: int = 3)

seismicrna.core.report.get_oconv_float(precision: int = 3)

seismicrna.core.report.get_oconv_list(dtype: type, precision: int = 3)

seismicrna.core.report.iconv_array_int(nums: list[int])

seismicrna.core.report.iconv_datetime(text: str)

seismicrna.core.report.iconv_dict_str_dict_int_dict_int_int(mapping: dict[Any, dict[Any, dict[Any, Any]]]) → dict[str, dict[int, dict[int, int]]]

seismicrna.core.report.iconv_dict_str_int(mapping: dict[Any, Any]) → dict[str, int]

seismicrna.core.report.iconv_int_keys(mapping: dict[Any, Any])

seismicrna.core.report.key_to_title(key: str): Map a field’s key to its title.

seismicrna.core.report.lookup_key(key: str): Get a field by its key.

seismicrna.core.report.lookup_title(title: str): Get a field by its title.

seismicrna.core.report.oconv_datetime(dtime: datetime)

seismicrna.core.run.log_command(command: str): Log the name of the command.

seismicrna.core.run.run_func(command: str, default: ~typing.Callable | None = <class 'list'>, with_tmp: bool = False, pass_keep_tmp: bool = False, *args, **kwargs): Decorator for a run function.

seismicrna.core.stats.calc_beta_mv(alpha: float, beta: float)

Find the mean and variance of a beta distribution from its alpha and beta parameters.

Parameters:

alpha (float) – Alpha parameter of the beta distribution.
beta (float) – Beta parameter of the beta distribution.

Returns:

Mean and variance of the beta distribution.

Return type:

tuple[float, float]

seismicrna.core.stats.calc_beta_params(mean: float, variance: float)

Find the alpha and beta parameters of a beta distribution from its mean and variance.

Parameters:

mean (float) – Mean of the beta distribution.
variance (float) – Variance of the beta distribution.

Returns:

Alpha and beta parameters of the beta distribution.

Return type:

tuple[float, float]

seismicrna.core.stats.calc_dirichlet_mv(alpha: ndarray)

Find the means and variances of a Dirichlet distribution from its concentration parameters.

Parameters:: alpha (np.ndarray) – Concentration parameters of the Dirichlet distribution.
Returns:: Means and variances of the Dirichlet distribution.
Return type:: tuple[np.ndarray, np.ndarray]

seismicrna.core.stats.calc_dirichlet_params(mean: ndarray, variance: ndarray)

Find the concentration parameters of a Dirichlet distribution from its mean and variance.

Parameters:

mean (np.ndarray) – Means.
variance (np.ndarray) – Variances.

Returns:

Concentration parameters.

Return type:

np.ndarray

Double Kumaraswamy distribution probability density function (PDF).

Parameters:

x (np.ndarray) – Input values; must be in the interval [0, 1].
w (float | int) – Weight for distribution 1; must be in the interval [0, 1].
a1 (float | int) – Shape parameter a for distribution 1; must be > 0.
b1 (float | int) – Shape parameter b for distribution 1; must be > 0.
a2 (float | int) – Shape parameter a for distribution 2; must be > 0.
b2 (float | int) – Shape parameter b for distribution 2; must be > 0.

Returns:

Double Kumaraswamy distribution PDF at input values.

Return type:

np.ndarray

seismicrna.core.stats.kumaraswamy_pdf(x: ndarray, a: float | int, b: float | int)

Kumaraswamy distribution probability density function (PDF).

Parameters:

x (np.ndarray) – Input values; must be in the interval [0, 1].
a (float | int) – Shape parameter a; must be > 0.
b (float | int) – Shape parameter b; must be > 0.

Returns:

Kumaraswamy distribution PDF at input values.

Return type:

np.ndarray

class seismicrna.core.task.Task(func: Callable)

Bases: object

Wrap a parallelizable task in a try-except block so that if it fails, it just returns None rather than crashing the other tasks being run in parallel.

__call__(*args, **kwargs): Call the task’s function in a try-except block, return the result if it succeeds, and return None otherwise.

property name

seismicrna.core.task.as_list_of_tuples(args: Iterable[Any]): Given an iterable of arguments, return a list of 1-item tuples, each containing one of the given arguments. This function is useful for creating a list of tuples to pass to the args parameter of dispatch.

seismicrna.core.task.calc_pool_size(num_tasks: int, num_cpus: int)

Calculate the size of a process pool.

Parameters:

num_tasks (int) – Number of tasks to parallelize. Must be ≥ 1.
num_cpus (int) – Number of CPUs available. Must be ≥ 1.

Returns:

Size of the pool (number of concurrent tasks). Always ≥ 1.
Number of CPUs for each task in the pool. Always ≥ 1.

Return type:

tuple[int, int]

seismicrna.core.task.dispatch(funcs: Callable | list[Callable], *, num_cpus: int, pass_num_cpus: bool, as_list: bool, ordered: bool, raise_on_error: bool, args: tuple | Iterable[tuple] = (), kwargs: dict[str, Any] | None = None)

Run one or more tasks in series or in parallel, depending on the number of tasks and the maximum number of CPUs.

Parameters:

funcs (Callable | list[Callable]) – The function(s) to run. Can be a list of functions or a single function that is not in a list. If a single function, then if args is a tuple, it is called once with that tuple as its positional arguments; and if args is a list of tuples, it is called for each tuple of positional arguments in args.
num_cpus (int) – Number of CPUs available. Must be ≥ 1.
pass_num_cpus (bool) – Pass the number of processes to the function(s) in funcs as the keyword argument num_cpus.
as_list (bool) – Return results as a list (if True) or an iterator (if False).
ordered (bool) – Return results in the same order as they were given in funcs and/or args (if True) or in order of completion (if False).
raise_on_error (bool) – If any task fails, then raise the exception that it raises (if True) or log that exception as an error (if False).
args (tuple | Iterable[tuple]) – Positional arguments to pass to each function in funcs. Can be a list of tuples of positional arguments or a single tuple that is not in a list. If a single tuple, then each function receives args as positional arguments. If a list, then args must be the same length as funcs; each function funcs[i] receives args[i] as positional arguments.
kwargs (dict[str, Any] | None) – Keyword arguments to pass to every function call.

seismicrna.core.tmp.get_release_working_dirs(tmp_dir: Path)

seismicrna.core.tmp.release_to_out(out_dir: Path, release_dir: Path, initial_path: Path): Move temporary path(s) to the output directory.

seismicrna.core.tmp.with_tmp_dir(pass_keep_tmp: bool): Make a temporary directory, and delete it after returning.

seismicrna.core.types.fit_uint_size(value: int): Smallest number of bytes that will fit the value.

seismicrna.core.types.fit_uint_type(value: int): Smallest unsigned int type that will fit the value.

seismicrna.core.types.get_byte_dtype(nchars: int): NumPy byte type with the given number of characters.

seismicrna.core.types.get_dtype(code: str, size: int): NumPy type with the given code and size.

seismicrna.core.types.get_max_uint(uint_type: type): Maximum value of a NumPy unsigned integer type.

seismicrna.core.types.get_max_value(nbytes: int): Get the maximum value of an unsigned integer of N bytes.

seismicrna.core.types.get_uint_dtype(nbytes: int): NumPy uint data type with the given number of bytes.

seismicrna.core.types.get_uint_size(uint_type: type): Size of a NumPy uint type in bytes.

seismicrna.core.types.get_uint_type(nbytes: int): NumPy uint type with the given number of bytes.

seismicrna.core.unbias.calc_n_reads_per_pos(p_ends_observed: ndarray, n_reads_per_clust: ndarray)

seismicrna.core.unbias.calc_p_clust(p_clust_observed: ndarray, p_noclose_given_clust: ndarray)

Cluster proportion among all reads.

Parameters:

p_clust_observed (np.ndarray) – Proportion of each cluster among reads with no two mutations too close. 1D (clusters)
p_noclose_given_clust (np.ndarray) – Probability that a read from each cluster would have no two mutations too close. 1D (clusters)

Returns:

Proportion of each cluster among all reads. 1D (clusters)

Return type:

np.ndarray

seismicrna.core.unbias.calc_p_clust_given_ends_noclose(p_ends_given_clust_noclose: ndarray, p_clust_given_noclose: ndarray)

Calculate the probability that a read with each pair of 5’/3’ ends and no two mutations too close came from each cluster.

Parameters:

p_ends_given_clust_noclose (np.ndarray) – 3D (positions x positions x clusters) array of the probability that a read from each cluster has each pair of 5’/3’ ends given that it has no two mutations too close.
p_clust_given_noclose (np.ndarray) – 1D (clusters) array of the probability that a read comes from each cluster given that it has no two mutations too close.

Returns:

3D (positions x positions x clusters) array of the probability that a read with each pair of 5’/3’ ends and no two mutations too close comes from each cluster.

Return type:

np.ndarray

seismicrna.core.unbias.calc_p_clust_given_noclose(p_clust: ndarray, p_noclose_given_clust: ndarray)

Cluster proportions among reads with no two mutations too close.

Parameters:

p_clust (np.ndarray) – Proportion of each cluster among all reads. 1D (clusters)
p_noclose_given_clust (np.ndarray) – Probability that a read from each cluster would have no two mutations too close. 1D (clusters)

Returns:

Proportion of each cluster among reads with no two mutations too close. 1D (clusters)

Return type:

np.ndarray

seismicrna.core.unbias.calc_p_ends(p_ends_observed: ndarray, p_noclose_given_ends: ndarray, p_mut_given_span: ndarray, p_clust: ndarray)

Calculate the proportion of total reads with each pair of 5’ and 3’ coordinates.

This function is meant to be called by another function that has validated the arguments; hence, this function makes assumptions:

Every value in the upper triangle of p_ends_observed is ≥ 0 and ≤ 1; no values below the main diagonal are used.
The upper triangle of p_ends_observed sums to 1.
Every value in p_mut_given_span is ≥ 0 and ≤ 1.

Parameters:

p_ends_observed (np.ndarray) – 3D (positions x positions x clusters) array of the proportion of observed reads in each cluster beginning at the row position and ending at the column position.
p_noclose_given_ends (np.ndarray) – 3D (positions x positions x clusters) array of the pobabilities that a read with 5’ and 3’ coordinates corresponding to the row and column would have no two mutations too close.
p_mut_given_span (np.ndarray) – 2D (positions x clusters) array of the total mutation rate at each position in each cluster.
p_clust (np.ndarray) – 1D (clusters) array of the proportion of each cluster.

Returns:

2D (positions x positions) array of the proportion of reads beginning at the row position and ending at the column position. This array is assumed to be identical for all clusters.

Return type:

np.ndarray

seismicrna.core.unbias.calc_p_ends_given_clust_noclose(p_ends: ndarray, p_noclose_given_ends: ndarray)

Calculate the proportion of reads with no two mutations too close with each pair of 5’ and 3’ coordinates.

Assumptions

p_ends has 2 dimensions: (positions x positions)
Every value in the upper triangle of p_ends is ≥ 0 and ≤ 1; no values below the main diagonal are used.
The upper triangle of p_ends sums to 1.
min_gap is a non-negative integer.
p_mut_given_span has 2 dimensions: (positions x clusters)
Every value in p_mut_given_span is ≥ 0 and ≤ 1.
There is at least 1 cluster.

param p_ends:: 2D (positions x positions) array of the proportion of reads in each cluster beginning at the row position and ending at the column position.
type p_ends:: np.ndarray
param p_noclose_given_ends:: 3D (positions x positions x clusters) array of the probabilities that a read with 5’ and 3’ coordinates corresponding to the row and column would have no two mutations too close.
type p_noclose_given_ends:: np.ndarray
returns:: 3D (positions x positions x clusters) array of the proportion of reads without mutations too close, beginning at the row position and ending at the column position, in each cluster.
rtype:: np.ndarray

seismicrna.core.unbias.calc_p_ends_given_noclose(p_ends_given_clust_noclose: ndarray, p_clust_given_noclose: ndarray)

Calculate the probability that a read would have each pair of 5’/3’ ends and no two mutations too close.

Parameters:

p_ends_given_clust_noclose (np.ndarray) – 3D (positions x positions x clusters) array of the probability that a read from each cluster has each pair of 5’/3’ ends given that it has no two mutations too close.
p_clust_given_noclose (np.ndarray) – 1D (clusters) array of the probability that a read comes from each cluster given that it has no two mutations too close.

Returns:

2D (positions x positions) array of the probability that a read with no two mutations too close has each pair of 5’/3’ ends, regardless of the cluster to which it belongs.

Return type:

np.ndarray

seismicrna.core.unbias.calc_p_ends_observed(npos: int, end5s: ndarray, end3s: ndarray, weights: ndarray | None = None, check_values: bool = True)

Calculate the proportion of each pair of 5’/3’ end coordinates observed in end5s and end3s, optionally weighted by weights.

Parameters:

npos (int) – Number of positions.
end5s (np.ndarray) – 5’ ends (0-indexed) of the reads: 1D array (reads)
end3s (np.ndarray) – 3’ ends (0-indexed) of the reads: 1D array (reads)
weights (np.ndarray | None) – Number of times each read occurs in each cluster: 2D array (reads x clusters)
check_values (bool) – Check that end5s, end3s, and weights are all valid.

Returns:

Fraction of reads with each 5’ (row) and 3’ (column) coordinate: 3D array (positions x positions x clusters)

Return type:

np.ndarray

seismicrna.core.unbias.calc_p_mut_given_span(p_mut_given_span_observed: ndarray, min_gap: int, p_ends: ndarray, init_p_mut_given_span: ndarray, *, quick_unbias: bool = True, quick_unbias_thresh: float = 0.0, f_tol: float = 0.0001, x_rtol: float = 0.001): Calculate the underlying mutation rates including for reads with two mutations too close based on the observed mutation rates.

seismicrna.core.unbias.calc_p_mut_given_span_noclose(p_mut_given_span: ndarray, p_ends: ndarray, p_noclose_given_ends: ndarray, p_nomut_window: ndarray)

Calculate the mutation rates of only reads with no two mutations too close that span each position.

Parameters:

p_mut_given_span (np.ndarray) – 2D (positions x clusters) array of the underlying mutation rates (i.e. the probability that a read has a mutation at position (j) given that it contains that position).
p_ends (np.ndarray) – 2D (positions x positions) array of the proportion of reads in each cluster beginning at the row position and ending at the column position.
p_noclose_given_ends (np.ndarray) – 3D (positions x positions x clusters) array of the probabilities that a read with 5’ and 3’ coordinates corresponding to the row and column would have no two mutations too close.
p_nomut_window (np.ndarray) – 3D (window x positions x clusters) array of the probability that (window) consecutive bases, ending at position (position), would have zero mutations at all.

Returns:

2D (positions x clusters) array of the mutation rate among reads with no two mutations too close per position per cluster.

Return type:

np.ndarray

seismicrna.core.unbias.calc_p_noclose(p_clust: ndarray, p_noclose_given_clust: ndarray)

Probability that any read would have two mutations too close.

Parameters:

p_clust (np.ndarray) – Proportion of each cluster among all reads. 1D (clusters)
p_noclose_given_clust (np.ndarray) – Probability that a read from each cluster would have no two mutations too close. 1D (clusters)

Returns:

Probability that any read would have no two mutations too close.

Return type:

float

seismicrna.core.unbias.calc_p_noclose_given_clust(p_ends: ndarray, p_noclose_given_ends: ndarray): Calculate the probability that a read from each cluster would have no two mutations too close.

seismicrna.core.unbias.calc_p_noclose_given_ends(p_mut_given_span: ndarray, p_nomut_window: ndarray)

Given underlying mutation rates (p_mut_given_span), calculate the probability that a read starting at position (a) and ending at position (b) would have no two mutations too close, for each (a) and (b) where 1 ≤ a ≤ b ≤ L (biological coordinates) or 0 ≤ a ≤ b < L (Python coordinates).

Parameters:

p_mut_given_span (np.ndarray) – 2D (positions x clusters) array of the underlying mutation rates (i.e. the probability that a read has a mutation at position (j) given that it contains that position).
p_nomut_window (np.ndarray) – 3D (window x positions x clusters) array of the probability that (window) consecutive bases, ending at position (position), would have zero mutations at all.

Returns:

3D (positions x positions x clusters) array of the probability that a random read starting at position (a) (row) and ending at position (b) (column) would have no two mutations too close.

Return type:

np.ndarray

seismicrna.core.unbias.calc_p_noclose_given_ends_auto(p_mut_given_span: ndarray, min_gap: int)

Given underlying mutation rates (p_mut_given_span), calculate the probability that a read starting at position (a) and ending at position (b) would have no two mutations too close (i.e. separated by fewer than min_gap non-mutated positions), for each combination of (a) and (b) such that 1 ≤ a ≤ b ≤ L (in biological coordinates) or 0 ≤ a ≤ b < L (in Python coordinates).

Parameters:

p_mut_given_span (ndarray) – A 2D (positions x clusters) array of the underlying mutation rates, i.e. the probability that a read has a mutation at position (j) given that it contains position (j).
min_gap (int) – Minimum number of non-mutated bases between two mutations; must be ≥ 0.

Returns:

3D (positions x positions x clusters) array of the probability that a random read starting at position (a) (row) and ending at position (b) (column) would have no two mutations too close.

Return type:

np.ndarray

seismicrna.core.unbias.calc_p_nomut_window(p_mut_given_span: ndarray, min_gap: int)

Given underlying mutation rates (p_mut_given_span), find the probability of no mutations in each window of size 0 to min_gap.

Parameters:

p_mut_given_span (ndarray) – 2D (positions x clusters) array of the underlying mutation rates (i.e. the probability that a read has a mutation at position (j) given that it contains that position).
min_gap (int) – Minimum number of non-mutated bases between two mutations.

Returns:

3D (window x positions + 1 x clusters) array of the probability that (window) consecutive bases, ending at position (position), would have 0 mutations at all.

Return type:

np.ndarray

seismicrna.core.unbias.calc_params(p_mut_given_span_observed: ndarray, p_ends_observed: ndarray, p_clust_observed: ndarray, min_gap: int, guess_p_mut_given_span: ndarray | None = None, guess_p_ends: ndarray | None = None, guess_p_clust: ndarray | None = None, *, prenormalize: bool = True, max_iter: int = 128, convergence_thresh: float = 0.0001, **kwargs)

Calculate the three sets of parameters based on observed data.

Parameters:

p_mut_given_span_observed (np.ndarray) – Observed probability that each position is mutated given that no two mutations are too close: 2D array (positions x clusters)
p_ends_observed (np.ndarray) – Observed proportion of reads aligned with each pair of 5’ and 3’ end coordinates given that no two mutations are too close: 3D array (positions x positions x clusters)
p_clust_observed (np.ndarray) – Observed proportion of reads in each cluster given that no two mutations are too close: 1D array (clusters)
min_gap (int) – Minimum number of non-mutated bases between two mutations. Must be a non-negative integer.
guess_p_mut_given_span (np.ndarray | None = None) – Initial guess for the probability that each position is mutated. If given, must be a 2D array (positions x clusters); defaults to p_mut_given_span_observed.
guess_p_ends (np.ndarray | None = None) – Initial guess for the proportion of total reads aligned to each pair of 5’ and 3’ end coordinates. If given, must be a 2D array (positions x positions); defaults to p_ends_observed.
guess_p_clust (np.ndarray | None = None) – Initial guess for the proportion of total reads in each cluster. If given, must be a 1D array (clusters); defaults to p_clust_observed.
prenormalize (bool = True) – Fill missing values in guess_p_mut_given_span, guess_p_ends, and guess_p_clust, and clip every value to be ≥ 0 and ≤ 1. Ensure the proportions in guess_p_clust and the upper triangle of guess_p_ends sum to 1.
max_iter (int = 128) – Maximum number of iterations in which to refine the parameters.
convergence_thresh (float = 1.e-4) – Convergence threshold based on the root-mean-square difference in mutation rates between consecutive iterations.
**kwargs – Additional keyword arguments for _calc_p_mut_given_span.

seismicrna.core.unbias.calc_params_observed(n_pos_total: int, unmasked_pos: Iterable[int], muts_per_pos: Iterable[ndarray], end5s: ndarray, end3s: ndarray, counts_per_uniq: ndarray, resps: ndarray)

Calculate the observed estimates of the parameters.

Parameters:

n_pos_total (int) – Total number of positions in the region.
unmasked_pos (Iterable[int]) – Unmasked positions; must be zero-indexed with respect to the 5’ end of the region.
muts_per_pos (Iterable[np.ndarray]) – For each unmasked position, numbers of all reads with a mutation at that position.
end5s (np.ndarray) – 5’ end of every unique read; must be 0-indexed with respect to the 5’ end of the region.
end3s (np.ndarray) – 3’ end of every unique read; must be 0-indexed with respect to the 5’ end of the region.
counts_per_uniq (np.ndarray) – Number of times each unique read occurs.
resps (np.ndarray) – Cluster memberships of each read: 2D array (reads x clusters)

Return type:

tuple[np.ndarray, np.ndarray, np.ndarray]

seismicrna.core.unbias.calc_rectangular_sum(array: ndarray)

For each element of the main diagonal, calculate the sum over the rectangular array from that element to the upper right corner. This function is meant to be called by another function that has validated the arguments; hence, this function makes assumptions:

array has at least 2 dimensions.
The first and second dimensions of array have equal lengths.

Parameters:: array (np.ndarray) – Array of at least two dimensions for which to calculate the sum of each rectangular array from each element on the main diagonal to the upper right corner.
Returns:: Array with all but the first dimension of array indicating the sum of the array from each element on the main diagonal to the upper right corner of array.
Return type:: np.ndarray

seismicrna.core.unbias.require_same_square_atleast2d(a: ndarray, b: ndarray): Require a and b to each be a NumPy NDArray with ≥ 2 dimensions, with the first and second dimensions of a and b all equal.

seismicrna.core.unbias.require_square_atleast2d(name: str, array: ndarray): Require the input to be a NumPy NDArray with ≥ 2 dimensions, and the first and second dimensions to be of equal length.

seismicrna.core.unbias.triu_allclose(a: ndarray | float, b: ndarray | float, rtol: float = 0.001, atol: float = 1e-06)

Whether the upper triangles of a and b are all close.

Parameters:

a (np.ndarray | float) – Array 1.
b (np.ndarray | float) – Array 2.
rtol (float = 1.0e-3) – Relative tolerance.
atol (float = 1.0e-6) – Absolute tolerance.

Returns:

Whether all elements of the upper triangles of a and b are close using the function np.allclose.

Return type:

bool

seismicrna.core.unbias.triu_dot(a: ndarray, b: ndarray)

Dot product of a and b over their first 2 dimensions.

Parameters:

a (np.ndarray) – Array 1.
b (np.ndarray) – Array 2.

Returns:

Dot product of a and b over their first 2 dimensions.

Return type:

np.ndarray

seismicrna.core.unbias.triu_log(a: ndarray)

Calculate the logarithm of the upper triangle(s) of array a. In the result, elements below the main diagonal are undefined.

Parameters:: a (np.ndarray) – Array (≥ 2 dimensions) of whose upper triangle to compute the logarithm; the first 2 dimensions must have equal lengths.
Returns:: Logarithm of the upper triangle(s) of a.
Return type:: np.ndarray

seismicrna.core.unbias.triu_sum(a: ndarray)

Calculate the sum over the upper triangle(s) of array a.

Parameters:: a (np.ndarray) – Array whose upper triangle to sum.
Returns:: Sum of the upper triangle(s), with the same shape as the third and subsequent dimensions of a.
Return type:: np.ndarray

seismicrna.core.validate.require_allclose(name: str, array: ~typing.Any, other_array: ~typing.Any, other_name: str = '', classes: type | tuple[type | tuple[~typing.Any, ...], ...] = <class 'object'>, error_type: ~typing.Type[ValueError] = <class 'ValueError'>): Require that array ≈ other_array.

seismicrna.core.validate.require_array_equal(name: str, array: ~typing.Any, other_array: ~typing.Any, other_name: str = '', classes: type | tuple[type | tuple[~typing.Any, ...], ...] = <class 'object'>, error_type: ~typing.Type[ValueError] = <class 'ValueError'>): Require that array = other_array.

seismicrna.core.validate.require_atleast(name: str, value: ~typing.Any, minimum_value: ~typing.Any, minimum_name: str = '', classes: type | tuple[type | tuple[~typing.Any, ...], ...] = <class 'object'>, error_type: ~typing.Type[ValueError] = <class 'ValueError'>): Require that value ≥ minimum_value.

seismicrna.core.validate.require_atmost(name: str, value: ~typing.Any, maximum_value: ~typing.Any, maximum_name: str = '', classes: type | tuple[type | tuple[~typing.Any, ...], ...] = <class 'object'>, error_type: ~typing.Type[ValueError] = <class 'ValueError'>): Require that value ≤ maximum_value.

seismicrna.core.validate.require_between(name: str, value: ~typing.Any, minimum_value: ~typing.Any | None, maximum_value: ~typing.Any | None, minimum_name: str = '', maximum_name: str = '', inclusive: bool = True, classes: type | tuple[type | tuple[~typing.Any, ...], ...] = <class 'object'>, error_type: ~typing.Type[ValueError] = <class 'ValueError'>): Require that value is in [minimum_value, maximum_value] if inclusive is True, otherwise in (minimum_value, maximum_value).

seismicrna.core.validate.require_equal(name: str, value: ~typing.Any, other_value: ~typing.Any, other_name: str = '', classes: type | tuple[type | tuple[~typing.Any, ...], ...] = <class 'object'>, error_type: ~typing.Type[ValueError] = <class 'ValueError'>): Require that value = other_value.

seismicrna.core.validate.require_fraction(name: str, value: ~typing.Any, classes: type | tuple[type | tuple[~typing.Any, ...], ...] = (<class 'float'>, <class 'int'>), error_type: ~typing.Type[ValueError] = <class 'ValueError'>): Require that value ≥ 0 and ≤ 1.

seismicrna.core.validate.require_greater(name: str, value: ~typing.Any, other_value: ~typing.Any, other_name: str = '', classes: type | tuple[type | tuple[~typing.Any, ...], ...] = <class 'object'>, error_type: ~typing.Type[ValueError] = <class 'ValueError'>): Require that value > other_value.

seismicrna.core.validate.require_index_equals(name: str, index: ~pandas.core.indexes.base.Index, other_index: ~pandas.core.indexes.base.Index, other_name: str = '', classes: type | tuple[type | tuple[~typing.Any, ...], ...] = <class 'pandas.core.indexes.base.Index'>, error_type: ~typing.Type[ValueError] = <class 'ValueError'>): Require that index = other_index.

seismicrna.core.validate.require_isin(name: str, value: ~typing.Any, values: ~typing.Container, values_name: str = '', error_type: ~typing.Type[Exception] = <class 'ValueError'>): Require value to be in values.

seismicrna.core.validate.require_isinstance(name: str, value: ~typing.Any, classes: type | tuple[type | tuple[~typing.Any, ...], ...], error_type: ~typing.Type[TypeError] = <class 'TypeError'>) → None: Raise an error if value is not an instance of classes.

seismicrna.core.validate.require_issubclass(name: str, value: type, classes: type | tuple[type | tuple[~typing.Any, ...], ...], error_type: ~typing.Type[ValueError] = <class 'ValueError'>) → None: Raise an error if value is not a subclass of classes.

seismicrna.core.validate.require_less(name: str, value: ~typing.Any, other_value: ~typing.Any, other_name: str = '', classes: type | tuple[type | tuple[~typing.Any, ...], ...] = <class 'object'>, error_type: ~typing.Type[ValueError] = <class 'ValueError'>): Require that value < other_value.

seismicrna.core.version.format_version(major: int = 0, minor: int = 24, patch: int = 3, prtag: str = 'dev')

seismicrna.core.version.parse_version(version: str = '0.24.3dev'): Major and minor versions, patch, and pre-release tag.

seismicrna.core.write.need_write(query: str | Path, force: bool = False, warn: bool = True)

Determine whether a file/directory must be written.

Parameters:

query (str | Path) – File or directory for which to check the need for writing.
force (bool = False) – Force the query to be written, even if it already exists.
warn (bool = True) – If the query does not need to be written, then log a warning.

Returns:

Whether the file must be written.

Return type:

bool

seismicrna.core.write.write_mode(force: bool = False, binary: bool = False)

Get the mode in which to open a file for writing.

Parameters:

force (bool = False) – Force the file to be written, truncating the file if it exists. If False and the file exists, a FileExistsError will be raised.
binary (bool = False) – Write the file in binary mode instead of text mode.

Returns:

The mode argument for the builtin function open().

Return type:

str