seismicrna.core package

Subpackages

Submodules

seismicrna.core.array.calc_inverse(target: ndarray, require: int = -1, fill: bool = False, fill_rev: bool = False, fill_default: int | None = None, verify: bool = True, what: str = 'array')

Calculate the inverse of target, such that if element i of target has value x, then element x of the inverse has value i.

>>> list(calc_inverse(np.array([3, 2, 7, 5, 1])))
[-1, 4, 1, 0, -1, 3, -1, 2]
>>> list(calc_inverse(np.arange(5)))
[0, 1, 2, 3, 4]
Parameters:
  • target (np.ndarray) – Target values; must be a 1-dimensional array of non-negative integers with no duplicate values.

  • require (int = -1) – Require the inverse to contain all indexes up to and including require (i.e. that its length is at least require + 1); ignored if require is -1; must be ≥ -1.

  • fill (bool = False) – Fill missing indexes (that do not appear in target).

  • fill_rev (bool = False) – Fill missing indexes in reverse order instead of forward order; only used if fill is True.

  • fill_default (int | None = None) – Value with which to fill before the first non-missing value has been encountered; if fill_rev is True, defaults to the length of target, otherwise to -1.

  • verify (bool = True) – Verify that all target values are unique, non-negative integers. If this is incorrect, then if verify is True, then ValueError will be raised; and if False, then the results of this function will be incorrect. Always set to True unless you have already verified that target is unique, non-negative integers.

  • what (str = "array") – What to name the array (only used for error messages).

Returns:

Inverse of target.

Return type:

np.ndarray

seismicrna.core.array.check_naturals(values: ndarray, what: str = 'values')

Raise ValueError if the values are not monotonically increasing natural numbers.

seismicrna.core.array.ensure_order(array1: ndarray, array2: ndarray, what1: str = 'array1', what2: str = 'array2', gt_eq: bool = False)

Ensure that array1 is ≤ or ≥ array2, element-wise.

Parameters:
  • array1 (np.ndarray) – Array 1 (same length as array2).

  • array2 (np.ndarray) – Array 2 (same length as array1).

  • what1 (str = "array1") – What array1 contains (only used for error messages).

  • what2 (str = "array2") – What array2 contains (only used for error messages).

  • gt_eq (bool = False) – Ensure array1 ≥ array2 if True, otherwise array1 ≤ array2.

Returns:

Shared length of array1 and array2.

Return type:

int

seismicrna.core.array.ensure_same_length(arr1: ndarray, arr2: ndarray, what1: str = 'array1', what2: str = 'array2')
seismicrna.core.array.find_dims(dims: Sequence[Sequence[str | None]], arrays: Sequence[ndarray], names: Sequence[str] | None = None, nonzero: Iterable[str] | bool = False)

Check the dimensions of the arrays.

seismicrna.core.array.get_length(array: ndarray, what: str = 'array') int
seismicrna.core.array.list_naturals(n: int)

List natural numbers up to and including n.

seismicrna.core.array.locate_elements(collection: ndarray, *elements: ndarray, what: str = 'collection', verify: bool = True)

Find the index at which each element of elements occurs in collection.

>>> list(locate_elements(np.array([4, 1, 2, 7, 5, 3]), np.array([5, 2, 5])))
[4, 2, 4]
Parameters:
  • collection (np.ndarray) – Collection in which to find each element in elements; must be a 1-dimensional array of non-negative integers with no duplicate values.

  • *elements (np.ndarray) – Elements to find; must be a 1-dimensional array that is a subset of collection, although duplicate values are permitted.

  • what (str = "collection") – What to name the collection (only used for error messages).

  • verify (bool = True) – Verify that all values in collection are unique, non-negative integers and that all items in elements are in collections.

Returns:

Index of each element of elements in collections.

Return type:

np.ndarray

seismicrna.core.array.sanitize_values(values: Iterable[int], lower_limit: int, upper_limit: int, whats: str = 'values')

Validate and sort values, and return them as an array.

seismicrna.core.array.triangular(n: int)

The n th triangular number (n ≥ 0): number of items in an equilateral triangle with n items on each side.

Parameters:

n (int) – Index of the triangular number to return; equivalently, the side length of the equilateral triangle.

Returns:

The triangular number with index n; equivalently, the number of items in the equilateral triangle of side length n.

Return type:

int

class seismicrna.core.dataset.Dataset(report_file: Path, verify_times: bool = True)

Bases: ABC

Dataset comprising batches of data.

property batch_nums

Numbers of the batches.

property best_k: int

Best number of clusters.

abstract property data_dirs: list[Path]

All directories containing data for the dataset.

property dir: Path

Directory containing the dataset.

abstract get_batch(batch_num: int) ReadBatch

Get a specific batch of data.

abstract classmethod get_report_type() type[Report]

Type of report.

iter_batches()

Yield each batch.

property ks: list[int]

Numbers of clusters.

Make links to a dataset in a temporary directory.

abstract property num_batches: int

Number of batches.

property num_reads

Number of reads in the dataset.

abstract property pattern: RelPattern | None

Pattern of mutations to count.

property ref: str

Name of the reference.

property sample: str

Name of the sample.

abstract property timestamp: datetime

Time at which the data were written.

property top: Path

Top-level directory of the dataset.

exception seismicrna.core.dataset.FailedToLoadDatasetError

Bases: RuntimeError

A batch failed to load.

class seismicrna.core.dataset.LoadFunction(data_type: type[Dataset], /, *more_types: type[Dataset])

Bases: object

Function to load a dataset.

__call__(report_file: Path, **kwargs)

Load a dataset from the report file.

is_dataset_type(dataset: Dataset)

Whether the dataset is one of the loadable types.

property report_path_auto_fields

Automatic field values of the report file path.

property report_path_seg_types

Segment types of the report file path.

class seismicrna.core.dataset.LoadedDataset(report_file: Path, verify_times: bool = True)

Bases: Dataset, ABC

Dataset created by loading directly from a Report.

property data_dirs

All directories containing data for the dataset.

get_batch(batch_num: int) ReadBatchIO | MutsBatchIO

Get a specific batch of data.

get_batch_checksum(batch: int)

Get the checksum of a specific batch from the report.

get_batch_path(batch: int)

Get the path to a batch of a specific number.

abstract classmethod get_batch_type() type[ReadBatchIO | MutsBatchIO]

Type of batch.

classmethod get_btype_name()

Name of the type of batch.

abstract classmethod get_report_type() type[BatchedReport]

Type of report.

property num_batches

Number of batches.

property timestamp

Time at which the data were written.

class seismicrna.core.dataset.MergedDataset(report_file: Path, verify_times: bool = True)

Bases: Dataset, ABC

Dataset made by merging one or more constituent datasets.

property data_dirs

All directories containing data for the dataset.

property datasets: list[Dataset]

Constituent datasets that were merged.

abstract classmethod get_dataset_load_func() LoadFunction

Function to load one constituent dataset.

property pattern

Pattern of mutations to count.

property timestamp

Time at which the data were written.

class seismicrna.core.dataset.MergedRegionDataset(report_file: Path, verify_times: bool = True)

Bases: MergedDataset, RegionDataset, ABC

property refseq

Sequence of the reference.

class seismicrna.core.dataset.MergedUnbiasDataset(*args, masked_read_nums: dict[[<class 'int'>, <class 'list'>]] | None = None, **kwargs)

Bases: MergedDataset, UnbiasDataset, ABC

MergedDataset with attributes for correcting observer bias.

property min_mut_gap

Minimum gap between two mutations.

property quick_unbias

Use the quick heuristic for unbiasing.

property quick_unbias_thresh

Consider mutation rates less than or equal to this threshold to be 0 when using the quick heuristic for unbiasing.

exception seismicrna.core.dataset.MissingBatchError

Bases: RuntimeError

A dataset does not have a batch of a given type and number.

exception seismicrna.core.dataset.MissingBatchTypeError

Bases: MissingBatchError

A dataset does not have a batch of a given type.

class seismicrna.core.dataset.MultistepDataset(dataset2_report_file: Path, **kwargs)

Bases: MutsDataset, ABC

Dataset made by integrating two datasets from different steps of the workflow.

property data_dirs

All directories containing data for the dataset.

get_batch(batch_num: int)

Get a specific batch of data.

abstract classmethod get_dataset1_load_func() LoadFunction

Function to load Dataset 1.

classmethod get_dataset1_report_file(dataset2_report_file: Path)

Given the report file for Dataset 2, determine the report file for Dataset 1.

classmethod get_dataset2_load_func()

Function to load Dataset 2.

abstract classmethod get_dataset2_type() type[RegionDataset]

Type of Dataset 2.

classmethod get_report_type()

Type of report.

classmethod load_dataset1(dataset2_report_file: Path, verify_times: bool)

Load Dataset 1.

classmethod load_dataset2(dataset2_report_file: Path, verify_times: bool)

Load Dataset 2.

property num_batches

Number of batches.

property refseq

Sequence of the reference.

property timestamp

Time at which the data were written.

class seismicrna.core.dataset.MutsDataset(report_file: Path, verify_times: bool = True)

Bases: RegionDataset, ABC

Dataset with a known region and explicit mutational data.

abstract get_batch(batch_num: int) RegionMutsBatch

Get a specific batch of data.

get_batch_count_all(batch_num: int, **kwargs)

Calculate the counts for a specific batch of data.

iter_batches()

Yield each batch.

class seismicrna.core.dataset.RegionDataset(report_file: Path, verify_times: bool = True)

Bases: Dataset, ABC

Dataset with a known reference sequence and region.

property reflen

Length of the reference sequence.

abstract property refseq: DNA

Sequence of the reference.

property region: Region

Region of the dataset.

exception seismicrna.core.dataset.ReversedTimeStampError

Bases: RuntimeError

A dataset has a timestamp that is earlier than a dataset that should have been written before it.

class seismicrna.core.dataset.TallDataset(report_file: Path, verify_times: bool = True)

Bases: MergedDataset, ABC

Dataset made by vertically pooling other datasets from one or more samples aligned to the same reference sequence.

property datasets

Constituent datasets that were merged.

get_batch(batch_num: int)

Get a specific batch of data.

property num_batches

Number of batches.

property nums_batches: list[int]

Number of batches in each dataset in the pool.

property samples: list[str]

Names of all samples in the pool.

class seismicrna.core.dataset.UnbiasDataset(*args, masked_read_nums: dict[[<class 'int'>, <class 'list'>]] | None = None, **kwargs)

Bases: Dataset, ABC

Dataset with attributes for correcting observer bias.

abstract property min_mut_gap: int

Minimum gap between two mutations.

abstract property quick_unbias: bool

Use the quick heuristic for unbiasing.

abstract property quick_unbias_thresh: float

Consider mutation rates less than or equal to this threshold to be 0 when using the quick heuristic for unbiasing.

class seismicrna.core.dataset.WideDataset(report_file: Path, verify_times: bool = True)

Bases: MergedRegionDataset, ABC

Dataset made by horizontally joining other datasets from one or more regions of the same reference sequence.

property datasets

Constituent datasets that were merged.

get_batch(batch_num: int)

Get a specific batch of data.

property num_batches

Number of batches.

property region

Region of the dataset.

property region_names

Names of all joined regions.

class seismicrna.core.dataset.WideMutsDataset(report_file: Path, verify_times: bool = True)

Bases: WideDataset, MutsDataset, ABC

WideDataset with mutation data.

seismicrna.core.dataset.load_datasets(input_path: Iterable[str | Path], load_func: LoadFunction, **kwargs)

Yield a Dataset from each report file in input_path.

Parameters:
  • input_path (Iterable[str | Path]) – Input paths to be searched recursively for report files.

  • load_func (LoadFunction) – Function to load the dataset from each report file.

Generic Exceptions

exception seismicrna.core.error.IncompatibleValuesError

Bases: ValueError

Two or more values are individually valid, but their combination is not.

exception seismicrna.core.error.InconsistentValueError

Bases: ValueError

Two or more values differ when they should be equal.

exception seismicrna.core.error.OutOfBoundsError

Bases: ValueError

A numeric value is outside its proper bounds.

class seismicrna.core.header.ClustHeader(*, ks: Iterable[int], **kwargs)

Bases: Header

Header of clusters.

classmethod clustered()

Whether the header has clusters.

property clusts

clusters for clustered data, otherwise one track of the average.

Type:

Tracks of data

property index

Index of the header.

iter_clust_indexes()

For each cluster, yield an Index/MultiIndex of every column that is part of the cluster.

property ks

Numbers of clusters.

classmethod levels()

Levels of the index.

property signature

Signature of the header, which will generate an identical header if passed as keyword arguments to make_header.

class seismicrna.core.header.Header

Bases: ABC

Header for a table.

abstract classmethod clustered() bool

Whether the header has clusters.

property clusts: list[tuple[int, int]]

clusters for clustered data, otherwise one track of the average.

Type:

Tracks of data

get_clust_header()

Corresponding ClustHeader.

get_rel_header()

Corresponding RelHeader.

property index: Index

Index of the header.

abstract iter_clust_indexes()

For each cluster, yield an Index/MultiIndex of every column that is part of the cluster.

abstract property ks: list[int]

Numbers of clusters.

classmethod level_keys()

Level keys of the index.

classmethod level_names()

Level names of the index.

abstract classmethod levels()

Levels of the index.

modified(**kwargs)

Return a new header with a possibly modified signature.

Parameters:

**kwargs – Keyword arguments for modifying the signature of the header. Each argument given here will be passed to make_header and override the attribute (if any) with the same name in this header’s signature. Attributes of this header’s signature that are not overriden will also be passed to make_header.

Returns:

New header with a possibly modified signature.

Return type:

Header

property names

Formatted name of each track.

classmethod num_levels()

Number of levels.

select(**kwargs) Index

Select and return items from the header as an Index.

property signature

Signature of the header, which will generate an identical header if passed as keyword arguments to make_header.

property size

Number of items in the Header.

class seismicrna.core.header.RelClustHeader(*, ks: Iterable[int], **kwargs)

Bases: ClustHeader, RelHeader

Header of relationships and clusters.

property index

Index of the header.

class seismicrna.core.header.RelHeader(*, rels: Iterable[str], **kwargs)

Bases: Header

Header of relationships.

classmethod clustered()

Whether the header has clusters.

property clusts

clusters for clustered data, otherwise one track of the average.

Type:

Tracks of data

property index

Index of the header.

iter_clust_indexes()

For each cluster, yield an Index/MultiIndex of every column that is part of the cluster.

property ks

Numbers of clusters.

classmethod levels()

Levels of the index.

property rels

Relationships.

property signature

Signature of the header, which will generate an identical header if passed as keyword arguments to make_header.

seismicrna.core.header.deduplicate_rels(rels: Iterable)

Remove duplicate relationships while preserving their order.

Parameters:

rels (Iterable) – Relationships

Returns:

Relationships with duplicates removed, in the original order.

Return type:

list[str]

seismicrna.core.header.format_clust_name(k: int, clust: int)

Format a pair of k and cluster numbers into a name.

Parameters:
  • k (int) – Number of clusters

  • clust (int) – Cluster number

Returns:

Name specifying k and clust, or “average” if k is 0.

Return type:

str

seismicrna.core.header.format_clust_names(clusts: Iterable[tuple[int, int]], allow_duplicates: bool = False)

Format pairs of k and clust into a list of names.

Parameters:
  • clusts (Iterable[tuple[int, int]]) – Zero or more pairs of k and cluster numbers.

  • allow_duplicates (bool = False) – Allow k and clust pairs to be duplicated.

Returns:

List of names of the pairs of k and clust.

Return type:

list[str]

Raises:

ValueError – If allow_duplicates is False and clusts has duplicates.

seismicrna.core.header.list_clusts(k: int)

List all cluster numbers for one k.

Parameters:

k (int) – Number of clusters (≥ 0)

Returns:

List of cluster numbers.

Return type:

list[int]

seismicrna.core.header.list_k_clusts(k: int)

List k and cluster numbers as 2-tuples for one k.

Parameters:

k (int) – Number of clusters (≥ 0)

Returns:

List wherein each item is a tuple of the number of clusters and the cluster number.

Return type:

list[tuple[int, int]]

seismicrna.core.header.list_ks_clusts(ks: Iterable[int])

List k and cluster numbers as 2-tuples.

Parameters:

ks (Iterable[int])

Returns:

List wherein each item is a tuple of the number of clusters and the cluster number.

Return type:

list[tuple[int, int]]

seismicrna.core.header.make_header(*, rels: Iterable[str] | None = None, ks: Iterable[int] | None = None)

Make a new Header of an appropriate type.

Parameters:
  • rels (Iterable[str] | None = None) – Relationships in the header

  • ks (Iterable[int] | None = None) – Numbers of clusters

Returns:

Header of the appropriate type.

Return type:

Header

seismicrna.core.header.parse_header(index: Index | MultiIndex)

Parse an Index into a Header of an appropriate type.

Parameters:

index (pd.Index | pd.MultiIndex) – Index to parse.

Returns:

New Header whose index is index.

Return type:

Header

seismicrna.core.header.validate_k_clust(k: int, clust: int)

Validate a pair of k and cluster numbers.

Parameters:
  • k (int) – Number of clusters

  • clust (int) – Cluster number

Returns:

If the k and cluster numbers form a valid pair.

Return type:

None

Raises:
  • TypeError – If k or clust is not an integer.

  • ValueError – If k and clust do not form a valid pair.

seismicrna.core.header.validate_ks(ks: Iterable)

Validate and sort numbers of clusters.

Parameters:

ks (Iterable) – Numbers of clusters

Returns:

Sorted numbers of clusters

Return type:

list[int]

Raises:

ValueError – If any k is not positive or is repeated.

class seismicrna.core.join.JoinMutsDataset(report_file: Path, verify_times: bool = True)

Bases: WideMutsDataset, ABC

classmethod check_batch_type(batch: MutsBatch)

Raise TypeError if the batch is the incorrect type.

abstract classmethod get_batch_type() type[MutsBatch]

Type of batch.

property min_mut_gap
abstract classmethod name_batch_attrs() list[str]

Name the attributes of each batch.

class seismicrna.core.join.JoinReport(**kwargs: Any | Callable[[Report], Any])

Bases: Report, RegIO, ABC

Report for a joined dataset.

class seismicrna.core.logs.AnsiCode

Bases: object

Format text with ANSI codes.

BOLD = 1
END = 'm'
RESET = 0
START = '\x1b['
classmethod format(code: int)

Make a format string for one ANSI code.

classmethod format_color(color: int)

Make a format string for one 256-color code.

classmethod reset()

Convenience function to end formatting.

class seismicrna.core.logs.ConsoleStream(filterer: Filterer, formatter: Formatter)

Bases: Stream

Log to the console’s stderr stream.

filterer
formatter
property stream

Text stream to which messages will be logged after filtering and formating.

class seismicrna.core.logs.FileStream(file_path: str | Path, *args, **kwargs)

Bases: Stream

Log to a file.

close()

Close the file stream.

file_path
property stream

Text stream to which messages will be logged after filtering and formating.

class seismicrna.core.logs.Filterer(verbosity: int)

Bases: object

Filter messages before logging.

verbosity
class seismicrna.core.logs.Formatter(formatter: Callable[[Message], str])

Bases: object

Filter messages before logging.

formatter
class seismicrna.core.logs.Level(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: IntEnum

Level of a logging message.

ACTION = 2
DETAIL = 4
ERROR = -2
FATAL = -3
ROUTINE = 3
STATUS = 0
TASK = 1
WARNING = -1
class seismicrna.core.logs.Logger(console_stream: ConsoleStream | None = None, file_stream: FileStream | None = None, raise_on_error: bool = False)

Bases: object

Log messages to the console and to files.

action(content: object)
console_stream
detail(content: object)
error(content: object)
fatal(content: object)
file_stream
raise_on_error
routine(content: object)
status(content: object)
task(content: object)
warning(content: object)
class seismicrna.core.logs.LoggerConfig(verbosity, log_file_path, log_color, raise_on_error)

Bases: tuple

log_color

Alias for field number 2

log_file_path

Alias for field number 1

raise_on_error

Alias for field number 3

verbosity

Alias for field number 0

class seismicrna.core.logs.Message(level: Level, content: object)

Bases: object

Message with a logging level.

content
level
class seismicrna.core.logs.Stream(filterer: Filterer, formatter: Formatter)

Bases: ABC

Log to a stream, such as to the console or to a file.

filterer
formatter
log(message: Message)

Log a message to the stream.

abstract property stream: TextIO

Text stream to which messages will be logged after filtering and formating.

seismicrna.core.logs.erase_config()

Erase the existing logger configuration.

seismicrna.core.logs.exc_info()

Whether to log exception information.

seismicrna.core.logs.format_console_color(message: Message)

Format a message to log on the console with color.

seismicrna.core.logs.format_console_plain(message: Message)

Format a message to log on the console without color.

seismicrna.core.logs.format_logfile(message: Message)

Format a message to write into the log file.

seismicrna.core.logs.get_config()

Get the configuration parameters of a logger.

seismicrna.core.logs.log_exceptions(default: Callable | None)

If any exception occurs, catch it and return the default.

seismicrna.core.logs.restore_config(func: Callable)

After the function exits, restore the logging configuration that was in place before the function ran.

seismicrna.core.logs.set_config(verbosity: int = 0, log_file_path: str | Path | None = None, log_color: bool = True, raise_on_error: bool = False)

Configure the main logger with handlers and verbosity.

class seismicrna.core.path.Field(dtype: type[str | int | Path], options: Iterable = (), is_ext: bool = False)

Bases: object

property as_str
build(val: Any)

Validate a value and return it as a string.

parse(text: str) Any

Parse a value from a string, validate it, and return it.

validate(val: Any)
class seismicrna.core.path.Path(*seg_types: Segment)

Bases: object

property as_str
build(**fields: Any)

Return a pathlib.Path instance by assembling the given fields into a full path.

parse(path: str | Path)

Return the field names and values from a given path.

exception seismicrna.core.path.PathError

Bases: Exception

Any error involving a path.

exception seismicrna.core.path.PathTypeError

Bases: PathError, TypeError

Use of the wrong type of path or segment.

exception seismicrna.core.path.PathValueError

Bases: PathError, ValueError

Invalid value of a path segment field.

class seismicrna.core.path.Segment(segment_name: str, field_types: dict[str, Field], *, order: int = 0, frmt: str | None = None)

Bases: object

property as_str
build(**vals: Any)
property ext_type

Type of the segment’s file extension, or None if it has no file extension.

property exts: list[str]

Valid file extensions of the segment.

match_longest_ext(text: str)

Find the longest extension of the given text that matches a valid file extension. If none match, return None.

parse(text: str)
exception seismicrna.core.path.WrongFileExtensionError

Bases: PathValueError

A file has the wrong extension.

seismicrna.core.path.build(*segment_types: Segment, **field_values: Any)

Return a pathlib.Path from the given segment types and field values.

seismicrna.core.path.builddir(*segment_types: Segment, **field_values: Any)

Build the path and create it on the file system as a directory if it does not already exist.

seismicrna.core.path.buildpar(*segment_types: Segment, **field_values: Any)

Build a path and create its parent directory if it does not already exist.

seismicrna.core.path.cast_path(input_path: Path, input_segments: Sequence[Segment], output_segments: Sequence[Segment], **override: Any)

Cast input_path made of input_segments to a new path made of output_segments.

Parameters:
  • input_path (pathlib.Path) – Input path from which to take the path fields.

  • input_segments (Sequence[Segment]) – Path segments to use to determine the fields in input_path.

  • output_segments (Sequence[Segment]) – Path segments to use to determine the fields in output_path.

  • **override (Any) – Override and supplement the fields in input_path.

Returns:

Path comprising output_segments made of fields in input_path (as determined by input_segments).

Return type:

pathlib.Path

seismicrna.core.path.check_file_extension(file: Path, extensions: Iterable[str] | Field)
seismicrna.core.path.create_path_type(*segment_types: Segment)

Create and cache a Path instance from the segment types.

seismicrna.core.path.deduplicate(paths: Iterable[str | Path], warn: bool = True)

Yield the non-redundant paths.

seismicrna.core.path.deduplicated(func: Callable)

Decorate a Path generator to yield non-redundant paths.

seismicrna.core.path.fill_whitespace(path: str | Path, fill: str = '_')

Replace all whitespace in path with fill.

seismicrna.core.path.find_files(path: str | Path, segments: Sequence[Segment], pre_sanitize: bool = True)

Yield all files that match a sequence of path segments. The behavior depends on what path is:

  • If it is a file, then yield path if it matches the segments; otherwise, yield nothing.

  • If it is a directory, then search it recursively and yield every matching file in the directory and its subdirectories.

Parameters:
  • path (str | pathlib.Path) – Path of a file to check or a directory to search recursively.

  • segments (Sequence[Segment]) – Sequence(s) of Path segments to check if each file matches.

  • pre_sanitize (bool) – Whether to sanitize the path before searching it.

Returns:

Paths of files matching the segments.

Return type:

Generator[Path, Any, None]

seismicrna.core.path.find_files_chain(paths: Iterable[str | Path], segments: Sequence[Segment])

Yield from find_files called on every path in paths.

seismicrna.core.path.get_fields_in_seg_types(*segment_types: Segment) dict[str, Field]

Get all fields among the given segment types.

seismicrna.core.path.get_seismicrna_project_dir()

SEISMIC-RNA project directory, named seismic-rna, containing src, pyproject.toml, and all other project files. Will exist if the entire SEISMIC-RNA project has been downloaded, e.g. from GitHub, but not if SEISMIC-RNA was only installed using pip or conda.

seismicrna.core.path.get_seismicrna_source_dir()

SEISMIC-RNA source directory, named seismicrna, containing __init__.py and the top-level modules and subpackages.

seismicrna.core.path.mkdir_if_needed(path: Path | str)

Create a directory and log that event if it does not exist.

seismicrna.core.path.parse(path: str | Path, /, *segment_types: Segment)

Return the fields of a path based on the segment types.

seismicrna.core.path.parse_top_separate(path: str | Path, /, *segment_types: Segment)

Return the fields of a path, and the top field separately.

seismicrna.core.path.path_matches(path: str | Path, segments: Sequence[Segment])

Check if a path matches a sequence of path segments.

Parameters:
  • path (str | pathlib.Path) – Path of the file/directory.

  • segments (Sequence[Segment]) – Sequence of path segments to check if the file matches.

Returns:

Whether the path matches any given sequence of path segments.

Return type:

bool

seismicrna.core.path.randdir(parent: str | Path | None = None, prefix: str = '', suffix: str = '')

Build a path of a new directory that does not exist and create it on the file system.

seismicrna.core.path.rmdir_if_needed(path: Path | str, rmtree: bool = False, rmtree_ignore_errors: bool = False, raise_on_rmtree_error: bool = True)

Remove a directory and log that event if it exists.

seismicrna.core.path.sanitize(path: str | Path, strict: bool = False)

Sanitize a path-like object by ensuring it is an absolute path, eliminating symbolic links and redundant path separators/references, and returning a Path object.

Parameters:
  • path (str | pathlib.Path) – Path to sanitize.

  • strict (bool = False) – Require the path to exist and contain no symbolic link loops.

Returns:

Absolute, normalized, symlink-free path.

Return type:

pathlib.Path

Make link_path a link pointing to target_path and log that event if it does not exist.

seismicrna.core.path.transpath(to_dir: str | Path, from_dir: str | Path, path: str | Path, strict: bool = False)

Return the path that would be produced by moving path from from_dir to to_dir (but do not actually move the path on the file system). This function does not require that any of the given paths exist, unless strict is True.

Parameters:
  • to_dir (str | pathlib.Path) – Directory to which to move path.

  • from_dir (str | pathlib.Path) – Directory from which to move path; must contain path but not necessarily be the direct parent directory of path.

  • path (str | pathlib.Path) – Path to move; can be a file or directory.

  • strict (bool = False) – Require that all paths exist and contain no symbolic link loops.

Returns:

Hypothetical path after moving path from indir to outdir.

Return type:

pathlib.Path

seismicrna.core.path.transpaths(to_dir: str | Path, *paths: str | Path, strict: bool = False)

Return all paths that would be produced by moving all paths in paths from their longest common sub-path to to_dir (but do not actually move the paths on the file system). This function does not require that any of the given paths exist, unless strict is True.

Parameters:
  • to_dir (str | pathlib.Path) – Directory to which to move every path in path.

  • *paths (str | pathlib.Path) – Paths to move; can be files or directories. A common sub-path must exist among all of these paths.

  • strict (bool = False) – Require that all paths exist and contain no symbolic link loops.

Returns:

Hypothetical paths after moving all paths in path to outdir.

Return type:

tuple[pathlib.Path, ]

seismicrna.core.path.validate_int(num: int)
seismicrna.core.path.validate_str(txt: str)
seismicrna.core.path.validate_top(top: Path)
seismicrna.core.random.stochastic_round(values: ndarray | list | float | int, preserve_sum: bool = False)

Round values to integers stochastically, so that the probability of rounding up equals the fractional part of the original value.

Parameters:
  • values (np.ndarray | list | float | int) – Values to round; if scalar, a 0D integer array will be returned.

  • preserve_sum (bool) – Whether to ensure that the sum of the rounded values equals the sum of the original values.

Returns:

Values rounded to integers, with the original sum preserved.

Return type:

np.ndarray

class seismicrna.core.report.BatchedRefseqReport(**kwargs: Any | Callable[[Report], Any])

Bases: BatchedReport, RefseqReport, ABC

Convenience class used as a base for several Report classes.

class seismicrna.core.report.BatchedReport(**kwargs: Any | Callable[[Report], Any])

Bases: Report, ABC

Report with a number of data batches (one file per batch).

classmethod batch_types() dict[str, type[ReadBatchIO]]

Type(s) of batch(es) for the report, keyed by name.

abstract classmethod fields()

All fields of the report.

classmethod get_batch_type(btype: str | None = None) type[ReadBatchIO]

Return a valid type of batch based on its name.

class seismicrna.core.report.Field(key: str, title: str, dtype: type, default: Any | None = None, *, iconv: Callable[[Any], Any] | None = None, oconv: Callable[[Any], Any] | None = None)

Bases: object

Field of a report.

default
dtype
iconv
key
oconv
title
exception seismicrna.core.report.InvalidReportFieldKeyError

Bases: ReportFieldKeyError

The key does not belog to an actual report field.

exception seismicrna.core.report.InvalidReportFieldTitleError

Bases: ReportFieldKeyError

The title does not belog to an actual report field.

exception seismicrna.core.report.MissingFieldWithNoDefaultError

Bases: ReportFieldValueError

The default value is requested of a field with no default.

class seismicrna.core.report.OptionField(option: Option, **kwargs)

Bases: Field

Field based on a command line option.

default
dtype
iconv
key
oconv
title
class seismicrna.core.report.RefseqReport(**kwargs: Any | Callable[[Report], Any])

Bases: Report, RefIO, ABC

Report associated with a reference sequence file.

abstract classmethod fields()

All fields of the report.

class seismicrna.core.report.Report(**kwargs: Any | Callable[[Report], Any])

Bases: FileIO, ABC

Abstract base class for a report from a step.

__setattr__(key: str, value: Any)

Validate the attribute name and value before setting it.

classmethod field_keys()

Keys of all fields of the report.

abstract classmethod fields()

All fields of the report.

classmethod from_dict(odata: dict[str, Any])

Convert a dict of raw values (keyed by the titles of their fields) into a dict of encoded values (keyed by the keys of their fields), from which a new Report is instantiated.

get_field(field: Field, missing_ok: bool = False)

Return the value of a field of the report using the field instance directly, not its key.

classmethod load(file: Path) Report

Load an object from a file.

save(top: Path, force: bool = False)

Save the report to a JSON file.

to_dict()

Return a dict of raw values of the fields, keyed by the titles of their fields.

exception seismicrna.core.report.ReportDoesNotHaveFieldError

Bases: ReportFieldAttributeError

A report does not contain this type of field.

exception seismicrna.core.report.ReportFieldAttributeError

Bases: ReportFieldError, AttributeError

exception seismicrna.core.report.ReportFieldError

Bases: RuntimeError

Any error involving a field of a report.

exception seismicrna.core.report.ReportFieldKeyError

Bases: ReportFieldError, KeyError

exception seismicrna.core.report.ReportFieldTypeError

Bases: ReportFieldError, TypeError

exception seismicrna.core.report.ReportFieldValueError

Bases: ReportFieldError, ValueError

seismicrna.core.report.calc_dt_minutes(began: datetime, ended: datetime)

Calculate the time taken in minutes.

seismicrna.core.report.calc_taken(report: Report)

Calculate the time taken in minutes.

seismicrna.core.report.default_key(key: str)

Get the default value of a field by its key.

seismicrna.core.report.field_keys() dict[str, Field]
seismicrna.core.report.field_titles() dict[str, Field]
seismicrna.core.report.fields()
seismicrna.core.report.get_oconv_dict(dtype: type, precision: int = 3)
seismicrna.core.report.get_oconv_dict_list(dtype: type, precision: int = 3)
seismicrna.core.report.get_oconv_float(precision: int = 3)
seismicrna.core.report.get_oconv_list(dtype: type, precision: int = 3)
seismicrna.core.report.iconv_array_int(nums: list[int])
seismicrna.core.report.iconv_datetime(text: str)
seismicrna.core.report.iconv_dict_str_dict_int_dict_int_int(mapping: dict[Any, dict[Any, dict[Any, Any]]]) dict[str, dict[int, dict[int, int]]]
seismicrna.core.report.iconv_dict_str_int(mapping: dict[Any, Any]) dict[str, int]
seismicrna.core.report.iconv_int_keys(mapping: dict[Any, Any])
seismicrna.core.report.key_to_title(key: str)

Map a field’s key to its title.

seismicrna.core.report.lookup_key(key: str)

Get a field by its key.

seismicrna.core.report.lookup_title(title: str)

Get a field by its title.

seismicrna.core.report.oconv_datetime(dtime: datetime)
seismicrna.core.run.log_command(command: str)

Log the name of the command.

seismicrna.core.run.run_func(command: str, default: ~typing.Callable | None = <class 'list'>, with_tmp: bool = False, pass_keep_tmp: bool = False, *args, **kwargs)

Decorator for a run function.

seismicrna.core.stats.calc_beta_mv(alpha: float, beta: float)

Find the mean and variance of a beta distribution from its alpha and beta parameters.

Parameters:
  • alpha (float) – Alpha parameter of the beta distribution.

  • beta (float) – Beta parameter of the beta distribution.

Returns:

Mean and variance of the beta distribution.

Return type:

tuple[float, float]

seismicrna.core.stats.calc_beta_params(mean: float, variance: float)

Find the alpha and beta parameters of a beta distribution from its mean and variance.

Parameters:
  • mean (float) – Mean of the beta distribution.

  • variance (float) – Variance of the beta distribution.

Returns:

Alpha and beta parameters of the beta distribution.

Return type:

tuple[float, float]

seismicrna.core.stats.calc_dirichlet_mv(alpha: ndarray)

Find the means and variances of a Dirichlet distribution from its concentration parameters.

Parameters:

alpha (np.ndarray) – Concentration parameters of the Dirichlet distribution.

Returns:

Means and variances of the Dirichlet distribution.

Return type:

tuple[np.ndarray, np.ndarray]

seismicrna.core.stats.calc_dirichlet_params(mean: ndarray, variance: ndarray)

Find the concentration parameters of a Dirichlet distribution from its mean and variance.

Parameters:
  • mean (np.ndarray) – Means.

  • variance (np.ndarray) – Variances.

Returns:

Concentration parameters.

Return type:

np.ndarray

class seismicrna.core.task.Task(func: Callable)

Bases: object

Wrap a parallelizable task in a try-except block so that if it fails, it just returns None rather than crashing the other tasks being run in parallel.

__call__(*args, **kwargs)

Call the task’s function in a try-except block, return the result if it succeeds, and return None otherwise.

property name
seismicrna.core.task.as_list_of_tuples(args: Iterable[Any])

Given an iterable of arguments, return a list of 1-item tuples, each containing one of the given arguments. This function is useful for creating a list of tuples to pass to the args parameter of dispatch.

seismicrna.core.task.calc_pool_size(num_tasks: int, max_procs: int)

Calculate the size of a process pool.

Parameters:
  • num_tasks (int) – Number of tasks to parallelize. Must be ≥ 1.

  • max_procs (int) – Maximum number of processes to run at one time. Must be ≥ 1.

Returns:

  • Number of tasks to run in parallel. Always ≥ 1.

  • Number of processes to run for each task. Always ≥ 1.

Return type:

tuple[int, int]

seismicrna.core.task.dispatch(funcs: list[Callable] | Callable, max_procs: int, pass_n_procs: bool = True, raise_on_error: bool = False, args: list[tuple] | tuple = (), kwargs: dict[str, Any] | None = None)

Run one or more tasks in series or in parallel, depending on the number of tasks, the maximum number of processes, and whether tasks are allowed to be run in parallel.

Parameters:
  • funcs (list[Callable] | Callable) – The function(s) to run. Can be a list of functions or a single function that is not in a list. If a single function, then if args is a tuple, it is called once with that tuple as its positional arguments; and if args is a list of tuples, it is called for each tuple of positional arguments in args.

  • max_procs (int) – Maximum number of processes to run at one time. Must be ≥ 1.

  • pass_n_procs (bool) – Whether to pass the number of processes to the function as the keyword argument n_procs.

  • raise_on_error (bool) – Whether to raise an error if any tasks fail (if False, only log a warning message).

  • args (list[tuple] | tuple) – Positional arguments to pass to each function in funcs. Can be a list of tuples of positional arguments or a single tuple that is not in a list. If a single tuple, then each function receives args as positional arguments. If a list, then args must be the same length as funcs; each function funcs[i] receives args[i] as positional arguments.

  • kwargs (dict[str, Any] | None) – Keyword arguments to pass to every function call.

Returns:

List of the return value of each run.

Return type:

list

seismicrna.core.tmp.get_release_working_dirs(tmp_dir: Path)
seismicrna.core.tmp.release_to_out(out_dir: Path, release_dir: Path, initial_path: Path)

Move temporary path(s) to the output directory.

seismicrna.core.tmp.with_tmp_dir(pass_keep_tmp: bool)

Make a temporary directory, and delete it after returning.

seismicrna.core.types.fit_uint_size(value: int)

Smallest number of bytes that will fit the value.

seismicrna.core.types.fit_uint_type(value: int)

Smallest unsigned int type that will fit the value.

seismicrna.core.types.get_byte_dtype(nchars: int)

NumPy byte type with the given number of characters.

seismicrna.core.types.get_dtype(code: str, size: int)

NumPy type with the given code and size.

seismicrna.core.types.get_max_uint(uint_type: type)

Maximum value of a NumPy unsigned integer type.

seismicrna.core.types.get_max_value(nbytes: int)

Get the maximum value of an unsigned integer of N bytes.

seismicrna.core.types.get_uint_dtype(nbytes: int)

NumPy uint data type with the given number of bytes.

seismicrna.core.types.get_uint_size(uint_type: type)

Size of a NumPy uint type in bytes.

seismicrna.core.types.get_uint_type(nbytes: int)

NumPy uint type with the given number of bytes.

seismicrna.core.unbias.calc_n_reads_per_pos(p_ends_observed: ndarray, n_reads_per_clust: ndarray)
seismicrna.core.unbias.calc_p_clust(p_clust_observed: ndarray, p_noclose_given_clust: ndarray)

Cluster proportion among all reads.

Parameters:
  • p_clust_observed (np.ndarray) – Proportion of each cluster among reads with no two mutations too close. 1D (clusters)

  • p_noclose_given_clust (np.ndarray) – Probability that a read from each cluster would have no two mutations too close. 1D (clusters)

Returns:

Proportion of each cluster among all reads. 1D (clusters)

Return type:

np.ndarray

seismicrna.core.unbias.calc_p_clust_given_ends_noclose(p_ends_given_clust_noclose: ndarray, p_clust_given_noclose: ndarray)

Calculate the probability that a read with each pair of 5’/3’ ends and no two mutations too close came from each cluster.

Parameters:
  • p_ends_given_clust_noclose (np.ndarray) – 3D (positions x positions x clusters) array of the probability that a read from each cluster has each pair of 5’/3’ ends given that it has no two mutations too close.

  • p_clust_given_noclose (np.ndarray) – 1D (clusters) array of the probability that a read comes from each cluster given that it has no two mutations too close.

Returns:

3D (positions x positions x clusters) array of the probability that a read with each pair of 5’/3’ ends and no two mutations too close comes from each cluster.

Return type:

np.ndarray

seismicrna.core.unbias.calc_p_clust_given_noclose(p_clust: ndarray, p_noclose_given_clust: ndarray)

Cluster proportions among reads with no two mutations too close.

Parameters:
  • p_clust (np.ndarray) – Proportion of each cluster among all reads. 1D (clusters)

  • p_noclose_given_clust (np.ndarray) – Probability that a read from each cluster would have no two mutations too close. 1D (clusters)

Returns:

Proportion of each cluster among reads with no two mutations too close. 1D (clusters)

Return type:

np.ndarray

seismicrna.core.unbias.calc_p_ends(p_ends_observed: ndarray, p_noclose_given_ends: ndarray, p_mut_given_span: ndarray, p_clust: ndarray)

Calculate the proportion of total reads with each pair of 5’ and 3’ coordinates.

This function is meant to be called by another function that has validated the arguments; hence, this function makes assumptions:

  • Every value in the upper triangle of p_ends_observed is ≥ 0 and ≤ 1; no values below the main diagonal are used.

  • The upper triangle of p_ends_observed sums to 1.

  • Every value in p_mut_given_span is ≥ 0 and ≤ 1.

Parameters:
  • p_ends_observed (np.ndarray) – 3D (positions x positions x clusters) array of the proportion of observed reads in each cluster beginning at the row position and ending at the column position.

  • p_noclose_given_ends (np.ndarray) – 3D (positions x positions x clusters) array of the pobabilities that a read with 5’ and 3’ coordinates corresponding to the row and column would have no two mutations too close.

  • p_mut_given_span (np.ndarray) – 2D (positions x clusters) array of the total mutation rate at each position in each cluster.

  • p_clust (np.ndarray) – 1D (clusters) array of the proportion of each cluster.

Returns:

2D (positions x positions) array of the proportion of reads beginning at the row position and ending at the column position. This array is assumed to be identical for all clusters.

Return type:

np.ndarray

seismicrna.core.unbias.calc_p_ends_given_clust_noclose(p_ends: ndarray, p_noclose_given_ends: ndarray)

Calculate the proportion of reads with no two mutations too close with each pair of 5’ and 3’ coordinates.

Assumptions

  • p_ends has 2 dimensions: (positions x positions)

  • Every value in the upper triangle of p_ends is ≥ 0 and ≤ 1; no values below the main diagonal are used.

  • The upper triangle of p_ends sums to 1.

  • min_gap is a non-negative integer.

  • p_mut_given_span has 2 dimensions: (positions x clusters)

  • Every value in p_mut_given_span is ≥ 0 and ≤ 1.

  • There is at least 1 cluster.

param p_ends:

2D (positions x positions) array of the proportion of reads in each cluster beginning at the row position and ending at the column position.

type p_ends:

np.ndarray

param p_noclose_given_ends:

3D (positions x positions x clusters) array of the probabilities that a read with 5’ and 3’ coordinates corresponding to the row and column would have no two mutations too close.

type p_noclose_given_ends:

np.ndarray

returns:

3D (positions x positions x clusters) array of the proportion of reads without mutations too close, beginning at the row position and ending at the column position, in each cluster.

rtype:

np.ndarray

seismicrna.core.unbias.calc_p_ends_given_noclose(p_ends_given_clust_noclose: ndarray, p_clust_given_noclose: ndarray)

Calculate the probability that a read would have each pair of 5’/3’ ends and no two mutations too close.

Parameters:
  • p_ends_given_clust_noclose (np.ndarray) – 3D (positions x positions x clusters) array of the probability that a read from each cluster has each pair of 5’/3’ ends given that it has no two mutations too close.

  • p_clust_given_noclose (np.ndarray) – 1D (clusters) array of the probability that a read comes from each cluster given that it has no two mutations too close.

Returns:

2D (positions x positions) array of the probability that a read with no two mutations too close has each pair of 5’/3’ ends, regardless of the cluster to which it belongs.

Return type:

np.ndarray

seismicrna.core.unbias.calc_p_ends_observed(npos: int, end5s: ndarray, end3s: ndarray, weights: ndarray | None = None, check_values: bool = True)

Calculate the proportion of each pair of 5’/3’ end coordinates observed in end5s and end3s, optionally weighted by weights.

Parameters:
  • npos (int) – Number of positions.

  • end5s (np.ndarray) – 5’ ends (0-indexed) of the reads: 1D array (reads)

  • end3s (np.ndarray) – 3’ ends (0-indexed) of the reads: 1D array (reads)

  • weights (np.ndarray | None) – Number of times each read occurs in each cluster: 2D array (reads x clusters)

  • check_values (bool) – Check that end5s, end3s, and weights are all valid.

Returns:

Fraction of reads with each 5’ (row) and 3’ (column) coordinate: 3D array (positions x positions x clusters)

Return type:

np.ndarray

seismicrna.core.unbias.calc_p_mut_given_span(p_mut_given_span_observed: ndarray, min_gap: int, p_ends: ndarray, init_p_mut_given_span: ndarray, *, quick_unbias: bool = True, quick_unbias_thresh: float = 0.0, f_tol: float = 0.0001, x_rtol: float = 0.001)

Calculate the underlying mutation rates including for reads with two mutations too close based on the observed mutation rates.

seismicrna.core.unbias.calc_p_mut_given_span_noclose(p_mut_given_span: ndarray, p_ends: ndarray, p_noclose_given_ends: ndarray, p_nomut_window: ndarray)

Calculate the mutation rates of only reads with no two mutations too close that span each position.

Parameters:
  • p_mut_given_span (np.ndarray) – 2D (positions x clusters) array of the underlying mutation rates (i.e. the probability that a read has a mutation at position (j) given that it contains that position).

  • p_ends (np.ndarray) – 2D (positions x positions) array of the proportion of reads in each cluster beginning at the row position and ending at the column position.

  • p_noclose_given_ends (np.ndarray) – 3D (positions x positions x clusters) array of the probabilities that a read with 5’ and 3’ coordinates corresponding to the row and column would have no two mutations too close.

  • p_nomut_window (np.ndarray) – 3D (window x positions x clusters) array of the probability that (window) consecutive bases, ending at position (position), would have zero mutations at all.

Returns:

2D (positions x clusters) array of the mutation rate among reads with no two mutations too close per position per cluster.

Return type:

np.ndarray

seismicrna.core.unbias.calc_p_noclose(p_clust: ndarray, p_noclose_given_clust: ndarray)

Probability that any read would have two mutations too close.

Parameters:
  • p_clust (np.ndarray) – Proportion of each cluster among all reads. 1D (clusters)

  • p_noclose_given_clust (np.ndarray) – Probability that a read from each cluster would have no two mutations too close. 1D (clusters)

Returns:

Probability that any read would have no two mutations too close.

Return type:

float

seismicrna.core.unbias.calc_p_noclose_given_clust(p_ends: ndarray, p_noclose_given_ends: ndarray)

Calculate the probability that a read from each cluster would have no two mutations too close.

seismicrna.core.unbias.calc_p_noclose_given_ends(p_mut_given_span: ndarray, p_nomut_window: ndarray)

Given underlying mutation rates (p_mut_given_span), calculate the probability that a read starting at position (a) and ending at position (b) would have no two mutations too close, for each (a) and (b) where 1 ≤ a ≤ b ≤ L (biological coordinates) or 0 ≤ a ≤ b < L (Python coordinates).

Parameters:
  • p_mut_given_span (np.ndarray) – 2D (positions x clusters) array of the underlying mutation rates (i.e. the probability that a read has a mutation at position (j) given that it contains that position).

  • p_nomut_window (np.ndarray) – 3D (window x positions x clusters) array of the probability that (window) consecutive bases, ending at position (position), would have zero mutations at all.

Returns:

3D (positions x positions x clusters) array of the probability that a random read starting at position (a) (row) and ending at position (b) (column) would have no two mutations too close.

Return type:

np.ndarray

seismicrna.core.unbias.calc_p_noclose_given_ends_auto(p_mut_given_span: ndarray, min_gap: int)

Given underlying mutation rates (p_mut_given_span), calculate the probability that a read starting at position (a) and ending at position (b) would have no two mutations too close (i.e. separated by fewer than min_gap non-mutated positions), for each combination of (a) and (b) such that 1 ≤ a ≤ b ≤ L (in biological coordinates) or 0 ≤ a ≤ b < L (in Python coordinates).

Parameters:
  • p_mut_given_span (ndarray) – A 2D (positions x clusters) array of the underlying mutation rates, i.e. the probability that a read has a mutation at position (j) given that it contains position (j).

  • min_gap (int) – Minimum number of non-mutated bases between two mutations; must be ≥ 0.

Returns:

3D (positions x positions x clusters) array of the probability that a random read starting at position (a) (row) and ending at position (b) (column) would have no two mutations too close.

Return type:

np.ndarray

seismicrna.core.unbias.calc_p_nomut_window(p_mut_given_span: ndarray, min_gap: int)

Given underlying mutation rates (p_mut_given_span), find the probability of no mutations in each window of size 0 to min_gap.

Parameters:
  • p_mut_given_span (ndarray) – 2D (positions x clusters) array of the underlying mutation rates (i.e. the probability that a read has a mutation at position (j) given that it contains that position).

  • min_gap (int) – Minimum number of non-mutated bases between two mutations.

Returns:

3D (window x positions + 1 x clusters) array of the probability that (window) consecutive bases, ending at position (position), would have 0 mutations at all.

Return type:

np.ndarray

seismicrna.core.unbias.calc_params(p_mut_given_span_observed: ndarray, p_ends_observed: ndarray, p_clust_observed: ndarray, min_gap: int, guess_p_mut_given_span: ndarray | None = None, guess_p_ends: ndarray | None = None, guess_p_clust: ndarray | None = None, *, prenormalize: bool = True, max_iter: int = 128, convergence_thresh: float = 0.0001, **kwargs)

Calculate the three sets of parameters based on observed data.

Parameters:
  • p_mut_given_span_observed (np.ndarray) – Observed probability that each position is mutated given that no two mutations are too close: 2D array (positions x clusters)

  • p_ends_observed (np.ndarray) – Observed proportion of reads aligned with each pair of 5’ and 3’ end coordinates given that no two mutations are too close: 3D array (positions x positions x clusters)

  • p_clust_observed (np.ndarray) – Observed proportion of reads in each cluster given that no two mutations are too close: 1D array (clusters)

  • min_gap (int) – Minimum number of non-mutated bases between two mutations. Must be a non-negative integer.

  • guess_p_mut_given_span (np.ndarray | None = None) – Initial guess for the probability that each position is mutated. If given, must be a 2D array (positions x clusters); defaults to p_mut_given_span_observed.

  • guess_p_ends (np.ndarray | None = None) – Initial guess for the proportion of total reads aligned to each pair of 5’ and 3’ end coordinates. If given, must be a 2D array (positions x positions); defaults to p_ends_observed.

  • guess_p_clust (np.ndarray | None = None) – Initial guess for the proportion of total reads in each cluster. If given, must be a 1D array (clusters); defaults to p_clust_observed.

  • prenormalize (bool = True) – Fill missing values in guess_p_mut_given_span, guess_p_ends, and guess_p_clust, and clip every value to be ≥ 0 and ≤ 1. Ensure the proportions in guess_p_clust and the upper triangle of guess_p_ends sum to 1.

  • max_iter (int = 128) – Maximum number of iterations in which to refine the parameters.

  • convergence_thresh (float = 1.e-4) – Convergence threshold based on the root-mean-square difference in mutation rates between consecutive iterations.

  • **kwargs – Additional keyword arguments for _calc_p_mut_given_span.

seismicrna.core.unbias.calc_params_observed(n_pos_total: int, unmasked_pos: Iterable[int], muts_per_pos: Iterable[ndarray], end5s: ndarray, end3s: ndarray, counts_per_uniq: ndarray, resps: ndarray)

Calculate the observed estimates of the parameters.

Parameters:
  • n_pos_total (int) – Total number of positions in the region.

  • unmasked_pos (Iterable[int]) – Unmasked positions; must be zero-indexed with respect to the 5’ end of the region.

  • muts_per_pos (Iterable[np.ndarray]) – For each unmasked position, numbers of all reads with a mutation at that position.

  • end5s (np.ndarray) – 5’ end of every unique read; must be 0-indexed with respect to the 5’ end of the region.

  • end3s (np.ndarray) – 3’ end of every unique read; must be 0-indexed with respect to the 5’ end of the region.

  • counts_per_uniq (np.ndarray) – Number of times each unique read occurs.

  • resps (np.ndarray) – Cluster memberships of each read: 2D array (reads x clusters)

Return type:

tuple[np.ndarray, np.ndarray, np.ndarray]

seismicrna.core.unbias.calc_rectangular_sum(array: ndarray)

For each element of the main diagonal, calculate the sum over the rectangular array from that element to the upper right corner. This function is meant to be called by another function that has validated the arguments; hence, this function makes assumptions:

  • array has at least 2 dimensions.

  • The first and second dimensions of array have equal lengths.

Parameters:

array (np.ndarray) – Array of at least two dimensions for which to calculate the sum of each rectangular array from each element on the main diagonal to the upper right corner.

Returns:

Array with all but the first dimension of array indicating the sum of the array from each element on the main diagonal to the upper right corner of array.

Return type:

np.ndarray

seismicrna.core.unbias.triu_allclose(a: ndarray | float, b: ndarray | float, rtol: float = 0.001, atol: float = 1e-06)

Whether the upper triangles of a and b are all close.

Parameters:
  • a (np.ndarray | float) – Array 1.

  • b (np.ndarray | float) – Array 2.

  • rtol (float = 1.0e-3) – Relative tolerance.

  • atol (float = 1.0e-6) – Absolute tolerance.

Returns:

Whether all elements of the upper triangles of a and b are close using the function np.allclose.

Return type:

bool

seismicrna.core.unbias.triu_dot(a: ndarray, b: ndarray)

Dot product of a and b over their first 2 dimensions.

Parameters:
  • a (np.ndarray) – Array 1.

  • b (np.ndarray) – Array 2.

Returns:

Dot product of a and b over their first 2 dimensions.

Return type:

np.ndarray

seismicrna.core.unbias.triu_log(a: ndarray)

Calculate the logarithm of the upper triangle(s) of array a. In the result, elements below the main diagonal are undefined.

Parameters:

a (np.ndarray) – Array (≥ 2 dimensions) of whose upper triangle to compute the logarithm; the first 2 dimensions must have equal lengths.

Returns:

Logarithm of the upper triangle(s) of a.

Return type:

np.ndarray

seismicrna.core.unbias.triu_sum(a: ndarray)

Calculate the sum over the upper triangle(s) of array a.

Parameters:

a (np.ndarray) – Array whose upper triangle to sum.

Returns:

Sum of the upper triangle(s), with the same shape as the third and subsequent dimensions of a.

Return type:

np.ndarray

seismicrna.core.version.format_version(major: int = 0, minor: int = 23, patch: int = 0, prtag: str = '')
seismicrna.core.version.parse_version(version: str = '0.23.0')

Major and minor versions, patch, and pre-release tag.

seismicrna.core.write.need_write(query: Path, force: bool = False, warn: bool = True)

Determine whether a file/directory must be written.

Parameters:
  • query (Path) – File or directory for which to check the need for writing.

  • force (bool = False) – Force the query to be written, even if it already exists.

  • warn (bool = True) – If the query does not need to be written, then log a warning.

Returns:

Whether the file must be written.

Return type:

bool

seismicrna.core.write.write_mode(force: bool = False, binary: bool = False)

Get the mode in which to open a file for writing.

Parameters:
  • force (bool = False) – Force the file to be written, truncating the file if it exists. If False and the file exists, a FileExistsError will be raised.

  • binary (bool = False) – Write the file in binary mode instead of text mode.

Returns:

The mode argument for the builtin function open().

Return type:

str