seismicrna.core.batch package

Subpackages

Submodules

seismicrna.core.batch.accum.accumulate_batches(get_batch_count_all: Callable[[int], tuple[Any, Any, Any, Any]], num_batches: int, refseq: DNA, pos_nums: ndarray, patterns: dict[str, RelPattern], ks: Iterable[int] | None = None, *, count_ends: bool = True, count_pos: bool = True, count_read: bool = True, validate: bool = True, num_cpus: int = 1)

Compute and accumulate counts from all batches, optionally in parallel.

Parameters:
  • get_batch_count_all (Callable[[int], tuple]) – Callable that takes a batch number and returns (num_reads, end_counts, count_per_pos, count_per_read).

  • num_batches (int) – Total number of batches to process.

  • refseq (DNA) – Reference sequence.

  • pos_nums (np.ndarray) – Position numbers to include in the per-position counts.

  • patterns (dict[str, RelPattern]) – Mapping from relationship name to relationship pattern.

  • ks (Iterable[int] | None) – Numbers of clusters; None for unclustered data.

  • count_ends (bool = True) – Whether to accumulate end coordinate counts.

  • count_pos (bool = True) – Whether to accumulate per-position counts.

  • count_read (bool = True) – Whether to accumulate per-read counts.

  • validate (bool = True) – Whether to validate the index and column labels of each batch.

  • num_cpus (int = 1) – Number of CPUs to use for parallel processing.

Returns:

Total (num_reads, end_counts, count_per_pos, count_per_read).

Return type:

tuple

seismicrna.core.batch.accum.accumulate_confusion_matrices(get_batch: Callable[[int], RegionMutsBatch], num_batches: int, pattern: RelPattern, pos_index: Index, clusters: Index | None, min_gap: int = 0, num_cpus: int = 1)

Accumulate confusion matrices from all batches.

Parameters:
  • get_batch (Callable[[int], RegionMutsBatch]) – Callable that returns the batch for a given batch number.

  • num_batches (int) – Total number of batches to process.

  • pattern (RelPattern) – Relationship pattern defining which reads count as mutated.

  • pos_index (pd.Index) – Index of positions to include in the confusion matrix.

  • clusters (pd.Index | None) – Cluster index for the confusion matrix columns; None if not clustered.

  • min_gap (int = 0) – Minimum gap between positions to include in position pairs.

  • num_cpus (int = 1) – Number of CPUs to use for parallel processing.

Returns:

Total (n, a, b, ab) confusion matrix components.

Return type:

tuple

seismicrna.core.batch.accum.accumulate_counts(batch_counts: Iterable[tuple[Any, Any, Any, Any]], refseq: DNA, pos_nums: ndarray, patterns: dict[str, RelPattern], ks: Iterable[int] | None = None, *, count_ends: bool = True, count_pos: bool = True, count_read: bool = True, validate: bool = True)

Accumulate counts from batches into total counts.

Parameters:
  • batch_counts (Iterable[tuple]) – Iterable of (num_reads, end_counts, count_per_pos, count_per_read) tuples, one per batch.

  • refseq (DNA) – Reference sequence.

  • pos_nums (np.ndarray) – Position numbers to include in the per-position counts.

  • patterns (dict[str, RelPattern]) – Mapping from relationship name to relationship pattern.

  • ks (Iterable[int] | None) – Numbers of clusters; None for unclustered data.

  • count_ends (bool = True) – Whether to accumulate end coordinate counts.

  • count_pos (bool = True) – Whether to accumulate per-position counts.

  • count_read (bool = True) – Whether to accumulate per-read counts.

  • validate (bool = True) – Whether to validate the index and column labels of each batch.

Returns:

Total (num_reads, end_counts, count_per_pos, count_per_read).

Return type:

tuple

seismicrna.core.batch.confusion.calc_bh_adjusted_pvals(pvals: ndarray | Series | DataFrame)

Calculate BH-adjusted p-values using the Benjamini-Hochberg correction.

Returns adjusted p-values (the minimum FDR at which each hypothesis would be rejected). Compare adjusted_pvals <= fdr to label significant pairs.

seismicrna.core.batch.confusion.calc_confusion_matrix(pos_index: Index, covering_reads: dict[int, ndarray], mutated_reads: dict[int, ndarray], read_weights: DataFrame | None = None, min_gap: int = 0)

For every pair of positions, calculate the confusion matrix:

+----+----+
| AB | AO | A.
+----+----+
| OB | OO | O.
+----+----+
  .B   .O   ..

And return .., A., .B, AB in that order.

seismicrna.core.batch.confusion.calc_confusion_phi(n: Series | DataFrame, a: Series | DataFrame, b: Series | DataFrame, ab: Series | DataFrame, *, min_cover: int | float = 1, tol: float = 1000000.0, validate: bool = True)

Calculate the phi correlation coefficient for a 2x2 matrix.

+----+----+
| AB | AO | A.
+----+----+
| OB | OO | O.
+----+----+
  .B   .O   ..
where

A. = AB + AO .B = AB + OB .. = A. + O. = .B + .O

Parameters:
  • n (pd.Series | pd.DataFrame) – Observations in total (..)

  • a (pd.Series | pd.DataFrame) – Observations for which A is true, regardless of B (A.)

  • b (pd.Series | pd.DataFrame) – Observations for which B is true, regardless of A (.B)

  • ab (pd.Series | pd.DataFrame) – Observations for which A and B are both true (AB)

  • min_cover (float) – Set phi values with coverage < min_cover to NaN

  • tol (float) – Tolerance for phi coefficients outside [-1, 1]

  • validate (bool) – Validate the confusion matrix first

Returns:

Phi correlation coefficient

Return type:

float | np.ndarray | pd.Series | pd.DataFrame

seismicrna.core.batch.confusion.calc_confusion_pvals(n: Series, a: Series, b: Series, ab: Series, *, validate: bool = True)

Calculate the p-value of each element of the confusion matrix using a two-sided Fisher exact test.

seismicrna.core.batch.confusion.init_confusion_matrix(pos_index: Index, clusters: Index | None = None, min_gap: int = 0)

For every pair of positions, initialize the confusion matrix:

+----+----+
| AB | AO | A.
+----+----+
| OB | OO | O.
+----+----+
  .B   .O   ..

And return .., A., .B, AB in that order.

seismicrna.core.batch.confusion.validate_confusion_matrix(n: Series | DataFrame, a: Series | DataFrame, b: Series | DataFrame, ab: Series | DataFrame)
seismicrna.core.batch.count.calc_count_per_pos(pattern: RelPattern, cover_per_pos: Series | DataFrame, rels_per_pos: dict[int, Series | DataFrame])

Count the reads that fit a pattern at each position.

seismicrna.core.batch.count.calc_count_per_read(pattern: RelPattern, cover_per_read: DataFrame, rels_per_read: dict[int, DataFrame])

Count the positions that fit a pattern in each read.

seismicrna.core.batch.count.calc_coverage(pos_index: Index, read_nums: ndarray, seg_end5s: ndarray, seg_end3s: ndarray, seg_ends_mask: ndarray | None, read_weights: DataFrame | None = None)

Calculate the coverage per position and per read.

seismicrna.core.batch.count.calc_covered_reads_per_pos(pos_index: Index, read_nums: ndarray, seg_end5s: ndarray, seg_end3s: ndarray, seg_ends_mask: ndarray | None)

For each position, find all reads covering it.

seismicrna.core.batch.count.calc_reads_per_pos(pattern: RelPattern, mutations: dict[int, dict[int, ndarray]], pos_index: Index)

For each position, find all reads matching a pattern.

seismicrna.core.batch.count.calc_rels_per_pos(mutations: dict[int, dict[int, ndarray]], num_reads: int | Series, cover_per_pos: Series | DataFrame, read_indexes: ndarray | None = None, read_weights: DataFrame | None = None)

For each relationship, the number of reads at each position.

seismicrna.core.batch.count.calc_rels_per_read(mutations: dict[int, dict[int, ndarray]], pos_index: Index, cover_per_read: DataFrame, read_indexes: ndarray)

For each relationship, the number of positions in each read.

seismicrna.core.batch.count.count_end_coords(end5s: ndarray, end3s: ndarray, weights: DataFrame | None = None)

Count each pair of 5’ and 3’ end coordinates.

exception seismicrna.core.batch.ends.BadSegmentEndsError

Bases: ValueError

Segment 5’ and 3’ ends are not valid.

class seismicrna.core.batch.ends.EndCoords(*, region: Region, seg_end5s: ndarray, seg_end3s: ndarray, sanitize: bool = True, **kwargs)

Bases: object

Collection of 5’ and 3’ segment end coordinates.

property contiguous

Whether the segments of each read are contiguous.

property num_contiguous

Number of contiguous reads.

property num_discontiguous

Number of discontiguous reads.

property num_reads

Number of reads.

property num_segments

Number of segments in each read.

property pos_dtype

Data type for positions.

property read_end3s

3’ end of each read.

property read_end5s

5’ end of each read.

property read_lengths

Length of each read.

property seg_ends_mask: ndarray | None

Whether each pair of 5’ and 3’ ends is masked out.

seismicrna.core.batch.ends.count_reads_segments(seg_ends: ndarray, what: str = 'seg_ends') tuple[int, int]
seismicrna.core.batch.ends.find_contiguous_reads(seg_end5s: ndarray, seg_end3s: ndarray, seg_ends_mask: ndarray | None)

Whether the segments of each read are contiguous.

seismicrna.core.batch.ends.find_read_end3s(seg_end3s: ndarray, seg_ends_mask: ndarray | None)

Find the 3’ end of each read.

seismicrna.core.batch.ends.find_read_end5s(seg_end5s: ndarray, seg_ends_mask: ndarray | None)

Find the 5’ end of each read.

seismicrna.core.batch.ends.match_reads_segments(seg_end5s: ndarray, seg_end3s: ndarray, seg_ends_mask: ndarray | None)

Number of segments for the given end coordinates.

seismicrna.core.batch.ends.merge_read_ends(read_end5s: ndarray, read_end3s: ndarray)

Return the 5’ and 3’ ends as one 2D array.

seismicrna.core.batch.ends.sanitize_segment_ends(seg_end5s: ndarray, seg_end3s: ndarray, min_pos: int, max_pos: int, check_values: bool = True)

Sanitize end coordinates.

Parameters:
  • seg_end5s (np.ndarray) – 5’ end coordinate of each segment in each read.

  • seg_end3s (np.ndarray) – 3’ end coordinate of each segment in each read.

  • min_pos (int) – Minimum allowed value of a position.

  • max_pos (int) – Maximum allowed value of a position.

  • check_values (bool = True) – Whether to check the bounds of the values, which is the most expensive operation in this function. Can be set to False if the only desired effect is to ensure the output is a positive, even number of arrays in the proper data type.

Returns:

Sanitized end coordinates: encoded in the most efficient data type, and if check_values is True then all between min_pos and max_pos (inclusive).

Return type:

tuple[np.ndarray, np.ndarray]

seismicrna.core.batch.ends.simulate_segment_ends(uniq_end5s: ndarray, uniq_end3s: ndarray, p_ends: ndarray, num_reads: int, read_length: int = 0, p_rev: float = 0.5, seed: int | None = None)

Simulate segment end coordinates from their probabilities.

Parameters:
  • uniq_end5s (np.ndarray) – Unique read 5’ end coordinates.

  • uniq_end3s (np.ndarray) – Unique read 3’ end coordinates.

  • p_ends (np.ndarray) – Probability of each set of unique end coordinates.

  • num_reads (int) – Number of reads to simulate.

  • read_length (int = 0) – If == 0, then generate single-end reads (1 segment per read); if > 0, then generate paired-end reads (2 segments per read) with at most this number of base calls in each segment.

  • p_rev (float = 0.5) – For paired-end reads, the probability that mate 1 aligns in the

Returns:

5’ and 3’ segment end coordinates of the simulated reads.

Return type:

tuple[np.ndarray, np.ndarray]

seismicrna.core.batch.ends.sort_segment_ends(seg_end5s: ndarray, seg_end3s: ndarray, seg_ends_mask: ndarray | None)

Sort the segment end coordinates and label the 3’ end of each contiguous set of segments.

Parameters:
  • seg_end5s (np.ndarray) – 5’ end of each segment in each read.

  • seg_end3s (np.ndarray) – 3’ end of each segment in each read.

  • seg_ends_mask (np.ndarray | None) – Whether each pair of 5’ and 3’ ends is masked.

Returns:

  • Sorted 5’ and 3’ coordinates of the segments in each read

  • Labels of whether each coordinate is a 5’ end of a segment

  • Labels of whether each coordinate is a 5’ end of a contiguous segment

  • Labels of whether each coordinate is a 3’ end of a contiguous segment

Return type:

tuple[np.ndarray, np.ndarray, np.ndarray]

seismicrna.core.batch.index.count_base_types(base_pos_index: Index)

Return the number of each type of base in the index of positions and bases.

seismicrna.core.batch.index.iter_base_types(base_pos_index: Index)

For each type of base in the index of positions and bases, yield the positions in the index with that type of base.

seismicrna.core.batch.index.list_batch_nums(num_batches: int)

List the batch numbers.

class seismicrna.core.batch.muts.MutsBatch(*, region: Region, sanitize: bool = True, muts: dict[int, dict[int, list[int] | ndarray]], masked_read_nums: ndarray | list[int] | None = None, **kwargs)

Bases: EndCoords, ReadBatch, ABC

Batch of mutational data.

property pos_nums

Positions in use.

property read_end_counts

Counts of read end coordinates.

abstract property read_weights: DataFrame | None

Weights for each read when computing counts.

class seismicrna.core.batch.muts.RegionMutsBatch(*, region: Region, **kwargs)

Bases: MutsBatch, ABC

Batch of mutational data that knows its region.

calc_confusion_matrix(pattern: RelPattern, min_gap: int = 0)

Calculate the confusion matrix of mutations.

calc_min_mut_dist(pattern: RelPattern)

For each read, calculate the smallest distance (i.e. the gap plus 1) between any two mutations.

count_all(patterns: dict[str, RelPattern], ks: Iterable[int] | None = None, *, count_ends: bool = True, count_pos: bool = True, count_read: bool = True)

Calculate all counts.

count_per_pos(pattern: RelPattern)

Count the reads that fit a relationship pattern at each position in a region.

count_per_read(pattern: RelPattern)

Count the positions in a region that fit a relationship pattern in each read.

property cover_per_pos

Number of reads covering each position.

property cover_per_read

Number of positions covered by each read.

property covered_reads_per_pos

Reads covering each position.

inject_close_muts(pattern: RelPattern, mut_probs: Iterable[float] | ndarray, seed: int | None)

Return a new muts dictionary with extra mutations within min_gap positions 5’ of existing mutations.

iter_reads(pattern: RelPattern, only_read_ends: bool = False, require_contiguous: bool = False)

End coordinates and mutated positions in each read.

property matrix

Matrix of relationships at each position in each read.

merge_close_muts(pattern: RelPattern, min_gap: int)

Return a new muts dictionary in which mutations closer than min_gap are merged into a single mutation, keeping only the 3’-most mutation. This algorithm corrects for extra mutations occuring at non-modified positions shortly 5’ of modifications, and therefore distinguishes probe-modified positions from all mutated positions.

property pos_index

Index of unmasked positions and bases.

reads_noclose_muts(pattern: RelPattern, min_gap: int)

List the reads with no two mutations too close.

reads_per_pos(pattern: RelPattern)

For each position, find all reads matching a relationship pattern.

property rels_per_pos

For each relationship, the number of reads at each position with that relationship.

property rels_per_read

For each relationship, the number of positions in each read with that relationship.

seismicrna.core.batch.muts.calc_muts_matrix(region: Region, read_nums: ndarray, seg_end5s: ndarray, seg_end3s: ndarray, seg_ends_mask: ndarray, muts: dict[int, dict[int, ndarray]])

Build a matrix of relationships at each position in each read.

Parameters:
  • region (Region) – Region providing unmasked positions.

  • read_nums (np.ndarray) – Read numbers: 1D array (reads).

  • seg_end5s (np.ndarray) – 5’ end coordinates of each segment: 2D array (reads x segments).

  • seg_end3s (np.ndarray) – 3’ end coordinates of each segment: 2D array (reads x segments).

  • seg_ends_mask (np.ndarray) – Boolean mask of segments to exclude: 2D array (reads x segments).

  • muts (dict[int, dict[int, np.ndarray]]) – Mapping from position to relationship code to read numbers.

Returns:

DataFrame of relationship codes, indexed by read number and columned by position index.

Return type:

pd.DataFrame

seismicrna.core.batch.muts.sanitize_muts(muts: dict[int, dict[int, list[int] | ndarray]], region: Region, data_type: type, sanitize: bool = True)

Keep only unmasked positions in the muts dictionary and convert the read lists to arrays.

Parameters:
  • muts (dict[int, dict[int, list[int] | np.ndarray]]) – Mapping from position to relationship code to read numbers.

  • region (Region) – Region whose unmasked positions define which positions to keep.

  • data_type (type) – NumPy dtype to use for the read number arrays.

  • sanitize (bool = True) – If True, filter to unmasked positions and convert to arrays; if False, return the muts values as-is.

Returns:

Sanitized mutation data.

Return type:

dict[int, dict[int, np.ndarray]]

seismicrna.core.batch.muts.simulate_muts(pmut: DataFrame, seg_end5s: ndarray, seg_end3s: ndarray, seed: int | None)

Simulate mutation data.

Parameters:
  • pmut (pd.DataFrame) – Rate of each type of mutation at each position.

  • seg_end5s (np.ndarray) – 5’ end coordinate of each segment.

  • seg_end3s (np.ndarray) – 3’ end coordinate of each segment.

  • seed (int | None) – Random seed for reproducibility; None for a random seed.

Returns:

Mutation data: mapping from position to relationship code to read numbers.

Return type:

dict[int, dict[int, np.ndarray]]

class seismicrna.core.batch.read.ReadBatch(*, batch: int, **kwargs)

Bases: ABC

Batch of reads.

property batch_read_index

MultiIndex of the batch number and read numbers.

property masked_reads_bool
property max_read: int

Maximum possible value for a read index.

property num_reads: int | Series

Number of reads.

property read_dtype

Data type for read numbers.

property read_indexes: ndarray

Map each read number to its index in self.read_nums.

property read_nums: ndarray

Read numbers.