seismicrna.core.batch package

Subpackages

seismicrna.core.batch.tests package
- Submodules

Submodules

seismicrna.core.batch.accum.accumulate_batches(get_batch_count_all: Callable[[int], tuple[Any, Any, Any, Any]], num_batches: int, refseq: DNA, pos_nums: ndarray, patterns: dict[str, RelPattern], ks: Iterable[int] | None = None, *, count_ends: bool = True, count_pos: bool = True, count_read: bool = True, validate: bool = True, num_cpus: int = 1)

seismicrna.core.batch.accum.accumulate_counts(batch_counts: Iterable[tuple[Any, Any, Any, Any]], refseq: DNA, pos_nums: ndarray, patterns: dict[str, RelPattern], ks: Iterable[int] | None = None, *, count_ends: bool = True, count_pos: bool = True, count_read: bool = True, validate: bool = True)

seismicrna.core.batch.count.calc_count_per_pos(pattern: RelPattern, cover_per_pos: Series | DataFrame, rels_per_pos: dict[int, Series | DataFrame]): Count the reads that fit a pattern at each position.

seismicrna.core.batch.count.calc_count_per_read(pattern: RelPattern, cover_per_read: DataFrame, rels_per_read: dict[int, DataFrame]): Count the positions that fit a pattern in each read.

seismicrna.core.batch.count.calc_coverage(pos_index: Index, read_nums: ndarray, seg_end5s: ndarray, seg_end3s: ndarray, seg_ends_mask: ndarray | None, read_weights: DataFrame | None = None): Calculate the coverage per position and per read.

seismicrna.core.batch.count.calc_reads_per_pos(pattern: RelPattern, mutations: dict[int, dict[int, ndarray]], pos_index: Index): For each position, find all reads matching a pattern.

seismicrna.core.batch.count.calc_rels_per_pos(mutations: dict[int, dict[int, ndarray]], num_reads: int | Series, cover_per_pos: Series | DataFrame, read_indexes: ndarray | None = None, read_weights: DataFrame | None = None): For each relationship, the number of reads at each position.

seismicrna.core.batch.count.calc_rels_per_read(mutations: dict[int, dict[int, ndarray]], pos_index: Index, cover_per_read: DataFrame, read_indexes: ndarray): For each relationship, the number of positions in each read.

seismicrna.core.batch.count.count_end_coords(end5s: ndarray, end3s: ndarray, weights: DataFrame | None = None): Count each pair of 5’ and 3’ end coordinates.

exception seismicrna.core.batch.ends.BadSegmentEndsError

Bases: ValueError

Segment 5’ and 3’ ends are not valid.

class seismicrna.core.batch.ends.EndCoords(*, region: Region, seg_end5s: ndarray, seg_end3s: ndarray, sanitize: bool = True, **kwargs)

Bases: object

Collection of 5’ and 3’ segment end coordinates.

property contiguous: Whether the segments of each read are contiguous.

property num_contiguous: Number of contiguous reads.

property num_discontiguous: Number of discontiguous reads.

property num_reads: Number of reads.

property num_segments: Number of segments in each read.

property pos_dtype: Data type for positions.

property read_end3s: 3’ end of each read.

property read_end5s: 5’ end of each read.

property read_lengths: Length of each read.

property seg_ends_mask: ndarray | None: Whether each pair of 5’ and 3’ ends is masked out.

seismicrna.core.batch.ends.count_reads_segments(seg_ends: ndarray, what: str = 'seg_ends') → tuple[int, int]

seismicrna.core.batch.ends.find_contiguous_reads(seg_end5s: ndarray, seg_end3s: ndarray, seg_ends_mask: ndarray | None): Whether the segments of each read are contiguous.

seismicrna.core.batch.ends.find_read_end3s(seg_end3s: ndarray, seg_ends_mask: ndarray | None): Find the 3’ end of each read.

seismicrna.core.batch.ends.find_read_end5s(seg_end5s: ndarray, seg_ends_mask: ndarray | None): Find the 5’ end of each read.

seismicrna.core.batch.ends.match_reads_segments(seg_end5s: ndarray, seg_end3s: ndarray, seg_ends_mask: ndarray | None): Number of segments for the given end coordinates.

seismicrna.core.batch.ends.merge_read_ends(read_end5s: ndarray, read_end3s: ndarray): Return the 5’ and 3’ ends as one 2D array.

seismicrna.core.batch.ends.sanitize_segment_ends(seg_end5s: ndarray, seg_end3s: ndarray, min_pos: int, max_pos: int, check_values: bool = True)

Sanitize end coordinates.

Parameters:

seg_end5s (np.ndarray) – 5’ end coordinate of each segment in each read.
seg_end3s (np.ndarray) – 3’ end coordinate of each segment in each read.
min_pos (int) – Minimum allowed value of a position.
max_pos (int) – Maximum allowed value of a position.
check_values (bool = True) – Whether to check the bounds of the values, which is the most expensive operation in this function. Can be set to False if the only desired effect is to ensure the output is a positive, even number of arrays in the proper data type.

Returns:

Sanitized end coordinates: encoded in the most efficient data type, and if check_values is True then all between min_pos and max_pos (inclusive).

Return type:

tuple[np.ndarray, np.ndarray]

seismicrna.core.batch.ends.simulate_segment_ends(uniq_end5s: ndarray, uniq_end3s: ndarray, p_ends: ndarray, num_reads: int, read_length: int = 0, p_rev: float = 0.5)

Simulate segment end coordinates from their probabilities.

Parameters:

uniq_end5s (np.ndarray) – Unique read 5’ end coordinates.
uniq_end3s (np.ndarray) – Unique read 3’ end coordinates.
p_ends (np.ndarray) – Probability of each set of unique end coordinates.
num_reads (int) – Number of reads to simulate.
read_length (int = 0) – If == 0, then generate single-end reads (1 segment per read); if > 0, then generate paired-end reads (2 segments per read) with at most this number of base calls in each segment.
p_rev (float = 0.5) – For paired-end reads, the probability that mate 1 aligns in the

Returns:

5’ and 3’ segment end coordinates of the simulated reads.

Return type:

tuple[np.ndarray, np.ndarray]

seismicrna.core.batch.ends.sort_segment_ends(seg_end5s: ndarray, seg_end3s: ndarray, seg_ends_mask: ndarray | None)

Sort the segment end coordinates and label the 3’ end of each contiguous set of segments.

Parameters:

seg_end5s (np.ndarray) – 5’ end of each segment in each read.
seg_end3s (np.ndarray) – 3’ end of each segment in each read.
seg_ends_mask (np.ndarray | None) – Whether each pair of 5’ and 3’ ends is masked.

Returns:

Sorted 5’ and 3’ coordinates of the segments in each read
Labels of whether each coordinate is a 5’ end of a segment
Labels of whether each coordinate is a 5’ end of a contiguous segment
Labels of whether each coordinate is a 3’ end of a contiguous segment

Return type:

tuple[np.ndarray, np.ndarray, np.ndarray]

seismicrna.core.batch.index.count_base_types(base_pos_index: Index): Return the number of each type of base in the index of positions and bases.

seismicrna.core.batch.index.iter_base_types(base_pos_index: Index): For each type of base in the index of positions and bases, yield the positions in the index with that type of base.

seismicrna.core.batch.index.list_batch_nums(num_batches: int): List the batch numbers.

class seismicrna.core.batch.muts.MutsBatch(*, region: Region, sanitize: bool = True, muts: dict[int, dict[int, list[int] | ndarray]], masked_read_nums: ndarray | list[int] | None = None, **kwargs)

Bases: EndCoords, ReadBatch, ABC

Batch of mutational data.

property pos_nums: Positions in use.

property read_end_counts: Counts of read end coordinates.

abstract property read_weights: DataFrame | None: Weights for each read when computing counts.

class seismicrna.core.batch.muts.RegionMutsBatch(*, region: Region, **kwargs)

Bases: MutsBatch, ABC

Batch of mutational data that knows its region.

calc_min_mut_dist(pattern: RelPattern): For each read, calculate the smallest distance (i.e. the gap plus 1) between any two mutations.

count_all(patterns: dict[str, RelPattern], ks: Iterable[int] | None = None, *, count_ends: bool = True, count_pos: bool = True, count_read: bool = True): Calculate all counts.

count_per_pos(pattern: RelPattern): Count the reads that fit a relationship pattern at each position in a region.

count_per_read(pattern: RelPattern): Count the positions in a region that fit a relationship pattern in each read.

property cover_per_pos: Number of reads covering each position.

property cover_per_read: Number of positions covered by each read.

iter_reads(pattern: RelPattern, only_read_ends: bool = False, require_contiguous: bool = False): End coordinates and mutated positions in each read.

property matrix: Matrix of relationships at each position in each read.

property pos_index: Index of unmasked positions and bases.

reads_noclose_muts(pattern: RelPattern, min_gap: int): List the reads with no two mutations too close.

reads_per_pos(pattern: RelPattern): For each position, find all reads matching a relationship pattern.

property rels_per_pos: For each relationship, the number of reads at each position with that relationship.

property rels_per_read: For each relationship, the number of positions in each read with that relationship.

seismicrna.core.batch.muts.calc_muts_matrix(region: Region, read_nums: ndarray, seg_end5s: ndarray, seg_end3s: ndarray, seg_ends_mask: ndarray, muts: dict[int, dict[int, ndarray]]): Matrix of relationships at each position in each read.

seismicrna.core.batch.muts.sanitize_muts(muts: dict[int, dict[int, list[int] | ndarray]], region: Region, data_type: type, sanitize: bool = True)

seismicrna.core.batch.muts.simulate_muts(pmut: DataFrame, seg_end5s: ndarray, seg_end3s: ndarray)

Simulate mutation data.

Parameters:

pmut (pd.DataFrame) – Rate of each type of mutation at each position.
seg_end5s – 5’ end coordinate of each segment.
seg_end3s – 3’ end coordinate of each segment.

class seismicrna.core.batch.read.ReadBatch(*, batch: int, **kwargs)

Bases: ABC

Batch of reads.

property batch_read_index: MultiIndex of the batch number and read numbers.

property masked_reads_bool

property max_read: int: Maximum possible value for a read index.

property num_reads: int | Series: Number of reads.

property read_dtype: Data type for read numbers.

property read_indexes: ndarray: Map each read number to its index in self.read_nums.

property read_nums: ndarray: Read numbers.