seismicrna.core.batch package
Subpackages
- seismicrna.core.batch.tests package
- Submodules
TestAccumulateBatchesget_batch_count_all_func()TestCalcBhAdjustedPvalsTestCalcBhAdjustedPvals.test_2d_ndarray_per_column()TestCalcBhAdjustedPvals.test_against_reference_implementation()TestCalcBhAdjustedPvals.test_all_nan()TestCalcBhAdjustedPvals.test_invalid_pvalues_raise()TestCalcBhAdjustedPvals.test_monotonicity_enforcement()TestCalcBhAdjustedPvals.test_nan_propagates_and_valid_entries_computed()TestCalcBhAdjustedPvals.test_order_invariance()TestCalcBhAdjustedPvals.test_pandas_dataframe()TestCalcBhAdjustedPvals.test_pandas_series_preserves_index()TestCalcBhAdjustedPvals.test_single_value()TestCalcBhAdjustedPvals.test_sorted_input_hand_calc()TestCalcBhAdjustedPvals.test_unsorted_input_hand_calc()
TestCalcConfusionMatrixTestCalcConfusionPhiTestCountIntersectionTestInitConfusionMatrixbh_ref_adjusted()TestCalcCountPerPosTestCalcCountPerReadTestCalcCoverageTestCalcCoverage.test_0_positions()TestCalcCoverage.test_0_positions_weighted()TestCalcCoverage.test_1_position()TestCalcCoverage.test_1_segment()TestCalcCoverage.test_1_segment_mask()TestCalcCoverage.test_1_segment_redundant()TestCalcCoverage.test_1_segment_redundant_weighted()TestCalcCoverage.test_1_segment_weighted()TestCalcCoverage.test_2_segments_mask()TestCalcCoverage.test_2_segments_nomask()
TestCalcRelsPerPosTestCalcRelsPerReadTestCountEndCoordsTestCountReadsSegmentsTestFindContiguousReadsTestFindReadEnd3sTestFindReadEnd5sTestMatchReadsSegmentsTestMergeReadEndsTestSortSegmentEndsTestSortSegmentEnds.test_0_segments()TestSortSegmentEnds.test_1_segment_mask()TestSortSegmentEnds.test_1_segment_nomask()TestSortSegmentEnds.test_2_segments_mask()TestSortSegmentEnds.test_2_segments_nomask()TestSortSegmentEnds.test_n_segments_contig()TestSortSegmentEnds.test_n_segments_discontig()
TestCalcMutsMatrixTestCalcMutsMatrix.test_full_reads_no_muts()TestCalcMutsMatrix.test_full_reads_no_muts_some_masked()TestCalcMutsMatrix.test_paired_reads_masked_segments()TestCalcMutsMatrix.test_paired_reads_no_muts()TestCalcMutsMatrix.test_partial_reads_muts()TestCalcMutsMatrix.test_partial_reads_no_muts()TestCalcMutsMatrix.test_partial_reads_no_muts_some_masked()
TestInjectCloseMutsTestInjectCloseMuts.regionTestInjectCloseMuts.test_all_zero_probs()TestInjectCloseMuts.test_empty_mut_probs()TestInjectCloseMuts.test_existing_mutation_not_duplicated()TestInjectCloseMuts.test_invalid_above_one()TestInjectCloseMuts.test_invalid_ndim()TestInjectCloseMuts.test_invalid_negative()TestInjectCloseMuts.test_masked_position_skipped()TestInjectCloseMuts.test_only_mutated_reads_injected()TestInjectCloseMuts.test_original_mutation_preserved()TestInjectCloseMuts.test_prob_one_injects_all()TestInjectCloseMuts.test_read_not_covering_skipped()TestInjectCloseMuts.test_window_of_two()
TestMergeCloseMutsTestMergeCloseMuts.regionTestMergeCloseMuts.test_min_gap_zero_keeps_all()TestMergeCloseMuts.test_no_mutations()TestMergeCloseMuts.test_non_pattern_relationships_preserved()TestMergeCloseMuts.test_reads_processed_independently()TestMergeCloseMuts.test_single_mutation_survives()TestMergeCloseMuts.test_three_muts_middle_is_artifact()TestMergeCloseMuts.test_two_muts_gap_equals_min_gap()TestMergeCloseMuts.test_two_muts_gap_exceeds_min_gap()TestMergeCloseMuts.test_two_muts_gap_less_than_min_gap()
- Submodules
Submodules
- seismicrna.core.batch.accum.accumulate_batches(get_batch_count_all: Callable[[int], tuple[Any, Any, Any, Any]], num_batches: int, refseq: DNA, pos_nums: ndarray, patterns: dict[str, RelPattern], ks: Iterable[int] | None = None, *, count_ends: bool = True, count_pos: bool = True, count_read: bool = True, validate: bool = True, num_cpus: int = 1)
Compute and accumulate counts from all batches, optionally in parallel.
- Parameters:
get_batch_count_all (
Callable[[int],tuple]) – Callable that takes a batch number and returns (num_reads, end_counts, count_per_pos, count_per_read).num_batches (
int) – Total number of batches to process.refseq (
DNA) – Reference sequence.pos_nums (
np.ndarray) – Position numbers to include in the per-position counts.patterns (
dict[str,RelPattern]) – Mapping from relationship name to relationship pattern.ks (
Iterable[int] | None) – Numbers of clusters; None for unclustered data.count_ends (
bool = True) – Whether to accumulate end coordinate counts.count_pos (
bool = True) – Whether to accumulate per-position counts.count_read (
bool = True) – Whether to accumulate per-read counts.validate (
bool = True) – Whether to validate the index and column labels of each batch.num_cpus (
int = 1) – Number of CPUs to use for parallel processing.
- Returns:
Total (num_reads, end_counts, count_per_pos, count_per_read).
- Return type:
- seismicrna.core.batch.accum.accumulate_confusion_matrices(get_batch: Callable[[int], RegionMutsBatch], num_batches: int, pattern: RelPattern, pos_index: Index, clusters: Index | None, min_gap: int = 0, num_cpus: int = 1)
Accumulate confusion matrices from all batches.
- Parameters:
get_batch (
Callable[[int],RegionMutsBatch]) – Callable that returns the batch for a given batch number.num_batches (
int) – Total number of batches to process.pattern (
RelPattern) – Relationship pattern defining which reads count as mutated.pos_index (
pd.Index) – Index of positions to include in the confusion matrix.clusters (
pd.Index | None) – Cluster index for the confusion matrix columns; None if not clustered.min_gap (
int = 0) – Minimum gap between positions to include in position pairs.num_cpus (
int = 1) – Number of CPUs to use for parallel processing.
- Returns:
Total (n, a, b, ab) confusion matrix components.
- Return type:
- seismicrna.core.batch.accum.accumulate_counts(batch_counts: Iterable[tuple[Any, Any, Any, Any]], refseq: DNA, pos_nums: ndarray, patterns: dict[str, RelPattern], ks: Iterable[int] | None = None, *, count_ends: bool = True, count_pos: bool = True, count_read: bool = True, validate: bool = True)
Accumulate counts from batches into total counts.
- Parameters:
batch_counts (
Iterable[tuple]) – Iterable of (num_reads, end_counts, count_per_pos, count_per_read) tuples, one per batch.refseq (
DNA) – Reference sequence.pos_nums (
np.ndarray) – Position numbers to include in the per-position counts.patterns (
dict[str,RelPattern]) – Mapping from relationship name to relationship pattern.ks (
Iterable[int] | None) – Numbers of clusters; None for unclustered data.count_ends (
bool = True) – Whether to accumulate end coordinate counts.count_pos (
bool = True) – Whether to accumulate per-position counts.count_read (
bool = True) – Whether to accumulate per-read counts.validate (
bool = True) – Whether to validate the index and column labels of each batch.
- Returns:
Total (num_reads, end_counts, count_per_pos, count_per_read).
- Return type:
- seismicrna.core.batch.confusion.calc_bh_adjusted_pvals(pvals: ndarray | Series | DataFrame)
Calculate BH-adjusted p-values using the Benjamini-Hochberg correction.
Returns adjusted p-values (the minimum FDR at which each hypothesis would be rejected). Compare
adjusted_pvals <= fdrto label significant pairs.
- seismicrna.core.batch.confusion.calc_confusion_matrix(pos_index: Index, covering_reads: dict[int, ndarray], mutated_reads: dict[int, ndarray], read_weights: DataFrame | None = None, min_gap: int = 0)
For every pair of positions, calculate the confusion matrix:
+----+----+ | AB | AO | A. +----+----+ | OB | OO | O. +----+----+ .B .O ..
And return .., A., .B, AB in that order.
- seismicrna.core.batch.confusion.calc_confusion_phi(n: Series | DataFrame, a: Series | DataFrame, b: Series | DataFrame, ab: Series | DataFrame, *, min_cover: int | float = 1, tol: float = 1000000.0, validate: bool = True)
Calculate the phi correlation coefficient for a 2x2 matrix.
+----+----+ | AB | AO | A. +----+----+ | OB | OO | O. +----+----+ .B .O ..
- where
A. = AB + AO .B = AB + OB .. = A. + O. = .B + .O
- Parameters:
n (
pd.Series | pd.DataFrame) – Observations in total (..)a (
pd.Series | pd.DataFrame) – Observations for which A is true, regardless of B (A.)b (
pd.Series | pd.DataFrame) – Observations for which B is true, regardless of A (.B)ab (
pd.Series | pd.DataFrame) – Observations for which A and B are both true (AB)min_cover (
float) – Set phi values with coverage < min_cover to NaNtol (
float) – Tolerance for phi coefficients outside [-1, 1]validate (
bool) – Validate the confusion matrix first
- Returns:
Phi correlation coefficient
- Return type:
float | np.ndarray | pd.Series | pd.DataFrame
- seismicrna.core.batch.confusion.calc_confusion_pvals(n: Series, a: Series, b: Series, ab: Series, *, validate: bool = True)
Calculate the p-value of each element of the confusion matrix using a two-sided Fisher exact test.
- seismicrna.core.batch.confusion.init_confusion_matrix(pos_index: Index, clusters: Index | None = None, min_gap: int = 0)
For every pair of positions, initialize the confusion matrix:
+----+----+ | AB | AO | A. +----+----+ | OB | OO | O. +----+----+ .B .O ..
And return .., A., .B, AB in that order.
- seismicrna.core.batch.confusion.validate_confusion_matrix(n: Series | DataFrame, a: Series | DataFrame, b: Series | DataFrame, ab: Series | DataFrame)
- seismicrna.core.batch.count.calc_count_per_pos(pattern: RelPattern, cover_per_pos: Series | DataFrame, rels_per_pos: dict[int, Series | DataFrame])
Count the reads that fit a pattern at each position.
- seismicrna.core.batch.count.calc_count_per_read(pattern: RelPattern, cover_per_read: DataFrame, rels_per_read: dict[int, DataFrame])
Count the positions that fit a pattern in each read.
- seismicrna.core.batch.count.calc_coverage(pos_index: Index, read_nums: ndarray, seg_end5s: ndarray, seg_end3s: ndarray, seg_ends_mask: ndarray | None, read_weights: DataFrame | None = None)
Calculate the coverage per position and per read.
- seismicrna.core.batch.count.calc_covered_reads_per_pos(pos_index: Index, read_nums: ndarray, seg_end5s: ndarray, seg_end3s: ndarray, seg_ends_mask: ndarray | None)
For each position, find all reads covering it.
- seismicrna.core.batch.count.calc_reads_per_pos(pattern: RelPattern, mutations: dict[int, dict[int, ndarray]], pos_index: Index)
For each position, find all reads matching a pattern.
- seismicrna.core.batch.count.calc_rels_per_pos(mutations: dict[int, dict[int, ndarray]], num_reads: int | Series, cover_per_pos: Series | DataFrame, read_indexes: ndarray | None = None, read_weights: DataFrame | None = None)
For each relationship, the number of reads at each position.
- seismicrna.core.batch.count.calc_rels_per_read(mutations: dict[int, dict[int, ndarray]], pos_index: Index, cover_per_read: DataFrame, read_indexes: ndarray)
For each relationship, the number of positions in each read.
- seismicrna.core.batch.count.count_end_coords(end5s: ndarray, end3s: ndarray, weights: DataFrame | None = None)
Count each pair of 5’ and 3’ end coordinates.
- exception seismicrna.core.batch.ends.BadSegmentEndsError
Bases:
ValueErrorSegment 5’ and 3’ ends are not valid.
- class seismicrna.core.batch.ends.EndCoords(*, region: Region, seg_end5s: ndarray, seg_end3s: ndarray, sanitize: bool = True, **kwargs)
Bases:
objectCollection of 5’ and 3’ segment end coordinates.
- property contiguous
Whether the segments of each read are contiguous.
- property num_contiguous
Number of contiguous reads.
- property num_discontiguous
Number of discontiguous reads.
- property num_reads
Number of reads.
- property num_segments
Number of segments in each read.
- property pos_dtype
Data type for positions.
- property read_end3s
3’ end of each read.
- property read_end5s
5’ end of each read.
- property read_lengths
Length of each read.
- seismicrna.core.batch.ends.count_reads_segments(seg_ends: ndarray, what: str = 'seg_ends') tuple[int, int]
- seismicrna.core.batch.ends.find_contiguous_reads(seg_end5s: ndarray, seg_end3s: ndarray, seg_ends_mask: ndarray | None)
Whether the segments of each read are contiguous.
- seismicrna.core.batch.ends.find_read_end3s(seg_end3s: ndarray, seg_ends_mask: ndarray | None)
Find the 3’ end of each read.
- seismicrna.core.batch.ends.find_read_end5s(seg_end5s: ndarray, seg_ends_mask: ndarray | None)
Find the 5’ end of each read.
- seismicrna.core.batch.ends.match_reads_segments(seg_end5s: ndarray, seg_end3s: ndarray, seg_ends_mask: ndarray | None)
Number of segments for the given end coordinates.
- seismicrna.core.batch.ends.merge_read_ends(read_end5s: ndarray, read_end3s: ndarray)
Return the 5’ and 3’ ends as one 2D array.
- seismicrna.core.batch.ends.sanitize_segment_ends(seg_end5s: ndarray, seg_end3s: ndarray, min_pos: int, max_pos: int, check_values: bool = True)
Sanitize end coordinates.
- Parameters:
seg_end5s (
np.ndarray) – 5’ end coordinate of each segment in each read.seg_end3s (
np.ndarray) – 3’ end coordinate of each segment in each read.min_pos (
int) – Minimum allowed value of a position.max_pos (
int) – Maximum allowed value of a position.check_values (
bool = True) – Whether to check the bounds of the values, which is the most expensive operation in this function. Can be set to False if the only desired effect is to ensure the output is a positive, even number of arrays in the proper data type.
- Returns:
Sanitized end coordinates: encoded in the most efficient data type, and if check_values is True then all between min_pos and max_pos (inclusive).
- Return type:
tuple[np.ndarray,np.ndarray]
- seismicrna.core.batch.ends.simulate_segment_ends(uniq_end5s: ndarray, uniq_end3s: ndarray, p_ends: ndarray, num_reads: int, read_length: int = 0, p_rev: float = 0.5, seed: int | None = None)
Simulate segment end coordinates from their probabilities.
- Parameters:
uniq_end5s (
np.ndarray) – Unique read 5’ end coordinates.uniq_end3s (
np.ndarray) – Unique read 3’ end coordinates.p_ends (
np.ndarray) – Probability of each set of unique end coordinates.num_reads (
int) – Number of reads to simulate.read_length (
int = 0) – If == 0, then generate single-end reads (1 segment per read); if > 0, then generate paired-end reads (2 segments per read) with at most this number of base calls in each segment.p_rev (
float = 0.5) – For paired-end reads, the probability that mate 1 aligns in the
- Returns:
5’ and 3’ segment end coordinates of the simulated reads.
- Return type:
tuple[np.ndarray,np.ndarray]
- seismicrna.core.batch.ends.sort_segment_ends(seg_end5s: ndarray, seg_end3s: ndarray, seg_ends_mask: ndarray | None)
Sort the segment end coordinates and label the 3’ end of each contiguous set of segments.
- Parameters:
seg_end5s (
np.ndarray) – 5’ end of each segment in each read.seg_end3s (
np.ndarray) – 3’ end of each segment in each read.seg_ends_mask (
np.ndarray | None) – Whether each pair of 5’ and 3’ ends is masked.
- Returns:
Sorted 5’ and 3’ coordinates of the segments in each read
Labels of whether each coordinate is a 5’ end of a segment
Labels of whether each coordinate is a 5’ end of a contiguous segment
Labels of whether each coordinate is a 3’ end of a contiguous segment
- Return type:
tuple[np.ndarray,np.ndarray,np.ndarray]
- seismicrna.core.batch.index.count_base_types(base_pos_index: Index)
Return the number of each type of base in the index of positions and bases.
- seismicrna.core.batch.index.iter_base_types(base_pos_index: Index)
For each type of base in the index of positions and bases, yield the positions in the index with that type of base.
- class seismicrna.core.batch.muts.MutsBatch(*, region: Region, sanitize: bool = True, muts: dict[int, dict[int, list[int] | ndarray]], masked_read_nums: ndarray | list[int] | None = None, **kwargs)
Bases:
EndCoords,ReadBatch,ABCBatch of mutational data.
- property pos_nums
Positions in use.
- property read_end_counts
Counts of read end coordinates.
- class seismicrna.core.batch.muts.RegionMutsBatch(*, region: Region, **kwargs)
-
Batch of mutational data that knows its region.
- calc_confusion_matrix(pattern: RelPattern, min_gap: int = 0)
Calculate the confusion matrix of mutations.
- calc_min_mut_dist(pattern: RelPattern)
For each read, calculate the smallest distance (i.e. the gap plus 1) between any two mutations.
- count_all(patterns: dict[str, RelPattern], ks: Iterable[int] | None = None, *, count_ends: bool = True, count_pos: bool = True, count_read: bool = True)
Calculate all counts.
- count_per_pos(pattern: RelPattern)
Count the reads that fit a relationship pattern at each position in a region.
- count_per_read(pattern: RelPattern)
Count the positions in a region that fit a relationship pattern in each read.
- property cover_per_pos
Number of reads covering each position.
- property cover_per_read
Number of positions covered by each read.
- property covered_reads_per_pos
Reads covering each position.
- inject_close_muts(pattern: RelPattern, mut_probs: Iterable[float] | ndarray, seed: int | None)
Return a new muts dictionary with extra mutations within min_gap positions 5’ of existing mutations.
- iter_reads(pattern: RelPattern, only_read_ends: bool = False, require_contiguous: bool = False)
End coordinates and mutated positions in each read.
- property matrix
Matrix of relationships at each position in each read.
- merge_close_muts(pattern: RelPattern, min_gap: int)
Return a new muts dictionary in which mutations closer than min_gap are merged into a single mutation, keeping only the 3’-most mutation. This algorithm corrects for extra mutations occuring at non-modified positions shortly 5’ of modifications, and therefore distinguishes probe-modified positions from all mutated positions.
- property pos_index
Index of unmasked positions and bases.
- reads_noclose_muts(pattern: RelPattern, min_gap: int)
List the reads with no two mutations too close.
- reads_per_pos(pattern: RelPattern)
For each position, find all reads matching a relationship pattern.
- property rels_per_pos
For each relationship, the number of reads at each position with that relationship.
- property rels_per_read
For each relationship, the number of positions in each read with that relationship.
- seismicrna.core.batch.muts.calc_muts_matrix(region: Region, read_nums: ndarray, seg_end5s: ndarray, seg_end3s: ndarray, seg_ends_mask: ndarray, muts: dict[int, dict[int, ndarray]])
Build a matrix of relationships at each position in each read.
- Parameters:
region (
Region) – Region providing unmasked positions.read_nums (
np.ndarray) – Read numbers: 1D array (reads).seg_end5s (
np.ndarray) – 5’ end coordinates of each segment: 2D array (reads x segments).seg_end3s (
np.ndarray) – 3’ end coordinates of each segment: 2D array (reads x segments).seg_ends_mask (
np.ndarray) – Boolean mask of segments to exclude: 2D array (reads x segments).muts (
dict[int,dict[int,np.ndarray]]) – Mapping from position to relationship code to read numbers.
- Returns:
DataFrame of relationship codes, indexed by read number and columned by position index.
- Return type:
pd.DataFrame
- seismicrna.core.batch.muts.sanitize_muts(muts: dict[int, dict[int, list[int] | ndarray]], region: Region, data_type: type, sanitize: bool = True)
Keep only unmasked positions in the muts dictionary and convert the read lists to arrays.
- Parameters:
muts (
dict[int,dict[int,list[int] | np.ndarray]]) – Mapping from position to relationship code to read numbers.region (
Region) – Region whose unmasked positions define which positions to keep.data_type (
type) – NumPy dtype to use for the read number arrays.sanitize (
bool = True) – If True, filter to unmasked positions and convert to arrays; if False, return the muts values as-is.
- Returns:
Sanitized mutation data.
- Return type:
dict[int,dict[int,np.ndarray]]
- seismicrna.core.batch.muts.simulate_muts(pmut: DataFrame, seg_end5s: ndarray, seg_end3s: ndarray, seed: int | None)
Simulate mutation data.
- Parameters:
pmut (
pd.DataFrame) – Rate of each type of mutation at each position.seg_end5s (
np.ndarray) – 5’ end coordinate of each segment.seg_end3s (
np.ndarray) – 3’ end coordinate of each segment.seed (
int | None) – Random seed for reproducibility; None for a random seed.
- Returns:
Mutation data: mapping from position to relationship code to read numbers.
- Return type:
dict[int,dict[int,np.ndarray]]
- class seismicrna.core.batch.read.ReadBatch(*, batch: int, **kwargs)
Bases:
ABCBatch of reads.
- property batch_read_index
MultiIndex of the batch number and read numbers.
- property masked_reads_bool
- property read_dtype
Data type for read numbers.
- property read_indexes: ndarray
Map each read number to its index in self.read_nums.
- property read_nums: ndarray
Read numbers.