seismicrna.mask package

Subpackages

Submodules

class seismicrna.mask.batch.MaskMutsBatch(*, read_nums: ndarray, **kwargs)

Bases: MaskReadBatch, PartialRegionMutsBatch

property read_weights

Weights for each read when computing counts.

class seismicrna.mask.batch.MaskReadBatch(*, read_nums: ndarray, **kwargs)

Bases: PartialReadBatch

property num_reads

Number of reads.

property read_nums

Read numbers.

class seismicrna.mask.batch.PartialReadBatch(*, batch: int, **kwargs)

Bases: ReadBatch, ABC

property max_read

Maximum possible value for a read index.

property read_indexes

Map each read number to its index in self.read_nums.

class seismicrna.mask.batch.PartialRegionMutsBatch(*, region: Region, **kwargs)

Bases: PartialReadBatch, RegionMutsBatch, ABC

seismicrna.mask.batch.apply_mask(batch: RegionMutsBatch, read_nums: ndarray | None = None, region: Region | None = None, sanitize: bool = False)

Apply a read/position mask to a batch, returning a MaskMutsBatch.

Parameters:
  • batch (RegionMutsBatch) – Source batch to mask.

  • read_nums (np.ndarray or None, optional) – Array of read numbers to keep; if None, all reads are kept.

  • region (Region or None, optional) – Region to clip reads to; if None, the batch’s existing region is used.

  • sanitize (bool, optional) – Whether to run extra validation checks on the new batch (default False).

Returns:

A new batch containing only the selected reads and positions.

Return type:

MaskMutsBatch

class seismicrna.mask.dataset.JoinMaskMutsDataset(*args, **kwargs)

Bases: MaskDataset, JoinMutsDataset, MergedUnbiasDataset

classmethod get_batch_type()

Type of batch.

classmethod get_dataset_load_func()

Function to load one constituent dataset.

classmethod get_report_type()

Type of report.

classmethod name_batch_attrs()

Name the attributes of each batch.

class seismicrna.mask.dataset.MaskDataset(report_file: str | Path, verify_times: bool = True)

Bases: AverageDataset, ABC

Dataset of masked data.

class seismicrna.mask.dataset.MaskMutsDataset(dataset2_report_file: Path, **kwargs)

Bases: MaskDataset, MultistepDataset, UnbiasDataset

Chain mutation data with masked reads.

MASK_NAME = 'mask'
classmethod get_dataset1_load_func()

Function to load Dataset 1.

classmethod get_dataset2_type()

Type of Dataset 2.

property min_mut_gap

Minimum gap between two mutations.

property mut_collisions

Method for handling mutations that are too close.

property pattern

Pattern of mutations to count.

property probe

Chemical probe used for the experiment.

property quick_unbias

Use the quick heuristic for unbiasing.

property quick_unbias_thresh

Consider mutation rates less than or equal to this threshold to be 0 when using the quick heuristic for unbiasing.

property region

Region of the dataset.

class seismicrna.mask.dataset.MaskReadDataset(*args, masked_read_nums: dict[int, list] | None = None, **kwargs)

Bases: MaskDataset, LoadedDataset, UnbiasDataset

Load batches of masked data.

classmethod get_batch_type()

Type of batch.

classmethod get_report_type()

Type of report.

property min_mut_gap

Minimum gap between two mutations.

property mut_collisions

Method for handling mutations that are too close.

property pattern

Pattern of mutations to count.

property pos_kept

Positions kept after masking.

property probe

Chemical probe used for the experiment.

property quick_unbias

Use the quick heuristic for unbiasing.

property quick_unbias_thresh

Consider mutation rates less than or equal to this threshold to be 0 when using the quick heuristic for unbiasing.

class seismicrna.mask.io.MaskBatchIO(*, read_nums: ndarray, **kwargs)

Bases: MaskReadBatch, ReadBatchIO, RegBrickleIO, MaskIO

classmethod get_file_seg_type()

Type of the last segment in the path.

class seismicrna.mask.io.MaskFile

Bases: HasRegFilePath, ABC

classmethod get_step()

Step of the workflow.

class seismicrna.mask.io.MaskIO

Bases: MaskFile, RegFileIO, ABC

class seismicrna.mask.lists.MaskList(*, reg: str, **kwargs)

Bases: List, MaskFile, ABC

class seismicrna.mask.lists.MaskPositionList(*, reg: str, **kwargs)

Bases: PositionList, MaskList

classmethod get_table_type()

Type of table that this type of list can process.

classmethod list_init_table_attrs()

List the table attribute names to pass to __init__().

seismicrna.mask.main.load_regions(input_path: Iterable[str | Path], coords: Iterable[tuple[str, int, int]], primers: Iterable[tuple[str, DNA, DNA]], primer_gap: int, regions_file: Path | None = None)

Load regions of relate reports.

seismicrna.mask.main.run(input_path: Iterable[str | Path] = Sentinel.UNSET, *, branch: str = '', tmp_pfx: str | Path = './tmp', keep_tmp: bool = False, mask_coords: Iterable[tuple[str, int, int]] = (), mask_primers: Iterable[tuple[str, DNA, DNA]] = (), primer_gap: int = 0, mask_regions_file: str | None = None, count_del: bool = True, count_ins: bool = True, no_mut: Iterable[str] = (), only_mut: Iterable[str] = (), probe: str = 'DMS', mask_a: bool | None = None, mask_c: bool | None = None, mask_g: bool | None = None, mask_u: bool | None = None, mask_polya: int = 5, mask_pos: Iterable[tuple[str, int]] = (), mask_pos_file: Iterable[str | Path] = (), mask_read: Iterable[str] = (), mask_read_file: Iterable[str | Path] = (), mask_discontig: bool = True, min_ninfo_pos: int = 1000, max_fmut_pos: float = 1.0, min_ncov_read: int = 1, min_fcov_read: float = 0.0, min_finfo_read: float = 0.95, max_fmut_read: float = 1.0, min_mut_gap: int | None = None, mut_collisions: str = 'auto', quick_unbias: bool = True, quick_unbias_thresh: float = 0.001, max_mask_iter: int = 0, mask_pos_table: bool = True, mask_read_table: bool = True, brotli_level: int = 10, num_cpus: int = 4, force: bool = False) list[Path]

Define mutations and regions to filter reads and positions.

Parameters:
  • branch (str) – Create a new branch of the workflow with this name [keyword-only, default: ‘’]

  • tmp_pfx (str | pathlib._local.Path) – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]

  • keep_tmp (bool) – Keep temporary files after finishing [keyword-only, default: False]

  • mask_coords (Iterable) – Select a region of a reference given its 5’ and 3’ end coordinates [keyword-only, default: ()]

  • mask_primers (Iterable) – Select a region of a reference given its forward and reverse primers [keyword-only, default: ()]

  • primer_gap (int) – Leave a gap of this many bases between the primer and the region [keyword-only, default: 0]

  • mask_regions_file (str | None) – Select regions of references from coordinates/primers in a CSV file [keyword-only, default: None]

  • count_del (bool) – Count deletions as mutations [keyword-only, default: True]

  • count_ins (bool) – Count insertions as mutations [keyword-only, default: True]

  • no_mut (Iterable) – Do not count this type of mutation (overrides –count-del/ins) [keyword-only, default: ()]

  • only_mut (Iterable) – Count only this type of mutation (overrides other mutation settings) [keyword-only, default: ()]

  • probe (str) – Use default mask options for this chemical probe [keyword-only, default: ‘DMS’]

  • mask_a (bool | None) – Mask positions with base A [keyword-only, default: None]

  • mask_c (bool | None) – Mask positions with base C [keyword-only, default: None]

  • mask_g (bool | None) – Mask positions with base G [keyword-only, default: None]

  • mask_u (bool | None) – Mask positions with base U [keyword-only, default: None]

  • mask_polya (int) – Mask stretches of at least this many consecutive A bases (0 disables) [keyword-only, default: 5]

  • mask_pos (Iterable) – Mask this position in this reference [keyword-only, default: ()]

  • mask_pos_file (Iterable) – Mask positions in references from a file [keyword-only, default: ()]

  • mask_read (Iterable) – Mask the read with this name [keyword-only, default: ()]

  • mask_read_file (Iterable) – Mask the reads with names in this file [keyword-only, default: ()]

  • mask_discontig (bool) – Mask paired-end reads with discontiguous mates [keyword-only, default: True]

  • min_ninfo_pos (int) – Mask positions with fewer than this many informative base calls [keyword-only, default: 1000]

  • max_fmut_pos (float) – Mask positions with more than this fraction of mutated base calls [keyword-only, default: 1.0]

  • min_ncov_read (int) – Mask reads with fewer than this many bases covering the region [keyword-only, default: 1]

  • min_fcov_read (float) – Mask reads covering less than this fraction of the region [keyword-only, default: 0.0]

  • min_finfo_read (float) – Mask reads with less than this fraction of informative base calls [keyword-only, default: 0.95]

  • max_fmut_read (float) – Mask reads with more than this fraction of mutated base calls [keyword-only, default: 1.0]

  • min_mut_gap (int | None) – Mask reads with two mutations separated by fewer than this many bases [keyword-only, default: None]

  • mut_collisions (str) – If two mutations are closer than –min-mut-gap positions, MERGE the mutations, DROP the read, or AUTO-select based on the probe. [keyword-only, default: ‘auto’]

  • quick_unbias (bool) – Correct observer bias using a quick (typically linear time) heuristic [keyword-only, default: True]

  • quick_unbias_thresh (float) – Treat mutated fractions under this threshold as 0 with –quick-unbias [keyword-only, default: 0.001]

  • max_mask_iter (int) – Stop masking after this many iterations (0 for no limit) [keyword-only, default: 0]

  • mask_pos_table (bool) – Tabulate relationships per position for mask data [keyword-only, default: True]

  • mask_read_table (bool) – Tabulate relationships per read for mask data [keyword-only, default: True]

  • brotli_level (int) – Compress pickle files with this level of Brotli (0 - 11) [keyword-only, default: 10]

  • num_cpus (int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]

  • force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

seismicrna.mask.main.set_mask_acgu(probe: str, mask_a: bool | None = None, mask_c: bool | None = None, mask_g: bool | None = None, mask_u: bool | None = None)

Resolve per-base masking flags based on the probe type.

Parameters:
  • probe (str) – Probe type (one of the values in PROBES), used to set defaults when a flag is None.

  • mask_a (bool or None, optional) – Whether to mask adenine positions; if None, inferred from probe.

  • mask_c (bool or None, optional) – Whether to mask cytosine positions; if None, inferred from probe.

  • mask_g (bool or None, optional) – Whether to mask guanine positions; if None, inferred from probe.

  • mask_u (bool or None, optional) – Whether to mask uracil/thymine positions; if None, inferred from probe.

Returns:

Resolved (mask_a, mask_c, mask_g, mask_u) flags.

Return type:

tuple[bool, bool, bool, bool]

seismicrna.mask.main.set_mut_gap_params(probe: str, min_mut_gap: int | None = None, mut_collisions: str = 'auto')

Resolve mutation-gap and collision parameters based on the probe type.

Parameters:
  • probe (str) – Probe type (one of the values in PROBES), used to set defaults when a parameter is None / MUT_COLLISIONS_AUTO.

  • min_mut_gap (int or None, optional) – Minimum gap (in nucleotides) between two mutations in the same read; if None, a probe-specific default is used.

  • mut_collisions (str, optional) – How to handle reads with mutations closer than min_mut_gap; if MUT_COLLISIONS_AUTO, a probe-specific default is used.

Returns:

Resolved (min_mut_gap, mut_collisions) values.

Return type:

tuple[int, str]

class seismicrna.mask.report.BaseMaskReport(**kwargs: Any | Callable[[Report], Any])

Bases: RegReport, MaskIO, ABC

classmethod get_file_seg_type()

Type of the last segment in the path.

class seismicrna.mask.report.JoinMaskReport(**kwargs: Any | Callable[[Report], Any])

Bases: JoinReport, BaseMaskReport

classmethod get_param_report_fields()

Parameter fields of the report.

class seismicrna.mask.report.MaskReport(**kwargs: Any | Callable[[Report], Any])

Bases: BatchedReport, BaseMaskReport

classmethod get_param_report_fields()

Parameter fields of the report.

classmethod get_result_report_fields()

Result fields of the report.

class seismicrna.mask.table.MaskBatchTabulator(*, get_batch_count_all: Callable, num_batches: int, num_cpus: int = 1, **kwargs)

Bases: BatchTabulator, MaskTabulator

class seismicrna.mask.table.MaskCountTabulator(*, batch_counts: Iterable[tuple[Any, Any, Any, Any]], **kwargs)

Bases: CountTabulator, MaskTabulator

class seismicrna.mask.table.MaskDatasetTabulator(*, dataset: MutsDataset, validate: bool = False, **kwargs)

Bases: PartialDatasetTabulator, MaskTabulator

class seismicrna.mask.table.MaskPositionTable

Bases: MaskTable, PartialPositionTable, ABC

class seismicrna.mask.table.MaskPositionTableLoader(table_file: str | Path, **kwargs)

Bases: PositionTableLoader, MaskPositionTable

class seismicrna.mask.table.MaskPositionTableWriter(tabulator: Tabulator)

Bases: PositionTableWriter, MaskPositionTable

class seismicrna.mask.table.MaskReadTable

Bases: MaskTable, PartialReadTable, ABC

class seismicrna.mask.table.MaskReadTableLoader(table_file: str | Path, **kwargs)

Bases: ReadTableLoader, MaskReadTable

class seismicrna.mask.table.MaskReadTableWriter(tabulator: Tabulator)

Bases: ReadTableWriter, MaskReadTable

class seismicrna.mask.table.MaskTable

Bases: AverageTable, MaskFile, ABC

classmethod get_load_function()

LoadFunction for all Dataset types for this Table.

class seismicrna.mask.table.MaskTabulator(*, refseq: DNA, region: Region, pattern: RelPattern, min_mut_gap: int, mut_collisions: str, quick_unbias: bool, quick_unbias_thresh: float, count_ends: bool = True, **kwargs)

Bases: PartialTabulator, AverageTabulator, ABC

classmethod table_types()

Types of tables that this tabulator can write.

class seismicrna.mask.table.PartialDatasetTabulator(*, dataset: MutsDataset, validate: bool = False, **kwargs)

Bases: DatasetTabulator, PartialTabulator, ABC

classmethod init_kws()

Attributes of the dataset to use as keyword arguments in super().__init__().

class seismicrna.mask.table.PartialPositionTable

Bases: PartialTable, PositionTable, ABC

class seismicrna.mask.table.PartialReadTable

Bases: PartialTable, ReadTable, ABC

class seismicrna.mask.table.PartialTable

Bases: Table, HasRegFilePath, ABC

Table of filtered reads over a region of the sequence.

class seismicrna.mask.table.PartialTabulator(*, refseq: DNA, region: Region, pattern: RelPattern, min_mut_gap: int, mut_collisions: str, quick_unbias: bool, quick_unbias_thresh: float, count_ends: bool = True, **kwargs)

Bases: Tabulator, ABC

property correct_bias

Whether to correct for observer bias.

property data_per_pos

DataFrame of per-position data.

classmethod get_null_value()

The null value for a count: either 0 or NaN.

property p_ends_given_clust_noclose

Probability of each end coordinate.

seismicrna.mask.table.adjust_counts(table_per_pos: DataFrame, p_ends_given_clust_noclose: ndarray, n_reads_clust: Series | int, region: Region, min_mut_gap: int, mut_collisions: str, quick_unbias: bool, quick_unbias_thresh: float)

Adjust the given table of masked/clustered counts per position to correct for observer bias.

class seismicrna.mask.write.Masker(dataset: RelateMutsDataset | PoolMutsDataset, region: Region, pattern: RelPattern, *, max_mask_iter: int, mask_polya: int, mask_a: bool, mask_c: bool, mask_g: bool, mask_u: bool, mask_pos: list[tuple[str, int]], mask_pos_file: list[Path], mask_read: list[str], mask_read_file: list[Path], mask_discontig: bool, min_ncov_read: int, min_fcov_read: float, min_finfo_read: float, max_fmut_read: float, min_mut_gap: int, mut_collisions: str, probe: str, min_ninfo_pos: int, max_fmut_pos: float, quick_unbias: bool, quick_unbias_thresh: float, count_read: bool, brotli_level: int, top: Path, branch: str, num_cpus: int = 1)

Bases: object

Mask batches of relationships.

CHECKSUM_KEY = 'mask'
MASK_POS_FMUT = 'pos-fmut'
MASK_POS_NINFO = 'pos-ninfo'
MASK_READ_DISCONTIG = 'read-discontig'
MASK_READ_FCOV = 'read-fcov'
MASK_READ_FINFO = 'read-finfo'
MASK_READ_FMUT = 'read-fmut'
MASK_READ_GAP = 'read-gap'
MASK_READ_INIT = 'read-init'
MASK_READ_KEPT = 'read-kept'
MASK_READ_LIST = 'read-exclude'
MASK_READ_NCOV = 'read-ncov'
PATTERN_KEY = 'pattern'
create_report()
mask()
property n_reads_discontig
property n_reads_init
property n_reads_kept

Number of reads kept.

property n_reads_list
property n_reads_max_fmut
property n_reads_min_fcov
property n_reads_min_finfo
property n_reads_min_gap
property n_reads_min_ncov
property pos_a

Positions masked for having base A.

property pos_c

Positions masked for having base C.

property pos_g

Positions masked for having base G.

property pos_kept

Positions kept.

property pos_list

Positions masked arbitrarily from a list.

property pos_max_fmut

Positions masked for having too many mutations.

property pos_min_ninfo

Positions masked for having too few informative reads.

property pos_n

Positions masked for having base N.

property pos_polya

Positions masked for lying in a poly(A) sequence.

property pos_u

Positions masked for having base T or U.

property read_names_dataset

Dataset of the read names.

seismicrna.mask.write.mask_region(dataset: RelateMutsDataset | PoolMutsDataset, region: Region, *, branch: str, count_del: bool, count_ins: bool, no_mut: Iterable[str], only_mut: Iterable[str], mask_pos_table: bool, mask_read_table: bool, force: bool, num_cpus: int, tmp_pfx, keep_tmp, **kwargs)

Mask out certain reads, positions, and relationships.