seismicrna.mask package

Subpackages

Submodules

class seismicrna.mask.batch.MaskMutsBatch(*, read_nums: ndarray, **kwargs)

Bases: MaskReadBatch, PartialRegionMutsBatch

property read_weights

Weights for each read when computing counts.

class seismicrna.mask.batch.MaskReadBatch(*, read_nums: ndarray, **kwargs)

Bases: PartialReadBatch

property num_reads

Number of reads.

property read_nums

Read numbers.

class seismicrna.mask.batch.PartialReadBatch(*, batch: int)

Bases: ReadBatch, ABC

property max_read

Maximum possible value for a read index.

property read_indexes

Map each read number to its index in self.read_nums.

class seismicrna.mask.batch.PartialRegionMutsBatch(*, region: Region, **kwargs)

Bases: PartialReadBatch, RegionMutsBatch, ABC

seismicrna.mask.batch.apply_mask(batch: RegionMutsBatch, read_nums: ndarray | None = None, region: Region | None = None, sanitize: bool = False)
class seismicrna.mask.dataset.JoinMaskMutsDataset(*args, **kwargs)

Bases: MaskDataset, JoinMutsDataset, MergedUnbiasDataset

classmethod get_batch_type()

Type of batch.

classmethod get_dataset_load_func()

Function to load one constituent dataset.

classmethod get_report_type()

Type of report.

classmethod name_batch_attrs()

Name the attributes of each batch.

class seismicrna.mask.dataset.MaskDataset(report_file: Path, verify_times: bool = True)

Bases: AverageDataset, ABC

Dataset of masked data.

class seismicrna.mask.dataset.MaskMutsDataset(dataset2_report_file: Path, **kwargs)

Bases: MaskDataset, MultistepDataset, UnbiasDataset

Chain mutation data with masked reads.

MASK_NAME = 'mask'
classmethod get_dataset1_load_func()

Function to load Dataset 1.

classmethod get_dataset2_type()

Type of Dataset 2.

property min_mut_gap

Minimum gap between two mutations.

property pattern

Pattern of mutations to count.

property quick_unbias

Use the quick heuristic for unbiasing.

property quick_unbias_thresh

Consider mutation rates less than or equal to this threshold to be 0 when using the quick heuristic for unbiasing.

property region

Region of the dataset.

class seismicrna.mask.dataset.MaskReadDataset(*args, masked_read_nums: dict[[<class 'int'>, <class 'list'>]] | None = None, **kwargs)

Bases: MaskDataset, LoadedDataset, UnbiasDataset

Load batches of masked data.

classmethod get_batch_type()

Type of batch.

classmethod get_report_type()

Type of report.

property min_mut_gap

Minimum gap between two mutations.

property pattern

Pattern of mutations to count.

property pos_kept

Positions kept after masking.

property quick_unbias

Use the quick heuristic for unbiasing.

property quick_unbias_thresh

Consider mutation rates less than or equal to this threshold to be 0 when using the quick heuristic for unbiasing.

class seismicrna.mask.io.MaskBatchIO(*, reg: str, **kwargs)

Bases: ReadBatchIO, MaskIO, MaskReadBatch

classmethod file_seg_type()

Type of the last segment in the path.

class seismicrna.mask.io.MaskIO(*, reg: str, **kwargs)

Bases: RegIO, ABC

classmethod auto_fields()

Names and automatic values of selected fields.

seismicrna.mask.main.load_regions(input_path: Iterable[str | Path], coords: Iterable[tuple[str, int, int]], primers: Iterable[tuple[str, DNA, DNA]], primer_gap: int, regions_file: Path | None = None)

Load regions of relate reports.

seismicrna.mask.main.run(input_path: Iterable[str | Path], *, tmp_pfx: str | Path = './tmp', keep_tmp: bool = False, mask_coords: Iterable[tuple[str, int, int]] = (), mask_primers: Iterable[tuple[str, DNA, DNA]] = (), primer_gap: int = 0, mask_regions_file: str | None = None, mask_del: bool = True, mask_ins: bool = True, mask_mut: Iterable[str] = (), mask_polya: int = 5, mask_gu: bool = True, mask_pos: Iterable[tuple[str, int]] = (), mask_pos_file: str | None = None, mask_read: Iterable[str] = (), mask_read_file: str | None = None, mask_discontig: bool = True, min_ninfo_pos: int = 1000, max_fmut_pos: float = 1.0, min_ncov_read: int = 1, min_finfo_read: float = 0.95, max_fmut_read: float = 1.0, min_mut_gap: int = 3, quick_unbias: bool = True, quick_unbias_thresh: float = 0.001, max_mask_iter: int = 0, mask_pos_table: bool = True, mask_read_table: bool = True, brotli_level: int = 10, max_procs: int = 4, force: bool = False) list[Path]

Define mutations and regions to filter reads and positions.

Parameters:
  • tmp_pfx (str | pathlib._local.Path) – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]

  • keep_tmp (bool) – Keep temporary files after finishing [keyword-only, default: False]

  • mask_coords (Iterable) – Select a region of a reference given its 5’ and 3’ end coordinates [keyword-only, default: ()]

  • mask_primers (Iterable) – Select a region of a reference given its forward and reverse primers [keyword-only, default: ()]

  • primer_gap (int) – Leave a gap of this many bases between the primer and the region [keyword-only, default: 0]

  • mask_regions_file (str | None) – Select regions of references from coordinates/primers in a CSV file [keyword-only, default: None]

  • mask_del (bool) – Mask deletions [keyword-only, default: True]

  • mask_ins (bool) – Mask insertions [keyword-only, default: True]

  • mask_mut (Iterable) – Mask this type of mutation [keyword-only, default: ()]

  • mask_polya (int) – Mask stretches of at least this many consecutive A bases (0 disables) [keyword-only, default: 5]

  • mask_gu (bool) – Mask G and U bases [keyword-only, default: True]

  • mask_pos (Iterable) – Mask this position in this reference [keyword-only, default: ()]

  • mask_pos_file (str | None) – Mask positions in references from a file [keyword-only, default: None]

  • mask_read (Iterable) – Mask the read with this name [keyword-only, default: ()]

  • mask_read_file (str | None) – Mask the reads with names in this file [keyword-only, default: None]

  • mask_discontig (bool) – Mask paired-end reads with discontiguous mates [keyword-only, default: True]

  • min_ninfo_pos (int) – Mask positions with fewer than this many informative base calls [keyword-only, default: 1000]

  • max_fmut_pos (float) – Mask positions with more than this fraction of mutated base calls [keyword-only, default: 1.0]

  • min_ncov_read (int) – Mask reads with fewer than this many bases covering the region [keyword-only, default: 1]

  • min_finfo_read (float) – Mask reads with less than this fraction of informative base calls [keyword-only, default: 0.95]

  • max_fmut_read (float) – Mask reads with more than this fraction of mutated base calls [keyword-only, default: 1.0]

  • min_mut_gap (int) – Mask reads with two mutations separated by fewer than this many bases [keyword-only, default: 3]

  • quick_unbias (bool) – Correct observer bias using a quick (typically linear time) heuristic [keyword-only, default: True]

  • quick_unbias_thresh (float) – Treat mutated fractions under this threshold as 0 with –quick-unbias [keyword-only, default: 0.001]

  • max_mask_iter (int) – Stop masking after this many iterations (0 for no limit) [keyword-only, default: 0]

  • mask_pos_table (bool) – Tabulate relationships per position for mask data [keyword-only, default: True]

  • mask_read_table (bool) – Tabulate relationships per read for mask data [keyword-only, default: True]

  • brotli_level (int) – Compress pickle files with this level of Brotli (0 - 11) [keyword-only, default: 10]

  • max_procs (int) – Run up to this many processes simultaneously [keyword-only, default: 4]

  • force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

class seismicrna.mask.report.JoinMaskReport(**kwargs: Any | Callable[[Report], Any])

Bases: JoinReport

classmethod auto_fields()

Names and automatic values of selected fields.

classmethod fields()

All fields of the report.

classmethod file_seg_type()

Type of the last segment in the path.

class seismicrna.mask.report.MaskReport(**kwargs: Any | Callable[[Report], Any])

Bases: BatchedReport, MaskIO

classmethod auto_fields()

Names and automatic values of selected fields.

classmethod fields()

All fields of the report.

classmethod file_seg_type()

Type of the last segment in the path.

class seismicrna.mask.table.MaskBatchTabulator(*, get_batch_count_all: Callable, num_batches: int, max_procs: int = 1, **kwargs)

Bases: BatchTabulator, MaskTabulator

class seismicrna.mask.table.MaskCountTabulator(*, batch_counts: Iterable[tuple[Any, Any, Any, Any]], **kwargs)

Bases: CountTabulator, MaskTabulator

class seismicrna.mask.table.MaskDatasetTabulator(*, dataset: MutsDataset, validate: bool = False, **kwargs)

Bases: PartialDatasetTabulator, MaskTabulator

classmethod load_function()

LoadFunction for all Dataset types for this Tabulator.

class seismicrna.mask.table.MaskPositionTable

Bases: MaskTable, PartialPositionTable, ABC

class seismicrna.mask.table.MaskPositionTableLoader(table_file: Path)

Bases: PositionTableLoader, MaskPositionTable

class seismicrna.mask.table.MaskPositionTableWriter(tabulator: Tabulator)

Bases: PositionTableWriter, MaskPositionTable

class seismicrna.mask.table.MaskReadTable

Bases: MaskTable, PartialReadTable, ABC

class seismicrna.mask.table.MaskReadTableLoader(table_file: Path)

Bases: ReadTableLoader, MaskReadTable

class seismicrna.mask.table.MaskReadTableWriter(tabulator: Tabulator)

Bases: ReadTableWriter, MaskReadTable

class seismicrna.mask.table.MaskTable

Bases: AverageTable, ABC

classmethod kind()

Kind of table.

class seismicrna.mask.table.MaskTabulator(*, refseq: DNA, region: Region, pattern: RelPattern, min_mut_gap: int, quick_unbias: bool, quick_unbias_thresh: float, count_ends: bool = True, **kwargs)

Bases: PartialTabulator, AverageTabulator, ABC

classmethod table_types()

Types of tables that this tabulator can write.

class seismicrna.mask.table.PartialDatasetTabulator(*, dataset: MutsDataset, validate: bool = False, **kwargs)

Bases: DatasetTabulator, PartialTabulator, ABC

classmethod init_kws()

Attributes of the dataset to use as keyword arguments in super().__init__().

class seismicrna.mask.table.PartialPositionTable

Bases: PartialTable, PositionTable, ABC

classmethod path_segs()

Table’s path segments.

class seismicrna.mask.table.PartialReadTable

Bases: PartialTable, ReadTable, ABC

classmethod path_segs()

Table’s path segments.

class seismicrna.mask.table.PartialTable

Bases: Table, ABC

property path_fields

Table’s path fields.

class seismicrna.mask.table.PartialTabulator(*, refseq: DNA, region: Region, pattern: RelPattern, min_mut_gap: int, quick_unbias: bool, quick_unbias_thresh: float, count_ends: bool = True, **kwargs)

Bases: Tabulator, ABC

property data_per_pos

DataFrame of per-position data.

classmethod get_null_value()

The null value for a count: either 0 or NaN.

property p_ends_given_clust_noclose

Probability of each end coordinate.

seismicrna.mask.table.adjust_counts(table_per_pos: DataFrame, p_ends_given_clust_noclose: ndarray, n_reads_clust: Series | int, region: Region, min_mut_gap: int, quick_unbias: bool, quick_unbias_thresh: float)

Adjust the given table of masked/clustered counts per position to correct for observer bias.

class seismicrna.mask.write.Masker(dataset: RelateMutsDataset | PoolDataset, region: Region, pattern: RelPattern, *, max_mask_iter: int = 0, mask_polya: int = 5, mask_gu: bool = True, mask_pos: list[tuple[str, int]] = (), mask_pos_file: Path | None, mask_read: list[str] = (), mask_read_file: Path | None, mask_discontig: bool = True, min_ncov_read: int = 1, min_finfo_read: float = 0.95, max_fmut_read: float = 1.0, min_mut_gap: int = 3, min_ninfo_pos: int = 1000, max_fmut_pos: float = 1.0, quick_unbias: bool = True, quick_unbias_thresh: float = 0.001, count_read: bool, brotli_level: int = 10, top: Path, max_procs: int = 4)

Bases: object

Mask batches of relation vectors.

CHECKSUM_KEY = 'mask'
MASK_POS_FMUT = 'pos-fmut'
MASK_POS_NINFO = 'pos-ninfo'
MASK_READ_DISCONTIG = 'read-discontig'
MASK_READ_FINFO = 'read-finfo'
MASK_READ_FMUT = 'read-fmut'
MASK_READ_GAP = 'read-gap'
MASK_READ_INIT = 'read-init'
MASK_READ_KEPT = 'read-kept'
MASK_READ_LIST = 'read-exclude'
MASK_READ_NCOV = 'read-ncov'
PATTERN_KEY = 'pattern'
create_report()
mask()
property n_reads_discontig
property n_reads_init
property n_reads_kept

Number of reads kept.

property n_reads_list
property n_reads_max_fmut
property n_reads_min_finfo
property n_reads_min_gap
property n_reads_min_ncov
property pos_gu

Positions masked for having a G or U base.

property pos_kept

Positions kept.

property pos_list

Positions masked arbitrarily from a list.

property pos_max_fmut

Positions masked for having too many mutations.

property pos_min_ninfo

Positions masked for having too few informative reads.

property pos_polya

Positions masked for lying in a poly(A) sequence.

property read_names_dataset

Dataset of the read names.

seismicrna.mask.write.mask_region(dataset: RelateMutsDataset | PoolDataset, region: Region, *, mask_del: bool, mask_ins: bool, mask_mut: Iterable[str], mask_pos_table: bool, mask_read_table: bool, force: bool, n_procs: int, tmp_pfx, keep_tmp, **kwargs)

Mask out certain reads, positions, and relationships.