seismicrna.relate package

Subpackages

Submodules

class seismicrna.relate.batch.FullReadBatch(*, batch: int, **kwargs)

Bases: ReadBatch, ABC

property max_read

Maximum possible value for a read index.

property read_indexes

Map each read number to its index in self.read_nums.

property read_nums

Read numbers.

abstractmethod classmethod simulate(*args, **kwargs) Self

Simulate a batch.

class seismicrna.relate.batch.ReadNamesBatch(*, names: list[str] | ndarray, **kwargs)

Bases: FullReadBatch

property num_reads

Number of reads.

classmethod simulate(branches: dict[str, str], batch: int, num_reads: int, formatter: ~typing.Callable[[int, int], str] = <function format_read_name>, **kwargs)

Simulate a batch.

Parameters:
  • branches (dict[str, str]) – Branches of the workflow.

  • batch (int) – Batch number.

  • num_reads (int) – Number of reads in the batch.

  • formatter (Callable[[int, int], str]) – Function to generate the name of each read: must accept the batch number and the read number and return a string.

class seismicrna.relate.batch.RelateMutsBatch(*, region: Region, sanitize: bool = True, muts: dict[int, dict[int, list[int] | ndarray]], masked_read_nums: ndarray | list[int] | None = None, **kwargs)

Bases: FullReadBatch, MutsBatch, ABC

property read_weights

Weights for each read when computing counts.

class seismicrna.relate.batch.RelateRegionMutsBatch(*, region: Region, **kwargs)

Bases: RelateMutsBatch, RegionMutsBatch

classmethod simulate(ref: str, pmut: DataFrame, uniq_end5s: ndarray, uniq_end3s: ndarray, pends: ndarray, paired: bool, read_length: int, p_rev: float, min_mut_gap: int, num_reads: int, **kwargs)

Simulate a batch.

Parameters:
  • ref (str) – Name of the reference.

  • pmut (pd.DataFrame) – Rate of each type of mutation at each position.

  • uniq_end5s (np.ndarray) – Unique read 5’ end coordinates.

  • uniq_end3s (np.ndarray) – Unique read 3’ end coordinates.

  • pends (np.ndarray) – Probability of each set of unique end coordinates.

  • paired (bool) – Whether to simulate paired-end or single-end reads.

  • read_length (int) – Length of each read segment (paired-end reads only).

  • p_rev (float) – Probability that mate 1 is reversed (paired-end reads only).

  • min_mut_gap (int) – Minimum number of positions between two mutations.

  • num_reads (int) – Number of reads in the batch.

seismicrna.relate.batch.format_read_name(batch_num: int, read_num: int)

Format a read name.

class seismicrna.relate.dataset.AverageDataset(report_file: str | Path, verify_times: bool = True)

Bases: Dataset, ABC

Dataset of population average data.

property best_k

Best number of clusters.

property ks

Numbers of clusters.

class seismicrna.relate.dataset.NamesDataset(report_file: str | Path, verify_times: bool = True)

Bases: AverageDataset, ABC

classmethod kind()
class seismicrna.relate.dataset.PoolDataset(*args, **kwargs)

Bases: RelateDataset, TallDataset, MutsDataset, MergedRegionDataset

Load pooled batches of relationships.

classmethod get_dataset_load_func()

Function to load one constituent dataset.

classmethod get_report_type()

Type of report.

property region

Region of the dataset.

class seismicrna.relate.dataset.PoolReadNamesDataset(*args, **kwargs)

Bases: NamesDataset, TallDataset

Pooled Dataset of read names.

classmethod get_dataset_load_func()

Function to load one constituent dataset.

classmethod get_report_type()

Type of report.

class seismicrna.relate.dataset.ReadNamesDataset(report_file: str | Path, verify_times: bool = True)

Bases: NamesDataset, LoadedDataset

Dataset of read names from the Relate step.

classmethod get_batch_type()

Type of batch.

classmethod get_report_type()

Type of report.

property pattern

Pattern of mutations to count.

class seismicrna.relate.dataset.RelateDataset(report_file: str | Path, verify_times: bool = True)

Bases: AverageDataset, ABC

Dataset of relationships.

class seismicrna.relate.dataset.RelateMutsDataset(report_file: str | Path, verify_times: bool = True)

Bases: RelateDataset, LoadedDataset, MutsDataset

Dataset of mutations from the Relate step.

get_batch(batch: int)

Get a specific batch of data.

classmethod get_batch_type()

Type of batch.

classmethod get_report_type()

Type of report.

property paired

Whether the reads are paired-end.

property pattern

Pattern of mutations to count.

property refseq

Sequence of the reference.

property region

Region of the dataset.

class seismicrna.relate.io.ReadNamesBatchIO(*, names: list[str] | ndarray, **kwargs)

Bases: ReadNamesBatch, ReadBatchIO, RefBrickleIO, RelateIO

classmethod get_file_seg_type()

Type of the last segment in the path.

class seismicrna.relate.io.RefseqIO(*args, refseq: DNA, **kwargs)

Bases: RefBrickleIO, RelateIO

classmethod get_file_seg_type()

Type of the last segment in the path.

property refseq
class seismicrna.relate.io.RelateBatchIO(*args, region: Region, **kwargs)

Bases: RelateMutsBatch, MutsBatchIO, RefBrickleIO, RelateIO

classmethod from_region_batch(batch: RelateRegionMutsBatch, *, sample: str, branches: dict[str, str])

Create an instance from a RelateRegionMutsBatch.

classmethod get_file_seg_type()

Type of the last segment in the path.

classmethod simulate(*args, sample: str, branches: dict[str, str], **kwargs)

Simulate a batch.

to_region_batch(region: Region)

Create a RelateRegionMutsBatch from this instance.

class seismicrna.relate.io.RelateFile

Bases: HasRefFilePath, ABC

classmethod get_step()

Step of the workflow.

class seismicrna.relate.io.RelateIO

Bases: RelateFile, RefFileIO, ABC

seismicrna.relate.io.from_reads(reads: Iterable[tuple[str, tuple[tuple[list[int], list[int]], dict[int, int]]]], *, sample: str, branches: dict[str, str], ref: str, refseq: DNA, batch: int, write_read_names: bool, drop_empty_reads: bool = True)

Gather reads into a batch of relationships.

class seismicrna.relate.lists.RelateList(*, sample: str, branches: Iterable[str], ref: str, data: DataFrame, **kwargs)

Bases: List, RelateFile, ABC

class seismicrna.relate.lists.RelatePositionList(*, sample: str, branches: Iterable[str], ref: str, data: DataFrame, **kwargs)

Bases: PositionList, RelateList

classmethod get_table_type()

Type of table that this type of list can process.

seismicrna.relate.main.check_duplicates(xam_files: list[Path])

Check if any combination of sample, reference, and branches occurs more than once.

seismicrna.relate.main.run(fasta: str | Path, input_path: Iterable[str | Path], *, out_dir: str | Path = './out', branch: str = '', min_reads: int = 1000, min_mapq: int = 25, phred_enc: int = 33, min_phred: int = 25, batch_size: int = 65536, insert3: bool = True, ambindel: bool = True, overhangs: bool = True, clip_end5: int = 4, clip_end3: int = 4, sep_strands: bool = False, rev_label: str = '-rev', write_read_names: bool = False, relate_pos_table: bool = True, relate_read_table: bool = False, relate_cx: bool = True, num_cpus: int = 4, brotli_level: int = 10, force: bool = False, keep_tmp: bool = False, tmp_pfx='./tmp')

Compute relationships between references and aligned reads.

Parameters:
  • out_dir (str | pathlib._local.Path) – Write all output files to this directory [keyword-only, default: ‘./out’]

  • branch (str) – Create a new branch of the workflow with this name [keyword-only, default: ‘’]

  • min_reads (int) – Discard alignment maps with fewer than this many reads [keyword-only, default: 1000]

  • min_mapq (int) – Discard reads with mapping qualities below this threshold [keyword-only, default: 25]

  • phred_enc (int) – Specify the Phred score encoding of FASTQ and SAM/BAM/CRAM files [keyword-only, default: 33]

  • min_phred (int) – Mark base calls with Phred scores lower than this threshold as ambiguous [keyword-only, default: 25]

  • batch_size (int) – Limit batches to at most this many reads [keyword-only, default: 65536]

  • insert3 (bool) – Mark each insertion on the base to its 3’ (True) or 5’ (False) side [keyword-only, default: True]

  • ambindel (bool) – Mark all ambiguous insertions and deletions (indels) [keyword-only, default: True]

  • overhangs (bool) – Retain the overhangs of paired-end mates that dovetail [keyword-only, default: True]

  • clip_end5 (int) – Clip this many bases from the 5’ end of each read [keyword-only, default: 4]

  • clip_end3 (int) – Clip this many bases from the 3’ end of each read [keyword-only, default: 4]

  • sep_strands (bool) – Separate each alignment map into forward- and reverse-strand reads [keyword-only, default: False]

  • rev_label (str) – With –sep-strands, add this label to each reverse-strand reference [keyword-only, default: ‘-rev’]

  • write_read_names (bool) – Write the name of each read in a second set of batches (necessary for the options –mask-read or –mask-read-file) [keyword-only, default: False]

  • relate_pos_table (bool) – Tabulate relationships per position for relate data [keyword-only, default: True]

  • relate_read_table (bool) – Tabulate relationships per read for relate data [keyword-only, default: False]

  • relate_cx (bool) – Use a fast (C extension module) version of the relate algorithm; the slow (Python) version is still avilable as a fallback if the C extension cannot be loaded, and for debugging/benchmarking [keyword-only, default: True]

  • num_cpus (int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]

  • brotli_level (int) – Compress pickle files with this level of Brotli (0 - 11) [keyword-only, default: 10]

  • force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

  • keep_tmp (bool) – Keep temporary files after finishing [keyword-only, default: False]

  • tmp_pfx – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]

class seismicrna.relate.report.BaseRelateReport(**kwargs: Any | Callable[[Report], Any])

Bases: RefReport, RelateIO, ABC

classmethod get_file_seg_type()

Type of the last segment in the path.

class seismicrna.relate.report.PoolReport(**kwargs: Any | Callable[[Report], Any])

Bases: BaseRelateReport

classmethod get_param_report_fields()

Parameter fields of the report.

class seismicrna.relate.report.RelateReport(**kwargs: Any | Callable[[Report], Any])

Bases: BatchedReport, BaseRelateReport

classmethod get_checksum_report_fields()

Checksum fields of the report.

classmethod get_param_report_fields()

Parameter fields of the report.

get_refseq_file(top: Path)
classmethod get_result_report_fields()

Result fields of the report.

class seismicrna.relate.sam.XamViewer(xam_input: Path, tmp_dir: Path, branch: str, batch_size: int, num_cpus: int = 1)

Bases: object

property ancestors
property branches
create_tmp_sam()

Create the temporary SAM file.

delete_tmp_sam()

Delete the temporary SAM file.

property flagstats
property indexes
iter_records(batch: int)

Iterate through the records of the batch.

property n_reads

Total number of reads.

property paired

Whether the reads are paired.

property ref
property sample
property tmp_sam_path

Get the path to the temporary SAM file.

seismicrna.relate.sam.get_line_attrs(line: str) tuple[str, bool, bool]

Read attributes from a line in a SAM file.

seismicrna.relate.sam.tmp_xam_cmd(xam_in: Path, xam_out: Path, paired: bool, num_cpus: int = 1)

Collate and create a temporary XAM file.

seismicrna.relate.sim.simulate_batch(sample: str, branches: dict[str, str], ref: str, batch: int, write_read_names: bool, pmut: ~pandas.core.frame.DataFrame, uniq_end5s: ~numpy.ndarray, uniq_end3s: ~numpy.ndarray, pends: ~numpy.ndarray, paired: bool, read_length: int, p_rev: float, min_mut_gap: int, num_reads: int, formatter: ~typing.Callable[[int, int], str] = <function format_read_name>)

Simulate a pair of RelateBatchIO and ReadNamesBatchIO.

seismicrna.relate.sim.simulate_batches(batch_size: int, pmut: DataFrame, pclust: Series, num_reads: int, **kwargs)
seismicrna.relate.sim.simulate_cluster(first_batch: int, batch_size: int, num_reads: int, **kwargs)

Simulate all batches for one cluster.

seismicrna.relate.sim.simulate_relate(*, out_dir: Path, tmp_dir: Path, branch: str, sample: str, ref: str, refseq: DNA, batch_size: int, num_reads: int, write_read_names: bool, pmut: DataFrame, uniq_end5s: ndarray, uniq_end3s: ndarray, pends: ndarray, pclust: Series, brotli_level: int, force: bool, **kwargs)

Simulate an entire relate step.

seismicrna.relate.strands.generate_both_strands(ref: str, seq: DNA, rev_label: str)

Yield both the forward and reverse strand for each sequence.

seismicrna.relate.strands.write_both_strands(fasta_in: Path, fasta_out: Path, rev_label: str)

Write a FASTA file of both forward and reverse strands.

class seismicrna.relate.table.AverageTable

Bases: RelTypeTable, ABC

Average over an ensemble of RNA structures.

classmethod get_header_type()

Type of the header for the table.

class seismicrna.relate.table.AverageTabulator(*, top: Path, branches: dict[str, str], sample: str, region: Region, count_ends: bool, count_pos: bool, count_read: bool, validate: bool = True)

Bases: Tabulator, ABC

property data_per_clust

Series of per-cluster data (or None if no clusters).

class seismicrna.relate.table.FullTabulator(*, ref: str, refseq: DNA, count_ends: bool = False, **kwargs)

Bases: Tabulator, ABC

classmethod get_null_value()

The null value for a count: either 0 or NaN.

class seismicrna.relate.table.RelateBatchTabulator(*, get_batch_count_all: Callable, num_batches: int, num_cpus: int = 1, **kwargs)

Bases: BatchTabulator, RelateTabulator

class seismicrna.relate.table.RelateCountTabulator(*, batch_counts: Iterable[tuple[Any, Any, Any, Any]], **kwargs)

Bases: CountTabulator, RelateTabulator

class seismicrna.relate.table.RelateDatasetTabulator(*, dataset: MutsDataset, validate: bool = False, **kwargs)

Bases: DatasetTabulator, RelateTabulator

classmethod init_kws()

Attributes of the dataset to use as keyword arguments in super().__init__().

class seismicrna.relate.table.RelatePositionTable

Bases: RelateTable, PositionTable, ABC

class seismicrna.relate.table.RelatePositionTableLoader(table_file: str | Path, **kwargs)

Bases: PositionTableLoader, RelatePositionTable

Load relate data indexed by position.

class seismicrna.relate.table.RelatePositionTableWriter(tabulator: Tabulator)

Bases: PositionTableWriter, RelatePositionTable

class seismicrna.relate.table.RelateReadTable

Bases: RelateTable, ReadTable, ABC

class seismicrna.relate.table.RelateReadTableLoader(table_file: str | Path, **kwargs)

Bases: ReadTableLoader, RelateReadTable

Load relate data indexed by read.

class seismicrna.relate.table.RelateReadTableWriter(tabulator: Tabulator)

Bases: ReadTableWriter, RelateReadTable

class seismicrna.relate.table.RelateTable

Bases: AverageTable, RelateFile, ABC

classmethod get_load_function()

LoadFunction for all Dataset types for this Table.

class seismicrna.relate.table.RelateTabulator(*, ref: str, refseq: DNA, count_ends: bool = False, **kwargs)

Bases: FullTabulator, AverageTabulator, ABC

classmethod table_types()

Types of tables that this tabulator can write.

class seismicrna.relate.write.RelationWriter(xam_view: XamViewer, fasta_file: str | Path)

Bases: object

Compute and write relationships for all reads from one sample aligned to one reference sequence.

property branches
property num_reads
property ref
property refseq
property sample
write(*, out_dir: Path, release_dir: Path, min_mapq: int, min_reads: int, min_phred: int, phred_enc: int, insert3: bool, ambindel: bool, overhangs: bool, clip_end5: int, clip_end3: int, relate_pos_table: bool, relate_read_table: bool, brotli_level: int, force: bool, num_cpus: int, **kwargs)

Compute relationships for every record in a XAM file.

seismicrna.relate.write.generate_batch(batch: int, *, xam_view: XamViewer, top: Path, refseq: DNA, brotli_level: int, count_pos: bool, count_read: bool, write_read_names: bool, **kwargs)

Compute relationships for every SAM record in one batch.

seismicrna.relate.write.relate_records(records: Iterable[tuple[str, str, str]], ref: str, refseq: str, min_mapq: int, min_qual: int, insert3: bool, ambindel: bool, overhangs: bool, clip_end5: int, clip_end3: int, relate_cx: bool)
seismicrna.relate.write.relate_xam(xam_file: Path, *, fasta: Path, tmp_dir: Path, branch: str, batch_size: int, num_cpus: int, **kwargs)

Write the batches of relationships for one XAM file.