seismicrna.relate package

Subpackages

Submodules

class seismicrna.relate.batch.FullReadBatch(*, batch: int)

Bases: ReadBatch, ABC

property max_read

Maximum possible value for a read index.

property read_indexes

Map each read number to its index in self.read_nums.

property read_nums

Read numbers.

class seismicrna.relate.batch.FullRegionMutsBatch(*, region: Region, **kwargs)

Bases: FullReadBatch, RegionMutsBatch, ABC

class seismicrna.relate.batch.ReadNamesBatch(*, names: list[str] | ndarray, **kwargs)

Bases: FullReadBatch

property num_reads

Number of reads.

classmethod simulate(batch: int, num_reads: int, formatter: ~typing.Callable[[int, int], str] = <function format_read_name>, **kwargs)

Simulate a batch.

Parameters:
  • batch (int) – Batch number.

  • num_reads (int) – Number of reads in the batch.

  • formatter (Callable[[int, int], str]) – Function to generate the name of each read: must accept the batch number and the read number and return a string.

class seismicrna.relate.batch.RelateBatch(*, region: Region, **kwargs)

Bases: FullRegionMutsBatch

property read_weights

Weights for each read when computing counts.

classmethod simulate(batch: int, ref: str, pmut: DataFrame, uniq_end5s: ndarray, uniq_end3s: ndarray, pends: ndarray, paired: bool, read_length: int, p_rev: float, min_mut_gap: int, num_reads: int, **kwargs)

Simulate a batch.

Parameters:
  • batch (int) – Batch number.

  • ref (str) – Name of the reference.

  • pmut (pd.DataFrame) – Rate of each type of mutation at each position.

  • uniq_end5s (np.ndarray) – Unique read 5’ end coordinates.

  • uniq_end3s (np.ndarray) – Unique read 3’ end coordinates.

  • pends (np.ndarray) – Probability of each set of unique end coordinates.

  • paired (bool) – Whether to simulate paired-end or single-end reads.

  • read_length (int) – Length of each read segment (paired-end reads only).

  • p_rev (float) – Probability that mate 1 is reversed (paired-end reads only).

  • min_mut_gap (int) – Minimum number of positions between two mutations.

  • num_reads (int) – Number of reads in the batch.

seismicrna.relate.batch.format_read_name(batch: int, read: int)

Format a read name.

class seismicrna.relate.dataset.AverageDataset(report_file: Path, verify_times: bool = True)

Bases: Dataset, ABC

Dataset of population average data.

property best_k

Best number of clusters.

property ks

Numbers of clusters.

class seismicrna.relate.dataset.NamesDataset(report_file: Path, verify_times: bool = True)

Bases: AverageDataset, ABC

classmethod kind()
class seismicrna.relate.dataset.PoolDataset(report_file: Path, verify_times: bool = True)

Bases: RelateDataset, TallDataset, MutsDataset, MergedRegionDataset

Load pooled batches of relationships.

classmethod get_dataset_load_func()

Function to load one constituent dataset.

classmethod get_report_type()

Type of report.

property region

Region of the dataset.

class seismicrna.relate.dataset.PoolReadNamesDataset(report_file: Path, verify_times: bool = True)

Bases: NamesDataset, TallDataset

Pooled Dataset of read names.

classmethod get_dataset_load_func()

Function to load one constituent dataset.

classmethod get_report_type()

Type of report.

class seismicrna.relate.dataset.ReadNamesDataset(report_file: Path, verify_times: bool = True)

Bases: NamesDataset, LoadedDataset

Dataset of read names from the Relate step.

classmethod get_batch_type()

Type of batch.

classmethod get_report_type()

Type of report.

property pattern

Pattern of mutations to count.

class seismicrna.relate.dataset.RelateDataset(report_file: Path, verify_times: bool = True)

Bases: AverageDataset, ABC

Dataset of relationships.

class seismicrna.relate.dataset.RelateMutsDataset(report_file: Path, verify_times: bool = True)

Bases: RelateDataset, LoadedDataset, MutsDataset

Dataset of mutations from the Relate step.

get_batch(batch: int)

Get a specific batch of data.

classmethod get_batch_type()

Type of batch.

classmethod get_report_type()

Type of report.

property paired

Whether the reads are paired-end.

property pattern

Pattern of mutations to count.

property refseq

Sequence of the reference.

property region

Region of the dataset.

class seismicrna.relate.io.ReadNamesBatchIO(*, sample: str, ref: str, **kwargs)

Bases: ReadBatchIO, RelateIO, ReadNamesBatch

classmethod file_seg_type()

Type of the last segment in the path.

class seismicrna.relate.io.RelateBatchIO(*args, region: Region, **kwargs)

Bases: MutsBatchIO, RelateIO, RelateBatch

classmethod file_seg_type()

Type of the last segment in the path.

class seismicrna.relate.io.RelateIO(*, sample: str, ref: str, **kwargs)

Bases: RefIO, ABC

classmethod auto_fields()

Names and automatic values of selected fields.

seismicrna.relate.io.from_reads(reads: Iterable[tuple[str, tuple[tuple[list[int], list[int]], dict[int, int]]]], sample: str, ref: str, refseq: DNA, batch: int, write_read_names: bool)

Accumulate reads into relation vectors.

seismicrna.relate.main.check_duplicates(xam_files: list[Path])

Check if any sample-reference pair occurs more than once.

seismicrna.relate.main.run(fasta: str | Path, input_path: Iterable[str | Path], *, out_dir: str | Path = './out', min_reads: int = 1000, min_mapq: int = 25, phred_enc: int = 33, min_phred: int = 25, batch_size: int = 65536, insert3: bool = True, ambindel: bool = True, overhangs: bool = True, clip_end5: int = 4, clip_end3: int = 4, sep_strands: bool = False, rev_label: str = '-rev', write_read_names: bool = False, relate_pos_table: bool = True, relate_read_table: bool = False, relate_cx: bool = True, max_procs: int = 4, brotli_level: int = 10, force: bool = False, keep_tmp: bool = False, tmp_pfx='./tmp')

Compute relationships between references and aligned reads.

Parameters:
  • out_dir (str | pathlib._local.Path) – Write all output files to this directory [keyword-only, default: ‘./out’]

  • min_reads (int) – Discard alignment maps with fewer than this many reads [keyword-only, default: 1000]

  • min_mapq (int) – Discard reads with mapping qualities below this threshold [keyword-only, default: 25]

  • phred_enc (int) – Specify the Phred score encoding of FASTQ and SAM/BAM/CRAM files [keyword-only, default: 33]

  • min_phred (int) – Mark base calls with Phred scores lower than this threshold as ambiguous [keyword-only, default: 25]

  • batch_size (int) – Limit batches to at most this many reads [keyword-only, default: 65536]

  • insert3 (bool) – Mark each insertion on the base to its 3’ (True) or 5’ (False) side [keyword-only, default: True]

  • ambindel (bool) – Mark all ambiguous insertions and deletions (indels) [keyword-only, default: True]

  • overhangs (bool) – Retain the overhangs of paired-end mates that dovetail [keyword-only, default: True]

  • clip_end5 (int) – Clip this many bases from the 5’ end of each read [keyword-only, default: 4]

  • clip_end3 (int) – Clip this many bases from the 3’ end of each read [keyword-only, default: 4]

  • sep_strands (bool) – Separate each alignment map into forward- and reverse-strand reads [keyword-only, default: False]

  • rev_label (str) – With –sep-strands, add this label to each reverse-strand reference [keyword-only, default: ‘-rev’]

  • write_read_names (bool) – Write the name of each read in a second set of batches (necessary for the options –mask-read or –mask-read-file) [keyword-only, default: False]

  • relate_pos_table (bool) – Tabulate relationships per position for relate data [keyword-only, default: True]

  • relate_read_table (bool) – Tabulate relationships per read for relate data [keyword-only, default: False]

  • relate_cx (bool) – Use a fast (C extension module) version of the relate algorithm; the slow (Python) version is still avilable as a fallback if the C extension cannot be loaded, and for debugging/benchmarking [keyword-only, default: True]

  • max_procs (int) – Run up to this many processes simultaneously [keyword-only, default: 4]

  • brotli_level (int) – Compress pickle files with this level of Brotli (0 - 11) [keyword-only, default: 10]

  • force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

  • keep_tmp (bool) – Keep temporary files after finishing [keyword-only, default: False]

  • tmp_pfx – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]

class seismicrna.relate.report.PoolReport(**kwargs: Any | Callable[[Report], Any])

Bases: Report, RefIO

classmethod auto_fields()

Names and automatic values of selected fields.

classmethod fields()

All fields of the report.

classmethod file_seg_type()

Type of the last segment in the path.

classmethod path_segs()
class seismicrna.relate.report.RelateReport(**kwargs: Any | Callable[[Report], Any])

Bases: BatchedRefseqReport, RelateIO

classmethod fields()

All fields of the report.

classmethod file_seg_type()

Type of the last segment in the path.

refseq_file(top: Path)
seismicrna.relate.report.refseq_file_auto_fields()
seismicrna.relate.report.refseq_file_path(top: Path, sample: str, ref: str)
seismicrna.relate.report.refseq_file_seg_types()
class seismicrna.relate.sam.XamViewer(xam_input: Path, tmp_dir: Path, batch_size: int, n_procs: int = 1)

Bases: object

create_tmp_sam()

Create the temporary SAM file.

delete_tmp_sam()

Delete the temporary SAM file.

property flagstats
property indexes
iter_records(batch: int)

Iterate through the records of the batch.

property n_reads

Total number of reads.

property paired

Whether the reads are paired.

property ref
property sample
property tmp_sam_path

Get the path to the temporary SAM file.

seismicrna.relate.sam.line_attrs(line: str) tuple[str, bool, bool]

Read attributes from a line in a SAM file.

seismicrna.relate.sam.tmp_xam_cmd(xam_in: Path, xam_out: Path, paired: bool, n_procs: int = 1)

Collate and create a temporary XAM file.

seismicrna.relate.sim.simulate_batch(sample: str, ref: str, batch: int, write_read_names: bool, pmut: ~pandas.core.frame.DataFrame, uniq_end5s: ~numpy.ndarray, uniq_end3s: ~numpy.ndarray, pends: ~numpy.ndarray, paired: bool, read_length: int, p_rev: float, min_mut_gap: int, num_reads: int, formatter: ~typing.Callable[[int, int], str] = <function format_read_name>)

Simulate a pair of RelateBatchIO and ReadNamesBatchIO.

seismicrna.relate.sim.simulate_batches(batch_size: int, pmut: DataFrame, pclust: Series, num_reads: int, **kwargs)
seismicrna.relate.sim.simulate_cluster(first_batch: int, batch_size: int, num_reads: int, **kwargs)

Simulate all batches for one cluster.

seismicrna.relate.sim.simulate_relate(*, out_dir: Path, tmp_dir: Path, sample: str, ref: str, refseq: DNA, batch_size: int, num_reads: int, write_read_names: bool, pmut: DataFrame, uniq_end5s: ndarray, uniq_end3s: ndarray, pends: ndarray, pclust: Series, brotli_level: int, force: bool, **kwargs)

Simulate an entire relate step.

seismicrna.relate.strands.generate_both_strands(ref: str, seq: DNA, rev_label: str)

Yield both the forward and reverse strand for each sequence.

seismicrna.relate.strands.write_both_strands(fasta_in: Path, fasta_out: Path, rev_label: str)

Write a FASTA file of both forward and reverse strands.

class seismicrna.relate.table.AverageTable

Bases: RelTypeTable, ABC

Average over an ensemble of RNA structures.

classmethod header_type()

Type of the header for the table.

class seismicrna.relate.table.AverageTabulator(*, top: Path, sample: str, region: Region, count_ends: bool, count_pos: bool, count_read: bool, validate: bool = True)

Bases: Tabulator, ABC

property data_per_clust

Series of per-cluster data (or None if no clusters).

class seismicrna.relate.table.FullPositionTable

Bases: FullTable, PositionTable, ABC

classmethod path_segs()

Table’s path segments.

class seismicrna.relate.table.FullReadTable

Bases: FullTable, ReadTable, ABC

classmethod path_segs()

Table’s path segments.

class seismicrna.relate.table.FullTable

Bases: Table, ABC

property path_fields

Table’s path fields.

class seismicrna.relate.table.FullTabulator(*, ref: str, refseq: DNA, count_ends: bool = False, **kwargs)

Bases: Tabulator, ABC

classmethod get_null_value()

The null value for a count: either 0 or NaN.

class seismicrna.relate.table.PositionTableLoader(table_file: Path)

Bases: RelTypeTableLoader, PositionTable, ABC

Load data indexed by position.

class seismicrna.relate.table.ReadTableLoader(table_file: Path)

Bases: RelTypeTableLoader, ReadTable, ABC

Load data indexed by read.

class seismicrna.relate.table.RelTypeTableLoader(table_file: Path)

Bases: TableLoader, RelTypeTable, ABC

Load a table of relationship types.

property data: DataFrame

Table’s data.

class seismicrna.relate.table.RelateBatchTabulator(*, get_batch_count_all: Callable, num_batches: int, max_procs: int = 1, **kwargs)

Bases: BatchTabulator, RelateTabulator

class seismicrna.relate.table.RelateCountTabulator(*, batch_counts: Iterable[tuple[Any, Any, Any, Any]], **kwargs)

Bases: CountTabulator, RelateTabulator

class seismicrna.relate.table.RelateDatasetTabulator(*, dataset: MutsDataset, validate: bool = False, **kwargs)

Bases: DatasetTabulator, RelateTabulator

classmethod init_kws()

Attributes of the dataset to use as keyword arguments in super().__init__().

classmethod load_function()

LoadFunction for all Dataset types for this Tabulator.

class seismicrna.relate.table.RelatePositionTable

Bases: RelateTable, FullPositionTable, ABC

class seismicrna.relate.table.RelatePositionTableLoader(table_file: Path)

Bases: PositionTableLoader, RelatePositionTable

Load relate data indexed by position.

class seismicrna.relate.table.RelatePositionTableWriter(tabulator: Tabulator)

Bases: PositionTableWriter, RelatePositionTable

class seismicrna.relate.table.RelateReadTable

Bases: RelateTable, FullReadTable, ABC

class seismicrna.relate.table.RelateReadTableLoader(table_file: Path)

Bases: ReadTableLoader, RelateReadTable

Load relate data indexed by read.

class seismicrna.relate.table.RelateReadTableWriter(tabulator: Tabulator)

Bases: ReadTableWriter, RelateReadTable

class seismicrna.relate.table.RelateTable

Bases: AverageTable, ABC

classmethod kind()

Kind of table.

class seismicrna.relate.table.RelateTabulator(*, ref: str, refseq: DNA, count_ends: bool = False, **kwargs)

Bases: FullTabulator, AverageTabulator, ABC

classmethod table_types()

Types of tables that this tabulator can write.

class seismicrna.relate.table.TableLoader(table_file: Path)

Bases: Table, ABC

Load a table from a file.

classmethod find_tables(paths: Iterable[str | Path])

Yield files of the tables within the given paths.

classmethod load_tables(paths: Iterable[str | Path])

Yield tables within the given paths.

property ref: str

Name of the table’s reference.

property refseq

Reference sequence.

property reg: str

Name of the table’s region.

property sample: str

Name of the table’s sample.

property top: Path

Path of the table’s output directory.

class seismicrna.relate.write.RelationWriter(xam_view: XamViewer, refseq: DNA)

Bases: object

Compute and write relationships for all reads from one sample aligned to one reference sequence.

property num_reads
property ref
property sample
write(*, out_dir: Path, release_dir: Path, min_mapq: int, min_reads: int, min_phred: int, phred_enc: int, insert3: bool, ambindel: bool, overhangs: bool, clip_end5: int, clip_end3: int, relate_pos_table: bool, relate_read_table: bool, brotli_level: int, force: bool, n_procs: int, **kwargs)

Compute relationships for every record in a XAM file.

seismicrna.relate.write.generate_batch(batch: int, *, xam_view: XamViewer, top: Path, refseq: DNA, brotli_level: int, count_pos: bool, count_read: bool, write_read_names: bool, **kwargs)

Compute relationships for every SAM record in one batch.

seismicrna.relate.write.relate_records(records: Iterable[tuple[str, str, str]], ref: str, refseq: str, min_mapq: int, min_qual: int, insert3: bool, ambindel: bool, overhangs: bool, clip_end5: int, clip_end3: int, relate_cx: bool)
seismicrna.relate.write.relate_xam(xam_file: Path, *, fasta: Path, tmp_dir: Path, batch_size: int, n_procs: int, **kwargs)

Write the batches of relationships for one XAM file.