seismicrna.core.seq package

Subpackages

Submodules

exception seismicrna.core.seq.fasta.BadReferenceNameError

Bases: ReferenceNameError

A reference name is not valid.

exception seismicrna.core.seq.fasta.BadReferenceNameLineError

Bases: ReferenceNameError

A line that should contain a reference name is not valid.

exception seismicrna.core.seq.fasta.DuplicateReferenceNameError

Bases: ReferenceNameError

A reference name occurs more than once.

exception seismicrna.core.seq.fasta.MissingReferenceNameError

Bases: ReferenceNameError

A reference name was expected to appear but is absent.

exception seismicrna.core.seq.fasta.ReferenceNameError

Bases: ValueError

Error in the name of a reference sequence.

seismicrna.core.seq.fasta.extract_fasta_seqname(line: str)

Extract the name of a sequence from a line in FASTA format.

seismicrna.core.seq.fasta.format_fasta_name_line(name: str)
seismicrna.core.seq.fasta.format_fasta_record(name: str, seq: XNA, wrap: int = 0)
seismicrna.core.seq.fasta.format_fasta_seq_lines(seq: XNA, wrap: int = 0)

Format a sequence in a FASTA file so that each line has at most wrap characters, or no limit if wrap is ≤ 0.

seismicrna.core.seq.fasta.get_fasta_seq(fasta: Path, seq_type: type[XNA], name: str)

Get one sequence of a given name from a FASTA file.

seismicrna.core.seq.fasta.parse_fasta(fasta: Path, seq_type: type[XNA] | None, only: Iterable[str] | None = None)
seismicrna.core.seq.fasta.valid_fasta_seqname(line: str) str

Get a valid sequence name from a line in FASTA format.

seismicrna.core.seq.fasta.write_fasta(fasta: Path, refs: Iterable[tuple[str, XNA]], wrap: int = 0, force: bool = False)

Write an iterable of reference names and DNA sequences to a FASTA file.

class seismicrna.core.seq.refs.RefSeqs(seqs: Iterable[tuple[str, XNA]] = ())

Bases: object

Store reference sequences.

add(name: str, seq: XNA)

Add a sequence to the collection via its name.

get(name: str)

Get a sequence from the collection via its name.

iter()

Yield every sequence and its name.

class seismicrna.core.seq.region.RefRegions(ref_seqs: Iterable[tuple[str, DNA]], *, regs_file: Path | None = None, coords: Iterable[tuple[str, int, int]] = (), primers: Iterable[tuple[str, DNA, DNA]] = (), primer_gap: int = 0, exclude_primers: bool = False, default_full: bool = True)

Bases: object

A collection of regions, grouped by reference.

property count

Total number of regions.

property dict

List the regions for every reference.

list(ref: str)

List the regions for a given reference.

property refs

Reference names.

property regions

List all regions.

class seismicrna.core.seq.region.Region(ref: str, seq: DNA, *, seq5: int = 1, reflen: int | None = None, end5: int | None = None, end3: int | None = None, name: str | None = None)

Bases: object

Region of a reference sequence between two coordinates.

MASK_GU = 'pos-gu'
MASK_LIST = 'pos-list'
MASK_POLYA = 'pos-polya'
add_mask(name: str, positions: Iterable[int], complement: bool = False)

Mask the integer positions in the array positions.

Parameters:
  • name (str) – Name of the mask.

  • positions (Iterable[int]) – Positions to mask (1-indexed).

  • complement (bool = False) – If True, then leave only positions in positions unmasked.

property coord

Tuple of the 5’ and 3’ coordinates.

copy(masks: bool = True)

Return an identical region.

get_mask(name: str)

Get the positions masked under the given name.

property hyphen
property length

Length of the entire region.

mask_gu()

Mask positions whose base is neither A nor C.

mask_list(pos: Iterable[int])

Mask a list of positions.

property mask_names

Names of the masks.

mask_polya(min_length: int)

Mask poly(A) stretches with length ≥ min_length.

property masked_bool: ndarray

Masked positions as a boolean array.

property masked_int: ndarray

Masked positions as integers.

property masked_zero: ndarray

Masked positions as integers (0-indexed with respect to the first position in the region).

property range

Index of all positions in the region.

property range_int

All positions in the region as integers.

property range_one

All 1-indexed positions in the region as integers.

property ref_reg
remove_mask(name: str, missing_ok: bool = False)

Remove the specified mask from the region.

renumber_from(seq5: int, name: str | None = None)

Return a new region renumbered starting from a position.

Parameters:
  • seq5 (int) – Position from which to start the new numbering system.

  • name (str | None = None) – Name of the renumbered region.

Returns:

Region with renumbered positions.

Return type:

Region

property size

Number of relevant positions in the region.

subregion(end5: int | None = None, end3: int | None = None, name: str | None = None)

Return a new region from part of this region.

to_dict()
property unmasked

Index of unmasked positions in the region.

property unmasked_bool: ndarray

Unmasked positions as a boolean array.

property unmasked_int: ndarray

Unmasked positions as integers (1-indexed).

property unmasked_zero: ndarray

Unmasked positions as integers (0-indexed with respect to the first position in the region).

class seismicrna.core.seq.region.RegionFinder(ref: str, seq: DNA, *, seq5: int = 1, end5: int | None = None, end3: int | None = None, fwd: DNA | None = None, rev: DNA | None = None, primer_gap: int = 0, exclude_primers: bool = False, **kwargs)

Bases: Region

The 5’ and 3’ ends of a region can be given explicitly as integers, but if the sample is of an amplicon (i.e. generated by RT-PCR using site-specific primers), then it is often more convenient to enter the sequences of the PCR primers and have the software determine the coordinates. RegionFinder accepts 5’ and 3’ coordinates given as integers or primers, validates them, and stores the coordinates as integers, as follows:

end5 = end5 if end5 is given, else the 3’ end of the forward primer
  • (primer_gap + 1) if fwd is given, else 1

end3 = end3 if end3 is given, else the 5’ end of the reverse primer
  • (primer_gap + 1) if rev is given, else the length of refseq

static locate(seq: DNA, primer: DNA, seq5: int) RegionTuple

Return the 5’ and 3’ positions (1-indexed) of a primer within a reference sequence. The primer must occur exactly once in the reference, otherwise an error is raised.

Parameters:
  • seq (DNA) – The full reference sequence or a part of it.

  • primer (DNA) – Sequence of the forward PCR primer or the reverse complement of the reverse PCR primer

  • seq5 (int = 1) – Positional number to assign the 5’ end of the given part of the reference sequence. Must be ≥ 1.

Returns:

Named tuple of the first and last positions that the primer occupies in the reference sequence. Positions are 1-indexed and include the first and last coordinates.

Return type:

RegionTuple

class seismicrna.core.seq.region.RegionTuple(pos5, pos3)

Bases: tuple

pos3

Alias for field number 1

pos5

Alias for field number 0

seismicrna.core.seq.region.get_coords_by_ref(coords: Iterable[tuple[str, int | DNA, int | DNA]])
seismicrna.core.seq.region.get_reg_coords_primers(regs_file: Path)

Parse a file defining each region by the name of its reference and either its 5’ and 3’ coordinates or its forward and reverse primer sequences. Return one map from each reference and 5’/3’ coordinate pair to the name of the corresponding region, and another from each reference and primer pair to the name of the corresponding region.

Parameters:

regs_file (Path) – CSV file of a table that defines the regions. The table must have columns labeled “Reference”, “Region”, “5’ End”, “3’ End”, “Forward Primer”, and “Reverse Primer”. Others are ignored.

Returns:

dict[tuple[str, DNA, DNA], str]] Two mappings, the first from (ref name, 5’ coord, 3’ coord) to each region, the second from (ref name, fwd primer, rev primer) to each region. If the region is named in the “Region” column of the table, then that name will be used as the region name. Otherwise, the region name will be an empty string.

Return type:

tuple[dict[tuple[str, int, int], str],

seismicrna.core.seq.region.get_shared_index(indexes: Iterable[MultiIndex], empty_ok: bool = False)

Get the shared index among all those given, as follows:

  • If indexes contains no elements and empty_ok is True, then return an empty MultiIndex with levels named ‘Positions’ and ‘Base’.

  • If indexes contains one element or multiple identical elements, and each has two levels named ‘Positions’ and ‘Base’, then return the first element.

  • Otherwise, raise an error.

Parameters:
  • indexes (Iterable[pandas.MultiIndex]) – Indexes to compare.

  • empty_ok (bool = False) – If given no indexes, then default to an empty index (if True) or raise a ValueError (if False).

Returns:

The shared index.

Return type:

pandas.MultiIndex

seismicrna.core.seq.region.hyphenate_ends(end5: int, end3: int)

Return the 5’ and 3’ ends as a hyphenated string.

Parameters:
  • end5 (int) – 5’ end (1-indexed)

  • end3 (int) – 3’ end (1-indexed)

Returns:

Hyphenated 5’ and 3’ ends

Return type:

str

seismicrna.core.seq.region.index_to_pos(index: MultiIndex)

Get the positions from a MultiIndex of (pos, base) pairs.

seismicrna.core.seq.region.index_to_seq(index: MultiIndex, allow_gaps: bool = False)

Get the DNA sequence from a MultiIndex of (pos, base) pairs.

seismicrna.core.seq.region.intersect(*regions: Region, name: str | None = None)

Intersect one or more regions.

Parameters:
  • *regions (Region) – Regions to intersect.

  • name (str | None = None) – Name for the region to return.

Returns:

Intersection of all given regions.

Return type:

Region

seismicrna.core.seq.region.iter_windows(*series: Series, size: int, min_count: int = 1, include_nan: bool = False)
seismicrna.core.seq.region.seq_pos_to_index(seq: DNA, positions: Sequence[int], start: int)

Convert a sequence and positions to indexes, where each index is a tuple of (position, base).

Parameters:
  • seq (DNA) – DNA sequence.

  • positions (Sequence[int]) – Positions of the sequence from which to build the index. Every position must be an integer ≥ start.

  • start (int) – Numerical position to assign to the first base in the sequence. Must be a positive integer.

Returns:

MultiIndex of the same length as positions where each index is a tuple of (position, base).

Return type:

pd.MultiIndex

seismicrna.core.seq.region.unite(*regions: Region, name: str | None = None, refseq: DNA | None = None)

Unite one or more regions.

Parameters:
  • *regions (Region) – Regions to unite.

  • name (str | None = None) – Name for the region to return.

  • refseq (DNA | None = None) – Reference sequence (optional) for filling any gaps in the union of the regions. If given, then it must match every region at the corresponding positions. If omitted, then any positions not covered by at least one region will be filled with N.

Returns:

Union of all given regions.

Return type:

Region

seismicrna.core.seq.region.verify_index_names(index: MultiIndex)

Verify that the names of the index are correct.

seismicrna.core.seq.region.window_to_margins(window: int)

Compute the 5’ and 3’ margins from the size of the window.

Sequence Core Module.


Define alphabets and classes for nucleic acid sequences, and functions for reading them from and writing them to FASTA files.

class seismicrna.core.seq.xna.CompressedSeq(seq: XNA)

Bases: object

Compress a sequence into two bits per base.

decompress()

Restore the original sequence.

property type
class seismicrna.core.seq.xna.DNA(seq: Any)

Bases: XNA

classmethod alph()

Sequence alphabet.

classmethod pict()

Sequence pictograms.

tr()

Transcribe DNA to RNA.

class seismicrna.core.seq.xna.RNA(seq: Any)

Bases: XNA

classmethod alph()

Sequence alphabet.

classmethod pict()

Sequence pictograms.

rt()

Reverse transcribe RNA to DNA.

class seismicrna.core.seq.xna.XNA(seq: Any)

Bases: ABC

__add__(other)

Allow addition (concatenation) of two sequences only if the sequences have the same class.

__bool__()

Empty sequences return False; all else, True.

__contains__(item)

Check if a sequence is contained in this sequence.

__eq__(other)

Return True if both the type of the sequence and the bases in the sequence match, otherwise False.

__getitem__(item)

If item is a slice, then return an instance of the class. Otherwise, return an instance of str.

__hash__()

Define __hash__ so that Seq subclasses can be used as keys for dict-like mappings. Use the hash of the plain string.

__mul__(other)

Multiply a sequence by an int like a str times an int.

__repr__()

Encapsulate the sequence string with the class name.

abstract classmethod alph() tuple[str, str, str, str, str]

Sequence alphabet.

property array

NumPy array of Unicode characters for the sequence.

compress()

Compress the sequence.

classmethod four()

Get the four standard bases.

classmethod get_alphaset()

Get the alphabet as a set.

classmethod get_comp()

Get the complementary alphabet as a tuple.

classmethod get_comptrans()

Get the translation table for complementary bases.

classmethod get_nonalphaset()

Get the printable characters not in the alphabet.

classmethod get_other_iupac()

Get the IUPAC extended characters not in the alphabet.

classmethod get_pictrans()

Get the translation table for pictogram characters.

kmers(k: int)

Every subsequence of length k (k-mer).

abstract classmethod pict() tuple[str, str, str, str, str]

Sequence pictograms.

property picto

Pictogram string.

classmethod random(nt: int, a: float = 0.25, c: float = 0.25, g: float = 0.25, t: float = 0.25)

Return a random sequence of the given length.

Parameters:
  • nt (int) – Number of nucleotides to simulate. Must be ≥ 0.

  • a (float = 0.25) – Expected proportion of A.

  • c (float = 0.25) – Expected proportion of C.

  • g (float = 0.25) – Expected proportion of G.

  • t (float = 0.25) – Expected proportion of T (if DNA) or U (if RNA).

Returns:

A random sequence.

Return type:

DNA | RNA

property rc

Reverse complement.

classmethod t_or_u()

Get the base that is complementary to A.

seismicrna.core.seq.xna.decompress(seq: CompressedSeq)

Restore the original sequence from a CompressedSeq object.

seismicrna.core.seq.xna.expand_degenerate_seq(seq: DNA)

Given a (possibly degenerate) sequence, yield every definite sequence that could derive from it. Only the degenerate base N is supported by this function; other IUPAC codes (e.g. R) are not.