seismicrna.core.seq package

Subpackages

seismicrna.core.seq.tests package
- Submodules

Submodules

seismicrna.core.seq.fasta.extract_fasta_seqname(line: str): Extract the name of a sequence from a line in FASTA format.

seismicrna.core.seq.fasta.format_fasta_name_line(name: str)

seismicrna.core.seq.fasta.format_fasta_record(name: str, seq: XNA, wrap: int = 0)

seismicrna.core.seq.fasta.format_fasta_seq_lines(seq: XNA, wrap: int = 0): Format a sequence in a FASTA file so that each line has at most wrap characters, or no limit if wrap is 0.

seismicrna.core.seq.fasta.get_fasta_seq(fasta: Path, seq_type: type[XNA], name: str): Get one sequence of a given name from a FASTA file.

seismicrna.core.seq.fasta.parse_fasta(fasta: Path, seq_type: type[XNA] | None, only: Iterable[str] | None = None)

seismicrna.core.seq.fasta.valid_fasta_seqname(line: str) → str: Get a valid sequence name from a line in FASTA format.

seismicrna.core.seq.fasta.write_fasta(fasta: Path, refs: Iterable[tuple[str, XNA]], wrap: int = 0, force: bool = False): Write an iterable of reference names and DNA sequences to a FASTA file.

class seismicrna.core.seq.refs.RefSeqs(seqs: Iterable[tuple[str, XNA]] = ())

Bases: object

Store reference sequences.

add(name: str, seq: XNA): Add a sequence to the collection via its name.

get(name: str): Get a sequence from the collection via its name.

iter(): Yield every sequence and its name.

class seismicrna.core.seq.section.RefSections(ref_seqs: Iterable[tuple[str, DNA]], *, sects_file: Path | None = None, coords: Iterable[tuple[str, int, int]] = (), primers: Iterable[tuple[str, DNA, DNA]] = (), primer_gap: int = 0, exclude_primers: bool = False, default_full: bool = True)

Bases: object

A collection of sections, grouped by reference.

property count: Total number of sections.

property dict: List the sections for every reference.

list(ref: str): List the sections for a given reference.

property refs: Reference names.

property sections: List all sections.

class seismicrna.core.seq.section.Section(ref: str, seq: DNA, *, seq5: int = 1, reflen: int | None = None, end5: int | None = None, end3: int | None = None, name: str | None = None)

Bases: object

Section of a reference sequence between two coordinates.

MASK_GU = 'pos-gu'

MASK_LIST = 'pos-list'

MASK_POLYA = 'pos-polya'

add_mask(name: str, positions: Iterable[int], complement: bool = False)

Mask the integer positions in the array positions.

Parameters:

name (str) – Name of the mask.
positions (Iterable[int]) – Positions to mask (1-indexed).
complement (bool = False) – If True, then leave only positions in positions unmasked.

property coord: Tuple of the 5’ and 3’ coordinates.

copy(masks: bool = True): Return an identical section.

get_mask(name: str): Get the positions masked under the given name.

property hyphen

property length: Length of the entire section.

mask_gu(): Mask positions whose base is neither A nor C.

mask_list(pos: Iterable[int]): Mask a list of positions.

property mask_names: Names of the masks.

mask_polya(min_length: int): Mask poly(A) stretches with length ≥ min_length.

property masked_bool: ndarray: Masked positions as a boolean array.

property masked_int: ndarray: Masked positions as integers.

property masked_zero: ndarray: Masked positions as integers (0-indexed with respect to the first position in the section).

property range: Index of all positions in the section.

property range_int: All positions in the section as integers.

property range_one: All 1-indexed positions in the section as integers.

property ref_sect

remove_mask(name: str, missing_ok: bool = False): Remove the specified mask from the section.

renumber_from(seq5: int, name: str | None = None)

Return a new Section renumbered starting from a position.

Parameters:

seq5 (int) – Position from which to start the new numbering system.
name (str | None = None) – Name of the renumbered section.

Returns:

Section with renumbered positions.

Return type:

Section

property size: Number of relevant positions in the section.

subsection(end5: int | None = None, end3: int | None = None, name: str | None = None): Return a new section from part of this section.

to_dict()

property unmasked: Index of unmasked positions in the section.

property unmasked_bool: ndarray: Unmasked positions as a boolean array.

property unmasked_int: ndarray: Unmasked positions as integers (1-indexed).

property unmasked_zero: ndarray: Unmasked positions as integers (0-indexed with respect to the first position in the section).

class seismicrna.core.seq.section.SectionFinder(ref: str, seq: DNA, *, seq5: int = 1, end5: int | None = None, end3: int | None = None, fwd: DNA | None = None, rev: DNA | None = None, primer_gap: int = 0, exclude_primers: bool = False, **kwargs)

Bases: Section

The 5’ and 3’ ends of a section can be given explicitly as integers, but if the sample is of an amplicon (i.e. generated by RT-PCR using site-specific primers), then it is often more convenient to enter the sequences of the PCR primers and have the software determine the coordinates. SectionFinder accepts 5’ and 3’ coordinates given as integers or primers, validates them, and stores the coordinates as integers, as follows:

end5 = end5 if end5 is given, else the 3’ end of the forward primer

(primer_gap + 1) if fwd is given, else 1

end3 = end3 if end3 is given, else the 5’ end of the reverse primer

(primer_gap + 1) if rev is given, else the length of refseq

static locate(seq: DNA, primer: DNA, seq5: int) → PrimerTuple

Return the 5’ and 3’ positions (1-indexed) of a primer within a reference sequence. The primer must occur exactly once in the reference, otherwise an error is raised.

Parameters:

seq (DNA) – The full reference sequence or a part of it.
primer (DNA) – Sequence of the forward PCR primer or the reverse complement of the reverse PCR primer
seq5 (int = 1) – Positional number to assign the 5’ end of the given part of the reference sequence. Must be ≥ 1.

Returns:

Named tuple of the first and last positions that the primer occupies in the reference sequence. Positions are 1-indexed and include the first and last coordinates.

Return type:

SectionTuple

seismicrna.core.seq.section.SectionTuple: alias of PrimerTuple

seismicrna.core.seq.section.get_coords_by_ref(coords: Iterable[tuple[str, int | DNA, int | DNA]])

seismicrna.core.seq.section.get_sect_coords_primers(sects_file: Path)

Parse a file defining each section by the name of its reference and either its 5’ and 3’ coordinates or its forward and reverse primer sequences. Return one map from each reference and 5’/3’ coordinate pair to the name of the corresponding section, and another from each reference and primer pair to the name of the corresponding section.

Parameters:: sects_file (Path) – CSV file of a table that defines the sections. The table must have columns labeled “Reference”, “Section”, “5’ End”, “3’ End”, “Forward Primer”, and “Reverse Primer”. Others are ignored.
Returns:: dict[tuple[str, DNA, DNA], str]] Two mappings, the first from (ref name, 5’ coord, 3’ coord) to each section, the second from (ref name, fwd primer, rev primer) to each section. If the section is named in the “Section” column of the table, then that name will be used as the section name. Otherwise, the section name will be an empty string.
Return type:: tuple[dict[tuple[str, int, int], str],

seismicrna.core.seq.section.get_shared_index(indexes: Iterable[MultiIndex], empty_ok: bool = False)

Get the shared index among all those given, as follows:

If indexes contains no elements and empty_ok is True, then return an empty MultiIndex with levels named ‘Positions’ and ‘Base’.
If indexes contains one element or multiple identical elements, and each has two levels named ‘Positions’ and ‘Base’, then return the first element.
Otherwise, raise an error.

Parameters:

indexes (Iterable[pandas.MultiIndex]) – Indexes to compare.
empty_ok (bool = False) – If given no indexes, then default to an empty index (if True) or raise a ValueError (if False).

Returns:

The shared index.

Return type:

pandas.MultiIndex

seismicrna.core.seq.section.hyphenate_ends(end5: int, end3: int)

Return the 5’ and 3’ ends as a hyphenated string.

Parameters:

end5 (int) – 5’ end (1-indexed)
end3 (int) – 3’ end (1-indexed)

Returns:

Hyphenated 5’ and 3’ ends

Return type:

str

seismicrna.core.seq.section.index_to_pos(index: MultiIndex): Get the positions from a MultiIndex of (pos, base) pairs.

seismicrna.core.seq.section.index_to_seq(index: MultiIndex, allow_gaps: bool = False): Get the DNA sequence from a MultiIndex of (pos, base) pairs.

seismicrna.core.seq.section.intersect(*sections: Section, name: str | None = None)

Intersect one or more sections.

Parameters:

*sections (Section) – Sections to intersect.
name (str | None = None) – Name for the section to return.

Returns:

Intersection of all given sections.

Return type:

Section

seismicrna.core.seq.section.iter_windows(*series: Series, size: int, min_count: int = 1, include_nan: bool = False)

seismicrna.core.seq.section.seq_pos_to_index(seq: DNA, positions: Sequence[int], start: int)

Convert a sequence and positions to indexes, where each index is a tuple of (position, base).

Parameters:

seq (DNA) – DNA sequence.
positions (Sequence[int]) – Positions of the sequence from which to build the index. Every position must be an integer ≥ start.
start (int) – Numerical position to assign to the first base in the sequence. Must be a positive integer.

Returns:

MultiIndex of the same length as positions where each index is a tuple of (position, base).

Return type:

pd.MultiIndex

seismicrna.core.seq.section.unite(*sections: Section, name: str | None = None, refseq: DNA | None = None)

Unite one or more sections.

Parameters:

*sections (Section) – Sections to unite.
name (str | None = None) – Name for the section to return.
refseq (DNA | None = None) – Reference sequence (optional) for filling any gaps in the union of the sections. If given, then it must match every section at the corresponding positions. If omitted, then any positions not covered by at least one section will be filled with N.

Returns:

Union of all given sections.

Return type:

Section

seismicrna.core.seq.section.verify_index_names(index: MultiIndex): Verify that the names of the index are correct.

seismicrna.core.seq.section.window_to_margins(window: int): Compute the 5’ and 3’ margins from the size of the window.

Sequence Core Module.

Define alphabets and classes for nucleic acid sequences, and functions for reading them from and writing them to FASTA files.

class seismicrna.core.seq.xna.CompressedSeq(seq: XNA)

Bases: object

Compress a sequence into two bits per base.

decompress(): Restore the original sequence.

property type

class seismicrna.core.seq.xna.DNA(seq: Any)

Bases: XNA

classmethod alph(): Sequence alphabet.

classmethod pict(): Sequence pictograms.

tr(): Transcribe DNA to RNA.

class seismicrna.core.seq.xna.RNA(seq: Any)

Bases: XNA

classmethod alph(): Sequence alphabet.

classmethod pict(): Sequence pictograms.

rt(): Reverse transcribe RNA to DNA.

class seismicrna.core.seq.xna.XNA(seq: Any)

Bases: ABC

__add__(other): Allow addition (concatenation) of two sequences only if the sequences have the same class.

__bool__(): Empty sequences return False; all else, True.

__contains__(item): Check if a sequence is contained in this sequence.

__eq__(other): Return True if both the type of the sequence and the bases in the sequence match, otherwise False.

__getitem__(item): If item is a slice, then return an instance of the class. Otherwise, return an instance of str.

__hash__(): Define __hash__ so that Seq subclasses can be used as keys for dict-like mappings. Use the hash of the plain string.

__mul__(other): Multiply a sequence by an int like a str times an int.

__repr__(): Encapsulate the sequence string with the class name.

abstract classmethod alph() → tuple[str, str, str, str, str]: Sequence alphabet.

property array: NumPy array of Unicode characters for the sequence.

compress(): Compress the sequence.

classmethod four(): Get the four standard bases.

classmethod get_alphaset(): Get the alphabet as a set.

classmethod get_comp(): Get the complementary alphabet as a tuple.

classmethod get_comptrans(): Get the translation table for complementary bases.

classmethod get_nonalphaset(): Get the printable characters not in the alphabet.

classmethod get_other_iupac(): Get the IUPAC extended characters not in the alphabet.

classmethod get_pictrans(): Get the translation table for pictogram characters.

kmers(k: int): Every subsequence of length k (k-mer).

abstract classmethod pict() → tuple[str, str, str, str, str]: Sequence pictograms.

property picto: Pictogram string.

classmethod random(nt: int, a: float = 0.25, c: float = 0.25, g: float = 0.25, t: float = 0.25)

Return a random sequence of the given length.

Parameters:

nt (int) – Number of nucleotides to simulate. Must be ≥ 0.
a (float = 0.25) – Expected proportion of A.
c (float = 0.25) – Expected proportion of C.
g (float = 0.25) – Expected proportion of G.
t (float = 0.25) – Expected proportion of T (if DNA) or U (if RNA).

Returns:

A random sequence.

Return type:

DNA | RNA

property rc: Reverse complement.

classmethod t_or_u(): Get the base that is complementary to A.

seismicrna.core.seq.xna.decompress(seq: CompressedSeq): Restore the original sequence from a CompressedSeq object.

seismicrna.core.seq.xna.expand_degenerate_seq(seq: DNA): Given a (possibly degenerate) sequence, yield every definite sequence that could derive from it. Only the degenerate base N is supported by this function; other IUPAC codes (e.g. R) are not.