seismicrna.demult package

Submodules

class seismicrna.demult.barcode.RefBarcodes(ref_seqs: Iterable[tuple[str, DNA]], *, refs_meta_file: Path | None = None, coords: Iterable[tuple[str, int, int, int]] = (), bcs: Iterable[tuple[str, DNA, int]] = (), mismatches: int = 0, index_tolerance: int = 0, allow_n: bool = False)

Bases: object

A collection of barcodes, mapped to references/names.

property as_dict

barcode pairs.

Type:

Get a dict of name

property automaton
property barcodes

List all barcodes.

property by_pos
property count

Total number of barcodes.

get_automaton(barcodes: list[tuple[str, DNA]], start: int = 0, check: list[Automaton] = [])

Build an Aho-Corasick automaton for a set of barcodes.

Parameters:
  • barcodes (list[tuple[str, DNA]]) – Pairs of (name, barcode sequence) to add to the automaton.

  • start (int) – Integer index offset for the first barcode, used so that forward and reverse-complement automatons can share a single index space in name_map.

  • check (list[ahocorasick.Automaton]) – Existing automatons to check for collisions; any barcode that already appears in one of these will raise a ValueError.

property max_barcode_len
property name_map
property names

Reference names.

property pairs
property rc_automaton
property rc_barcodes

Reverse complement of the barcodes.

property rc_by_pos
property rc_pairs
property rc_read_pos_range

List the range in which an rc barcode can fall in a read.

property rc_read_positions

List all reverse complement barcode positions.

property rc_slice_position
property read_pos_range

List the range in which a barcode can fall in a read.

property read_positions

List all read positions.

property slice_position
property uniq_names
property valid_positions
class seismicrna.demult.barcode.RegionTuple(pos5, pos3)

Bases: tuple

pos3

Alias for field number 1

pos5

Alias for field number 0

seismicrna.demult.barcode.coords_to_seq(seq: DNA, end5: int, end3: int)

Extract the sequence between inclusive, 1-indexed coordinates

seismicrna.demult.barcode.expand_by_tolerance(input: tuple[int, ...], tolerance: int) set[int]

Get the full set of values in input within tolerance

seismicrna.demult.barcode.get_coords_by_name(coords: Iterable[tuple[str, int | DNA, int, int | None]])
seismicrna.demult.barcode.get_ref_barcodes(ref_meta_file: Path)

Parse a file defining each barcode by the name of its reference and either its 5’ and 3’ coordinates or its sequence and name. Return one map from each reference to a 5’/3’ coordinate pair and read position and another from each name to sequence and read position.

Parameters:

ref_meta_file (Path) – CSV file of a table that defines the basrcodes. The table must have columns labeled “Reference”, “Barcode5”, “Barcode3”, “Name”, “Barcode”, and “Read Position”. Others are ignored.

Returns:

  • dict[str, tuple[int, int, int]] – A mapping from ref to coords

  • dict[str, tuple[DNA, int]] – A mapping from name to barcode

seismicrna.demult.main.run(fasta: str | Path = Sentinel.UNSET, *, fastqz: Iterable[str | Path] = (), fastqy: Iterable[str | Path] = (), fastqx: Iterable[str | Path] = (), dmfastqz: Iterable[str | Path] = (), dmfastqy: Iterable[str | Path] = (), dmfastqx: Iterable[str | Path] = (), phred_enc: int = 33, refs_meta: str | Path = None, barcode_start: int = 0, barcode_end: int = 0, read_pos: int = None, barcode: tuple[tuple[str, DNA, int]] = (), mismatch_tolerance: int = 0, index_tolerance: int = 0, allow_n: bool = False, out_dir: str | Path = './out', keep_tmp: bool = False, branch: str = '', num_cpus: int = 4, force: bool = False, tmp_pfx='./tmp') list[Path]

Demultiplex FASTQ files.

Parameters:
  • fastqz (Iterable) – FASTQ file(s) of single-end reads [keyword-only, default: ()]

  • fastqy (Iterable) – FASTQ file(s) of paired-end reads with mates 1 and 2 interleaved [keyword-only, default: ()]

  • fastqx (Iterable) – FASTQ files of paired-end reads with mates 1 and 2 in separate files [keyword-only, default: ()]

  • dmfastqz (Iterable) – Demultiplexed FASTQ files of single-end reads [keyword-only, default: ()]

  • dmfastqy (Iterable) – Demultiplexed FASTQ files of paired-end reads interleaved in one file [keyword-only, default: ()]

  • dmfastqx (Iterable) – Demultiplexed FASTQ files of mate 1 and mate 2 reads [keyword-only, default: ()]

  • phred_enc (int) – Specify the Phred score encoding of FASTQ and SAM/BAM/CRAM files [keyword-only, default: 33]

  • refs_meta (str | pathlib._local.Path) – Add reference metadata from this CSV file to exported results [keyword-only, default: None]

  • barcode_start (int) – Index of start of barcode [keyword-only, default: 0]

  • barcode_end (int) – Index of end of barcode [keyword-only, default: 0]

  • read_pos (int) – Expected position of the barcode in the read (1-indexed). Defaults to –barcode-start [keyword-only, default: None]

  • barcode (tuple) – A list of barcode name, barcode sequence, and barcode position (1-indexed relative to read start) to demultiplex [keyword-only, default: ()]

  • mismatch_tolerance (int) – Designates the allowable amount of mismatches allowed in a string and still be considered a valid pattern find. will increase non-parallel computation at a factorial rate. use caution going above 2 mismatches. does not apply to clipped sequences. [keyword-only, default: 0]

  • index_tolerance (int) – Designates the allowable amount of distance you allow the pattern to be found in a read from the reference index [keyword-only, default: 0]

  • allow_n (bool) – Allow N as a valid mismatch when –mismatch-tolerance ≥ 1. Increases memory consumption. [keyword-only, default: False]

  • out_dir (str | pathlib._local.Path) – Write all output files to this directory [keyword-only, default: ‘./out’]

  • keep_tmp (bool) – Keep temporary files after finishing [keyword-only, default: False]

  • branch (str) – Create a new branch of the workflow with this name [keyword-only, default: ‘’]

  • num_cpus (int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]

  • force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

  • tmp_pfx – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]

seismicrna.demult.neighbor.decode_barcode_2bit(barcode: int, length: int)

Decode the 2-bit encoded integer back into a DNA string.

seismicrna.demult.neighbor.decode_barcode_3bit(barcode: int, length: int)

Decode the 3-bit encoded integer back into a DNA string. Mapping: 0->A, 1->C, 2->G, 3->T, 4->N.

seismicrna.demult.neighbor.encode_barcode_2bit(barcode: DNA)

Encode a DNA barcode (string) into an integer using 2-bit encoding per base. Mapping: A=00, C=01, G=10, T=11.

seismicrna.demult.neighbor.encode_barcode_3bit(barcode: DNA)

Encode a DNA barcode (string) into an integer using 3-bit encoding per base. Mapping: A=000, C=001, G=010, T=011. (Note: The barcode itself will never contain ‘N’.)

seismicrna.demult.neighbor.generate_neighbors_2bit(orig: int, length: int, max_mismatches: int)

Generate all neighbor integers using 2-bit encoding.

seismicrna.demult.neighbor.generate_neighbors_3bit(orig: int, length: int, max_mismatches: int)

Generate all neighbor integers using 3-bit encoding.

seismicrna.demult.neighbor.get_neighbors(barcode: DNA, max_mismatches: int, allow_n: bool)

Get all DNA barcodes within max_mismatches.

Parameters:
  • barcode – DNA barcode (of type DNA). It should not contain ‘N’.

  • max_mismatches – Maximum allowed mismatches.

  • allow_n – If True, substitutions to ‘N’ are allowed (using 3-bit encoding). If False, only A, C, G, and T are used (using 2-bit encoding).

Returns:

A set of barcode strings representing all neighbors.

seismicrna.demult.neighbor.rec_neighbors_2bit(orig: int, length: int, max_mismatches: int, pos: int, mismatches: int, current: int, out: List)

Recursively generate neighbor integers using 2-bit encoding.

seismicrna.demult.neighbor.rec_neighbors_3bit(orig: int, length: int, max_mismatches: int, pos: int, mismatches: int, current: int, out: List)

Recursively generate neighbor integers using 3-bit encoding. This version allows substitutions to ‘N’ (encoded as 4).

seismicrna.demult.write.check_demult_fqs(demult_fqs: dict[tuple[str, str], FastqUnit], out_dir: Path, branches: dict[str, str])

Return every FASTQ unit on which demultiplexing must be run and every expected demultiplexed file that already exists.

seismicrna.demult.write.check_matches(matches: Iterable[tuple[tuple[int, str, set]]], barcodes)
seismicrna.demult.write.demult_ahocorasick(fq_unit: FastqUnit, out_fqs: dict[int, FastqUnit], barcodes: RefBarcodes, buffer_limit: int = 1000)

Demultiplex reads from a FASTQ unit using Aho-Corasick matching.

Parameters:
  • fq_unit (FastqUnit) – Input FASTQ unit whose reads are to be demultiplexed.

  • out_fqs (dict[int, FastqUnit]) – Mapping from barcode name to the output FastqUnit that receives reads assigned to that barcode.

  • barcodes (RefBarcodes) – Barcode definitions including the forward and reverse-complement automatons and valid position ranges.

  • buffer_limit (int) – Number of records to accumulate in memory before flushing to disk; higher values use more memory but reduce I/O calls.

seismicrna.demult.write.demult_fq_pipeline(fq_inp: FastqUnit, barcodes: RefBarcodes, *, out_dir: Path, tmp_dir: Path, num_cpus: int = 1, **kwargs)

Run all stages of the demult pipeline for one FASTQ file or one pair of mated FASTQ files.

seismicrna.demult.write.demult_fqs_pipeline(fq_units: list[FastqUnit], fasta: Path, barcodes: RefBarcodes, index_tolerance: int, mismatch_tolerance: int, *, num_cpus: int, tmp_dir: Path, **kwargs) list[Path]

Run all stages of demultiplexing for one or more FASTQ files or pairs of mated FASTQ files.

seismicrna.demult.write.demult_samples(fq_units: list[FastqUnit], fasta: Path, *, refs_meta: Path, barcode_start: int, barcode_end: int, mismatch_tolerance: int, index_tolerance: int, allow_n: bool, read_pos: int | None, barcode: tuple[tuple[str, DNA, int]], out_dir: Path, branch: str, force: bool, **kwargs) list[Path]

Run the demult pipeline and return a tuple of all fastq files from the pipeline.

seismicrna.demult.write.get_fq_suffix(path: Path)
seismicrna.demult.write.get_open_func(fq_path: Path)
seismicrna.demult.write.get_part(fq_path: Path)
seismicrna.demult.write.get_split_paths(split_dir, fq_inp, num_parts)

Return the expected paths of split FASTQ part files.

Parameters:
  • split_dir (Path) – Directory that contains the split files.

  • fq_inp (FastqUnit) – The original FASTQ unit from which the base name and suffix are derived.

  • num_parts (int) – Number of split parts expected.

seismicrna.demult.write.list_demult(fq_units: list[FastqUnit], refs: set[str])

List every expected demultiplexed FASTQ from a multiplexed FASTQ.

seismicrna.demult.write.list_demulted_fqs(demult_fqs: dict[tuple[str, str], FastqUnit], out_dir: Path, branches: dict[str, str])

List every FASTQ to demultiplex and every extant demultiplexed file.

seismicrna.demult.write.merge_fqs(fq_units: Iterable[FastqUnit])

For every FASTQ that is not demultiplexed, merge all the keys that map to the FASTQ into one key: (sample, None). Merging ensures that every non-demultiplexed FASTQ is aligned only once to the whole set of references, not once for every reference in the set. This function is essentially the inverse of figure_alignments.

seismicrna.demult.write.merge_parts(parts=list[pathlib._local.Path])
seismicrna.demult.write.process_fq_part(fq_inp: FastqUnit, fqs: tuple[Path, ...], barcodes: RefBarcodes, *, release_dir: Path, branches: dict[str, str], **kwargs)

Demultiplex one split part of a FASTQ file.

Parameters:
  • fq_inp (FastqUnit) – The original (unsplit) FASTQ unit, used for metadata such as sample name and Phred encoding.

  • fqs (tuple[Path, ]) – Paths to the split FASTQ file(s) for this part.

  • barcodes (RefBarcodes) – Barcode definitions used to assign reads to references.

  • release_dir (Path) – Top-level directory where demultiplexed output files are staged before being moved to the final output directory.

  • branches (dict[str, str]) – Mapping of pipeline step names to branch names, used when constructing output file paths.

seismicrna.demult.write.remove_suffixes(path: Path)
seismicrna.demult.write.rename_fq_part(fq_path: Path) Path
seismicrna.demult.write.split_fq(fq_inp: FastqUnit, working_dir: Path, num_split: int)

Split a FASTQ file (or pair) into parts for parallel processing.

Parameters:
  • fq_inp (FastqUnit) – The input FASTQ unit (single-end or paired-end) to split.

  • working_dir (Path) – Temporary working directory under which split files are written.

  • num_split (int) – Number of parts to split into; if <= 1 the file is symlinked rather than copied.

seismicrna.demult.write.strip_all_fq_suffixes(path: str | Path)
seismicrna.demult.write.to_range(pos: int, tolerance: int)