seismicrna.demult package
Submodules
- class seismicrna.demult.barcode.RefBarcodes(ref_seqs: Iterable[tuple[str, DNA]], *, refs_meta_file: Path | None = None, coords: Iterable[tuple[str, int, int, int]] = (), bcs: Iterable[tuple[str, DNA, int]] = (), mismatches: int = 0, index_tolerance: int = 0, allow_n: bool = False)
Bases:
objectA collection of barcodes, mapped to references/names.
- property automaton
- property barcodes
List all barcodes.
- property by_pos
- property count
Total number of barcodes.
- get_automaton(barcodes: list[tuple[str, DNA]], start: int = 0, check: list[Automaton] = [])
Build an Aho-Corasick automaton for a set of barcodes.
- Parameters:
barcodes (
list[tuple[str,DNA]]) – Pairs of (name, barcode sequence) to add to the automaton.start (
int) – Integer index offset for the first barcode, used so that forward and reverse-complement automatons can share a single index space in name_map.check (
list[ahocorasick.Automaton]) – Existing automatons to check for collisions; any barcode that already appears in one of these will raise a ValueError.
- property max_barcode_len
- property name_map
- property names
Reference names.
- property pairs
- property rc_automaton
- property rc_barcodes
Reverse complement of the barcodes.
- property rc_by_pos
- property rc_pairs
- property rc_read_pos_range
List the range in which an rc barcode can fall in a read.
- property rc_read_positions
List all reverse complement barcode positions.
- property rc_slice_position
- property read_pos_range
List the range in which a barcode can fall in a read.
- property read_positions
List all read positions.
- property slice_position
- property uniq_names
- property valid_positions
- class seismicrna.demult.barcode.RegionTuple(pos5, pos3)
Bases:
tuple- pos3
Alias for field number 1
- pos5
Alias for field number 0
- seismicrna.demult.barcode.coords_to_seq(seq: DNA, end5: int, end3: int)
Extract the sequence between inclusive, 1-indexed coordinates
- seismicrna.demult.barcode.expand_by_tolerance(input: tuple[int, ...], tolerance: int) set[int]
Get the full set of values in input within tolerance
- seismicrna.demult.barcode.get_coords_by_name(coords: Iterable[tuple[str, int | DNA, int, int | None]])
- seismicrna.demult.barcode.get_ref_barcodes(ref_meta_file: Path)
Parse a file defining each barcode by the name of its reference and either its 5’ and 3’ coordinates or its sequence and name. Return one map from each reference to a 5’/3’ coordinate pair and read position and another from each name to sequence and read position.
- Parameters:
ref_meta_file (
Path) – CSV file of a table that defines the basrcodes. The table must have columns labeled “Reference”, “Barcode5”, “Barcode3”, “Name”, “Barcode”, and “Read Position”. Others are ignored.- Returns:
dict[str,tuple[int,int,int]]– A mapping from ref to coordsdict[str,tuple[DNA,int]]– A mapping from name to barcode
- seismicrna.demult.main.run(fasta: str | Path = Sentinel.UNSET, *, fastqz: Iterable[str | Path] = (), fastqy: Iterable[str | Path] = (), fastqx: Iterable[str | Path] = (), dmfastqz: Iterable[str | Path] = (), dmfastqy: Iterable[str | Path] = (), dmfastqx: Iterable[str | Path] = (), phred_enc: int = 33, refs_meta: str | Path = None, barcode_start: int = 0, barcode_end: int = 0, read_pos: int = None, barcode: tuple[tuple[str, DNA, int]] = (), mismatch_tolerance: int = 0, index_tolerance: int = 0, allow_n: bool = False, out_dir: str | Path = './out', keep_tmp: bool = False, branch: str = '', num_cpus: int = 4, force: bool = False, tmp_pfx='./tmp') list[Path]
Demultiplex FASTQ files.
- Parameters:
fastqz (
Iterable) – FASTQ file(s) of single-end reads [keyword-only, default: ()]fastqy (
Iterable) – FASTQ file(s) of paired-end reads with mates 1 and 2 interleaved [keyword-only, default: ()]fastqx (
Iterable) – FASTQ files of paired-end reads with mates 1 and 2 in separate files [keyword-only, default: ()]dmfastqz (
Iterable) – Demultiplexed FASTQ files of single-end reads [keyword-only, default: ()]dmfastqy (
Iterable) – Demultiplexed FASTQ files of paired-end reads interleaved in one file [keyword-only, default: ()]dmfastqx (
Iterable) – Demultiplexed FASTQ files of mate 1 and mate 2 reads [keyword-only, default: ()]phred_enc (
int) – Specify the Phred score encoding of FASTQ and SAM/BAM/CRAM files [keyword-only, default: 33]refs_meta (
str | pathlib._local.Path) – Add reference metadata from this CSV file to exported results [keyword-only, default: None]barcode_start (
int) – Index of start of barcode [keyword-only, default: 0]barcode_end (
int) – Index of end of barcode [keyword-only, default: 0]read_pos (
int) – Expected position of the barcode in the read (1-indexed). Defaults to –barcode-start [keyword-only, default: None]barcode (
tuple) – A list of barcode name, barcode sequence, and barcode position (1-indexed relative to read start) to demultiplex [keyword-only, default: ()]mismatch_tolerance (
int) – Designates the allowable amount of mismatches allowed in a string and still be considered a valid pattern find. will increase non-parallel computation at a factorial rate. use caution going above 2 mismatches. does not apply to clipped sequences. [keyword-only, default: 0]index_tolerance (
int) – Designates the allowable amount of distance you allow the pattern to be found in a read from the reference index [keyword-only, default: 0]allow_n (
bool) – Allow N as a valid mismatch when –mismatch-tolerance ≥ 1. Increases memory consumption. [keyword-only, default: False]out_dir (
str | pathlib._local.Path) – Write all output files to this directory [keyword-only, default: ‘./out’]keep_tmp (
bool) – Keep temporary files after finishing [keyword-only, default: False]branch (
str) – Create a new branch of the workflow with this name [keyword-only, default: ‘’]num_cpus (
int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]force (
bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]tmp_pfx – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]
- seismicrna.demult.neighbor.decode_barcode_2bit(barcode: int, length: int)
Decode the 2-bit encoded integer back into a DNA string.
- seismicrna.demult.neighbor.decode_barcode_3bit(barcode: int, length: int)
Decode the 3-bit encoded integer back into a DNA string. Mapping: 0->A, 1->C, 2->G, 3->T, 4->N.
- seismicrna.demult.neighbor.encode_barcode_2bit(barcode: DNA)
Encode a DNA barcode (string) into an integer using 2-bit encoding per base. Mapping: A=00, C=01, G=10, T=11.
- seismicrna.demult.neighbor.encode_barcode_3bit(barcode: DNA)
Encode a DNA barcode (string) into an integer using 3-bit encoding per base. Mapping: A=000, C=001, G=010, T=011. (Note: The barcode itself will never contain ‘N’.)
- seismicrna.demult.neighbor.generate_neighbors_2bit(orig: int, length: int, max_mismatches: int)
Generate all neighbor integers using 2-bit encoding.
- seismicrna.demult.neighbor.generate_neighbors_3bit(orig: int, length: int, max_mismatches: int)
Generate all neighbor integers using 3-bit encoding.
- seismicrna.demult.neighbor.get_neighbors(barcode: DNA, max_mismatches: int, allow_n: bool)
Get all DNA barcodes within max_mismatches.
- Parameters:
barcode – DNA barcode (of type DNA). It should not contain ‘N’.
max_mismatches – Maximum allowed mismatches.
allow_n – If True, substitutions to ‘N’ are allowed (using 3-bit encoding). If False, only A, C, G, and T are used (using 2-bit encoding).
- Returns:
A set of barcode strings representing all neighbors.
- seismicrna.demult.neighbor.rec_neighbors_2bit(orig: int, length: int, max_mismatches: int, pos: int, mismatches: int, current: int, out: List)
Recursively generate neighbor integers using 2-bit encoding.
- seismicrna.demult.neighbor.rec_neighbors_3bit(orig: int, length: int, max_mismatches: int, pos: int, mismatches: int, current: int, out: List)
Recursively generate neighbor integers using 3-bit encoding. This version allows substitutions to ‘N’ (encoded as 4).
- seismicrna.demult.write.check_demult_fqs(demult_fqs: dict[tuple[str, str], FastqUnit], out_dir: Path, branches: dict[str, str])
Return every FASTQ unit on which demultiplexing must be run and every expected demultiplexed file that already exists.
- seismicrna.demult.write.demult_ahocorasick(fq_unit: FastqUnit, out_fqs: dict[int, FastqUnit], barcodes: RefBarcodes, buffer_limit: int = 1000)
Demultiplex reads from a FASTQ unit using Aho-Corasick matching.
- Parameters:
fq_unit (
FastqUnit) – Input FASTQ unit whose reads are to be demultiplexed.out_fqs (
dict[int,FastqUnit]) – Mapping from barcode name to the output FastqUnit that receives reads assigned to that barcode.barcodes (
RefBarcodes) – Barcode definitions including the forward and reverse-complement automatons and valid position ranges.buffer_limit (
int) – Number of records to accumulate in memory before flushing to disk; higher values use more memory but reduce I/O calls.
- seismicrna.demult.write.demult_fq_pipeline(fq_inp: FastqUnit, barcodes: RefBarcodes, *, out_dir: Path, tmp_dir: Path, num_cpus: int = 1, **kwargs)
Run all stages of the demult pipeline for one FASTQ file or one pair of mated FASTQ files.
- seismicrna.demult.write.demult_fqs_pipeline(fq_units: list[FastqUnit], fasta: Path, barcodes: RefBarcodes, index_tolerance: int, mismatch_tolerance: int, *, num_cpus: int, tmp_dir: Path, **kwargs) list[Path]
Run all stages of demultiplexing for one or more FASTQ files or pairs of mated FASTQ files.
- seismicrna.demult.write.demult_samples(fq_units: list[FastqUnit], fasta: Path, *, refs_meta: Path, barcode_start: int, barcode_end: int, mismatch_tolerance: int, index_tolerance: int, allow_n: bool, read_pos: int | None, barcode: tuple[tuple[str, DNA, int]], out_dir: Path, branch: str, force: bool, **kwargs) list[Path]
Run the demult pipeline and return a tuple of all fastq files from the pipeline.
- seismicrna.demult.write.get_split_paths(split_dir, fq_inp, num_parts)
Return the expected paths of split FASTQ part files.
- Parameters:
split_dir (
Path) – Directory that contains the split files.fq_inp (
FastqUnit) – The original FASTQ unit from which the base name and suffix are derived.num_parts (
int) – Number of split parts expected.
- seismicrna.demult.write.list_demult(fq_units: list[FastqUnit], refs: set[str])
List every expected demultiplexed FASTQ from a multiplexed FASTQ.
- seismicrna.demult.write.list_demulted_fqs(demult_fqs: dict[tuple[str, str], FastqUnit], out_dir: Path, branches: dict[str, str])
List every FASTQ to demultiplex and every extant demultiplexed file.
- seismicrna.demult.write.merge_fqs(fq_units: Iterable[FastqUnit])
For every FASTQ that is not demultiplexed, merge all the keys that map to the FASTQ into one key: (sample, None). Merging ensures that every non-demultiplexed FASTQ is aligned only once to the whole set of references, not once for every reference in the set. This function is essentially the inverse of figure_alignments.
- seismicrna.demult.write.merge_parts(parts=list[pathlib._local.Path])
- seismicrna.demult.write.process_fq_part(fq_inp: FastqUnit, fqs: tuple[Path, ...], barcodes: RefBarcodes, *, release_dir: Path, branches: dict[str, str], **kwargs)
Demultiplex one split part of a FASTQ file.
- Parameters:
fq_inp (
FastqUnit) – The original (unsplit) FASTQ unit, used for metadata such as sample name and Phred encoding.fqs (
tuple[Path,]) – Paths to the split FASTQ file(s) for this part.barcodes (
RefBarcodes) – Barcode definitions used to assign reads to references.release_dir (
Path) – Top-level directory where demultiplexed output files are staged before being moved to the final output directory.branches (
dict[str,str]) – Mapping of pipeline step names to branch names, used when constructing output file paths.
- seismicrna.demult.write.split_fq(fq_inp: FastqUnit, working_dir: Path, num_split: int)
Split a FASTQ file (or pair) into parts for parallel processing.
- Parameters:
fq_inp (
FastqUnit) – The input FASTQ unit (single-end or paired-end) to split.working_dir (
Path) – Temporary working directory under which split files are written.num_split (
int) – Number of parts to split into; if <= 1 the file is symlinked rather than copied.