seismicrna.align package

Subpackages

Submodules

class seismicrna.align.fqunit.FastqUnit(*, fastqz: Path | None = None, fastqy: Path | None = None, fastq1: Path | None = None, fastq2: Path | None = None, phred_enc: int, one_ref: bool)

Bases: object

Unified interface for the following sets of sequencing reads:

  • One FASTQ file of single-end reads from one sample

  • One FASTQ file of interleaved, paired-end reads from one sample

  • Two FASTQ files of mate 1 and 2 paired-end reads from one sample

  • One FASTQ file of single-end reads originating from one reference sequence in one sample

  • One FASTQ file of interleaved, paired-end reads originating from one reference sequence in one sample

  • Two FASTQ files of mate 1 and mate 2 paired-end reads originating from one reference sequence in one sample

BOWTIE2_FLAGS = {'fastq1': '-1', 'fastq2': '-2', 'fastqy': '--interleaved', 'fastqz': '-U'}
KEY_DINTER = 'dmfastqy'
KEY_DMATED = 'dmfastqx'
KEY_DSINGLE = 'dmfastqz'
KEY_INTER = 'fastqy'
KEY_MATE1 = 'fastq1'
KEY_MATE2 = 'fastq2'
KEY_MATED = 'fastqx'
KEY_SINGLE = 'fastqz'
MAX_PHRED_ENC = 127
property bowtie2_inputs

Return input file arguments for Bowtie2.

property checksums
exists()

Check if all FASTQ paths in the FastqUnit exist

fields(key: str)
classmethod from_paths(*, phred_enc: int, **fastq_args: Iterable[str | Path])

Yield a FastqUnit for each FASTQ file (or each pair of mate 1 and mate 2 FASTQ files) whose paths are given as strings.

Parameters:
  • phred_enc (int) – ASCII offset for encoding Phred scores.

  • fastq_args (Iterable[str | Path]) – FASTQ files, given as iterables of paths keyed by type: - fastqz: single-end - fastqy: interleaved paired-end - fastqx: mated paired-end - dmfastqz: demultiplexed single-end - dmfastqy: demultiplexed interleaved paired-end - dmfastqx: demultiplexed mated paired-end

Yields:

FastqUnit – FastqUnit representing the FASTQ or pair of FASTQ files. The order is determined primarily by the order of keyword arguments; within each keyword argument, by the order of file or directory paths; and for directories, by the order in which os.path.listdir returns file paths.

get_sample_ref_exts()

Return the sample and reference of the FASTQ file(s).

iter_records(segments: Iterable[tuple[int, int]] = None)

Yields processed sequences (or segments) along with the full FASTQ records from the FastqUnit.

Output structure: a tuple (processed_seqs, full_records) where:

  • processed_seqs:

    • If segments is provided:

      • For paired-end FASTQs, a tuple of lists; each list contains the sliced segments, one per (start, end) tuple.

      • For single-end FASTQs, a one-element tuple containing a list of sliced segments.

    • If segments is None:

      • For paired-end FASTQs, a tuple of two full sequence strings safe-sliced to the length of the longer sequence in the pair.

      • For single-end FASTQs, a one-element tuple with the full sequence safe-sliced to its length.

  • full_records: a tuple containing the full FASTQ records (each record is a list of 4 lines). In paired mode the tuple will have two elements; in single-end mode, one element.

Note: Only the sequence (second line of each record) is processed for segments. The header, plus, and quality lines are read to maintain proper file structure and are returned in the full record.

property kind
property n_reads: int

Number of reads in the FASTQ file(s).

property parent

Return the parent directory of the FASTQ file(s).

property phred_arg
property seg_types: dict[str, tuple[PathSegment, ...]]
to_new(*new_segments: PathSegment, **new_fields)

Return a new FASTQ unit with updated path fields.

exception seismicrna.align.fqunit.MissingFastqMate

Bases: FileNotFoundError

Missing a file in a pair of paired-end FASTQ files.

exception seismicrna.align.fqunit.MissingFastqMate1

Bases: MissingFastqMate

Missing mate 1 in a pair of paired-end FASTQ files.

exception seismicrna.align.fqunit.MissingFastqMate2

Bases: MissingFastqMate

Missing mate 2 in a pair of paired-end FASTQ files.

seismicrna.align.fqunit.count_fastq_reads(fastq_file: Path)

Count the reads in a FASTQ file.

seismicrna.align.fqunit.fastq_gz(fastq_file: Path)

Return whether a FASTQ file is compressed with gzip.

seismicrna.align.fqunit.format_phred_arg(phred_enc: int)

Format a Phred score argument for Bowtie2.

seismicrna.align.fqunit.get_args_count_fastq_reads(fastq_file: Path)

Count the reads in a FASTQ file.

seismicrna.align.fqunit.parse_stdout_count_fastq_reads(process: CompletedProcess)

Parse the output of word count to find the number of reads.

seismicrna.align.fqunit.safe_slice(s: str, start: int, end: int, pad_char: str = ' ') str

Slice a string with out-of-bounds safety, padding as needed.

Uses NumPy for ultra-fast slicing and padding. Ensures output length is exactly end - start, even if the slice goes out of bounds.

Parameters:
  • s (str) – Input string to slice.

  • start (int) – Start index of the slice (0-indexed).

  • end (int) – End index of the slice (exclusive, 0-indexed).

  • pad_char (str) – Character used to pad the result when the slice extends beyond the bounds of s; defaults to a space.

Returns:

Sliced (and possibly padded) string of length end - start, or an empty string if end <= start.

Return type:

str

seismicrna.align.main.run(fasta: str | Path = Sentinel.UNSET, *, fastqz: Iterable[str | Path] = (), fastqy: Iterable[str | Path] = (), fastqx: Iterable[str | Path] = (), dmfastqz: Iterable[str | Path] = (), dmfastqy: Iterable[str | Path] = (), dmfastqx: Iterable[str | Path] = (), phred_enc: int = 33, out_dir: str | Path = './out', keep_tmp: bool = False, branch: str = '', fastp: bool = True, fastp_5: bool = False, fastp_3: bool = True, fastp_w: int = 6, fastp_m: int = 25, fastp_poly_g: str = 'auto', fastp_poly_g_min_len: int = 10, fastp_poly_x: bool = False, fastp_poly_x_min_len: int = 10, fastp_adapter_trimming: bool = True, fastp_adapter_1: str = '', fastp_adapter_2: str = '', fastp_adapter_fasta: str | None = None, fastp_detect_adapter_for_pe: bool = True, fastp_min_length: int = 9, bt2_local: bool = True, bt2_discordant: bool = False, bt2_mixed: bool = False, bt2_dovetail: bool = False, bt2_contain: bool = True, bt2_score_min_e2e: str = 'L,-1,-0.8', bt2_score_min_loc: str = 'L,1,0.8', bt2_i: int = 0, bt2_x: int = 600, bt2_gbar: int = 4, bt2_l: int = 20, bt2_s: str = 'L,1,0.1', bt2_d: int = 4, bt2_r: int = 2, bt2_dpad: int = 2, bt2_orient: str = 'fr', bt2_un: bool = True, seed: int | None = None, min_mapq: int = 25, min_reads: int = 1000, sep_strands: bool = False, f1r2_fwd: bool = False, rev_label: str = '-rev', num_cpus: int = 4, force: bool = False, tmp_pfx='./tmp') list[Path]

Trim FASTQ files and align them to reference sequences.

Parameters:
  • fastqz (Iterable) – FASTQ file(s) of single-end reads [keyword-only, default: ()]

  • fastqy (Iterable) – FASTQ file(s) of paired-end reads with mates 1 and 2 interleaved [keyword-only, default: ()]

  • fastqx (Iterable) – FASTQ files of paired-end reads with mates 1 and 2 in separate files [keyword-only, default: ()]

  • dmfastqz (Iterable) – Demultiplexed FASTQ files of single-end reads [keyword-only, default: ()]

  • dmfastqy (Iterable) – Demultiplexed FASTQ files of paired-end reads interleaved in one file [keyword-only, default: ()]

  • dmfastqx (Iterable) – Demultiplexed FASTQ files of mate 1 and mate 2 reads [keyword-only, default: ()]

  • phred_enc (int) – Specify the Phred score encoding of FASTQ and SAM/BAM/CRAM files [keyword-only, default: 33]

  • out_dir (str | pathlib._local.Path) – Write all output files to this directory [keyword-only, default: ‘./out’]

  • keep_tmp (bool) – Keep temporary files after finishing [keyword-only, default: False]

  • branch (str) – Create a new branch of the workflow with this name [keyword-only, default: ‘’]

  • fastp (bool) – Use fastp to QC, filter, and trim reads before alignment [keyword-only, default: True]

  • fastp_5 (bool) – Trim low-quality bases from the 5’ ends of reads [keyword-only, default: False]

  • fastp_3 (bool) – Trim low-quality bases from the 3’ ends of reads [keyword-only, default: True]

  • fastp_w (int) – Use this window size (nt) for –fastp-5 and –fastp-3 [keyword-only, default: 6]

  • fastp_m (int) – Use this mean quality threshold for –fastp-5 and –fastp-3 [keyword-only, default: 25]

  • fastp_poly_g (str) – Trim poly(G) tails (two-color sequencing artifacts) from the 3’ end [keyword-only, default: ‘auto’]

  • fastp_poly_g_min_len (int) – Minimum number of Gs to consider a poly(G) tail for –fastp-poly-g [keyword-only, default: 10]

  • fastp_poly_x (bool) – Trim poly(X) tails (i.e. of any nucleotide) from the 3’ end [keyword-only, default: False]

  • fastp_poly_x_min_len (int) – Minimum number of bases to consider a poly(X) tail for –fastp-poly-x [keyword-only, default: 10]

  • fastp_adapter_trimming (bool) – Trim adapter sequences from the 3’ ends of reads [keyword-only, default: True]

  • fastp_adapter_1 (str) – Trim this adapter sequence from the 3’ ends of read 1s [keyword-only, default: ‘’]

  • fastp_adapter_2 (str) – Trim this adapter sequence from the 3’ ends of read 2s [keyword-only, default: ‘’]

  • fastp_adapter_fasta (str | None) – Trim adapter sequences in this FASTA file from the 3’ ends of reads [keyword-only, default: None]

  • fastp_detect_adapter_for_pe (bool) – Automatically detect the adapter sequences for paired-end reads [keyword-only, default: True]

  • fastp_min_length (int) – Discard reads shorter than this length [keyword-only, default: 9]

  • bt2_local (bool) – Align reads in local mode rather than end-to-end mode [keyword-only, default: True]

  • bt2_discordant (bool) – Output paired-end reads whose mates align discordantly [keyword-only, default: False]

  • bt2_mixed (bool) – Attempt to align individual mates of pairs that fail to align [keyword-only, default: False]

  • bt2_dovetail (bool) – Consider dovetailed mate pairs to align concordantly [keyword-only, default: False]

  • bt2_contain (bool) – Consider nested mate pairs to align concordantly [keyword-only, default: True]

  • bt2_score_min_e2e (str) – Discard alignments that score below this threshold in end-to-end mode [keyword-only, default: ‘L,-1,-0.8’]

  • bt2_score_min_loc (str) – Discard alignments that score below this threshold in local mode [keyword-only, default: ‘L,1,0.8’]

  • bt2_i (int) – Discard paired-end alignments shorter than this many bases [keyword-only, default: 0]

  • bt2_x (int) – Discard paired-end alignments longer than this many bases [keyword-only, default: 600]

  • bt2_gbar (int) – Do not place gaps within this many bases from the end of a read [keyword-only, default: 4]

  • bt2_l (int) – Use this seed length for Bowtie2 [keyword-only, default: 20]

  • bt2_s (str) – Seed Bowtie2 alignments at this interval [keyword-only, default: ‘L,1,0.1’]

  • bt2_d (int) – Discard alignments if over this many consecutive seed extensions fail [keyword-only, default: 4]

  • bt2_r (int) – Re-seed reads with repetitive seeds up to this many times [keyword-only, default: 2]

  • bt2_dpad (int) – Pad the alignment matrix with this many bases (to allow gaps) [keyword-only, default: 2]

  • bt2_orient (str) – Require paired mates to have this orientation [keyword-only, default: ‘fr’]

  • bt2_un (bool) – Output unaligned reads to a FASTQ file [keyword-only, default: True]

  • seed (int | None) – Seed for the random number generator [keyword-only, default: None]

  • min_mapq (int) – Discard reads with mapping qualities below this threshold [keyword-only, default: 25]

  • min_reads (int) – Discard alignment maps with fewer than this many reads [keyword-only, default: 1000]

  • sep_strands (bool) – Separate each alignment map into forward- and reverse-strand reads [keyword-only, default: False]

  • f1r2_fwd (bool) – With –sep-strands, consider forward mate 1s and reverse mate 2s to be forward-stranded [keyword-only, default: False]

  • rev_label (str) – With –sep-strands, add this label to each reverse-strand reference [keyword-only, default: ‘-rev’]

  • num_cpus (int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]

  • force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

  • tmp_pfx – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]

class seismicrna.align.report.AlignRefReport(*, ref: str, demultiplexed: bool, **kwargs)

Bases: BaseAlignReport

classmethod get_file_seg_type()

Type of the last segment in the path.

classmethod get_ident_report_fields()

Identification fields of the report.

class seismicrna.align.report.AlignSampleReport(*, ref: str | None = None, demultiplexed: bool, **kwargs)

Bases: BaseAlignReport

classmethod get_file_seg_type()

Type of the last segment in the path.

class seismicrna.align.report.BaseAlignReport(**kwargs: Any | Callable[[Report], Any])

Bases: Report, ABC

classmethod get_checksum_report_fields()

Checksum fields of the report.

classmethod get_param_report_fields()

Parameter fields of the report.

classmethod get_result_report_fields()

Result fields of the report.

classmethod get_step()

Step of the workflow.

class seismicrna.align.report.SplitReport(**kwargs: Any | Callable[[Report], Any])

Bases: Report, ABC

classmethod get_checksum_report_fields()

Checksum fields of the report.

classmethod get_file_seg_type()

Type of the last segment in the path.

classmethod get_param_report_fields()

Parameter fields of the report.

classmethod get_result_report_fields()

Result fields of the report.

classmethod get_step()

Step of the workflow.

exception seismicrna.align.split.DuplicateSampleError

Bases: DuplicateValueError

A sample name occurs more than once.

seismicrna.align.split.check_duplicate_samples(xam_files: Iterable[str | Path])

Check if any XAM files have the same sample name.

seismicrna.align.split.run(fasta: str | Path = Sentinel.UNSET, *, input_path: Iterable[str | Path] = Sentinel.UNSET, phred_enc: int = 33, out_dir: str | Path = './out', keep_tmp: bool = False, branch: str = '', bt2_local: bool = True, bt2_discordant: bool = False, bt2_mixed: bool = False, bt2_dovetail: bool = False, bt2_contain: bool = True, bt2_score_min_e2e: str = 'L,-1,-0.8', bt2_score_min_loc: str = 'L,1,0.8', bt2_i: int = 0, bt2_x: int = 600, bt2_gbar: int = 4, bt2_l: int = 20, bt2_s: str = 'L,1,0.1', bt2_d: int = 4, bt2_r: int = 2, bt2_dpad: int = 2, bt2_orient: str = 'fr', seed: int | None = None, min_mapq: int = 25, min_reads: int = 1000, sep_strands: bool = False, f1r2_fwd: bool = False, rev_label: str = '-rev', num_cpus: int = 4, force: bool = False, tmp_pfx='./tmp') list[Path]

Split a BAM file into one file for each reference.

Parameters:
  • phred_enc (int) – Specify the Phred score encoding of FASTQ and SAM/BAM/CRAM files [keyword-only, default: 33]

  • out_dir (str | pathlib._local.Path) – Write all output files to this directory [keyword-only, default: ‘./out’]

  • keep_tmp (bool) – Keep temporary files after finishing [keyword-only, default: False]

  • branch (str) – Create a new branch of the workflow with this name [keyword-only, default: ‘’]

  • bt2_local (bool) – Align reads in local mode rather than end-to-end mode [keyword-only, default: True]

  • bt2_discordant (bool) – Output paired-end reads whose mates align discordantly [keyword-only, default: False]

  • bt2_mixed (bool) – Attempt to align individual mates of pairs that fail to align [keyword-only, default: False]

  • bt2_dovetail (bool) – Consider dovetailed mate pairs to align concordantly [keyword-only, default: False]

  • bt2_contain (bool) – Consider nested mate pairs to align concordantly [keyword-only, default: True]

  • bt2_score_min_e2e (str) – Discard alignments that score below this threshold in end-to-end mode [keyword-only, default: ‘L,-1,-0.8’]

  • bt2_score_min_loc (str) – Discard alignments that score below this threshold in local mode [keyword-only, default: ‘L,1,0.8’]

  • bt2_i (int) – Discard paired-end alignments shorter than this many bases [keyword-only, default: 0]

  • bt2_x (int) – Discard paired-end alignments longer than this many bases [keyword-only, default: 600]

  • bt2_gbar (int) – Do not place gaps within this many bases from the end of a read [keyword-only, default: 4]

  • bt2_l (int) – Use this seed length for Bowtie2 [keyword-only, default: 20]

  • bt2_s (str) – Seed Bowtie2 alignments at this interval [keyword-only, default: ‘L,1,0.1’]

  • bt2_d (int) – Discard alignments if over this many consecutive seed extensions fail [keyword-only, default: 4]

  • bt2_r (int) – Re-seed reads with repetitive seeds up to this many times [keyword-only, default: 2]

  • bt2_dpad (int) – Pad the alignment matrix with this many bases (to allow gaps) [keyword-only, default: 2]

  • bt2_orient (str) – Require paired mates to have this orientation [keyword-only, default: ‘fr’]

  • seed (int | None) – Seed for the random number generator [keyword-only, default: None]

  • min_mapq (int) – Discard reads with mapping qualities below this threshold [keyword-only, default: 25]

  • min_reads (int) – Discard alignment maps with fewer than this many reads [keyword-only, default: 1000]

  • sep_strands (bool) – Separate each alignment map into forward- and reverse-strand reads [keyword-only, default: False]

  • f1r2_fwd (bool) – With –sep-strands, consider forward mate 1s and reverse mate 2s to be forward-stranded [keyword-only, default: False]

  • rev_label (str) – With –sep-strands, add this label to each reverse-strand reference [keyword-only, default: ‘-rev’]

  • num_cpus (int) – Use up to this many CPUs simultaneously [keyword-only, default: 4]

  • force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

  • tmp_pfx – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp’]

seismicrna.align.split.split_xam_file(xam_file: Path, out_dir: Path, tmp_dir: Path, branch: str, fasta: Path, phred_enc: int, force: bool, num_cpus: int, **kwargs)

Sort, index, and split a XAM file into one file per reference.

Parameters:
  • xam_file (Path) – Input XAM (BAM/SAM/CRAM) file to split; its stem is used as the sample name.

  • out_dir (Path) – Directory where the final output files are written.

  • tmp_dir (Path) – Directory for temporary intermediate files.

  • branch (str) – Branch label to embed in the output file paths.

  • fasta (Path) – FASTA file of reference sequences used during alignment.

  • phred_enc (int) – ASCII offset for Phred quality score encoding.

  • force (bool) – Whether to overwrite existing output files.

  • num_cpus (int) – Number of CPU cores to use.

  • **kwargs – Additional keyword arguments forwarded to split_references.

Returns:

Parent directory of the written report file.

Return type:

Path

seismicrna.align.write.align_samples(fq_units: list[FastqUnit], fasta: Path, *, out_dir: Path, branch: str, force: bool, **kwargs) list[Path]

Run the alignment pipeline and return a tuple of all XAM files from the pipeline.

seismicrna.align.write.calc_flags_sep_strands(f1r2_fwd: bool, paired: bool, bt2_mixed: bool)

Calculate flags for separating strands.

seismicrna.align.write.check_fqs_xams(alignments: dict[tuple[str, str], FastqUnit], out_dir: Path, branches: dict[str, str])

Return every FASTQ unit on which alignment must be run and every expected XAM file that already exists.

seismicrna.align.write.extract_reference(ref: str, header: str, *, branches: dict[str, str], xam_whole: Path, fasta: Path, sample: str, top: Path, min_reads: int, sep_strands: bool, num_cpus: int = 1, **kwargs)

Extract one reference from a XAM file.

seismicrna.align.write.format_ref_reverse(ref: str, rev_label: str)

Name the reverse strand of a reference.

seismicrna.align.write.fq_pipeline(fq_inp: FastqUnit, fasta: Path, bowtie2_index: Path, *, out_dir: Path, tmp_dir: Path, keep_tmp: bool, branches: dict[str, str], fastp: bool, fastp_5: bool, fastp_3: bool, fastp_w: int, fastp_m: int, fastp_poly_g: str, fastp_poly_g_min_len: int, fastp_poly_x: bool, fastp_poly_x_min_len: int, fastp_adapter_trimming: bool, fastp_adapter_1: str, fastp_adapter_2: str, fastp_adapter_fasta: Path | None, fastp_detect_adapter_for_pe: bool, fastp_min_length: int, bt2_local: bool, bt2_discordant: bool, bt2_mixed: bool, bt2_dovetail: bool, bt2_contain: bool, bt2_score_min_e2e: str, bt2_score_min_loc: str, bt2_i: int, bt2_x: int, bt2_gbar: int, bt2_l: int, bt2_s: str, bt2_d: int, bt2_r: int, bt2_dpad: int, bt2_orient: str, bt2_un: bool, seed: int | None, min_mapq: int, min_reads: int, sep_strands: bool, f1r2_fwd: bool, rev_label: str, num_cpus: int = 1) Path

Run all stages of the alignment pipeline for one FASTQ file or one pair of mated FASTQ files.

seismicrna.align.write.fqs_pipeline(fq_units: list[FastqUnit], main_fasta: Path, *, num_cpus: int, tmp_dir: Path, **kwargs) list[Path]

Run all stages of alignment for one or more FASTQ files or pairs of mated FASTQ files.

seismicrna.align.write.list_alignments(fq_units: list[FastqUnit], refs: set[str])

List every expected alignment of a sample to a reference.

seismicrna.align.write.list_fqs_xams(alignments: dict[tuple[str, str], FastqUnit], out_dir: Path, branches: dict[str, str])

List every FASTQ to align and every extant XAM file.

seismicrna.align.write.merge_nondemult_fqs(fq_units: Iterable[FastqUnit])

For every FASTQ that is not demultiplexed, merge all the keys that map to the FASTQ into one key: (sample, None). Merging ensures that every non-demultiplexed FASTQ is aligned only once to the whole set of references, not once for every reference in the set. This function is essentially the inverse of figure_alignments.

seismicrna.align.write.separate_strands(xam_file: Path, fasta: Path, *, paired: bool | None, bt2_mixed: bool, rev_label: str, f1r2_fwd: bool, min_mapq: int, keep_tmp: bool, num_cpus: int = 1, **kwargs)

Separate a XAM file into two XAM files of reads that aligned to the forward and reverse strands, respectively.

seismicrna.align.write.split_references(xam_whole: Path, *, fasta: Path, paired: bool | None, phred_arg: str, top: Path, keep_tmp: bool, branches: dict[str, str], bt2_local: bool, bt2_discordant: bool, bt2_mixed: bool, bt2_dovetail: bool, bt2_contain: bool, bt2_score_min_e2e: str, bt2_score_min_loc: str, bt2_i: int, bt2_x: int, bt2_gbar: int, bt2_l: int, bt2_s: str, bt2_d: int, bt2_r: int, bt2_dpad: int, bt2_orient: str, seed: int | None, min_mapq: int, min_reads: int, sep_strands: bool, f1r2_fwd: bool, rev_label: str, num_cpus: int = 1)

Split a XAM file into one file per reference.

seismicrna.align.write.write_tmp_ref_files(tmp_dir: Path, refset_path: Path, refs: set[str], num_cpus: int = 1)

Write temporary FASTA files, each containing one reference that corresponds to a FASTQ file from demultiplexing.

Alignment XAM Generation Module


Alignment Score Parameters for Bowtie2

Consider this example: Ref = ACGT, Read = AG

Assume that we want to minimize the number of edits needed to convert the reference into the read sequence. The smallest number of edits is two, specifically these two deletions (/) from the reference: [A/G/] which gets a score of (2 * match - 2 * gap_open - 2 * gap_extend).

But there are two alternative alignments, each with 3 edits: [Ag//] and [A//g] (substitutions marked in lowercase). Each gets the score (match - substitution - gap_open - 2 * gap_extend).

In order to favor the simpler alignment with two edits, (2 * match - 2 * gap_open - 2 * gap_extend) must be greater than (match - substitution - gap_open - 2 * gap_extend). This inequality simplifies to (substitution > gap_open - match).

Thus, the substitution penalty and match bonus must be relatively large, and the gap open penalty small. We want to avoid introducing too many gaps, especially to prevent the introduction of an insertion and a deletion from scoring better than one substitution.

Consider this example: Ref = ATAT, Read = ACTT

The simplest alignment (the smallest number of mutations) is ActT, which gets a score of (2 * match - 2 * substitution). Another alignment with indels is A{C}T/T, where {C} means a C was inserted into the read and the / denotes an A deleted from the read. This alignment scores (3 * match - 2 * gap_open - 2 * gap_extend).

Thus, (2 * match - 2 * substitution) must be greater than (3 * match - 2 * gap_open - 2 * gap_extend), which simplifies to (2 * gap_open + 2 * gap_extend > match + 2 * substitution).

There are two easy solutions to these inequalities: - Bowtie v2.5 defaults: 6 > 5 - 2 and 2*5 + 2*3 > 2 + 2*6 - Set every score to 1: 1 > 1 - 1 and 2*1 + 2*1 > 1 + 2*1

seismicrna.align.xamops.bowtie2_build_cmd(fasta: Path, prefix: Path, *, num_cpus: int = 1)

Build a Bowtie2 index of a FASTA file.

seismicrna.align.xamops.bowtie2_cmd(fq_inp: FastqUnit | None, sam_out: Path | None, *, paired: bool | None, phred_arg: str | None, index_pfx: Path, bt2_local: bool, bt2_discordant: bool, bt2_mixed: bool, bt2_dovetail: bool, bt2_contain: bool, bt2_score_min_e2e: str, bt2_score_min_loc: str, bt2_i: int, bt2_x: int, bt2_gbar: int, bt2_l: int, bt2_s: str, bt2_d: int, bt2_r: int, bt2_dpad: int, bt2_orient: str, fq_unal: Path | None, seed: int | None, num_cpus: int = 1)

Build the Bowtie2 alignment command.

Parameters:
  • fq_inp (FastqUnit | None) – Input FASTQ file(s); if None, reads come from standard input.

  • sam_out (Path | None) – Output SAM file path; if None, output goes to standard output.

  • paired (bool | None) – Whether reads are paired-end; inferred from fq_inp if None.

  • phred_arg (str | None) – Bowtie2 Phred encoding argument (e.g. “–phred33”); inferred from fq_inp if None.

  • index_pfx (Path) – Prefix path of the Bowtie2 index to align against.

  • bt2_local (bool) – Whether to use local (soft-clipping) alignment mode.

  • bt2_discordant (bool) – Whether to allow discordant paired-end alignments.

  • bt2_mixed (bool) – Whether to allow mixed (partially unpaired) alignments.

  • bt2_dovetail (bool) – Whether to allow dovetail paired-end alignments.

  • bt2_contain (bool) – Whether to allow one mate to contain the other.

  • bt2_score_min_e2e (str) – Minimum alignment score function for end-to-end mode.

  • bt2_score_min_loc (str) – Minimum alignment score function for local mode.

  • bt2_i (int) – Minimum fragment length for valid paired-end alignments.

  • bt2_x (int) – Maximum fragment length for valid paired-end alignments.

  • bt2_gbar (int) – Minimum distance from the end of a read to allow a gap.

  • bt2_l (int) – Length of the seed substring.

  • bt2_s (str) – Interval function between seed substrings.

  • bt2_d (int) – Maximum number of seed extension attempts.

  • bt2_r (int) – Maximum number of re-seeding attempts.

  • bt2_dpad (int) – Number of extra reference bases to pad on either side of the dynamic programming matrix.

  • bt2_orient (str) – Expected orientation of paired-end mates (e.g. “fr”, “rf”).

  • fq_unal (Path | None) – Path to write unaligned reads; if None, unaligned reads are discarded.

  • seed (int | None) – Random seed for reproducible alignment; None for no fixed seed.

  • num_cpus (int) – Number of CPU threads for Bowtie2.

Returns:

Assembled Bowtie2 command ready to run or pipe.

Return type:

ShellCommand

seismicrna.align.xamops.export_cmd(xam_in: Path, xam_out: Path, *, ref: str, header: str, ref_file: Path | None = None, num_cpus: int = 1)

Select and export one reference to its own XAM file.

seismicrna.align.xamops.fastp_cmd(fq_inp: FastqUnit, fq_out: FastqUnit | None, *, fastp_html: Path, fastp_json: Path, fastp_5: bool, fastp_3: bool, fastp_w: int, fastp_m: int, fastp_poly_g: str, fastp_poly_g_min_len: int, fastp_poly_x: bool, fastp_poly_x_min_len: int, fastp_adapter_trimming: bool, fastp_adapter_1: str, fastp_adapter_2: str, fastp_adapter_fasta: Path | None, fastp_detect_adapter_for_pe: bool, fastp_min_length: int, num_cpus: int)

Build the fastp command for quality trimming and adapter removal.

Parameters:
  • fq_inp (FastqUnit) – Input FASTQ file(s) to trim.

  • fq_out (FastqUnit | None) – Output FASTQ file(s); if None, fastp writes trimmed reads to standard output.

  • fastp_html (Path) – Path for the fastp HTML report.

  • fastp_json (Path) – Path for the fastp JSON report.

  • fastp_5 (bool) – Whether to enable 5’-end sliding-window quality trimming.

  • fastp_3 (bool) – Whether to enable 3’-end sliding-window quality trimming.

  • fastp_w (int) – Window size for sliding-window quality trimming.

  • fastp_m (int) – Minimum mean quality within the sliding window.

  • fastp_poly_g (str) – Poly-G tail trimming mode: “yes”, “no”, or “auto”.

  • fastp_poly_g_min_len (int) – Minimum length of poly-G tail to trim.

  • fastp_poly_x (bool) – Whether to trim poly-X tails.

  • fastp_poly_x_min_len (int) – Minimum length of poly-X tail to trim.

  • fastp_adapter_trimming (bool) – Whether to perform adapter trimming.

  • fastp_adapter_1 (str) – Adapter sequence for read 1.

  • fastp_adapter_2 (str) – Adapter sequence for read 2 (ignored for single-end reads).

  • fastp_adapter_fasta (Path | None) – Path to a FASTA file of adapter sequences; used if provided.

  • fastp_detect_adapter_for_pe (bool) – Whether to auto-detect adapters for paired-end reads.

  • fastp_min_length (int) – Minimum read length after trimming; reads shorter than this are discarded (0 disables length filtering).

  • num_cpus (int) – Number of CPU threads for fastp.

Returns:

Assembled fastp command ready to run or pipe.

Return type:

ShellCommand

seismicrna.align.xamops.flags_cmd(*args, **kwargs)

Filter a XAM file based on flags, then collate the output.

seismicrna.align.xamops.flags_cmds(xam_inp: Path, xam_out: Path | None, *, tmp_pfx: Path | None = None, flags_req: int | Iterable[int] = (), flags_exc: int | Iterable[int] = (), num_cpus: int = 1)

Filter a XAM file based on flags, then collate the output.

seismicrna.align.xamops.get_bowtie2_index_paths(prefix: Path)

Return the Bowtie 2 index paths for a FASTA file.

seismicrna.align.xamops.parse_bowtie2(process: CompletedProcess)

Get the number of reads input and aligned.

seismicrna.align.xamops.realign_cmd(xam_inp: Path, xam_out: Path, *, paired: bool | None = None, tmp_pfx: Path | None = None, flags_req: int | Iterable[int] = (), flags_exc: int | Iterable[int] = (), min_mapq: int | None = None, num_cpus: int = 1, **kwargs)

Re-align reads that are already in a XAM file.

seismicrna.align.xamops.xamgen_cmd(fq_inp: FastqUnit, bam_out: Path, *, fastp: bool, fastp_dir: Path, fastp_5: bool, fastp_3: bool, fastp_w: int, fastp_m: int, fastp_poly_g: str, fastp_poly_g_min_len: int, fastp_poly_x: bool, fastp_poly_x_min_len: int, fastp_adapter_trimming: bool, fastp_adapter_1: str, fastp_adapter_2: str, fastp_adapter_fasta: Path | None, fastp_detect_adapter_for_pe: bool, fastp_min_length: int, index_pfx: Path, bt2_local: bool, bt2_discordant: bool, bt2_mixed: bool, bt2_dovetail: bool, bt2_contain: bool, bt2_score_min_e2e: str, bt2_score_min_loc: str, bt2_i: int, bt2_x: int, bt2_gbar: int, bt2_l: int, bt2_s: str, bt2_d: int, bt2_r: int, bt2_dpad: int, bt2_orient: str, fq_unal: Path | None, min_mapq: int, seed: int | None, num_cpus: int = 1)

Wrap QC, alignment, and post-processing into one pipeline.