Demultiplexing

demultiplexing takes a set of paired-end fastqs and seperates them based on a reference fasta and a cordinates of a barcode DREEM output file.

Key features:

seperating a a set of paired-end fastqs and demultiplexes based on barcodes
can use a secondary signiture to help validate query
creates report on demultiplexing

Get started

Let’s aggregate sample_1.

You need:

the paths to fastqs
the reference genome under the fasta format reference.fasta.
either:
1. the library csv with a reference field, a barcode start index field, a barcode length, and optionally a secondary signiture start index and a secondary signiture length
1. give barcode start index and barcode length

You get:

sample_1.json under the DREEM output format.

CLI

Run the command:

dreem demultiplex —-fasta path/to/reference.fasta --fastq1 path/to/sample_1_R1.fastq --fastq2 path/to/sample_1_R2.fastq --barcode-start X --barcode-length Y

Python

Run the following code:

from ..demultiplex.demultiplex import demultiplex_run
demultiplex_run(
                mixed_fastq1 = path/to/sample_1_R1.fastq,
                mixed_fastq2 = path/to/sample_1_R1.fastq,
                barcode_start = X,
                barcode_length = Y,
                fasta = path/to/reference.fasta)

See the Command Reference for more details.

Examples

CLI

Run the command:

dreem demultiplex —-fasta path/to/reference.fasta --fastq1 path/to/sample_1_R1.fastq --fastq2 path/to/sample_1_R2.fastq --barcode-start X --barcode-length Y

Python

Run the following code:

from ..demultiplex.demultiplex import demultiplex_run
demultiplex_run(
                mixed_fastq1 = path/to/sample_1_R1.fastq,
                mixed_fastq2 = path/to/sample_1_R1.fastq,
                barcode_start = X,
                barcode_length = Y,
                fasta = path/to/reference.fasta)

API Reference

The arguments for the CLI and Python API are the same. The CLI is just a wrapper around the Python API.

CLI cmds

dreem demultiplex

dreem demultiplex [OPTIONS]

Options

--fasta <fasta>: FASTA file of all reference sequences in the project

--fastq1 <fastq1>: FASTQ files of mate 1 paired-end reads

--fastq2 <fastq2>: FASTQ files of mate 2 paired-end reads

--library <library>: Library CSV file

--barcode-start <barcode_start>: index of start of barcode

--barcode-length <barcode_length>: length of barcode

--out-dir <out_dir>: Where to output all finished files

--temp-dir <temp_dir>: Where to write all temporary files

--parallel-demultiplexing <parallel_demultiplexing>: Whether to run demultiplexing at maximum speed by submitting multithreaded grep functions

--clipped <clipped>: Designates the amount of clipped patterns to search for in the sample, will raise compution time

--mismatch-tolerence <mismatch_tolerence>: Designates the allowable amount of mismatches allowed in a string and still be considered a valid pattern find. will increase non-parallel computation at a factorial rate. use caution going above 2 mismatches. does not apply to clipped sequences.

--index-tolerance <index_tolerance>: Designates the allowable amount of distance you allow the pattern to be found in a read from the reference index

--demulti-overwrite <demulti_overwrite>: desiginates whether to overwrite the grepped fastq. should only be used if changing setting on the same sample

Python args

dreem.demultiplex.run(library: str, out_dir: str, temp_dir: str, fastq1: str, fastq2: str, fasta: str, barcode_start=0, barcode_length=0, clipped: int = 0, index_tolerance: int = 0, parallel_demultiplexing: bool = False, mismatch_tolerence: int = 0, demulti_overwrite: bool = False)

I/O files

This module needs the files to be sorted into a specific folder structure. The folder structure is described below.

Input

As input, demultiplex requires either a library.csv file with the barcode_start_index and barcode_length, but this can be alternatively be given as input in the CLI:

sample_1_R1              # <=> a fastq file <<< give this path to be demultiplexed
sample_1_R2              # <=> a fastq file <<< give this path to be demultiplexed
sample_ref               # <=> a fasta file <<< give this path of references

Output

As output, demultiplex provides a lot of temporary files but the primary output is the demultiplexed fastqs. The folder structure is as follows:

temp/              # <=> a fastq file <<< give this path to aggregate the entire sample
    |- sample_1/    # <=> holds all the demultiplex temporary files
        |- sample_1_R1_pickle_split/  # <=> holds paritioned fastqs stored as dictionaries, usually split into 10
            |- split_0.fastq   # <=> a split fastq
            |- split_0.p       # <=> a split fastq stored as dictionary as python pickle
            |- split_1.fastq
            |- split_1.p
        |- sample_1_R2_pickle_split/
            |- split_0.fastq
            |- split_0.p
            |- split_1.fastq
            |- split_1.p
        |- sample_fqs/    # <=> holds all
            |- reference_1_R1.fastq      # <=> fastq representing reads found from sample_1_R1
            |- reference_1_R2.fastq      # <=> fastq representing reads found from sample_1_R2
            |- reference_2_R1.fastq
            |- reference_2_R2.fastq
        |- sequence_data/    # <=> holds all the temporary data used in demultiplexing
            |- reference_1/     # <=> the references temporary data
                |- fq1/
                    |- specific_pattern_grepped.fastq # <=> demultiplexing requires several different types of grepping to isolate what reads belong to what reference
                    |- complete_set_of_reads.p   # <=> the total set of read ids found post
                    |- read_id_data.p  # <=> the read ids found by what type of grep
                    |- unfiltered.fastq # <=> the complete set of reads before a secondary signiture filtering process
                |- fq2/
                    |- specific_pattern_grepped.fastq
                    |- complete_set_of_reads.p
                    |- read_id_data.p
                    |- unfiltered.fastq
            |- reference_2/
                |- fq1/
                    |- specific_pattern_grepped.fastq
                    |- complete_set_of_reads.p
                    |- read_id_data.p
                    |- unfiltered.fastq
                |- fq2/
                    |- specific_pattern_grepped.fastq
                    |- complete_set_of_reads.p
                    |- read_id_data.p
                    |- unfiltered.fastq
            |- multigrepped_reads.p     # <=> dictionary mapping read ids to a list of the references to which it mapped
        |-demultiplex_info.csv          # <=> a csv showing the read counts found for each type of query

Key algorithm(s)

The main algorithm of this code involves using a grep function from seqkit (https://bioinf.shenwei.me/seqkit/) which is essentially just grep for fastq. For those that do not know how grep works, please read this article (https://en.wikipedia.org/wiki/Grep). The demultiplexing code here relies on this tool to query the fastq for specific patterns that connect it to a reference sequence. This usually means querying for a sequnece of nucleotides unique to each reference.

For example:

Edge case handling

Index Tolerence:

Giving a index tolerence on index too close to the start or end of the sequence will result in a error if not handled. This currently is handled by taking the value that would go below zero and instead adding it to the oppostie side of the index tolerence

Fastq cases:

Giving two Fastqs that do not have the same number of reads

this can be because they are corrupted
erroronoeus input
- fastqs from different samples
- incorrect file path
these issues are dealth with with in the object super_fastq

Mutliple Barcode:

If barcodes are not desgined well enough and they appear mutliple times in the sequence, it will cause issues

Contributors: Scott Grote