FASTQ: Sequencing Reads
FASTQ file: Content format
In a FASTQ file, each sequencing read record comprises four lines:
A header that begins with
@
and contains the name of the read.The sequence of the read.
A second header that begins with
+
and may repeat the read name.The quality string of the same length as the read, indicating the quality of each base in the read using Phred quality score encodings.
See FASTQ format for more information.
FASTQ character sets
FASTQ DNA alphabet
Read sequences may contain only A
, C
, G
, T
, and N
.
Phred quality score encodings
Phred quality scores represent the confidence that a base in a FASTQ file was called correctly during sequencing. The probability p that a base was called incorrectly is 10 raised to the power of the quotient of the Phred score s and -10:
p = 10-s/10
For example, if a base call has a Phred score of 30, the probability that the base call is incorrect is 10-30/10 = 0.001.
In FASTQ files, each phred quality score (a non-negative integer) is
encoded as one character of text by adding another integer N to the
Phred score (Phred+N) and then converting the number to the character
with the corresponding ASCII code.
For example, if N is 33, then the Phred score 25 would be encoded by
adding 33 to 25 (obtaining 58), then writing the character whose ASCII
code is 58 (which is :
).
Endedness: single-end and paired-end reads
Illumina sequencers can be run in two main modes: single-end mode, in which each molecule is read starting from one end; and paired-end mode, in which each molecule is read from both ends. For more details on endedness, see this guide from Illumina.
Each paired-end read comprises a pair of so-called “mates”: mate 1 and mate 2. Each mate is similar to a single-end read, the difference being that it is paired with and shares its name with its mate. There are two options for paired-end reads:
A single “interleaved” file in which mates 1 and 2 alternate, each mate 2 coming directly after the mate 1 to which it is paired.
Two separate files containing all mate 1 reads and all mate 2 reads, respectively, in the same order in both files.
In SEISMIC-RNA (as well as many other pieces of software), each FASTQ file must contain only one type of reads:
single-end reads
paired-end reads, with interleaved 1st and 2nd mates
paired-end reads, with 1st mates only
paired-end reads, with 2nd mates only
Thus, we refer to an entire FASTQ file as “single-end” if it contains only single-end reads, “interleaved paired-end” if it contains only interleaved 1st and 2nd mates, and “separate paired-end” if it contains either 1st or 2nd mates of paired-end reads.
The following diagrams illustrate single-end, separate paired-end, and interleaved paired-end FASTQ files.
FASTQ file with single-end reads
1 file, 4 lines per read
name.fq
@Read_1_ID
GCATGCTAGCCA
+
FFFFFFFFF:F:
@Read_2_ID
ATCGTCATGTGT
+
FFFFFFF:FFFF
FASTQ file with interleaved paired-end reads
1 file, 8 lines per paired read (4 for mate 1, then 4 for mate 2)
name.fq
@Read_1_ID/1
GCATGCTAGCCA
+
FFFFFFFFF:F:
@Read_1_ID/2
TACGTCGTCGTC
+
FFFFF:FF:F::
@Read_2_ID/1
ATCGTCATGTGT
+
FFFFFFF:FFFF
@Read_2_ID/2
CACGAGCGATAG
+
FFFF:FF:::F:
Pair of FASTQ files with 1st and 2nd mates in separate files
2 files, 4 lines per mate in each file
name_R1.fq
@Read_1_ID/1
GCATGCTAGCCA
+
FFFFFFFFF:F:
@Read_2_ID/1
ATCGTCATGTGT
+
FFFFFFF:FFFF
name_R2.fq
@Read_1_ID/2
TACGTCGTCGTC
+
FFFFF:FF:F::
@Read_2_ID/2
CACGAGCGATAG
+
FFFF:FF:::F:
FASTQ file: Path format
FASTQ file extensions
SEISMIC-RNA accepts the following extensions for FASTQ files:
Compressed (with gzip):
.fq.gz
(default).fastq.gz
Uncompressed:
.fq
(default).fastq
Note
SEISMIC-RNA accepts FASTQ files that are compressed with gzip.
It is recommended to always use compressed FASTQ files because FASTQ
files are typically very large without compression, on the order of
100 Mb to 10 Gb.
The file extension will be preserved through the workflow, i.e. if
an input FASTQ file has the extension .fq.gz
, then the trimmed
FASTQ file (if any) will also have that extension and be compressed.
FASTQ mate 1 and 2 labels
For paired-end reads whose 1st and 2nd mates are in separate files, the file names must have one of the following labels before the extension:
Mate 1:
_R1
,_mate1
,_1_sequence
,_R1_001
,_mate1_001
,_1_sequence_001
Mate 2:
_R2
,_mate2
,_2_sequence
,_R2_001
,_mate2_001
,_2_sequence_001
For example, a sample named sample-26
consisting of paired-end reads
could have the FASTQ files sample-26_R1.fq
and sample-26_R2.fq
.
If you would like future versions to support additional file extensions, then please request so by creating an issue (see Bugs and Requests).
FASTQ path parsing
For FASTQ files from whole samples (-x
, -y
, -z
), the sample
name is taken from the file name (dropping the mate number, if any).
For example, the single-end FASTQ -z project/sienna.fq
would be parsed to have sample name sienna
.
And the separate paired-end FASTQ -x project/lavender_R1.fq
would be parsed to have sample name lavender
.
For demultiplexed FASTQ files (-X
, -Y
, -Z
), the reference
name is taken from the file name (dropping the mate number, if any),
and the sample name is taken from the directory of the FASTQ file.
For example, the single-end FASTQ -Z project/azure/ochre.fq
would be parsed to have sample azure
and reference ochre
.
And the separate paired-end FASTQ -X project/lilac/teal_R2.fq
would be parsed to have sample lilac
and reference teal
.
FASTQ file: Uses
FASTQ as input file
Sequencing reads for these commands must be input as FASTQ files:
all
align
FASTQ as output file
The
align
command outputs a file in FASTQ format containing the unaligned reads from each input FASTQ (with option--bt2-un
).
FASTQ as temporary file
The
align
command writes a temporary FASTQ file for each input FASTQ that it trims with cutadapt (with option--cut
).