FASTA: Reference sequences
FASTA file: Content format
In a FASTA file, each sequence record comprises two parts:
A header that contains the name of the sequence.
A body that contains the sequence.
See FASTA format for more information.
FASTA header lines
Every header line must start with the character >
.
The name of the sequence must follow this character, on the same line.
Optionally, metadata may follow the name of the sequence after a break
by non-alphanumeric characters such as whitespace or |
.
SEISMIC-RNA requires that header lines contain no metadata – i.e. that
all characters after the initial >
are part of the sequence name.
This restriction exists because the name of each reference sequence is
incorporated into file paths, and SEISMIC-RNA restricts the characters
allowed in file paths to avoid any potential problems caused by special
characters and whitespace in paths.
If SEISMIC-RNA were to simply ignore characters after the first non-path
character in the header of a FASTA, then the names would not necessarily
match those produced by other tools such as Bowtie 2 that read the FASTA
files directly; and these inconsistencies in names could cause errors.
Thus, to ensure consistent names, SEISMIC-RNA will raise errors if a
FASTA file has any illegal characters in its header lines.
FASTA body lines
The remaining lines in each record encode the sequence. SEISMIC-RNA can parse sequences that obey the following rules:
Alphabet:
A
,C
,G
, andN
are valid characters for DNA and RNA;T
andU
are also valid for DNA and RNA, respectively. Lowercase equivalents are also valid but will be cast to uppercase. All other characters (including whitespace) are illegal.Sequence lengths: Arbitrary lengths are supported, from zero to the maximum number of nucleotides that will fit in your system’s memory.
Line lengths: Arbitrary lengths are supported, up to the line length limit imposed by your system.
Blank lines: Blank lines (i.e. containing only a newline character) are simply ignored, but lines containing other whitespace characters are illegal.
FASTA file: Path format
FASTA file extensions
SEISMIC-RNA accepts the following extensions for FASTA files:
.fa
(default).fna
.fasta
FASTA path parsing
The name of an input FASTA file of all reference sequences is used for the following purposes:
Determining if a Bowtie 2 index exists for the FASTA file.
Building a Bowtie 2 index for the FASTA file.
Linking a CRAM file to its reference sequence.
FASTA file: Uses
FASTA as input file
Reference sequences for these commands must be input as FASTA files:
seismic wf
seismic align
seismic relate
seismic fold
FASTA as output file
The
align
command outputs a file in FASTA format alongside each file in CRAM format (with option--cram
).
FASTA as temporary file
The
align
command writes a temporary FASTA file with a single reference sequence for each demultiplexed FASTQ file, which is used to build an index for Bowtie 2.The
fold
command writes a temporary FASTA file, which is used by the programFold
.