FASTA: Reference sequences

FASTA file: Content format

In a FASTA file, each sequence record comprises two parts:

  • A header that contains the name of the sequence.

  • A body that contains the sequence.

See FASTA format for more information.

FASTA header lines

Every header line must start with the character >. The name of the sequence must follow this character, on the same line. Optionally, metadata may follow the name of the sequence after a break by non-alphanumeric characters such as whitespace or |.

SEISMIC-RNA requires that header lines contain no metadata – i.e. that all characters after the initial > are part of the sequence name. This restriction exists because the name of each reference sequence is incorporated into file paths, and SEISMIC-RNA restricts the characters allowed in file paths to avoid any potential problems caused by special characters and whitespace in paths. If SEISMIC-RNA were to simply ignore characters after the first non-path character in the header of a FASTA, then the names would not necessarily match those produced by other tools such as Bowtie 2 that read the FASTA files directly; and these inconsistencies in names could cause errors. Thus, to ensure consistent names, SEISMIC-RNA will raise errors if a FASTA file has any illegal characters in its header lines.

FASTA body lines

The remaining lines in each record encode the sequence. SEISMIC-RNA can parse sequences that obey the following rules:

  • Alphabet: A, C, G, and N are valid characters for DNA and RNA; T and U are also valid for DNA and RNA, respectively. Lowercase equivalents are also valid but will be cast to uppercase. All other characters (including whitespace) are illegal.

  • Sequence lengths: Arbitrary lengths are supported, from zero to the maximum number of nucleotides that will fit in your system’s memory.

  • Line lengths: Arbitrary lengths are supported, up to the line length limit imposed by your system.

  • Blank lines: Blank lines (i.e. containing only a newline character) are simply ignored, but lines containing other whitespace characters are illegal.

FASTA file: Path format

FASTA file extensions

SEISMIC-RNA accepts the following extensions for FASTA files:

  • .fa (default)

  • .fna

  • .fasta

FASTA path parsing

The name of an input FASTA file of all reference sequences is used for the following purposes:

  • Determining if a Bowtie 2 index exists for the FASTA file.

  • Building a Bowtie 2 index for the FASTA file.

  • Linking a CRAM file to its reference sequence.

FASTA file: Uses

FASTA as input file

Reference sequences for these commands must be input as FASTA files:

  • seismic wf

  • seismic align

  • seismic relate

  • seismic fold

FASTA as output file

  • The align command outputs a file in FASTA format alongside each file in CRAM format (with option --cram).

FASTA as temporary file

  • The align command writes a temporary FASTA file with a single reference sequence for each demultiplexed FASTQ file, which is used to build an index for Bowtie 2.

  • The fold command writes a temporary FASTA file, which is used by the program Fold.