Algorithm for Sequence Compression/Decompression

Background on Sequence Compression/Decompression

In a nucleic acid sequence, each base is represented by one character in ASCII code and hence occupies 1 byte (8 bits). But because each base has only 4 possibilities, each base requires only log₂(4) = 2 bits, shown in the following table:

Base	2-Bit Binary Code
A	`00`
C	`01`
G	`10`
T/U	`11`

For efficient storage of (long) nucleic acid sequences, the following algorithm compresses 4 bases (= 8 bits/byte ÷ 2 bits/nt) in each byte, and then restores the original sequence when needed.

Algorithm for Sequence Compression

Algorithm for Sequence Compression: Procedure

First, attributes of the nucleic acid sequence are recorded:

rna (bool): whether the sequence is RNA
length (int): number of bases in the sequence
ns (tuple[int, ...]): positions of N bases, if any

The type and the length are scalar values with constant sizes regardless of the length of the sequence. The 0-indexed positions of ambiguous bases is an array that at worst (if all bases are N) scales linearly with the length of the sequence and at best (if no bases are N) is a constant (small) size. Because most reference sequences contain no or very few bases that are N, recording the N positions generally requires little space.

Then, each non-overlapping segment of four bases is encoded as one byte by concatenating the 2-bit codes (above) of the bases in the segment, in reverse order. Ambiguous (N) bases are arbitrarily encoded as 00. Because this step requires the length of the sequence to be a multiple of 4, the sequence is padded on its 3’ side with A (an arbitrary choice) until its length is a multiple of 4.

Algorithm for Sequence Compression: Example

Given the DNA sequence CAGNTTCGAN, the attributes are extracted:

rna: False
length: 10
ns: (3, 9) (note: these positions are 0-indexed)

The sequence is then padded with A at the 3’ end until its length becomes a multiple of 4 (in this case, length 12):

CAGNTTCGANAA

Then, each 4-base segment is transformed into one byte by encoding each base and concatenating the codes in order from code 4 to code 1:

Number	Sequence	Base 1	Base 2	Base 3	Base 4	Code 4	Code 3	Code 2	Code 1	Byte
1	CAGN	C	A	G	N	00	10	00	01	00100001
2	TTCG	T	T	C	G	10	01	11	11	10011111
3	ANAA	A	N	A	A	00	00	00	00	00000000

Thus, the compressed byte string is [00100001, 10011111, 00000000]. Note that this string is only 3 bytes, compared to 10 for the original. As a bytes object, the representation is b'!\x9f\x00'.

Algorithm for Sequence Decompression

Algorithm for Sequence Decompression: Procedure

Beginning with a compressed sequence of bytes, each byte is transformed into four bases by decoding each 2-bit chunk to a base using the above table, then reversing the order of the four bases. The code 11 is decoded to U if the rna attribute is True and to T if False. The sequence at this point must have a number of bases divisible by 4. It is cut to the correct number of bases using the length attribute. Finally, every position in the attribute ns is masked to N.

Algorithm for Sequence Decompression: Example

Suppose that a compressed sequence has the following attributes:

compressed byte string: [00100001, 10011111, 00000000]
rna: False
length: 10
ns: (3, 9)

To decompress the sequence, each byte is split into four 2-bit segments, decoded, reversed, and reassembled:

Number	Byte	Code 1	Code 2	Code 3	Code 4	Base 4	Base 3	Base 2	Base 1	Sequence
1	00100001	00	10	00	01	C	A	G	A	CAGA
2	10011111	10	01	11	11	T	T	C	G	TTCG
3	00000000	00	00	00	00	A	A	A	A	AAAA

The resulting sequence, CAGATTCGAAAA, is trimmed to 10 nt: CAGATTCGAA. Finally, (0-indexed) positions 3 and 9 are replaced with N: CAGNTTCGAN.