Algorithm for Sequence Compression/Decompression
Background on Sequence Compression/Decompression
In a nucleic acid sequence, each base is represented by one character in ASCII code and hence occupies 1 byte (8 bits). But because each base has only 4 possibilities, each base requires only log2(4) = 2 bits, shown in the following table:
Base |
2-Bit Binary Code |
---|---|
A |
|
C |
|
G |
|
T/U |
|
For efficient storage of (long) nucleic acid sequences, the following algorithm compresses 4 bases (= 8 bits/byte ÷ 2 bits/nt) in each byte, and then restores the original sequence when needed.
Algorithm for Sequence Compression
Algorithm for Sequence Compression: Procedure
First, attributes of the nucleic acid sequence are recorded:
rna (
bool
): whether the sequence is RNAlength (
int
): number of bases in the sequencens (
tuple[int, ...]
): positions ofN
bases, if any
The type and the length are scalar values with constant sizes regardless
of the length of the sequence.
The 0-indexed positions of ambiguous bases is an array that at worst (if
all bases are N
) scales linearly with the length of the sequence and
at best (if no bases are N
) is a constant (small) size.
Because most reference sequences contain no or very few bases that are
N
, recording the N
positions generally requires little space.
Then, each non-overlapping segment of four bases is encoded as one byte
by concatenating the 2-bit codes (above) of the bases in the segment, in
reverse order.
Ambiguous (N
) bases are arbitrarily encoded as 00
.
Because this step requires the length of the sequence to be a multiple
of 4, the sequence is padded on its 3’ side with A
(an arbitrary
choice) until its length is a multiple of 4.
Algorithm for Sequence Compression: Example
Given the DNA sequence CAGNTTCGAN
, the attributes are extracted:
rna:
False
length:
10
ns:
(3, 9)
(note: these positions are 0-indexed)
The sequence is then padded with A
at the 3’ end until its length
becomes a multiple of 4 (in this case, length 12):
CAGNTTCGANAA
Then, each 4-base segment is transformed into one byte by encoding each base and concatenating the codes in order from code 4 to code 1:
Number |
Sequence |
Base 1 |
Base 2 |
Base 3 |
Base 4 |
Code 4 |
Code 3 |
Code 2 |
Code 1 |
Byte |
---|---|---|---|---|---|---|---|---|---|---|
1 |
CAGN |
C |
A |
G |
N |
00 |
10 |
00 |
01 |
00100001 |
2 |
TTCG |
T |
T |
C |
G |
10 |
01 |
11 |
11 |
10011111 |
3 |
ANAA |
A |
N |
A |
A |
00 |
00 |
00 |
00 |
00000000 |
Thus, the compressed byte string is [00100001, 10011111, 00000000]
.
Note that this string is only 3 bytes, compared to 10 for the original.
As a bytes
object, the representation is b'!\x9f\x00'
.
Algorithm for Sequence Decompression
Algorithm for Sequence Decompression: Procedure
Beginning with a compressed sequence of bytes, each byte is transformed
into four bases by decoding each 2-bit chunk to a base using the above
table, then reversing the order of the four bases.
The code 11
is decoded to U
if the rna attribute is True
and to T
if False
.
The sequence at this point must have a number of bases divisible by 4.
It is cut to the correct number of bases using the length attribute.
Finally, every position in the attribute ns is masked to N
.
Algorithm for Sequence Decompression: Example
Suppose that a compressed sequence has the following attributes:
compressed byte string:
[00100001, 10011111, 00000000]
rna:
False
length:
10
ns:
(3, 9)
To decompress the sequence, each byte is split into four 2-bit segments, decoded, reversed, and reassembled:
Number |
Byte |
Code 1 |
Code 2 |
Code 3 |
Code 4 |
Base 4 |
Base 3 |
Base 2 |
Base 1 |
Sequence |
---|---|---|---|---|---|---|---|---|---|---|
1 |
00100001 |
00 |
10 |
00 |
01 |
C |
A |
G |
A |
CAGA |
2 |
10011111 |
10 |
01 |
11 |
11 |
T |
T |
C |
G |
TTCG |
3 |
00000000 |
00 |
00 |
00 |
00 |
A |
A |
A |
A |
AAAA |
The resulting sequence, CAGATTCGAAAA
, is trimmed to 10 nt: CAGATTCGAA
.
Finally, (0-indexed) positions 3 and 9 are replaced with N
: CAGNTTCGAN
.