Algorithm for Sequence Compression/Decompression
Background on Sequence Compression/Decompression
In a nucleic acid sequence, each base is represented by one character in ASCII code and hence occupies 1 byte (8 bits). But because each base has only 4 possibilities, each base requires only log2(4) = 2 bits, shown in the following table:
Base |
2-Bit Binary Code |
|---|---|
A |
|
C |
|
G |
|
T/U |
|
For efficient storage of (long) nucleic acid sequences, the following algorithm compresses 4 bases (= 8 bits/byte ÷ 2 bits/nt) in each byte, and then restores the original sequence when needed.
Algorithm for Sequence Compression
Algorithm for Sequence Compression: Procedure
First, attributes of the nucleic acid sequence are recorded:
rna (
bool): whether the sequence is RNAlength (
int): number of bases in the sequencens (
tuple[int, ...]): positions ofNbases, if any
The type and the length are scalar values with constant sizes regardless
of the length of the sequence.
The 0-indexed positions of ambiguous bases is an array that at worst (if
all bases are N) scales linearly with the length of the sequence and
at best (if no bases are N) is a constant (small) size.
Because most reference sequences contain no or very few bases that are
N, recording the N positions generally requires little space.
Then, each non-overlapping segment of four bases is encoded as one byte
by concatenating the 2-bit codes (above) of the bases in the segment, in
reverse order.
Ambiguous (N) bases are arbitrarily encoded as 00.
Because this step requires the length of the sequence to be a multiple
of 4, the sequence is padded on its 3’ side with A (an arbitrary
choice) until its length is a multiple of 4.
Algorithm for Sequence Compression: Example
Given the DNA sequence CAGNTTCGAN, the attributes are extracted:
rna:
Falselength:
10ns:
(3, 9)(note: these positions are 0-indexed)
The sequence is then padded with A at the 3’ end until its length
becomes a multiple of 4 (in this case, length 12):
CAGNTTCGANAA
Then, each 4-base segment is transformed into one byte by encoding each base and concatenating the codes in order from code 4 to code 1:
Number |
Sequence |
Base 1 |
Base 2 |
Base 3 |
Base 4 |
Code 4 |
Code 3 |
Code 2 |
Code 1 |
Byte |
|---|---|---|---|---|---|---|---|---|---|---|
1 |
CAGN |
C |
A |
G |
N |
00 |
10 |
00 |
01 |
00100001 |
2 |
TTCG |
T |
T |
C |
G |
10 |
01 |
11 |
11 |
10011111 |
3 |
ANAA |
A |
N |
A |
A |
00 |
00 |
00 |
00 |
00000000 |
Thus, the compressed byte string is [00100001, 10011111, 00000000].
Note that this string is only 3 bytes, compared to 10 for the original.
As a bytes object, the representation is b'!\x9f\x00'.
Algorithm for Sequence Decompression
Algorithm for Sequence Decompression: Procedure
Beginning with a compressed sequence of bytes, each byte is transformed
into four bases by decoding each 2-bit chunk to a base using the above
table, then reversing the order of the four bases.
The code 11 is decoded to U if the rna attribute is True
and to T if False.
The sequence at this point must have a number of bases divisible by 4.
It is cut to the correct number of bases using the length attribute.
Finally, every position in the attribute ns is masked to N.
Algorithm for Sequence Decompression: Example
Suppose that a compressed sequence has the following attributes:
compressed byte string:
[00100001, 10011111, 00000000]rna:
Falselength:
10ns:
(3, 9)
To decompress the sequence, each byte is split into four 2-bit segments, decoded, reversed, and reassembled:
Number |
Byte |
Code 1 |
Code 2 |
Code 3 |
Code 4 |
Base 4 |
Base 3 |
Base 2 |
Base 1 |
Sequence |
|---|---|---|---|---|---|---|---|---|---|---|
1 |
00100001 |
00 |
10 |
00 |
01 |
C |
A |
G |
A |
CAGA |
2 |
10011111 |
10 |
01 |
11 |
11 |
T |
T |
C |
G |
TTCG |
3 |
00000000 |
00 |
00 |
00 |
00 |
A |
A |
A |
A |
AAAA |
The resulting sequence, CAGATTCGAAAA, is trimmed to 10 nt: CAGATTCGAA.
Finally, (0-indexed) positions 3 and 9 are replaced with N: CAGNTTCGAN.