Algorithm for Sequence Compression/Decompression

Background on Sequence Compression/Decompression

In a nucleic acid sequence, each base is represented by one character in ASCII code and hence occupies 1 byte (8 bits). But because each base has only 4 possibilities, each base requires only log2(4) = 2 bits, shown in the following table:

Base

2-Bit Binary Code

A

00

C

01

G

10

T/U

11

For efficient storage of (long) nucleic acid sequences, the following algorithm compresses 4 bases (= 8 bits/byte ÷ 2 bits/nt) in each byte, and then restores the original sequence when needed.

Algorithm for Sequence Compression

Algorithm for Sequence Compression: Procedure

First, attributes of the nucleic acid sequence are recorded:

  • rna (bool): whether the sequence is RNA

  • length (int): number of bases in the sequence

  • ns (tuple[int, ...]): positions of N bases, if any

The type and the length are scalar values with constant sizes regardless of the length of the sequence. The 0-indexed positions of ambiguous bases is an array that at worst (if all bases are N) scales linearly with the length of the sequence and at best (if no bases are N) is a constant (small) size. Because most reference sequences contain no or very few bases that are N, recording the N positions generally requires little space.

Then, each non-overlapping segment of four bases is encoded as one byte by concatenating the 2-bit codes (above) of the bases in the segment, in reverse order. Ambiguous (N) bases are arbitrarily encoded as 00. Because this step requires the length of the sequence to be a multiple of 4, the sequence is padded on its 3’ side with A (an arbitrary choice) until its length is a multiple of 4.

Algorithm for Sequence Compression: Example

Given the DNA sequence CAGNTTCGAN, the attributes are extracted:

  • rna: False

  • length: 10

  • ns: (3, 9) (note: these positions are 0-indexed)

The sequence is then padded with A at the 3’ end until its length becomes a multiple of 4 (in this case, length 12):

CAGNTTCGANAA

Then, each 4-base segment is transformed into one byte by encoding each base and concatenating the codes in order from code 4 to code 1:

Number

Sequence

Base 1

Base 2

Base 3

Base 4

Code 4

Code 3

Code 2

Code 1

Byte

1

CAGN

C

A

G

N

00

10

00

01

00100001

2

TTCG

T

T

C

G

10

01

11

11

10011111

3

ANAA

A

N

A

A

00

00

00

00

00000000

Thus, the compressed byte string is [00100001, 10011111, 00000000]. Note that this string is only 3 bytes, compared to 10 for the original. As a bytes object, the representation is b'!\x9f\x00'.

Algorithm for Sequence Decompression

Algorithm for Sequence Decompression: Procedure

Beginning with a compressed sequence of bytes, each byte is transformed into four bases by decoding each 2-bit chunk to a base using the above table, then reversing the order of the four bases. The code 11 is decoded to U if the rna attribute is True and to T if False. The sequence at this point must have a number of bases divisible by 4. It is cut to the correct number of bases using the length attribute. Finally, every position in the attribute ns is masked to N.

Algorithm for Sequence Decompression: Example

Suppose that a compressed sequence has the following attributes:

  • compressed byte string: [00100001, 10011111, 00000000]

  • rna: False

  • length: 10

  • ns: (3, 9)

To decompress the sequence, each byte is split into four 2-bit segments, decoded, reversed, and reassembled:

Number

Byte

Code 1

Code 2

Code 3

Code 4

Base 4

Base 3

Base 2

Base 1

Sequence

1

00100001

00

10

00

01

C

A

G

A

CAGA

2

10011111

10

01

11

11

T

T

C

G

TTCG

3

00000000

00

00

00

00

A

A

A

A

AAAA

The resulting sequence, CAGATTCGAAAA, is trimmed to 10 nt: CAGATTCGAA. Finally, (0-indexed) positions 3 and 9 are replaced with N: CAGNTTCGAN.