Relation Vectors
Relationships between reads and references
A relation vector encodes the relationship between one sequencing read (or pair of mated reads) and each base in the reference sequence. SEISMIC-RNA defines eight primary relationships:
Match: The reference base aligned to a high-quality base of the same kind in the read.
Deletion: The reference base aligned between two bases in the read.
5’ of Insertion: The reference base aligned to any read base immediately 5’ of an extra base in the read.
3’ of Insertion: The reference base aligned to any read base immediately 3’ of an extra base in the read.
Substitution to A: The reference base is not A, and it aligned to a high-quality A in the read.
Substitution to C: The reference base is not C, and it aligned to a high-quality C in the read.
Substitution to G: The reference base is not G, and it aligned to a high-quality G in the read.
Substitution to T: The reference base is not T, and it aligned to a high-quality T in the read.
This figure illustrates six of these primary relationships, as well the “blank” relationship for positions in the reference outside the span of the read.

Encoding primary relationships
Each position in a relation vector is physically represented by one byte (eight bits); each bit corresponds to one of the eight types of primary relationship. For a given position in the relation vector, the byte(s) corresponding to the relationship(s) at that position are turned on (set to 1); all other bits are set to 0. The following table indicates which relationship each bit represents. Each bit is shown as the sole 1 within an entire byte (8-digit binary number). The number’s decimal (Dec) and hexadecimal (Hex) forms are also shown.
Byte |
Dec |
Hex |
Relationship |
---|---|---|---|
00000001 |
001 |
01 |
Match |
00000010 |
002 |
02 |
Deletion |
00000100 |
004 |
04 |
5’ of Insertion |
00001000 |
008 |
08 |
3’ of Insertion |
00010000 |
016 |
10 |
Substitution to A |
00100000 |
032 |
20 |
Substitution to C |
01000000 |
064 |
40 |
Substitution to G |
10000000 |
128 |
80 |
Substitution to T |
Encoding ambiguous relationships
Oh, if only encoding relation vectors were that straightforward! Though most bases in a read will have primary relationships with the bases in the reference to which they align, two phenomena make it more difficult for some bases to define the relationship:
low-quality base calls
ambiguous insertions and deletions
Encoding low-quality base calls
A low-quality base call is defined as having a Phred quality score below the user-specified threshold (default: 25). Low-quality base calls are treated as if they could be any of the four bases. The type of relationship that would occur if the read base were each of the four bases is determined. The bytes for those relationships are united via bitwise OR into a consensus byte for the ambiguous relationship.
For example, suppose a low-quality base call in the read aligns to a T in the
reference.
If the read base (which is unknown because of its low quality) were actually A,
then the relationship would be a substitution to A (00010000
).
Likewise if the read base were C (00100000
) or G (01000000
).
If the read base were T, then the relationship would be a match (00000001
).
The bytes for these four possible situations are merged by taking the bitwise OR
into a consensus byte (01110001
) that shows ambiguity in the relationship.
This consensus byte is inserted into the relation vector.
If read were |
Then relationship would be |
Byte |
Dec |
Hex |
---|---|---|---|---|
A |
Substitution to A |
00010000 |
016 |
10 |
C |
Substitution to C |
00100000 |
032 |
20 |
G |
Substitution to G |
01000000 |
064 |
40 |
T |
Match |
00000001 |
001 |
01 |
Low-quality |
Any of the above |
01110001 |
113 |
71 |
In the following table, each row repeats this calculation for one type of base
in the reference (column “Ref”).
Each column named “Read: A?” / “Read: C?” / “Read: G?” / “Read: T?” shows what
the relationship would be if the low-quality base in the read were actually the
base in the column header.
For example, in the first row, the reference base is A: if the read base were A,
then the relationship would be a match (00000001
); and if it were C, then
the relationship would be a substitution to C (00100000
).
The column “Byte” shows the resulting ambiguous relationship, the bitwise OR of
the four columns “Read: A/C/G/T?”.
Ref |
Read: A? |
Read: C? |
Read: G? |
Read: T? |
Byte |
Dec |
Hex |
---|---|---|---|---|---|---|---|
A |
00000001 |
00100000 |
01000000 |
10000000 |
11100001 |
225 |
e1 |
C |
00010000 |
00000001 |
01000000 |
10000000 |
11010001 |
209 |
d1 |
G |
00010000 |
00100000 |
00000001 |
10000000 |
10110001 |
177 |
b1 |
T |
00010000 |
00100000 |
01000000 |
00000001 |
01110001 |
113 |
71 |
Note
A byte that has more than one bit set to 1 does not count more than once towards the total number of matches or mutations. To learn how mutations in relation vectors are counted, see [REF].
Encoding ambiguous insertions and deletions
Insertions and deletions (collectively, “indels”) in the read cause ambiguities
that even the highest quality sequencing reads could not prevent.
When one or more bases are inserted or deleted in a repetitive sequence, the
exact base that mutated cannot be determined.
For example, if the reference is ATCCTG
and the read is ATCTG
, then one
C was clearly deleted from the read.
But determining whether it was the first or second C is impossible because the
alignments are equally good:
Deletion of the first C
AT-CTG
|| |||
ATCCTG
Deletion of the second C
ATC-TG
||| ||
ATCCTG
Ambiguities in the location of a relationship are encoded by turning on the bit
of every possible relationship at each position.
In the above example, there could be a deletion (00000010
) or a match
(00000001
) at position 3 of the reference, so the byte it receives is the
bitwise OR of the two relationships: 00000011
.
Likewise for position 4.
Thus, the relationship byte at each position (Pos) in the alignment would be
Pos |
Byte |
Hex |
---|---|---|
1 |
00000001 |
01 |
2 |
00000001 |
01 |
3 |
00000011 |
03 |
4 |
00000011 |
03 |
5 |
00000001 |
01 |
6 |
00000001 |
01 |
Note
A byte that has more than one bit set to 1 does not count more than once towards the total number of matches or mutations. To learn how mutations in relation vectors are counted, see [REF].
To learn how the algorithm that finds ambiguous indels works, see [REF].
Encoding positions not covered by the read
If a read is shorter than the reference, then some positions in the reference
will not be covered by the read.
The “blank” positions to which the read does not align provide no information
and are thus considered fully ambiguous and assigned the byte 11111111
(decimal 255, hexadecimal ff).
Encoding paired-end reads
For paired-end reads, both mates produce a relation vector. They must be merged into one consensus relation vector to avoid double-counting any positions where the two mates overlap. Ideally, the mates would have identical relationships. However, they often differ because a position is covered in one mate but not in the other, one mate’s Phred score is above the threshold while the other’s is below, or (more rarely) the base calls themselves differ.
Encoding consensus relationships
When finding the consensus of two mates, information in one mate should fill in for a lack thereof in the other. Recall that each byte indicates all possible relationships at its position. The more bits that are set to 1, the more ambiguity (and the less knowledge) there is about the relationship. For one mate to add knowledge to the other, the consensus byte must thus have no more 1s than the byte of either mate. Specifically, a bit in the consensus should be 1 only if it is 1 in both mates. This result is achieved using the bitwise AND operation.
For example, consider the following mate 1 and mate 2, where the column “Result” indicates the consensus byte after taking the bitwise AND:
Pos |
Mate 1 |
Mate 2 |
Result |
---|---|---|---|
1 |
00000001 |
00000001 |
00000001 |
2 |
00000001 |
11010001 |
00000001 |
3 |
11100001 |
01000000 |
01000000 |
4 |
11111111 |
00000001 |
00000001 |
5 |
11111111 |
01110001 |
01110001 |
6 |
11111111 |
11111111 |
11111111 |
At position 1, the mates agree on a match.
At position 2, mate 2 has low quality, but mate 1 has a high-quality match, so
that the result has only the match bit set to 1.
Similarly, at position 3, a substitution to G in mate 2 compensates for the low
quality base call in mate 1: substitution to G is the consensus.
Mate 1 does not cover the positions 4-6 (hence the blank bytes 11111111
).
Mate 2 informs that position 4 is a match, but it is low quality at position 5,
so even the consensus byte is ambiguous.
Neither mate covers position 6, so the consensus byte is blank.
Encoding irreconcilable relationships
It is possible, although rare, for mates 1 and 2 to share no bits.
For example, if mate 1 were a high-quality match (00000001
) and mate 2 were
a high-quality substitution to T (10000000
), then the bitwise AND would be
all zeros (00000000
).
The mates would be irreconcilable at this position.