Relate Batch
Each batch of relation vectors is a RelateBatchIO
object.
Relate batch: Structure
The following attributes encode the relationships between each read and each position in the reference sequence:
Attribute |
Data Type |
Description |
---|---|---|
|
|
array of the first position of the most upstream mate in each read |
|
|
array of the first position of the most downstream mate in each read |
|
|
array of the last position of the most upstream mate in each read |
|
|
array of the last position of the most downstream mate in each read |
|
|
array of the reads with each type of mutation at each position |
Note
The positions of the first and last bases in the reference sequence are defined to be 1 and the length of the sequence, respectively.
Relate batch: Structure of read numbers
Each read, or pair of paired-end reads, is labeled with a non-negative integer: 0 for the first read in each batch, and incrementing by 1 for each subsequent read. Within one batch, all read numbers are unique. However, two different batches can have reads that share numbers.
Relate batch: Structure of 5’ and 3’ end positions
end5s
,mid5s
,mid3s
, andend3s
are all 1-dimensionalnumpy.ndarray
objects.For any relate batch,
end5s
,mid5s
,mid3s
, andend3s
all have the same length (which may be any integer ≥ 0).A read with index
i
corresponds to thei
th values ofend5s
,mid5s
,mid3s
, andend3s
; denoted (respectively)end5s[i]
,mid5s[i]
,mid3s[i]
, andend3s[i]
.For every read
i
:1 ≤
end5s[i]
≤end3s[i]
≤ length of reference sequenceIf paired-end and there is a gap of ≥ 1 nt between mates 1 and 2:
end5s[i]
≤mid3s[i]
<mid5s[i]
≤end3s[i]
Otherwise:
end5s[i]
=mid5s[i]
≤mid3s[i]
=end3s[i]
Relate batch: Structure of mutations
muts
is a dict
wherein
each key is a position in the reference sequence (
int
)each value is a
dict
whereineach key is a type of mutation (
int
, see Relation Vectors for more information)each value is an array of the numbers of the reads that have the given type of mutation at the given position (
numpy.ndarray
)
Relate batch: Example
For example, suppose that the reference sequence is TCAGAACC
and a
batch contains five paired-end reads, numbered 0 to 4:
Read |
Mate |
Alignment |
---|---|---|
0 |
1 |
|
0 |
2 |
|
1 |
1 |
|
1 |
2 |
|
2 |
1 |
|
2 |
2 |
|
3 |
1 |
|
3 |
2 |
|
4 |
1 |
|
4 |
2 |
|
Ref |
|
The positions, reads, and relationships can be shown explicitly as a matrix (see Relation Vectors for information on the relationship codes):
Read |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
---|---|---|---|---|---|---|---|---|
0 |
255 |
1 |
1 |
1 |
255 |
1 |
64 |
1 |
1 |
1 |
1 |
128 |
1 |
128 |
1 |
255 |
255 |
2 |
255 |
1 |
1 |
255 |
1 |
1 |
1 |
255 |
3 |
1 |
16 |
1 |
1 |
128 |
255 |
1 |
1 |
4 |
255 |
255 |
1 |
1 |
3 |
3 |
1 |
255 |
In a relate batch, they would be encoded as follows:
end5s
:[2, 1, 2, 1, 3]
mid5s
:[4, 1, 3, 5, 3]
mid3s
:[6, 6, 5, 7, 7]
end3s
:[8, 6, 7, 8, 7]
muts
:{1: {}, 2: {16: [3]}, 3: {128: [1]}, 4: {}, 5: {3: [4], 128: [1, 3]}, 6: {3: [4]}, 7: {64: [0]}, 8: {}}
Note that the numbers are shown here for visual simplicity as
list
objects, but would really benumpy.ndarray
objects.