Tutorial 1: Amplicon prepared with RT-PCR
This tutorial demonstrates how to analyze a dataset that was prepared using RT-PCR (with forward and reverse primers) of one specific region of an RNA.
TL;DR
Un-tar and enter the data directory:
tar xvf data.tar cd data
Process the no-DMS control:
seismic wf -x fq/nodms --keep-gu --mask-polya 0 --min-mut-gap 0 hiv-rre.fa
Process the DMS-treated replicates separately:
seismic wf -x fq/dms1 -x fq/dms2 --mask-pos rre 176 -p rre GGAGCTTTGTTCCTTGGGTTCTTGG GGAGCTGTTGATCCTTTAGGTATCTTTC hiv-rre.fa seismic graph scatter out/dms[12]/mask/rre/26-204/mask-position-table.csv
Pool the replicates and process them together:
seismic pool -P dms-pool out/dms[12] seismic wf --mask-pos rre 176 -p rre GGAGCTTTGTTCCTTGGGTTCTTGG GGAGCTGTTGATCCTTTAGGTATCTTTC --cluster --fold -q 0.95 hiv-rre.fa out/dms-pool/relate
Scientific premise
The RNA in this example is a segment of the the human immunodeficiency virus 1 (HIV-1) genome called the Rev response element (RRE), which binds to the protein Rev that mediates nuclear export (Sherpa et al.). The RRE RNA folds into two different secondary structures: a predominant (~75%) structure with 5 stems and a minor (~25%) structure with 4 stems, each of which is associated with a different rate of HIV-1 replication.
In this hypothetical experiment, you in vitro transcribe a 232 nt segment of the RRE, perform two DMS-MaPseq experiments on it (along with a no-DMS control), amplify a region using RT-PCR, and sequence the amplicons using paired-end 150 x 150 nt read lengths.
*The FASTQ files in this tutorial were actually generated using seismic sim
and don’t resemble the authentic DMS mutational patterns.
Download example files of an amplicon
Download the example FASTA and FASTQ files. First, open your terminal and navigate to a directory in which you want to run the tutorial.
If you have wget
, you can download the tutorial data simply by typing
wget https://raw.githubusercontent.com/rouskinlab/seismic-rna/main/src/userdocs/tutorials/amplicon/data.tar
Otherwise, click this link to download the tutorial data: https://raw.githubusercontent.com/rouskinlab/seismic-rna/main/src/userdocs/tutorials/amplicon/data.tar
To ensure the download is complete and not corrupted, verify that the SHA-256
checksum is 00c500bc9458048af04ee09b397a59e829fa25c7a53ba6c1b5cc09cbf80cd6ee
by typing this command:
shasum -a 256 data.tar
If this command prints a different checksum, then retry the download. If the problem persists, then raise an issue (see Bugs and Requests).
After downloading and verifying the data, untar the data by typing
tar xvf data.tar
and then navigate into the data
directory:
cd data
Process the no-DMS control
When you have a no-DMS control, it is a good idea to analyze it first to check if any bases are highly mutated: such bases would give falsely high mutation rates and should be ignored while analyzing DMS-modified samples.
Run the workflow on the no-DMS control
Process the no-DMS control through the whole workflow with this command:
seismic wf -x fq/nodms --keep-gu --mask-polya 0 --min-mut-gap 0 hiv-rre.fa
This is what each of the arguments does:
wf
means run the entire workflow.-x fq/nodms
means search insidefq/nodms
for pairs of FASTQ files of paired-end reads with mate 1 and mate 2 in separate files.--keep-gu
means keep G and U bases (which do not react with DMS and should typically be masked out in DMS-modified samples).--mask-polya 0
means do not mask out poly(A) sequences (which can produce artifacts in DMS-modified samples).--min-mut-gap 0
means disable observer bias correction (which only appies to DMS-modified samples).hiv-rre.fa
means use the sequence in this FASTA file as the reference (i.e. mutation-free) sequence for the RNA.
After it finishes running (which should take about one minute or less), all
output files will go into the directory out
.
Check the graphs of coverage and mutation rate for the no-DMS control
Next, check the unambiguous count (i.e. how many base calls at each position
could be labeled unambiguously as either a match or a mutation).
Open out/nodms/graph/rre/full/profile_filtered_n-count.html
in a web broser.
nodms
is the samplerre
is the reference (i.e. name of the RNA)full
is the region of the reference you are looking atprofile
is the type of graph (a bar graph with position on the x-axis)filtered
means graph the data come from the Mask stepn
is the shorthand for “unambiguous”count
means graph the number of reads
This graph shows that the number of unambiguous base calls at each position is fairly even – around 2,200 – across all positions amplified by the primers (56-222), which is expected for RT-PCR amplicons. This graph also shows that each position has enough unambigous base calls (>1,000) to obtain a reasonably accurate estimate of the mutation rate.
After confirming there are sufficient unambiguous base calls, view the mutation
rates by opening out/nodms/graph/rre/full/profile_filtered_m-ratio-q0.html
in a web browser.
nodms
,rre
,full
,profile
, andfiltered
have the same meanings as beforem
is the shorthand for “mutated”ratio
means graph the ratio ofm
(mutated) to unambiguously mutated or matching base calls (i.e. the mutation rate)q0
means do not normalize the mutation rates
This graph shows that the mutation rate is very low across all positions – as expected for a sample that is not DMS-modified – except for position 176, which has a mutation rate of nearly 50%. Because of this one outlier, it is hard to see just how low the mutation rates are at the other positions, but because this graph is interactive, you can click at the top of the y-axis and enter a new upper limit, such as 0.02. You can also mouse over a bar to see its mutation rate (G31 is shown here).
Now it is clear that every position except 176 has a mutation rate no greater than 1%, and most are below 0.5%, which is typical for non-DMS-modified RNA.
To figure out why position 176 has such a high mutation rate, you can check the
types of mutations that occur at each position, which are in another graph,
out/nodms/graph/rre/full/profile_filtered_acgtdi-ratio-q0.html
.
nodms
,rre
,full
,profile
,filtered
,ratio
, andq0
have the same meanings as beforeacgtdi
are the shorthands for substitutions to A, C, G, and T; deletions; and insertions; respectively
This graph shows that nearly all (~97%) of the mutations at position 176 are A-to-G substitutions. This finding suggests (given that this hypothetical experiment is in vitro) that the DNA template that was used to transcribe the RNA could actually be a mixture of about 50% the expected sequence and 50% that sequence with an A-to-G substitution at position 176.
Mask out the position that is highly mutated in the no-DMS sample
If this were a real experiment, it could be worth sequencing the DNA template to check if it actually was a mixture, and if so to fix it. For the purposes of this tutorial, you will learn how to mask out position 176 so that it does not skew the results.
Rerun the workflow with the option --mask-pos rre 176
:
seismic wf --force --keep-gu --mask-polya 0 --min-mut-gap 0 --mask-pos rre 176 hiv-rre.fa out/nodms/relate/rre
This is what each of the arguments does:
wf
means run the entire workflow.--force
means overwrite any output files that already exist.--keep-gu
means keep G and U bases (which do not react with DMS and should typically be masked out in DMS-modified samples).--mask-polya 0
means do not mask out poly(A) sequences (which can produce artifacts in DMS-modified samples).--min-mut-gap 0
means disable observer bias correction (which only appies to DMS-modified samples).--mask-pos rre 176
means mask position 176 in referencerre
.hiv-rre.fa
means use the sequence in this FASTA file as the reference (i.e. mutation-free) sequence for the RNA.out/nodms/relate/rre
means search this directory for data files: in this case, the data from the Relate step for samplenodms
, referencerre
.
After the command finishes running, you can see that position 176 was masked out
by opening out/nodms/graph/rre/full/profile_filtered_m-ratio-q0.html
(position
175 is highlighted to make the gap between it and position 177 more clear):
Process both DMS-modified replicates
Now you are ready to process the DMS-modified samples.
Run the workflow on both DMS-treated replicates
Process the DMS-treated samples through the whole workflow with this command:
seismic wf -x fq/dms1 -x fq/dms2 --mask-pos rre 176 -p rre GGAGCTTTGTTCCTTGGGTTCTTGG GGAGCTGTTGATCCTTTAGGTATCTTTC hiv-rre.fa
This is what each of the arguments does:
wf
means run the entire workflow.-x fq/dms1
means search insidefq/dms1
for pairs of FASTQ files of paired-end reads with mate 1 and mate 2 in separate files.-x fq/dms2
means search insidefq/dms2
for pairs of FASTQ files of paired-end reads with mate 1 and mate 2 in separate files.--mask-pos rre 176
means mask position 176 (because it had a high mutation rate in the no-DMS sample).-p rre GGAGCTTTGTTCCTTGGGTTCTTGG GGAGCTGTTGATCCTTTAGGTATCTTTC
defines a region of the referencerre
that corresponds to the amplicon flanked by primersGGAGCTTTGTTCCTTGGGTTCTTGG
andGGAGCTGTTGATCCTTTAGGTATCTTTC
.hiv-rre.fa
means use the sequence in this FASTA file as the reference (i.e. mutation-free) sequence for the RNA.
Check the correlation of mutation rates between DMS-treated replicates
Next, it is recommended to check the correlation of mutation rates between the replicates, to ensure they are reproducible. To create a scatter plot of the mutation rates and calculate the correlation, run the command
seismic graph scatter out/dms[12]/mask/rre/26-204/mask-position-table.csv
graph scatter
means graph a scatter plot.out/dms[12]/table/rre/26-204/mask-position-table.csv
means graph data from these tables, where[12]
is a glob pattern that is expanded by the shell into all files that match the pattern – which in this case isout/dms1/table/rre/26-204/mask-position-table.csv out/dms2/table/rre/26-204/mask-position-table.csv
. You could instead type this expression to list both table files explicitly, but the former requires fewer key strokes.
Open out/dms1__and__dms2/graph/rre/26-204/scatter_filtered_m-ratio-q0.html
in a web browser to view the scatter plot and correlation:
The Pearson correlation is 0.998, which is extremely high. (For a general amplicon, ≥0.98 would be ideal, and ≥0.95 would be decent).
Pool the DMS-treated replicates and process them together
Pool the two DMS-treated replicates
Because the correlation is so high, the data from the replicates can be combined so that they can be analyzed as a single sample, which is helpful whenever high coverage is necessary, such as during clustering. To combine the replicates, run this command:
seismic pool -P dms-pool out/dms[12]
pool
means combine samples into a new “pooled” sample.-P dms-pool
means name the pooled sampledms-pool
.out/dms[12]
means pool the samples in these directories, where[12]
is a glob pattern that is automatically expanded by the shell into all files that match the pattern – which in this case isout/dms1 out/dms2
. You could instead type this expression to list both table files explicitly, but the former requires fewer key strokes.
Process the pooled DMS-treated samples
Now that the replicates are pooled, the overall coverage will be higher, and so clustering is more likely to detect true alternative structures. Process the pooled sample, including with clustering, by running this command:
seismic -v wf --mask-pos rre 176 -p rre GGAGCTTTGTTCCTTGGGTTCTTGG GGAGCTGTTGATCCTTTAGGTATCTTTC --cluster --fold -q 0.95 hiv-rre.fa out/dms-pool/relate
This is what each of the arguments does:
-v
means use verbose mode (to print more messages to the console).wf
means run the entire workflow.--mask-pos rre 176
means mask position 176 (because it had a high mutation rate in the no-DMS sample).-p rre GGAGCTTTGTTCCTTGGGTTCTTGG GGAGCTGTTGATCCTTTAGGTATCTTTC
defines a region of the referencerre
that corresponds to the amplicon flanked by primersGGAGCTTTGTTCCTTGGGTTCTTGG
andGGAGCTGTTGATCCTTTAGGTATCTTTC
.--cluster
means enable clustering to find alternative structures.--fold
means enable secondary structure prediction.-q 0.95
sets the 95th percentile of the mutation rates to 1 and scales the rest of the data accordingly (required if using--fold
).hiv-rre.fa
means use the sequence in this FASTA file as the reference (i.e. mutation-free) sequence for the RNA.out/dms-pool/relate
means search insideout/dms-pool/relate
for data to feed into the workflow: in this case, a report from the Relate step.
Check whether the RNA forms alternative structures
To check whether alternative structures were detected, open the cluster report
in a web browser: out/dms-pool/cluster/rre/26-204/cluster-report.json
.
The contents of the report will resemble this (some fields may differ, such as
the time and version):
{
"Sample": "dms-pool",
"Reference": "rre",
"Region": "26-204",
"Number of unique reads": 12236,
"Start at this many clusters": 1,
"Stop at this many clusters (0 for no limit)": 0,
"Try all numbers of clusters (Ks), even after finding the best number": false,
"Write all numbers of clusters (Ks), rather than only the best number": false,
"Remove runs with two clusters more similar than this correlation": 0.9,
"Remove runs with two clusters different by less than this MARCD": 1.5,
"Calculate the jackpotting quotient to find over-represented reads": true,
"Confidence level for the jackpotting quotient confidence interval": 0.95,
"Remove runs whose jackpotting quotient exceeds this limit": 1.2,
"Remove Ks whose 1st/2nd log likelihood difference exceeds this gap": 250.0,
"Remove Ks where every run has less than this correlation vs. the best": 0.975,
"Remove Ks where every run has more than this MARCD vs. the best": 1.1,
"Run EM this many times for each number of clusters (K)": 12,
"Run EM for at least this many iterations (times number of clusters)": 10,
"Run EM for at most this many iterations (times number of clusters)": 500,
"Stop EM when the log likelihood increases by less than this threshold": 0.37,
"Whether each number of clusters (K) passed filters": {
"1": true,
"2": true,
"3": false
},
"Best number of clusters": 2,
"Numbers of clusters written to batches": [
2
],
"Number of batches": 2,
"MD5 checksums of batches": {
"cluster": [
"8f42feceefca3aba6dfbf2fba8072ecf",
"f3bb0d2bdd105ec91c3f84bf3a194bfe"
]
},
"Branches": [],
"Time began": "2024-12-08 at 23:10:31",
"Time ended": "2024-12-08 at 23:12:16",
"Time taken (minutes)": 1.75,
"Version of SEISMIC-RNA": "0.22.0"
}
The number of clusters detected is the field Best number of clusters
, which
is 2 in this case, suggesting that the RNA forms 2 alternative structures.
Draw the best-supported structure model for each cluster
For each cluster, the DMS reactivities are used to model the secondary structure
of the RNA (with the --fold
option).
However, the folding algorithm will produce a structure for any DMS reactivties
– even pure noise.
Thus, it is important to verify that the structure models agree with the DMS
reactivities – that is, paired bases tend to have low reactivities and unpaired
bases tend to have high reactivities.
The agreement is measured by computing the receiver operating characteristic (ROC) and calculating the area under the curve (AUC). If the structure agrees perfectly with the DMS reactivities – that is, every unpaired base has a higher DMS reactivity than every paired base – then the AUC will equal 1. If the DMS reactivities were completely random, then the AUC would be 0.5.
Open out/dms-pool/graph/rre/full/roc_26-204__clustered-2-x_m-ratio-q0.html
in a web browser to check the AUC-ROC for this sample.
The ROC curves for structures modeled using the DMS reactivities from cluster 1
are in the upper graph, and those from cluster 2 in the lower graph.
For each cluster, multiple structures were modeled: generally, the best model is
the one with the largest AUC, and if there is a tie for the largest AUC, then
the model among those with the smallest (most negative) free energy.
In this case, that is structure rre__full #0
for both clusters.
In a text editor, open the structure models for cluster 1:
out/dms-pool/fold/rre/full/26-204__cluster-2-1.db
Copy the sequence (second line) as well as the structure on the line below
rre__full #0
into an RNA structure visualization program, such as VARNA.
Adjust the angles to remove overlaps between different parts of the structure.
Color the bases by their DMS reactivities by right-clicking on the VARNA canvas,
choosing “Color map” then “Load values…”, and then typing the VARNA color file
out/dms-pool/fold/rre/full/26-204__cluster-2-1__varna-color.txt
in the box.
Specify a color palette by right-clicking on the VARNA canvas, choosing “Color map” and then “Style…”. Bases with no data are set to -1, and bases with data are set to a value between 0 and 1, so choose a color palette that is consistent with this scheme, such as light gray for -1 and a gradient from 0 to 1:
Save the VARNA session file by right-clicking on the VARNA canvas, choosing “Save…”, selecting “VARNA Session File” under “File Format”, and entering a location and name for the file.
To draw the structure for cluster 2, right-click the VARNA canvas, choose
“New…”, open out/dms-pool/fold/rre/full/26-204__cluster-2-2.db
in a text
editor, and copy-paste the structure below rre__full #0
into the text box
“Structure.”
Repeat the steps above to draw and color the structure for cluster 2, except
using out/dms-pool/fold/rre/full/26-204__cluster-2-2__varna-color.txt
as the
source of the DMS reactivities for the colors.