Step by Step Instruction
Define settings in the configure file
Important parameters of the
data.yaml
file.
Parameter | Â | Â | Default value (type) | Description |
Level 1 | Level 2 | Level 3 | ^^ | ^^ |
workdir | - | - | workspace (string) | (optional) The path to the working directory. |
tempdir | - | - | workspace/.tmp (string) | (optional) The path to the temp file directory. |
cores | - | - | 48 (int) | (optional) Max number of cores for all the running jobs at the same time. |
reference | contamination | fa | - (string) | (optional) Reference file for putative contamination in the samples |
^^ | ^^ | bt2 | Auto (string) | (optional) bowtie2 index for contamination reference |
^^ | genes | fa | - (string) | (required) reference file for selected rRNA, tRNA and snoRNA genes |
^^ | ^^ | bt2 | Auto (string) | (optional) bowtie2 index for selected genes |
^^ | genome | fa | - (string) | (required) reference genome for the species you study |
^^ | ^^ | star | - (string) | (required) STAR index for reference genome |
samples | (sample name) | data | - (list) | (required) list of files path for sequencing reads in fastq format. Multiple sequencing run for one sample is supported. You do not need to combine the input fastq files before passing the data into the pipeline. |
^^ | ^^ | group | - (string) | (required) group ID for this sample |
^^ | ^^ | treated | true (boolean) | (optional) true for BS treated sample, false for untreated control sample. |
^^ | (…) | (…) | (…) | (optional) other samples. Adding any number of entries is supported. |
workdir
tempdir
cores
…
reference
The ‘reference’ section specifies the path of reference sequences, which are in the fasta format, and the index for bowtie2 or STAR aligner.
contamination
: Put some putative contamination in your samples in to a fasta file. For cultured cells, contamination is most likely to be Mycoplasmsa and E. coli. For plant samples, it might be some fungi.💡 If contamination fasta file is not provided, the contamination filter step will be skipped.
💡 The bowtie2 index for contamination is optional, and if it is not provided, the pipeline will generate bowtie2 index automatically.
genes
: rRNA genes and non-coding small RNA genes (snoRNA, miRNA, tRNA, …) of the species you study can be downloaded from NCBI database and used as the reference in this step. An example dataset for mouse (Mus musculus) can be downloaded from this LINK.💡 Mapping to genes before mapping reads into genome is a strategy to get rid of false possitives and increase detection sensitivity on multiple-copies genes. More details are explained in the paper.
💡 The bowtie2 index for genes is optional, and if it is not provided, the pipeline will generate bowtie2 index automatically.
genome
: Use the latest version of the reference genome for the species you study.💡 The STAR index for the genome sequence is also required.
samples
The ‘samples’ section specifies the path (
data
), the classification (group
), the library type (treated
) and other information for each sample.
group
information is essential for sites prefiltering. Samples that might have different Ψ level should be labeled with different group. For example, different cell culture condition, different genotype, or different tissue type.
Read more on Advanced Customization.
Customized settings in the command line
- Customized configure file with
-c
argument. (default:data.yaml
) - Customized number of jobs/cores in parallel
-j
argument. (default:48
)
-j
arg in the command line has higher priority than thecores
setting in the yaml file. This one will mask the one in the configure file.
Post-filter Ψ sites
The reliability ($p$) of the Ψ-modified sites
A p-value from the One-sided Fisher’s Exact Test can be used for evaluating the reliability of each site.
As the table below,
- $\sum{d_{i}}$ is total number of coverage in input libraries
- $\sum{d_{t}}$ is total number of coverage in treated libraries
- $\sum{g_{i}}$ is total number of gaps in input libraries
- $\sum{g_{t}}$ is total number of gaps in treated libraries
 | Input | Treated |
gap | $\sum{g_{i}}$ | $\sum{g_{t}}$ |
T | $\sum{d_{i}} - \sum{g_{i}}$ | $\sum{d_{t}} - \sum{g_{t}}$ |
You can add p.value to the table generated by this pipeline with a R script.
Some example rows from the file:
chr pos strand mESCWT-rep1-input_depth mESCWT-rep1-input_gap mESCWT-rep2-input_depth mESCWT-rep2-input_gap mESCWT-rep3-input_depth mESCWT-rep3-input_gap mESCWT-rep1-treated_depth mESCWT-rep1-treated_gap mESCWT-rep2-treated_depth mESCWT-rep2-treated_gap mESCWT-rep3-treated_depth mESCWT-rep3-treated_gap rRNA_Rn45s 43 + 112 28 74 25 72 25 108 25 96 26 88 22 rRNA_Rn45s 4913 + 416 1 362 0 396 0 203 29 216 29 196 26 rRNA_Rn45s 10032 + 182 0 150 2 161 0 179 0 173 0 157 0 rRNA_Rn45s 10970 + 124 2 118 2 134 5 142 5 133 2 113 4 rRNA_Rn45s 11491 + 327 0 315 2 278 0 153 129 159 130 158 128 ...
Example code snippet:
# customized your file name and sample name in these 4 rows fn_in <- "sites.tsv.gz" # input file anme fn_out <- "sites_pvalue.tsv.gz" # output file name col_i <- c("mESCWT-rep1-input", "mESCWT-rep2-input", "mESCWT-rep3-input") # sample of input libs col_t <- c("mESCWT-rep1-treated", "mESCWT-rep2-treated", "mESCWT-rep3-treated") # sample of treated libs # calculate p.value get_p <- function(r) { fisher.test(as.table(matrix(as.numeric(r[c(1:4)]), ncol = 2)), alt = "greater")$p.value } df <- read.table(gzfile(fn_in), sep = "\t", header = T, check.names = F) di <- rowSums(df[paste0(col_i, "_depth")]) gi <- rowSums(df[paste0(col_i, "_gap")]) dt <- rowSums(df[paste0(col_t, "_depth")]) gt <- rowSums(df[paste0(col_t, "_gap")]) df$p.value <- apply(cbind(di - gi, gi, dt - gt, gt), 1, get_p) write.table(df, gzfile(fn_out), sep = "\t", row.names = F, quote = F)
The stoichiometry ($f$) of the Ψ-modified sites
Ψ stoichiometry ($f$) can be precisely calculated by applying the deletion ratio ($r$) to the calibration curves.
\[r = \displaystyle{\frac{g}{d}}\], where $g$ is the deletion number and $d$ is the sequencing coverage.
\[f=\frac{b - r}{c \times (b + s - s \times r - 1)}\], where $c$ is the conversion ratio, $s$ is the RT dropout proportion and $b$ is the background noise.