PARANOiD Outputs
Explanation of all output files generated by PARANOiD
Alignments
Directory containing deduplicated alignments in BAM format together with an index file in BAM.BAI format. BAM files are compressed binary forms of SAM files. SAM/BAM files are tab separated and show one alignment per line.
The information provided in the columns is as follows:
1. Read header
2. Bitwise FLAG
3. Name of reference sequence
4. Position of alignment (1-based)
5. MAPQ-score
6. CIGAR string
7. Name of mate read (shows * if information is not available)
8. Position of mate read (shows 0 if information is not available)
9. Length of alignment on the reference (shows 0 if information is not available)
10. Read sequence (shows * if information is not available)
11. Quality of read sequence (shows * if information is not available)
One of each is generated per sample.
Alignments are included in the basic analysis.
Example:
NB501399:129:HLW7VAFX2:3:11409:5471:17963_AAGACACTG 272 1 14572 0 23M * 0 0 CCACACAGTGCTGGTTCCGTCAC EEEEEEEEEEEAEEEEEEEEEEE NH:i:7 HI:i:4 AS:i:22 nM:i:0
NB501399:129:HLW7VAFX2:3:11604:9407:1314_TCTGCCCAC 272 1 14747 0 36M * 0 0 CGGCAGAGGAGGGATGGAGTCTGACACGCGGGCAAA EEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEE NH:i:5 HI:i:4 AS:i:35 nM:i:0
NB501399:129:HLW7VAFX2:2:11201:6526:7382_TCCCCGACC 272 1 14847 0 40M * 0 0 AGTGAGGGTGGTTGGTGGGAAACCCTGGTTCCCCCAGCCC EEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEE NH:i:6 HI:i:3 AS:i:39 nM:i:0
NB501399:129:HLW7VAFX2:1:11204:3841:14476_GCGATCCCG 272 1 14992 0 37M * 0 0 GTTGAAGAGATCCGACATCAAGTGCCCACCTTGGCTC EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE NH:i:8 HI:i:5 AS:i:36 nM:i:0
NB501399:129:HLW7VAFX2:2:11204:16119:17944_CACACCCCG 272 1 14992 0 37M * 0 0 GTTGAAGAGATCCGACATCAAGTGCCCACCTTGGCTC EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE NH:i:8 HI:i:5 AS:i:36 nM:i:0
NB501399:129:HLW7VAFX2:1:21211:6880:4260_CCACAACTC 272 1 15923 0 1S25M659N10M * 0 0 GACCACTTCCCTGGGAGCTCCCTGGACTGAAGGAGA AEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE NH:i:7 HI:i:3 AS:i:35 nM:i:0
Peak-called cross-link sites
Raw cross-link sites
Directory containing unmodified cross-link sites with all background noise retained. Cross-link sites are provided in three different formats, which are separated in one directory each; WIG, BIGWIG and BEDGRAPH. Each format contains identical data.
These files are included in the basic analysis.
WIG (Wiggle)
Format to represent genome-wide coverage that consists of one line per reference chromosome, with the coverage values listed below each in a tab separated manner. Column 1 represents the position, while column 2 shows the coverage at the current position. For each sample, two WIG files are generated - one representing cross-link events on the forward and one on the reverse strand which can be distinguished by their filenames. The number of cross-link events on the reverse strand is shown as negative values.
variableStep chrom=reference_1 span=1
2815 1.0
3726 1.0
3895 1.0
6201 1.0
6367 1.0
variableStep chrom=reference_2 span=1
22 1.0
31 1.0
66 1.0
80 1.0
BIGWIG
An extension of the previously mentioned WIG format. While WIG uses plain text, BIGWIG uses a binary format to store the data, which reduces file size. Therefore, accessing the data requires specialized software such as IGV.
BEDGRAPH
A format similar to WIG or BIGWIG. BEDGRAPH files consist of four columns: 1. The chromosome name 2. The start position of the described events 3. The end position of the described events (in PARANOiD, this is the actual cross-link position) 4. Coverage of the described event (negative for reverse strand)
DQ375404.1 2814 2815 1
DQ375404.1 3725 3726 1
DQ375404.1 3894 3895 1
DQ375404.1 6200 6201 1
DQ375404.1 6366 6367 1
DQ380154.1 21 22 1
DQ380154.1 30 31 1
DQ380154.1 65 66 1
DQ380154.1 79 80 1
Visualization with IGV
All provided file types can be easily visualized using the Integrative Genomics Viewer (IGV). To do so, the reference sequences must first be loaded into IGV. Click on the Genomes tab in the top-left corner and select the source of the reference genome.
The reference track can be used to zoom in, allowing users to see cross-link sites in more detail.
Cross link sites merged
Execution metrics
Directory containing general execution metrics of the workflow such as:
- container_information.txt
Container system used to execute the processes, along with the containers used during the workflow execution
- execution_information.txt
- Contains information required to reproduce the results, such as:
Command used for the execution
Directory of PARANOiD
Config file used
Profiles used
Version of Nextflow and PARANOiD
Execution directory
- parameter_information.txt
Contains all parameters used
Execution metrics are included in the basic analysis.
IGV-session
An XML file that can be directly loaded into IGV.
This can be done by clicking on the Data tab in the top-left corner and then on Open Session. A file browser will open, allowing you to navigate to the PARANOiD output directory and select the igv-session.xml.
This will open a predefined IGV session that includes the reference genome, cross-link sites for all samples (forward and reverse) and the alignment files of all samples.
If the option --merge_replicates was chosen then only the merged cross-link sites will be shown.
This is included in the basic analysis.
Peak height distribution
Peak height distribution is included in the basic analysis.
Reference
The reference sequence provided as input. The reference file is included in the basic analysis.
Statistics
Statistics are included in the basic analysis.
Strand distribution
Strand distribution is included in the basic analysis.
Optional analyses
Peak distance analysis
The peak distance analysis produces three output files:
- Distances table (TSV)
- Contains two columns:
Distance
Number of peaks with that distance
- Distance plot (linear scale)
A plot showing distribution of distances with normal y-axis
- Distance plot (logarithmic scale)
A plot showing the same data with logarithmic y-axis
The example image below was generated using iCLIP data from König et al., which is available here. iCLIP was performed on HeLa cells with hnRNP C antibodies using a precursor version of the iCLIP protocol. The study showed that hnRNP C binds RNAs at characteristic distances of approximately 165 and 300 nucleotides. These distances appear as peaks in the distance analysis (see fig 3e). By performing PARANOiDs peak distance analysis on this dataset, we were able to recreate these peaks.
Example for peak distance analysis performed by PARANOiD. Shows the distances between hnRNP C binding sites in HeLa cells.
To generate this figure, the following files were used:
- iCLIP reads:
Downloaded using
fasterq-dumpand merged into a single FASTQ file
fasterq-dump ERR018282 ERR018283 ERR018284
cat ERR018282.fastq ERR018283.fastq ERR018284.fastq > hnRNPC.fastq
barcodes:
hnRNPC_rep_1 TG
hnRNPC_rep_2 TC
hnRNPC_rep_3 CA
- Reference genome
hg18 from https://hgdownload.cse.ucsc.edu/goldenpath/hg18/bigZips/
- Annotation file
hg18 annotation from https://www.gencodegenes.org/human/release_18.html
The following command was used to run PARANOiD:
nextflow ~/git-projects/PARANOID/main.nf \
--reads data/hnRNPC.fastq \
--barcodes data/hnRNPC_barcodes.tsv \
--reference data/hg18.fa \
--annotation data/gencode.v18.annotation.gtf \
--domain eu \
--barcode_pattern NNXXX \
--mapq 3 \
--omit_peak_calling \
--peak_distance \
--distance 350 \
--merge_replicates \
--output PARANOiD_peak_distance \
-profile slurm,apptainer
Since the dataset was generated using a precursor iCLIP version, simple barcodes were used: 2 experimental nucleotides followed by 3 random nucleotides. To adapt to these barcodes, the parameter --barcode_pattern NNXXX was specified.
RNA subtype analysis
The RNA subtype analysis produces four output files per sample:
- Overview of ambiguous assignments (TSV)
Contains one column and line per RNA subtype. Shows how many subtypes have how many multiple assignments
intron
CDS
three_prime_UTR
five_prime_UTR
intron
3773
222
3359
237
CDS
222
371
151
36
three_prime_UTR
3359
151
3533
42
five_prime_UTR
237
36
42
282
Barplot showing the percentage RNA subtype distribution
Example for RNA subtype analysis performed by PARANOiD. Shows the binding preference of the protein HuR upon mouse genes.
- Distribution table of RNA subtypes
- Contains 3 columns:
RNA subtype being described
Total number of assignments per RNA subtype
Percentage distribution of assigned subtypes
RNA_subtypes
number_assignment
percentage
intron
86681
68.65
CDS
1847
1.46
three_prime_UTR
32723
25.92
five_prime_UTR
1062
0.84
ambiguous
3957
3.13
total
126270
100
- Logs
Contains information whether the assigned RNA subtypes sum up to a different amount than the number of peaks. Currently the case when ambiguous assignments are split as some reads get assigned to multiple RNA subtypes.
To generate these results, the following files were used. The data is from the publication of Diaz-Muñoz et al.. In Fig 5a the RNA subtype distribution of the publication is illustrated delivering comparable results:
- iCLIP reads
Downloaded using
fasterq-dump. Since replicates are already demultiplexed with the experimental barcode removed, they should not be merged into a single file. It should be noted that only replicate 3 achieves results as the quality for almost all nucleotides is extremely low in replicate 1 and 2 (Phred score of 2 - 63.1% of base being incorrectly called).
fasterq-dump SRR1694878 SRR1694879 SRR1694880
mv SRR1694878 HuR_iCLIP_rep_1.fastq
mv SRR1694879 HuR_iCLIP_rep_2.fastq
mv SRR1694880 HuR_iCLIP_rep_3.fastq
- Reference genome
mm10 from https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/
- Annotation file
mm10 annotation from https://www.gencodegenes.org/mouse/release_M10.html
Additionally, the annotation needs to be prepared in order to gain entries for introns. This is done via AGAT:
agat_sp_add_introns.pl --gff gencode.vM10.annotation.gff3 --out gencode.vM10.annotation.introns.gff3
The following command was used to run PARANOiD:
nextflow ~/git-projects/PARANOID/main.nf \
--reads 'data/HuR_iCLIP_rep_*.fastq' \
--reference data/mm10.fa \
--annotation gencode.vM10.annotation.introns.gff3 \
--domain eu \
--omit_demultiplexing \
--omit_peak_calling \
--barcode_pattern NNNN \
--omit_peak_calling \
--run_rna_subtype \
--rna_subtypes intron,CDS,three_prime_UTR,five_prime_UTR \
--min_length 25 \
--output PARANOiD_RNA_subtype \
-profile slurm,apptainer

