PARANOiD Outputs

Explanation of all output files generated by PARANOiD

Alignments

Directory containing deduplicated alignments in BAM format together with an index file in BAM.BAI format. BAM files are compressed binary forms of SAM files. SAM/BAM files are tab separated and show one alignment per line. The information provided in the columns is as follows: 1. Read header 2. Bitwise FLAG 3. Name of reference sequence 4. Position of alignment (1-based) 5. MAPQ-score 6. CIGAR string 7. Name of mate read (shows * if information is not available) 8. Position of mate read (shows 0 if information is not available) 9. Length of alignment on the reference (shows 0 if information is not available) 10. Read sequence (shows * if information is not available) 11. Quality of read sequence (shows * if information is not available)

One of each is generated per sample.

Alignments are included in the basic analysis.

Example:

NB501399:129:HLW7VAFX2:3:11409:5471:17963_AAGACACTG     272     1       14572   0       23M     *       0       0       CCACACAGTGCTGGTTCCGTCAC EEEEEEEEEEEAEEEEEEEEEEE NH:i:7  HI:i:4  AS:i:22 nM:i:0
NB501399:129:HLW7VAFX2:3:11604:9407:1314_TCTGCCCAC      272     1       14747   0       36M     *       0       0       CGGCAGAGGAGGGATGGAGTCTGACACGCGGGCAAA    EEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEE    NH:i:5  HI:i:4  AS:i:35 nM:i:0
NB501399:129:HLW7VAFX2:2:11201:6526:7382_TCCCCGACC      272     1       14847   0       40M     *       0       0       AGTGAGGGTGGTTGGTGGGAAACCCTGGTTCCCCCAGCCC        EEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEE        NH:i:6  HI:i:3  AS:i:39 nM:i:0
NB501399:129:HLW7VAFX2:1:11204:3841:14476_GCGATCCCG     272     1       14992   0       37M     *       0       0       GTTGAAGAGATCCGACATCAAGTGCCCACCTTGGCTC   EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE   NH:i:8  HI:i:5  AS:i:36 nM:i:0
NB501399:129:HLW7VAFX2:2:11204:16119:17944_CACACCCCG    272     1       14992   0       37M     *       0       0       GTTGAAGAGATCCGACATCAAGTGCCCACCTTGGCTC   EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE   NH:i:8  HI:i:5  AS:i:36 nM:i:0
NB501399:129:HLW7VAFX2:1:21211:6880:4260_CCACAACTC      272     1       15923   0       1S25M659N10M    *       0       0       GACCACTTCCCTGGGAGCTCCCTGGACTGAAGGAGA    AEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE    NH:i:7  HI:i:3  AS:i:35 nM:i:0

Peak-called cross-link sites

Raw cross-link sites

Directory containing unmodified cross-link sites with all background noise retained. Cross-link sites are provided in three different formats, which are separated in one directory each; WIG, BIGWIG and BEDGRAPH. Each format contains identical data. These files are included in the basic analysis.

WIG (Wiggle)

Format to represent genome-wide coverage that consists of one line per reference chromosome, with the coverage values listed below each in a tab separated manner. Column 1 represents the position, while column 2 shows the coverage at the current position. For each sample, two WIG files are generated - one representing cross-link events on the forward and one on the reverse strand which can be distinguished by their filenames. The number of cross-link events on the reverse strand is shown as negative values.

variableStep chrom=reference_1 span=1
  1.0
  1.0
  1.0
  1.0
  1.0
variableStep chrom=reference_2 span=1
1.0
1.0
1.0
1.0

BIGWIG

An extension of the previously mentioned WIG format. While WIG uses plain text, BIGWIG uses a binary format to store the data, which reduces file size. Therefore, accessing the data requires specialized software such as IGV.

BEDGRAPH

A format similar to WIG or BIGWIG. BEDGRAPH files consist of four columns: 1. The chromosome name 2. The start position of the described events 3. The end position of the described events (in PARANOiD, this is the actual cross-link position) 4. Coverage of the described event (negative for reverse strand)

DQ375404.1  2814    2815    1
DQ375404.1  3725    3726    1
DQ375404.1  3894    3895    1
DQ375404.1  6200    6201    1
DQ375404.1  6366    6367    1
DQ380154.1  21      22      1
DQ380154.1  30      31      1
DQ380154.1  65      66      1
DQ380154.1  79      80      1

Visualization with IGV

All provided file types can be easily visualized using the Integrative Genomics Viewer (IGV). To do so, the reference sequences must first be loaded into IGV. Click on the Genomes tab in the top-left corner and select the source of the reference genome.

The reference track can be used to zoom in, allowing users to see cross-link sites in more detail.

Cross link sites merged

Execution metrics

Directory containing general execution metrics of the workflow such as:

container_information.txt
Container system used to execute the processes, along with the containers used during the workflow execution
execution_information.txt
Contains information required to reproduce the results, such as:
Command used for the execution

Directory of PARANOiD

Config file used

Profiles used

Version of Nextflow and PARANOiD

Execution directory
parameter_information.txt
Contains all parameters used

Execution metrics are included in the basic analysis.

IGV-session

An XML file that can be directly loaded into IGV. This can be done by clicking on the Data tab in the top-left corner and then on Open Session. A file browser will open, allowing you to navigate to the PARANOiD output directory and select the igv-session.xml. This will open a predefined IGV session that includes the reference genome, cross-link sites for all samples (forward and reverse) and the alignment files of all samples. If the option --merge_replicates was chosen then only the merged cross-link sites will be shown. This is included in the basic analysis.

Peak height distribution

Peak height distribution is included in the basic analysis.

Reference

The reference sequence provided as input. The reference file is included in the basic analysis.

Statistics

Statistics are included in the basic analysis.

Strand distribution

Strand distribution is included in the basic analysis.

Optional analyses

Peak distance analysis

The peak distance analysis produces three output files:

Distances table (TSV)
Contains two columns:
Distance

Number of peaks with that distance
Distance plot (linear scale)
A plot showing distribution of distances with normal y-axis
Distance plot (logarithmic scale)
A plot showing the same data with logarithmic y-axis

The example image below was generated using iCLIP data from König et al., which is available here. iCLIP was performed on HeLa cells with hnRNP C antibodies using a precursor version of the iCLIP protocol. The study showed that hnRNP C binds RNAs at characteristic distances of approximately 165 and 300 nucleotides. These distances appear as peaks in the distance analysis (see fig 3e). By performing PARANOiDs peak distance analysis on this dataset, we were able to recreate these peaks.

Example for peak distance analysis performed by PARANOiD. Shows the distances between hnRNP C binding sites in HeLa cells.

To generate this figure, the following files were used:

iCLIP reads:
Downloaded using fasterq-dump and merged into a single FASTQ file

fasterq-dump ERR018282 ERR018283 ERR018284
cat ERR018282.fastq ERR018283.fastq ERR018284.fastq > hnRNPC.fastq

barcodes:

hnRNPC_rep_1    TG
hnRNPC_rep_2    TC
hnRNPC_rep_3    CA

Reference genome
hg18 from https://hgdownload.cse.ucsc.edu/goldenpath/hg18/bigZips/
Annotation file
hg18 annotation from https://www.gencodegenes.org/human/release_18.html

The following command was used to run PARANOiD:

nextflow ~/git-projects/PARANOID/main.nf \
    --reads data/hnRNPC.fastq \
    --barcodes data/hnRNPC_barcodes.tsv \
    --reference data/hg18.fa \
    --annotation data/gencode.v18.annotation.gtf \
    --domain eu \
    --barcode_pattern NNXXX \
    --mapq 3 \
    --omit_peak_calling \
    --peak_distance \
    --distance 350 \
    --merge_replicates \
    --output PARANOiD_peak_distance \
    -profile slurm,apptainer

Since the dataset was generated using a precursor iCLIP version, simple barcodes were used: 2 experimental nucleotides followed by 3 random nucleotides. To adapt to these barcodes, the parameter --barcode_pattern NNXXX was specified.

RNA subtype analysis

The RNA subtype analysis produces four output files per sample:

Overview of ambiguous assignments (TSV)
Contains one column and line per RNA subtype. Shows how many subtypes have how many multiple assignments

intron

CDS

three_prime_UTR

five_prime_UTR

intron

3773

222

3359

237

CDS

222

371

151

36

three_prime_UTR

3359

151

3533

42

five_prime_UTR

237

36

42

282
Barplot showing the percentage RNA subtype distribution

Example for RNA subtype analysis performed by PARANOiD. Shows the binding preference of the protein HuR upon mouse genes.

Distribution table of RNA subtypes
Contains 3 columns:
RNA subtype being described

Total number of assignments per RNA subtype

Percentage distribution of assigned subtypes
RNA_subtypes

number_assignment

percentage

intron

86681

68.65

CDS

1847

1.46

three_prime_UTR

32723

25.92

five_prime_UTR

1062

0.84

ambiguous

3957

3.13

total

126270

100
Logs
Contains information whether the assigned RNA subtypes sum up to a different amount than the number of peaks. Currently the case when ambiguous assignments are split as some reads get assigned to multiple RNA subtypes.

To generate these results, the following files were used. The data is from the publication of Diaz-Muñoz et al.. In Fig 5a the RNA subtype distribution of the publication is illustrated delivering comparable results:

iCLIP reads
Downloaded using fasterq-dump. Since replicates are already demultiplexed with the experimental barcode removed, they should not be merged into a single file. It should be noted that only replicate 3 achieves results as the quality for almost all nucleotides is extremely low in replicate 1 and 2 (Phred score of 2 - 63.1% of base being incorrectly called).

fasterq-dump SRR1694878 SRR1694879 SRR1694880
mv SRR1694878 HuR_iCLIP_rep_1.fastq
mv SRR1694879 HuR_iCLIP_rep_2.fastq
mv SRR1694880 HuR_iCLIP_rep_3.fastq

Reference genome
mm10 from https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/
Annotation file
mm10 annotation from https://www.gencodegenes.org/mouse/release_M10.html

Additionally, the annotation needs to be prepared in order to gain entries for introns. This is done via AGAT:

agat_sp_add_introns.pl --gff gencode.vM10.annotation.gff3 --out gencode.vM10.annotation.introns.gff3

The following command was used to run PARANOiD:

nextflow ~/git-projects/PARANOID/main.nf \
    --reads 'data/HuR_iCLIP_rep_*.fastq' \
    --reference data/mm10.fa \
    --annotation gencode.vM10.annotation.introns.gff3 \
    --domain eu \
    --omit_demultiplexing \
    --omit_peak_calling \
    --barcode_pattern NNNN \
    --omit_peak_calling \
    --run_rna_subtype \
    --rna_subtypes intron,CDS,three_prime_UTR,five_prime_UTR \
    --min_length 25 \
    --output PARANOiD_RNA_subtype \
    -profile slurm,apptainer

	intron	CDS	three_prime_UTR	five_prime_UTR
intron	3773	222	3359	237
CDS	222	371	151	36
three_prime_UTR	3359	151	3533	42
five_prime_UTR	237	36	42	282

RNA_subtypes	number_assignment	percentage
intron	86681	68.65
CDS	1847	1.46
three_prime_UTR	32723	25.92
five_prime_UTR	1062	0.84
ambiguous	3957	3.13
total	126270	100