PARANOiD Outputs

Explanation of all output files generated by PARANOiD

Alignments

Directory containing deduplicated alignments in BAM format together with an index file in BAM.BAI format. BAM files are compressed binary forms of SAM files. SAM/BAM files are tab separated and show one alignment per line. The information provided in the columns is as follows: 1. Read header 2. Bitwise FLAG 3. Name of reference sequence 4. Position of alignment (1-based) 5. MAPQ-score 6. CIGAR string 7. Name of mate read (shows * if information is not available) 8. Position of mate read (shows 0 if information is not available) 9. Length of alignment on the reference (shows 0 if information is not available) 10. Read sequence (shows * if information is not available) 11. Quality of read sequence (shows * if information is not available)

One of each is generated per sample.

Alignments are included in the basic analysis.

Example:

NB501399:129:HLW7VAFX2:3:11409:5471:17963_AAGACACTG     272     1       14572   0       23M     *       0       0       CCACACAGTGCTGGTTCCGTCAC EEEEEEEEEEEAEEEEEEEEEEE NH:i:7  HI:i:4  AS:i:22 nM:i:0
NB501399:129:HLW7VAFX2:3:11604:9407:1314_TCTGCCCAC      272     1       14747   0       36M     *       0       0       CGGCAGAGGAGGGATGGAGTCTGACACGCGGGCAAA    EEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEE    NH:i:5  HI:i:4  AS:i:35 nM:i:0
NB501399:129:HLW7VAFX2:2:11201:6526:7382_TCCCCGACC      272     1       14847   0       40M     *       0       0       AGTGAGGGTGGTTGGTGGGAAACCCTGGTTCCCCCAGCCC        EEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEE        NH:i:6  HI:i:3  AS:i:39 nM:i:0
NB501399:129:HLW7VAFX2:1:11204:3841:14476_GCGATCCCG     272     1       14992   0       37M     *       0       0       GTTGAAGAGATCCGACATCAAGTGCCCACCTTGGCTC   EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE   NH:i:8  HI:i:5  AS:i:36 nM:i:0
NB501399:129:HLW7VAFX2:2:11204:16119:17944_CACACCCCG    272     1       14992   0       37M     *       0       0       GTTGAAGAGATCCGACATCAAGTGCCCACCTTGGCTC   EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE   NH:i:8  HI:i:5  AS:i:36 nM:i:0
NB501399:129:HLW7VAFX2:1:21211:6880:4260_CCACAACTC      272     1       15923   0       1S25M659N10M    *       0       0       GACCACTTCCCTGGGAGCTCCCTGGACTGAAGGAGA    AEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE    NH:i:7  HI:i:3  AS:i:35 nM:i:0

Execution metrics

Directory containing general execution metrics of the workflow such as:

  1. container_information.txt

    Container system used to execute the processes, along with the containers used during the workflow execution

  2. execution_information.txt
    Contains information required to reproduce the results, such as:
    1. Command used for the execution

    2. Directory of PARANOiD

    3. Config file used

    4. Profiles used

    5. Version of Nextflow and PARANOiD

    6. Execution directory

  3. parameter_information.txt

    Contains all parameters used

Execution metrics are included in the basic analysis.

IGV-session

An XML file that can be directly loaded into IGV. This can be done by clicking on the Data tab in the top-left corner and then on Open Session. A file browser will open, allowing you to navigate to the PARANOiD output directory and select the igv-session.xml. This will open a predefined IGV session that includes the reference genome, cross-link sites for all samples (forward and reverse) and the alignment files of all samples. If the option --merge_replicates was chosen then only the merged cross-link sites will be shown. This is included in the basic analysis.

Peak height distribution

Peak height distribution is included in the basic analysis.

Reference

The reference sequence provided as input. The reference file is included in the basic analysis.

Statistics

Statistics are included in the basic analysis.

Strand distribution

Strand distribution is included in the basic analysis.

Optional analyses

Peak distance analysis

The peak distance analysis produces three output files:

  1. Distances table (TSV)
    Contains two columns:
    • Distance

    • Number of peaks with that distance

  2. Distance plot (linear scale)

    A plot showing distribution of distances with normal y-axis

  3. Distance plot (logarithmic scale)

    A plot showing the same data with logarithmic y-axis

The example image below was generated using iCLIP data from König et al., which is available here. iCLIP was performed on HeLa cells with hnRNP C antibodies using a precursor version of the iCLIP protocol. The study showed that hnRNP C binds RNAs at characteristic distances of approximately 165 and 300 nucleotides. These distances appear as peaks in the distance analysis (see fig 3e). By performing PARANOiDs peak distance analysis on this dataset, we were able to recreate these peaks.

Example for peak distance analysis performed by PARANOiD

Example for peak distance analysis performed by PARANOiD. Shows the distances between hnRNP C binding sites in HeLa cells.

To generate this figure, the following files were used:

  1. iCLIP reads:

    Downloaded using fasterq-dump and merged into a single FASTQ file

fasterq-dump ERR018282 ERR018283 ERR018284
cat ERR018282.fastq ERR018283.fastq ERR018284.fastq > hnRNPC.fastq
  1. barcodes:

hnRNPC_rep_1    TG
hnRNPC_rep_2    TC
hnRNPC_rep_3    CA
  1. Reference genome

    hg18 from https://hgdownload.cse.ucsc.edu/goldenpath/hg18/bigZips/

  2. Annotation file

    hg18 annotation from https://www.gencodegenes.org/human/release_18.html

The following command was used to run PARANOiD:

nextflow ~/git-projects/PARANOID/main.nf \
    --reads data/hnRNPC.fastq \
    --barcodes data/hnRNPC_barcodes.tsv \
    --reference data/hg18.fa \
    --annotation data/gencode.v18.annotation.gtf \
    --domain eu \
    --barcode_pattern NNXXX \
    --mapq 3 \
    --omit_peak_calling \
    --peak_distance \
    --distance 350 \
    --merge_replicates \
    --output PARANOiD_peak_distance \
    -profile slurm,apptainer

Since the dataset was generated using a precursor iCLIP version, simple barcodes were used: 2 experimental nucleotides followed by 3 random nucleotides. To adapt to these barcodes, the parameter --barcode_pattern NNXXX was specified.

RNA subtype analysis

The RNA subtype analysis produces four output files per sample:

  1. Overview of ambiguous assignments (TSV)

    Contains one column and line per RNA subtype. Shows how many subtypes have how many multiple assignments

    intron

    CDS

    three_prime_UTR

    five_prime_UTR

    intron

    3773

    222

    3359

    237

    CDS

    222

    371

    151

    36

    three_prime_UTR

    3359

    151

    3533

    42

    five_prime_UTR

    237

    36

    42

    282

  2. Barplot showing the percentage RNA subtype distribution

Example for RNA subtype analysis performed by PARANOiD

Example for RNA subtype analysis performed by PARANOiD. Shows the binding preference of the protein HuR upon mouse genes.

  1. Distribution table of RNA subtypes
    Contains 3 columns:
    1. RNA subtype being described

    2. Total number of assignments per RNA subtype

    3. Percentage distribution of assigned subtypes

    RNA_subtypes

    number_assignment

    percentage

    intron

    86681

    68.65

    CDS

    1847

    1.46

    three_prime_UTR

    32723

    25.92

    five_prime_UTR

    1062

    0.84

    ambiguous

    3957

    3.13

    total

    126270

    100

  2. Logs

    Contains information whether the assigned RNA subtypes sum up to a different amount than the number of peaks. Currently the case when ambiguous assignments are split as some reads get assigned to multiple RNA subtypes.

To generate these results, the following files were used. The data is from the publication of Diaz-Muñoz et al.. In Fig 5a the RNA subtype distribution of the publication is illustrated delivering comparable results:

  1. iCLIP reads

    Downloaded using fasterq-dump. Since replicates are already demultiplexed with the experimental barcode removed, they should not be merged into a single file. It should be noted that only replicate 3 achieves results as the quality for almost all nucleotides is extremely low in replicate 1 and 2 (Phred score of 2 - 63.1% of base being incorrectly called).

fasterq-dump SRR1694878 SRR1694879 SRR1694880
mv SRR1694878 HuR_iCLIP_rep_1.fastq
mv SRR1694879 HuR_iCLIP_rep_2.fastq
mv SRR1694880 HuR_iCLIP_rep_3.fastq
  1. Reference genome

    mm10 from https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/

  2. Annotation file

    mm10 annotation from https://www.gencodegenes.org/mouse/release_M10.html

    Additionally, the annotation needs to be prepared in order to gain entries for introns. This is done via AGAT:

agat_sp_add_introns.pl --gff gencode.vM10.annotation.gff3 --out gencode.vM10.annotation.introns.gff3

The following command was used to run PARANOiD:

nextflow ~/git-projects/PARANOID/main.nf \
    --reads 'data/HuR_iCLIP_rep_*.fastq' \
    --reference data/mm10.fa \
    --annotation gencode.vM10.annotation.introns.gff3 \
    --domain eu \
    --omit_demultiplexing \
    --omit_peak_calling \
    --barcode_pattern NNNN \
    --omit_peak_calling \
    --run_rna_subtype \
    --rna_subtypes intron,CDS,three_prime_UTR,five_prime_UTR \
    --min_length 25 \
    --output PARANOiD_RNA_subtype \
    -profile slurm,apptainer