Included Analyses

Short overview of all analyses implemented in PARANOiD

Basic analysis

The basic analysis of PARANOiD includes the preprocessing of FASTQ files, demultiplexing, aligning reads to a reference and calculating the cross-linking position based on the alignment. These positions are then translated into WIG, BIGWIG and BEDGRAPH files and given as output to the user. Additionally, a distribution of the peak height and the stradnedness are given as output together with a statistics overview of important proceses. Lastly, an XML file is generated that can be importet by the IGV <https://software.broadinstitute.org/software/igv/> to automatically visualize results generated by PARANOiD. The preprocessing involves adapter removal, quality filtering, splitting reads according to their experimental barcode and removing the whole barcode. The alignmentcan be done via 2 different alignment tools (Bowtie2 or STAR) and is followed by a deduplication step in which PCR duplicates are removed. Finally, alignments are filtered via the MAPQ score and cross-linking positions are calculated for each alignment. 2 WIG, BIGWIG and BEDGRAPH files are generated for each sample - one for forward and one for reverse alignments.

Associated parameters (preprocessing)
--barcode_pattern           States composition of random and experimental barcodes
--barcode_mismatches        Number of mismatches allowed mismatches within experimental barcode to still align it to it's sample
--min_length                Minimum length of reads necessary to retain after adapter removal
--min_qual                  Minimum quality a base needs to retain
--min_percent_qual_filter   Percentage of bases above the quality threshold necessary to retain the read

Associated parameters (alignment & cross link site determination)
--domain                   States if Bowtie2 or STAR is being used as aligner
--mapq                     Minimum MAPQ score for alignments necessary to retain
--max_alignments           Maximum number of alignments provided by the mapping tool
--report_all_alignments    Reports all possible alignments (might be filtered out later on)

Merge replicates

Merges several replicates into a single representative version which can be used for publications, posters or presentations. This version shows the mean hit count for every position. Additionally, a correlation analysis is performed to give the user an evaluation of the sample similarity and therefore a rationale for this analysis. The correlation is performed on raw cross-link sites (or on significant ones in case peak calling is used) via the Pearson correlation. Is deactivated by default.

Associated parameters:
--merge_replicates           Merges replicates according to the name in the barcode file
--correlation_analysis       Does a correlation analysis for merged replicates

RNA subtypes

Analysis to determine if the protein of interest is prone to bind to specific RNA subtypes or regions. As this is determined via the annotation file only subtypes included there can be determined (shown in column 3). To see which RNA subtypes are included in the annotation file a script was added. When choosing RNA subtypes one has to be careful not to use subtypes that are hierarchically higher or lower to each other as these will at least partially cover the same reference regions making hits in these regions ambiguous. The SO ontologies can be used to get an overview of the official hierarchical structures of annotation files. Is activated when an annotation file is provided.

Associated parameters
--gene_id               Tag for the gene ID used within the annotation file
--color_barplot         Color bars within the barplot generated by this analysis
--rna_subtypes          RNA subtypes/regions used for this analysis

Transcript analysis

Analysis to show if specific RNAs are more prone to interact with the the protein of interest. If choosing this analysis a file containing all RNAs of interest should be used as input reference instead of the genome. Here all RNAs of interest (or artificial RNAs present in the sample) can be combined to a single fasta file. If the general transcriptome of an organism shall be examined, they can often be accessed next to the genome and annotation of the organism. If not a FASTA file containing the transcripts can be generated as follows (needs the genome and an annotation file):

''' gffread -w output_transcripts.fa -g input_reference_genome.fa input_annotation.gff3 '''

Associated parameters
--map_to_transcripts             Activates transcript analysis
--number_top_transcripts         Amount of transcripts with most hits per sample that are offered as output

Peak calling

Results obtained from analyzed iCLIP experiments typically contain a fair amount of background noise (signal not caused by the actual protein-RNA interaction). This can be due to the reverse transcription not terminating when encountering an aminoacid or by a covalent binding of the protein of interest with an RNA just because their were in close proximity. Peak calling is supposed to filter out this background noise and thus reduce the amount of false positive signal. PARANOiD employs PureCLIP for its peak calling process. PureCLIP uses a hidden Markov model to divide the reference into 4 different states based on the peak distribution. Additionally, identified peaks in close proximity can be merged into binding regions.

Associated parameters:
--omit_peak_calling                           Omits peak calling analysis
--peak_calling_for_high_coverage              Adds parameters to PureCLIP which can allow it's succesful execution for high coverage samples
--peak_calling_regions                        Allows merging several cross link sites in close proximity to a cross link region
--peak_calling_regions_width                  Sets the width until which cross link sites in close proximity are allowed to be merged

Motif detection

Protein binding sites are often determined by protein-specific RNA motifs. These motifs are typically found at or in close proximity to cross-linking sites. To identify these motifs the motif detection was implemented. Background noise is being filtered out by using only the top percentiles of peaks (by default only the top 10% are used) in the same manner as in the peak distance analysis. Sequences around all peaks above the threshold are extracted and provided as output. All extracted sequences are then used for motif detection via streme, which offers several enriched sequences.

Associated parameters:
--omit_sequence_extraction          Omits the sequence extraction and motif detection
--percentile                        Sets threshold for peak values used for this analysis using percentiles
--seq_len                           Nucleotides extracted from each side of a cross link site
--omit_cl_nucleotide                Omits the nucleotide at the cross link position
--omit_cl_width                     Omits the nucleotides surrounding the cross link position
--remove_overlaps                   Removes overlapping sequences
--max_motif_num                     Maximum number of motifs generated
--min_motif_width                   Minimum width allowed for motifs
--max_motif_width                   Maximum width allowed for motifs

Peak distance analysis

Some proteins bind to long stretches of RNA instead of certain motif-dependent RNA subregions. This is, for example, the case with the Nucleocapsid (N) protein of several virus species which bind to a distinct number of nucleotides per N protein while packaging the viral RNA. The peak distance analysis was implemented to detect such periodical RNA-protein interactions by determining the occurences of distances between peaks. Background noise is being filtered out by using only the top percentiles of peaks (by default only the top 10% are used) in the same manner as in the motif detection. Then, going through every peak above the threshold, the distances to all other peaks above this threshold, which are within a certain distance (by default 30 nt) are measured, summarized and provided as a TSV file and visualized as a plot.

Associated parameters:
--omit_peak_distance     Omits the peak distance analysis
--percentile             Sets threshold for peak values used for this analysis using percentiles