Included Analyses
Short overview of all analyses implemented in PARANOiD
Basic analysis
PARANOiD's basic analysis includes preprocessing of FASTQ files, demultiplexing, aligning reads to a reference and calculating the cross-linking position based on the alignment. These positions are then translated into WIG, BIGWIG and BEDGRAPH files and given as output to the user. Additionally, a distribution of the peak height and the strandness are given as output together with a statistics overview of important processes. Lastly, an XML file is generated that can be imported into IGV <https://software.broadinstitute.org/software/igv/> to automatically visualize results generated by PARANOiD. The preprocessing involves adapter removal, quality filtering, splitting reads according to their experimental barcode, and removing the whole barcode. The alignment can be done via two different alignment tools (Bowtie2 or STAR) and is followed by a deduplication step in which PCR duplicates are removed. Finally, alignments are filtered via the MAPQ score and cross-linking positions are calculated for each alignment. two WIG, BIGWIG, and BEDGRAPH files are generated for each sample - one for forward and one for reverse alignments. When analyzing peaks you should consider that iCLIP tends to have a bias towards cross-linking at Uridines (Sugimoto et al. 2012 <10.1186/gb-2012-13-8-r67>).
Associated parameters (preprocessing) --barcode_pattern States composition of random and experimental barcodes --barcode_mismatches Number of allowed mismatches within the experimental barcode to still align it to its sample --omit_demultiplexing Omits demultiplexing step --min_length Minimum length of reads required to be retained after adapter removal --min_qual Minimum quality a base needs to be retained --min_percent_qual_filter Percentage of bases that must exceed the quality threshold to retain the read Associated parameters (alignment & cross link site determination) --domain Specifies whether Bowtie2 or STAR is used as the aligner --mapq Minimum MAPQ score required to retain alignments --max_alignments Maximum number of alignments reported by the mapping tool --report_all_alignments Reports all possible alignments (may be filtered out later)
Merge replicates
Merges several replicates into a single representative dataset which can be used for publications, posters or presentations. This dataset shows the mean hit count at each position. Additionally, a correlation analysis is performed to evaluate the sample similarity and therefore provide a rationale for this analysis. Correlation is performed on raw cross-link sites (or on significant ones in case peak calling is used) via the Pearson correlation. This option is deactivated by default.
Associated parameters: --merge_replicates Merges replicates based on the name specified in the barcode file --correlation_analysis Performs a correlation analysis on merged replicates --minimum_peaks_to_merge Minimum number of replicates with signal (peak height > 0) necessary to merge the position
RNA subtypes
Analysis to determine whether the protein of interest tends to bind to specific RNA subtypes or regions. As this is determined via the annotation file, only subtypes included in it can be determined (shown in column 3). To see which RNA subtypes are included in the annotation file, a script was added. When choosing RNA subtypes one has to be careful not to use subtypes that are hierarchically higher or lower than each other as they will at least partially cover the same reference regions, which makes hits in these regions ambiguous. The SO terms can be used to get an overview of the official hierarchical structures of annotation files. When performed without peak calling the amount of cross-link events (amount of cDNAs) is used instead of the amount of peaks. This analysis can only be performed when an annotation file is provided.
Associated parameters --run_rna_subtype Performs RNA subtype analysis --gene_id Specifies the gene ID tag used in the annotation file --color_barplot Specifies colors of the bars in the barplot generated by this analysis --rna_subtypes Specifies RNA subtypes/regions used for this analysis --report_not_assigned Reports not assigned peaks among RNA subtypes --annotation_extension Extension of annotation file. Accepts GFF (for GFF3) or GTF --split_ambiguous Splits ambiguous peaks into all appropriate subtypes
Transcript analysis
Analysis to show if specific RNAs are more prone to interact with the protein of interest. If choosing this analysis, a file containing all RNAs of interest should be used as the input reference instead of the genome. All RNAs of interest (or artificial RNAs present in the sample) can be combined into a single FASTA file. If the general transcriptome of an organism is to be examined, they are often available alongside the genome and annotation of the organism. If not, a FASTA file containing the transcripts can be generated as follows (requires the genome and an annotation file):
''' gffread -w output_transcripts.fa -g input_reference_genome.fa input_annotation.gff3 '''
Associated parameters --map_to_transcripts Activates transcript analysis --number_top_transcripts Number of transcripts with most hits per sample that are offered as output
Peak calling
Results obtained from analyzed iCLIP experiments typically contain a fair amount of background noise (signal not caused by the actual protein-RNA interaction). This can be due to the reverse transcription not terminating when encountering an amino acid or by a covalent binding of the protein of interest with an RNA just because they were in close proximity. Peak calling aims to filter out this background noise and thus reduce the amount of false-positive signal. PARANOiD employs PureCLIP for its peak calling process. PureCLIP uses a hidden Markov model to divide the reference into four different states based on the peak distribution (0-based). Additionally, identified peaks in close proximity can be merged into binding regions. Please note that in order to run PureCLIP all non ACGTN nucleotide letters need to be replaced with Ns.
Associated parameters: --omit_peak_calling Omits peak calling analysis --peak_calling_for_high_coverage Adds parameters to PureCLIP which can allow its successful execution for high coverage samples --peak_calling_regions Allows merging of several cross-link sites in close proximity to a cross link region --peak_calling_regions_width Sets the distance within which cross-link sites in close proximity are allowed to be merged
Motif detection
Protein binding sites are often determined by protein-specific RNA motifs. These motifs are typically found at or in close proximity to cross-linking sites. To identify these motifs, motif detection was implemented. When omitting peak calling (--omit_peak_calling), background noise is filtered out by using only the top percentiles of cross-link peaks (by default only the top 10% are used: --percentile 90) in the same manner as in the peak distance analysis. Sequences surrounding all peaks above the threshold are extracted and provided as output. All extracted sequences are then used for motif detection via STREME, which returns several enriched motifs. When analyzing motifs you should consider that iCLIP tends to have a bias towards cross-linking at Uridines (Sugimoto et al. 2012 <10.1186/gb-2012-13-8-r67>) which may influence the resulting motifs. This can be bypassed with the option --omit_cl_nucleotide which replaces the cross-linked nucleotide with an N.
Associated parameters: --sequence_extraction Performs sequence extraction and motif detection --percentile Sets the threshold for peak values used in this analysis (in percentiles) --seq_len Number of nucleotides extracted from each side of a cross-link site --omit_cl_nucleotide Omits the nucleotide at the cross link position --omit_cl_width Omits the nucleotides surrounding the cross link position --remove_overlaps Removes overlapping sequences --max_motif_num Specifies the maximum number of motifs to generate --min_motif_width Specifies the minimum width allowed for motifs --max_motif_width Specifies the maximum width allowed for motifs
Peak distance analysis
Some proteins bind to long stretches of RNA instead of certain motif-dependent RNA subregions. This is, for example, the case with the Nucleocapsid (N) protein of several virus species which bind to a distinct number of nucleotides per N protein while packaging the viral RNA. The peak distance analysis was implemented to detect such periodical RNA-protein interactions by determining the occurrences of distances between peaks. When omitting peak calling (--omit_peak_calling), background noise is filtered out by using only the top percentiles of peaks (by default only the top 10% are used: --percentile 90) in the same manner as in the motif detection. Then, iterating over each peak, the distances to all other peaks which are within a specified distance (default 30 nt: --distance 30) are measured, summarized and provided as output TSV file and barplot.
Associated parameters: --peak_distance Performs the peak distance analysis --percentile Sets the threshold for peak values used in this analysis (in percentiles) --distance Maximum reported distance between peaks