PARANOiD Parameters

Explanation of all PARANOiD parameters

--reads

Essential parameter!
States file containing reads obtained by iCLIP experiments.
Expects a FASTQ file.

Usage:

--reads /path/to/input-file.fastq

--barcodes

Essential parameter!
States file containing barcode sequences and experiment names. Necessary to split reads and allocate them to their experiment.
Expects a TSV file.

Usage:

--barcodes /path/to/barcodes.tsv

--reference

Essential parameter!
States reference genome used to align reads to and thus to determine the location of cross-link sites.
Expects a FASTA file.

Usage:

--reference /path/to/reference.fasta

--annotation

States annotation file used for the RNA subtype analysius.

Expects a GFF or GTF file.

Usage:

--annotation /path/to/annotation.gff

--merge_replicates

Merges replicates into a single representatiove form. In order to do so experiment names need to named in a particular manner which is further explained in the barcodes section.

Default: false

Usage:

--merge_replicates

--correlation_analysis

Only applies when replicate merging is chosen. Does a correlation analysis of replicates to show their similarity (and thus if they should be merged at all). Can cause problems with large reference genomes due to excessive RAM usage.

Default: false

Usage:

--correlation_analysis

--barcode_pattern

Adapt barcode patterns to different protocols. Default protocol is iCLIP2. N s represent the random barcode and X s the experimental barcode

Usage (default):

--barcode_pattern NNNNNXXXXXXNNNN

Example for iCLIP1

--barcode_pattern NNNXXXXNN

--domain

Choose between bowtie2 and STAR to be used to align reads to the reference sequence. Bowtie2 should be used for prokarytic organisms or transcript sequences while STAR should be used for eukaryotic organisms (or rather all splicing capable organisms) as STAR is splicing aware. If using STAR for splicing capable organisms it is highly recommended to provide an annotation file file besides the reference.

Options:
pro -> Bowtie2 (default)
eu  -> STAR

Usage (default):

--domain pro

--max_alignments

Maximum number of alignments the mapping tool provides per read. It is not guaranteed that this many alignments are found per read. If you want to find as many alignments as possible please use the parameter --report_all_alignments

Usage (default):

--max_alignments 1

--report_all_alignments

If used the mapping tools will report all alignments rather than a few. Overwrites the option --max_alignments

Usage:

--report_all_alignments

--output

Specify directory to which output generated by PARANOiD will be written.

Usage (default):

--output ./output

--min_length

Specify minimum length a read needs to have after adapter removal to persist. Reads that become shorter during adapter removal will be filtered out.

Usage (default):

--min_length 30

--min_qual

Minimum quality for bases. All bases below that quality are cut off. The quality score (also known as Phred quality score) describes the certainty of correctness of the base and is typically calculated as follows with e being the error probability: \(Q-Score = -10log_\text{10}(e)\)

Phred Quality score	Error probability	Accuracy
10	10%	90%
20	1%	99%
30	0.1%	99.9%
40	0.01%	99.99%

Usage (default):

--min_qual 20

--min_percent_qual_filter

Percentage of nucleotides that need to have a quality score above the chosen minimum base quality. Reads with less nucleotides above the desired quality will be removed.

Usage (default):

--min_percent_qual_filter 90

--barcode_mismatches

Number of mismatches allowed within the experimental barcode to still assign a read to an experiment. Typically, experimental barcodes should be designed with a v of at least 3 to each other in order to allow one mismatch.

Usage (default):

--barcode_mismatches 1

--mapq

Minimum alignment quality (mapq score) an alignment needs to retain. The meaning of different scores is dependant on the aligner chosen via --domain. All alignments with a mapq score below will be removed after the alignment step. Please note that these are just a short overview of the meaning of MAPQ scores and that they can be more complex than shown here when going into details. the MAPQ score can be found in alignment files (SAM/BAM/CRAM) in column 5.

Usage (default):

--mapq 2

Score meanings for Bowtie2 (--domain pro)

Apart from the description in the table a higher MAPQ score means less allowed mismatches (with difference of the base quality a mismatched nucleotide has)

MAPQ score	Description
0	All mappable reads
1	Multimapped reads that have the same alignment quality at different positions
2-39	Mulitmapped reads that have one specific alignment with a better score than the other potential positions
40	Reads mappable to only one position
42	Reads mappable to only one position with an almost perfect alignment. Best MAPQ score in Bowtie2 alignments

More information can be found here

Score meanings for STAR (--domain eu)

MAPQ score	Description
0	Maps to 10 or more positions
1	Maps to 4-9 positions
2	Maps to 3 positions
3	Maps to 2 positions
255	Reads mappable to only one position. Best MAPQ score in STAR alignments.

The mapping quality MAPQ (column 5) is 255 for uniquely mapping reads, and \(MAPQ score = int(-10log_\text{10}(1-1/[\text{number of positions the read maps to}]))\) for multi-mapping reads. This scheme is same as the one used by TopHat [...]

Source: Bowtie2 manual

--map_to_transcripts

Should be used when transcripts are given as reference instead of a reference genome. Returns the transcripts with most hits from each sample. More information can be found here

Default: false

Usage:

--map_to_transcripts

--number_top_transcripts

The number of transcripts with most hits that are selected from each sample if parameter --map_to_transcripts was used. As the amount is chosen from each sample the total number of transcripts can excede this number.

Usage (default):

--number_top_transcripts 10

--omit_peak_calling

If specified peak calling will not be performed. Will be performed by default.

Usage:

--omit_peak_calling

--peak_calling_for_high_coverage

Only has an effect if peak calling is performed. Proteins covering the whole reference genome can cause problems for PureCLIP causing it to throw an error. From our experience the parameters added by this argument can help PureCLIP with performing it's analysis. Adds following arguments to the PureCLIP execution: -mtc 5000 -mtc2 5000 -ld

Usage:

--peak_calling_for_high_coverage

--peak_calling_regions

Only has an effect if peak calling is performed. If specified peak regions instead of single peaks will be returned by PureCLIP.

Usage:

--peak_calling_regions

--peak_calling_regions_width

Only has an effect if peak calling regions are stated. Changes the width of peak calling regions returned by PureCLIP.

Usage (default):

--peak_calling_regions_width 8

--gene_id

Only has an effect if an annotation file is provided and thus the RNA subtype analysis performed.
Wording of the tag that describes the gene ID. Is found in the last column of annotation files, typically as the first tag-value pair.
The column looks similar to this:
ID=gene-LOC101842720;Dbxref=GeneID:101842720;Name=LOC101842720;gbkey=Gene;gene=LOC101842720;gene_biotype=pseudogene;pseudo=true
In this case the tag necessary is ID.

Usage (default):

--gene_id ID

--color_barplot

Color of barplots returned by PARANOiD. Affects graphs generated by peak height distribution, RNA subtype analysis and the experimental barcode distribution. Color is staded via a hexadecimal color code. If unsure which code translates to which color several websites can help to pick the correct one. Example

Usage (default):

--color_barplot #69b3a2

--rna_subtypes

Only has an effect if an annotation file is provided and thus the RNA subtype analysis performed. RNA subtypes/regions that shall be included in the RNA subtype analysis. RNA subtypes need to be separated by a , and should appear in the annotation file within the feature type column (3rd column). If both requirements are not met the analysis will either not be performed correctly or be aborted. If not sure which RNA subtypes are included within your annotation file you can use the script featuretypes-from-gtfgff.awk. Additionally, users should beware not to choose subtypes/regions that are in a hierarchical relationship to each other as they can cover the same regions and thus make affected peaks appear as ambiguous. Inormation about the hierarchical structure of RNA subtypes/regions can be obtained here.

Usage (default):

--rna_subtypes 3_prime_UTR,transcript,5_prime_UTR

--omit_peak_distance

Omits the peak distance analysis

Usage:

--omit_peak_distance

--distance

Max distance used for the peak distance analysis.

Usage (default):

--distance 30

--percentile

Peak percentiles for peak distance analysis and sequence extraction/motif analysis. Only peaks with a value above this threshold are considered while all peaks below are omitted as background noise. A percentile of 90 means that only top 10% of peaks are used.

Usage (default):

--percentile 90

--omit_sequence_extraction

Omits the motif detection

Usage:

--omit_sequence_extraction

--seq_len

Only applies when motif detection is performed. Length in nucleotides to each side of a peak that is extracted from the reference. A value of 20 will lead to sequences of 41 nucleotides being extracted. (20nt upstream;cross-link nt;20nt downstream)

Usage (default):

--seq_len 20

--omit_cl_nucleotide

Only applies when motif detection is performed. The nucleotide directly at the cross-linking position will be substituted with an N when extracting sequences. Can improve the motif detection since iCLIP tends to have a bias towards U when cross-linking which can influence the motif search.

Usage:

--omit_cl_nucleotide

--omit_cl_width

Only applies when motif detection is performed and the cl nucleotide is omitted. Omits nucleotides on both sides of the cross-linking position with an N to avoid potential uridine-polymers which can negatively influence the motif search. The number determines the amount of nucleotides on both sides that are to be replaced.

Usage (default):

--omit_cl_width 0

--remove_overlaps

Only applies when motif detection is performed. Removes cross-link sites with lower peak values if their extracted sequence would overlap with the sequence from another cross-link site. This can be done to avoid doubled sequences during motif detection.

Usage:

--remove_overlaps

--max_motif_num

Only applies when motif detection is performed. Maximum number of motifs that is reported by streme.

Usage (default):

--max_motif_num 50

--min_motif_width

Only applies when motif detection is performed. Minimum length of motifs reported by streme. Cannot be lower than 3

Usage (default):

--min_motif_width 8

--max_motif_width

Only applies when motif detection is performed. Maximum length of motifs reported by streme. Cannot be higher than 30

Usage (default):

--max_motif_width 15