PARANOiD Parameters

Explanation of all PARANOiD parameters

--reads

Essential parameter!
Specifies the file containing reads obtained from iCLIP experiments.
Expects a FASTQ file.

Usage:

--reads /path/to/input-file.fastq

--barcodes

Essential parameter! (as long as demultiplexing is not omitted)
Specifies the file containing barcode sequences and experiment names. Required to split reads and assign them to their corresponding experiment.
Expects a TSV file.

Usage:

--barcodes /path/to/barcodes.tsv

--reference

Essential parameter!
Specifies the reference genome used to align reads and determine the location of cross-link sites.
Expects a FASTA file.

Usage:

--reference /path/to/reference.fasta

--annotation

Expects a GFF or GTF file.

Usage:

--annotation /path/to/annotation.gff

--merge_replicates

Merges replicates into a single representative form. In order to do so, experiment names must follow a specific naming convention which is further explained in the barcodes section.

Default: false

Usage:

--merge_replicates

--correlation_analysis

Only applies when replicate merging is chosen. Performs a correlation analysis of replicates to show their similarity (and thus if they should be merged at all). May cause excessive memory usage for large reference genomes

Default: false

Usage:

--correlation_analysis

--minimum_peaks_to_merge

Only applies when replicate merging is chosen. Adapts the minimum number of replicates with signal (peak height > 0) necessary at a position to merge it into the merged version. If all peaks are supposed to be merged into the final version a value of 0 can be chosen (--minimum_peaks_to_merge 0). If no value is provided only positions with a signal in over half of the replicates will be merged.

Default: false

Usage:

--minimum_peaks_to_merge 2

--barcode_pattern

Adjusts the barcode pattern to different protocols. Default protocol is iCLIP2. N s represent the random barcode and X s the experimental barcode

Usage (default):

--barcode_pattern NNNNNXXXXXXNNNN

Example for iCLIP1

--barcode_pattern NNNXXXXNN

--omit_demultiplexing

Skips demultiplexing step. Only parameter that allows PARANOiD to run without barcode file. Requires that all samples are provided within their own FASTQ file. Naming of replicates should be as stated in the explanation of the barcode file with the FASTQ extension added afterwards.

Usage:

--omit_demultiplexing

--domain

Choose between bowtie2 and STAR to be used to align reads to the reference sequence. Bowtie2 should be used for prokaryotic organisms or transcript sequences while STAR should be used for eukaryotic organisms (or rather all splicing capable organisms) as STAR is splicing aware. If using STAR for splicing capable organisms it is highly recommended to provide an annotation file file besides the reference.

Options:
pro -> Bowtie2 (default)
eu -> STAR

Usage (default):

--domain pro

--max_alignments

Maximum number of alignments the mapping tool provides per read. It is not guaranteed that this many alignments are found per read. If you want to find as many alignments as possible please use the parameter --report_all_alignments

Usage (default):

--max_alignments 1

--report_all_alignments

If used the mapping tools will report all alignments rather than a few. Overwrites the option --max_alignments

Usage:

--report_all_alignments

--output

Specifies the output directory generated by PARANOiD.

Usage (default):

--output ./output

--min_length

Specifies the minimum length a read must have after adapter removal to be retained. Reads that become shorter during adapter removal will be filtered out.

Usage (default):

--min_length 30

--min_qual

Specifies the minimum base quality. All bases below that quality are cut off. The quality score (also known as Phred quality score) describes the certainty of correctness of the base and is typically calculated as follows with e being the error probability: \(Q-Score = -10log_\text{10}(e)\)

Phred Quality score

Error probability

Accuracy

10

10%

90%

20

1%

99%

30

0.1%

99.9%

40

0.01%

99.99%

Usage (default):

--min_qual 20

--min_percent_qual_filter

Percentage of nucleotides that need to have a quality score above the chosen minimum base quality. Reads with less nucleotides above the desired quality will be removed.

Usage (default):

--min_percent_qual_filter 90

--barcode_mismatches

Number of mismatches allowed within the experimental barcode to still assign a read to an experiment. Typically, experimental barcodes should be designed with a difference of at least 3 nucleotides to each other in order to allow one mismatch.

Usage (default):

--barcode_mismatches 1

--mapq

Minimum alignment quality (mapq score) an alignment needs to retain. The interpretation of score values depends on the aligner specified via --domain. All alignments with a mapq score below will be removed after the alignment step. Please note that these are just a short overview of the meaning of MAPQ scores and that they can be more complex than shown here when going into details. The MAPQ score can be found in alignment files (SAM/BAM/CRAM) in column 5.

Usage (default):

--mapq 2

Score meanings for Bowtie2 (--domain pro)

Apart from the description in the table a higher MAPQ score means less allowed mismatches (with the difference in base quality that a mismatched nucleotide has)

MAPQ score

Description

0

All mappable reads

1

Multimapped reads that have the same alignment quality at different positions

2-39

Multimapped reads that have one specific alignment with a better score than the other potential positions

40

Reads mappable to only one position

42

Reads mappable to only one position with an almost perfect alignment (best possible MAPQ score in Bowtie2).

More information can be found here

Score meanings for STAR (--domain eu)

MAPQ score

Description

0

Maps to 10 or more positions

1

Maps to 4-9 positions

2

Maps to 3 positions

3

Maps to 2 positions

255

Reads mappable to only one position. Best MAPQ score in STAR alignments.

The mapping quality MAPQ (column 5) is 255 for uniquely mapping reads, and \(MAPQ score = int(-10log_\text{10}(1-1/[\text{number of positions the read maps to}]))\) for multi-mapping reads. This scheme is the same as the one used by TopHat [...]

--map_to_transcripts

Use this option when transcript sequences are provided instead of a reference genome. Returns the transcripts with most hits from each sample. More information can be found here

Default: false

Usage:

--map_to_transcripts

--number_top_transcripts

Specifies how many top-hit transcripts to retain per sample that are selected if parameter --map_to_transcripts was used. Since selection is done per sample, the total number of reported transcripts may exceed this value.

Usage (default):

--number_top_transcripts 10

--omit_peak_calling

If specified peak calling will not be performed. By default, peak calling is performed

Usage:

--omit_peak_calling

--peak_calling_for_high_coverage

Only has an effect if peak calling is performed. Proteins covering the whole reference genome can cause problems for PureCLIP causing it to throw an error. Based on our experience, the parameters added by this option can improve PureCLIP in performing its analysis. Adds the following parameters to the PureCLIP command: -mtc 5000 -mtc2 5000 -ld

Usage:

--peak_calling_for_high_coverage

--peak_calling_regions

Takes effect only if peak calling is enabled. If specified, PureCLIP returns peak regions instead of individual peak sites.

Usage:

--peak_calling_regions

--peak_calling_regions_width

Takes effect only if peak calling regions is enabled. Specifies the width of peak regions reported by PureCLIP.

Usage (default):

--peak_calling_regions_width 8

--gene_id

Only has an effect if an annotation file is provided and thus the RNA subtype analysis performed.
Name of the tag used to identify gene IDs. Is found in the last column of annotation files, typically as the first tag-value pair.
This column typically looks like the following:
ID=gene-LOC101842720;Dbxref=GeneID:101842720;Name=LOC101842720;gbkey=Gene;gene=LOC101842720;gene_biotype=pseudogene;pseudo=true

In this case, the required tag is ID.

Usage (default):

--gene_id ID

--color_barplot

Specifies the color of bar plots generated by PARANOiD. Applies to graphs generated in the following analyses: peak height distribution, RNA subtype analysis and the experimental barcode distribution. Color is specified using a hexadecimal color code. If unsure which code corresponds to which color, websites like this Example <https://www.color-hex.com/> can help.

Usage (default):

--color_barplot #69b3a2

--run_rna_subtype

Enables the RNA subtype analysis

Usage:

--run_rna_subtype

--rna_subtypes

Takes effect only if an annotation file is provided and --run_rna_subtype is enabled, triggering RNA subtype analysis. Specifies which RNA subtypes (or regions) to include in the RNA subtype analysis. Subtypes must be comma-separated and must appear in the feature type column (3rd column) of the annotation file. If these conditions are not met, the analysis may fail or is performed incorrectly. If not sure which RNA subtypes are included within your annotation file you can use the script featuretypes-from-gtfgff.awk. "Avoid selecting subtypes or regions that are hierarchically related, as they may overlap and cause peaks to appear as ambiguous. Information about the hierarchical structure of RNA subtypes/regions can be obtained here.

Usage (default):

--rna_subtypes 3_prime_UTR,transcript,5_prime_UTR

--report_not_assigned

Reports not assigned peaks in the RNA subtype analysis. These are peaks that could not be assigned to one of the named features

Usage:

--report_not_assigned

--split_ambiguous

Reports ambiguous peaks in the normal distribution instead of as ambiguous during the RNA subtype analysis. Ambiguous peaks are those that were assigned to more than one of the named features.

Usage:

--split_ambiguous

--annotation_extension

Extension of the annotation file used for the RNA subtype analysis. Accepts the values GFF for GFF3 files and GTF for GTF files.

Usage (default):

--annotation_extension GFF

--peak_distance

Enables the peak distance analysis step.

Usage:

--peak_distance

--distance

Maximum allowed distance between peaks for the peak distance analysis.

Usage (default):

--distance 30

--percentile

Peak percentiles for peak distance analysis and sequence extraction/motif analysis. Only peaks with values above this threshold are considered; all others are treated as background noise and ignored. For example, a percentile value of 90 includes only the top 10% of peaks. Only applies when peak calling is omitted.

Usage (default):

--percentile 90

--sequence_extraction

Omits the motif detection step.

Usage:

--sequence_extraction

--seq_len

Only applies if motif detection is enabled. Length in nucleotides to each side of a peak that is extracted from the reference. A value of 20 will lead to sequences of 41 nucleotides being extracted. (i.e. 20 nt upstream + 1 cross-link nucleotide + 20 nt downstream)

Usage (default):

--seq_len 20

--omit_cl_nucleotide

Only applies when motif detection is performed. The nucleotide at the cross-link site will be replaced with an N during sequence extraction. This can improve motif detection, as iCLIP protocols often show a uridine (U) bias at cross-link sites.

Usage:

--omit_cl_nucleotide

--omit_cl_width

Only applies when motif detection is performed and the cl nucleotide is omitted. Replaces nucleotides flanking the cross-link site with N to reduce artifacts from uridine-rich regions. The value defines how many nucleotides upstream and downstream of the cross-link site are masked.

Usage (default):

--omit_cl_width 0

--remove_overlaps

Only applies when motif detection is performed. Removes cross-link sites with lower peak values if their extracted sequences would overlap with those of neighboring sites. This can be done to avoid doubled sequences during motif detection.

Usage:

--remove_overlaps

--max_motif_num

Only applies when motif detection is performed. Maximum number of motifs reported by STREME.

Usage (default):

--max_motif_num 50

--min_motif_width

Only applies when motif detection is performed. Minimum length of motifs reported by STREME. Must be at least 3.

Usage (default):

--min_motif_width 8

--max_motif_width

Only applies when motif detection is performed. Maximum length of motifs reported by STREME. Must not exceed 30.

Usage (default):

--max_motif_width 15