PARANOiD Parameters

Explanation of all PARANOiD parameters

--reads

Essential parameter!
States file containing reads obtained by iCLIP experiments.
Expects a FASTQ file.

Usage:

--reads /path/to/input-file.fastq

--barcodes

Essential parameter!
States file containing barcode sequences and experiment names. Necessary to split reads and allocate them to their experiment.
Expects a TSV file.

Usage:

--barcodes /path/to/barcodes.tsv

--reference

Essential parameter!
States reference genome used to align reads to and thus to determine the location of cross-link sites.
Expects a FASTA file.

Usage:

--reference /path/to/reference.fasta

--annotation

Expects a GFF or GTF file.

Usage:

--annotation /path/to/annotation.gff

--merge_replicates

Merges replicates into a single representatiove form. In order to do so experiment names need to named in a particular manner which is further explained in the barcodes section.

Default: false

Usage:

--merge_replicates

--correlation_analysis

Only applies when replicate merging is chosen. Does a correlation analysis of replicates to show their similarity (and thus if they should be merged at all). Can cause problems with large reference genomes due to excessive RAM usage.

Default: false

Usage:

--correlation_analysis

--barcode_pattern

Adapt barcode patterns to different protocols. Default protocol is iCLIP2. N s represent the random barcode and X s the experimental barcode

Usage (default):

--barcode_pattern NNNNNXXXXXXNNNN

Example for iCLIP1

--barcode_pattern NNNXXXXNN

--domain

Choose between bowtie2 and STAR to be used to align reads to the reference sequence. Bowtie2 should be used for prokarytic organisms or transcript sequences while STAR should be used for eukaryotic organisms (or rather all splicing capable organisms) as STAR is splicing aware. If using STAR for splicing capable organisms it is highly recommended to provide an annotation file file besides the reference.

Options:
pro -> Bowtie2 (default)
eu -> STAR

Usage (default):

--domain pro

--max_alignments

Maximum number of alignments the mapping tool provides per read. It is not guaranteed that this many alignments are found per read. If you want to find as many alignments as possible please use the parameter --report_all_alignments

Usage (default):

--max_alignments 1

--report_all_alignments

If used the mapping tools will report all alignments rather than a few. Overwrites the option --max_alignments

Usage:

--report_all_alignments

--output

Specify directory to which output generated by PARANOiD will be written.

Usage (default):

--output ./output

--min_length

Specify minimum length a read needs to have after adapter removal to persist. Reads that become shorter during adapter removal will be filtered out.

Usage (default):

--min_length 30

--min_qual

Minimum quality for bases. All bases below that quality are cut off. The quality score (also known as Phred quality score) describes the certainty of correctness of the base and is typically calculated as follows with e being the error probability: \(Q-Score = -10log_\text{10}(e)\)

Phred Quality score

Error probability

Accuracy

10

10%

90%

20

1%

99%

30

0.1%

99.9%

40

0.01%

99.99%

Usage (default):

--min_qual 20

--min_percent_qual_filter

Percentage of nucleotides that need to have a quality score above the chosen minimum base quality. Reads with less nucleotides above the desired quality will be removed.

Usage (default):

--min_percent_qual_filter 90

--barcode_mismatches

Number of mismatches allowed within the experimental barcode to still assign a read to an experiment. Typically, experimental barcodes should be designed with a v of at least 3 to each other in order to allow one mismatch.

Usage (default):

--barcode_mismatches 1

--mapq

Minimum alignment quality (mapq score) an alignment needs to retain. The meaning of different scores is dependant on the aligner chosen via --domain. All alignments with a mapq score below will be removed after the alignment step. Please note that these are just a short overview of the meaning of MAPQ scores and that they can be more complex than shown here when going into details. the MAPQ score can be found in alignment files (SAM/BAM/CRAM) in column 5.

Usage (default):

--mapq 2

Score meanings for Bowtie2 (--domain pro)

Apart from the description in the table a higher MAPQ score means less allowed mismatches (with difference of the base quality a mismatched nucleotide has)

MAPQ score

Description

0

All mappable reads

1

Multimapped reads that have the same alignment quality at different positions

2-39

Mulitmapped reads that have one specific alignment with a better score than the other potential positions

40

Reads mappable to only one position

42

Reads mappable to only one position with an almost perfect alignment. Best MAPQ score in Bowtie2 alignments

More information can be found here

Score meanings for STAR (--domain eu)

MAPQ score

Description

0

Maps to 10 or more positions

1

Maps to 4-9 positions

2

Maps to 3 positions

3

Maps to 2 positions

255

Reads mappable to only one position. Best MAPQ score in STAR alignments.

The mapping quality MAPQ (column 5) is 255 for uniquely mapping reads, and \(MAPQ score = int(-10log_\text{10}(1-1/[\text{number of positions the read maps to}]))\) for multi-mapping reads. This scheme is same as the one used by TopHat [...]

--map_to_transcripts

Should be used when transcripts are given as reference instead of a reference genome. Returns the transcripts with most hits from each sample. More information can be found here

Default: false

Usage:

--map_to_transcripts

--number_top_transcripts

The number of transcripts with most hits that are selected from each sample if parameter --map_to_transcripts was used. As the amount is chosen from each sample the total number of transcripts can excede this number.

Usage (default):

--number_top_transcripts 10

--omit_peak_calling

If specified peak calling will not be performed. Will be performed by default.

Usage:

--omit_peak_calling

--peak_calling_for_high_coverage

Only has an effect if peak calling is performed. Proteins covering the whole reference genome can cause problems for PureCLIP causing it to throw an error. From our experience the parameters added by this argument can help PureCLIP with performing it's analysis. Adds following arguments to the PureCLIP execution: -mtc 5000 -mtc2 5000 -ld

Usage:

--peak_calling_for_high_coverage

--peak_calling_regions

Only has an effect if peak calling is performed. If specified peak regions instead of single peaks will be returned by PureCLIP.

Usage:

--peak_calling_regions

--peak_calling_regions_width

Only has an effect if peak calling regions are stated. Changes the width of peak calling regions returned by PureCLIP.

Usage (default):

--peak_calling_regions_width 8

--gene_id

Only has an effect if an annotation file is provided and thus the RNA subtype analysis performed.
Wording of the tag that describes the gene ID. Is found in the last column of annotation files, typically as the first tag-value pair.
The column looks similar to this:
ID=gene-LOC101842720;Dbxref=GeneID:101842720;Name=LOC101842720;gbkey=Gene;gene=LOC101842720;gene_biotype=pseudogene;pseudo=true
In this case the tag necessary is ID.

Usage (default):

--gene_id ID

--color_barplot

Color of barplots returned by PARANOiD. Affects graphs generated by peak height distribution, RNA subtype analysis and the experimental barcode distribution. Color is staded via a hexadecimal color code. If unsure which code translates to which color several websites can help to pick the correct one. Example

Usage (default):

--color_barplot #69b3a2

--rna_subtypes

Only has an effect if an annotation file is provided and thus the RNA subtype analysis performed. RNA subtypes/regions that shall be included in the RNA subtype analysis. RNA subtypes need to be separated by a , and should appear in the annotation file within the feature type column (3rd column). If both requirements are not met the analysis will either not be performed correctly or be aborted. If not sure which RNA subtypes are included within your annotation file you can use the script featuretypes-from-gtfgff.awk. Additionally, users should beware not to choose subtypes/regions that are in a hierarchical relationship to each other as they can cover the same regions and thus make affected peaks appear as ambiguous. Inormation about the hierarchical structure of RNA subtypes/regions can be obtained here.

Usage (default):

--rna_subtypes 3_prime_UTR,transcript,5_prime_UTR

--omit_peak_distance

Omits the peak distance analysis

Usage:

--omit_peak_distance

--distance

Max distance used for the peak distance analysis.

Usage (default):

--distance 30

--percentile

Peak percentiles for peak distance analysis and sequence extraction/motif analysis. Only peaks with a value above this threshold are considered while all peaks below are omitted as background noise. A percentile of 90 means that only top 10% of peaks are used.

Usage (default):

--percentile 90

--omit_sequence_extraction

Omits the motif detection

Usage:

--omit_sequence_extraction

--seq_len

Only applies when motif detection is performed. Length in nucleotides to each side of a peak that is extracted from the reference. A value of 20 will lead to sequences of 41 nucleotides being extracted. (20nt upstream;cross-link nt;20nt downstream)

Usage (default):

--seq_len 20

--omit_cl_nucleotide

Only applies when motif detection is performed. The nucleotide directly at the cross-linking position will be substituted with an N when extracting sequences. Can improve the motif detection since iCLIP tends to have a bias towards U when cross-linking which can influence the motif search.

Usage:

--omit_cl_nucleotide

--omit_cl_width

Only applies when motif detection is performed and the cl nucleotide is omitted. Omits nucleotides on both sides of the cross-linking position with an N to avoid potential uridine-polymers which can negatively influence the motif search. The number determines the amount of nucleotides on both sides that are to be replaced.

Usage (default):

--omit_cl_width 0

--remove_overlaps

Only applies when motif detection is performed. Removes cross-link sites with lower peak values if their extracted sequence would overlap with the sequence from another cross-link site. This can be done to avoid doubled sequences during motif detection.

Usage:

--remove_overlaps

--max_motif_num

Only applies when motif detection is performed. Maximum number of motifs that is reported by streme.

Usage (default):

--max_motif_num 50

--min_motif_width

Only applies when motif detection is performed. Minimum length of motifs reported by streme. Cannot be lower than 3

Usage (default):

--min_motif_width 8

--max_motif_width

Only applies when motif detection is performed. Maximum length of motifs reported by streme. Cannot be higher than 30

Usage (default):

--max_motif_width 15