PARANOiD Parameters
Explanation of all PARANOiD parameters
--reads
FASTQ
file.Usage:
--reads /path/to/input-file.fastq
--barcodes
TSV
file.Usage:
--barcodes /path/to/barcodes.tsv
--reference
FASTA
file.Usage:
--reference /path/to/reference.fasta
--annotation
GFF
or GTF
file.Usage:
--annotation /path/to/annotation.gff
--merge_replicates
Merges replicates into a single representatiove form. In order to do so experiment names need to named in a particular manner which is further explained in the barcodes section.
Default: false
Usage:
--merge_replicates
--correlation_analysis
Only applies when replicate merging is chosen. Does a correlation analysis of replicates to show their similarity (and thus if they should be merged at all). Can cause problems with large reference genomes due to excessive RAM usage.
Default: false
Usage:
--correlation_analysis
--barcode_pattern
Adapt barcode patterns to different protocols. Default protocol is iCLIP2.
N
s represent the random barcode and X
s the experimental barcode
Usage (default):
--barcode_pattern NNNNNXXXXXXNNNN
Example for iCLIP1
--barcode_pattern NNNXXXXNN
--domain
Choose between bowtie2 and STAR to be used to align reads to the reference sequence. Bowtie2 should be used for prokarytic organisms or transcript sequences while STAR should be used for eukaryotic organisms (or rather all splicing capable organisms) as STAR is splicing aware. If using STAR for splicing capable organisms it is highly recommended to provide an annotation file file besides the reference.
Usage (default):
--domain pro
--max_alignments
Maximum number of alignments the mapping tool provides per read. It is not guaranteed that this many alignments are found per read. If you want to find as many alignments as possible please use the parameter --report_all_alignments
Usage (default):
--max_alignments 1
--report_all_alignments
If used the mapping tools will report all alignments rather than a few. Overwrites the option --max_alignments
Usage:
--report_all_alignments
--output
Specify directory to which output generated by PARANOiD will be written.
Usage (default):
--output ./output
--min_length
Specify minimum length a read needs to have after adapter removal to persist. Reads that become shorter during adapter removal will be filtered out.
Usage (default):
--min_length 30
--min_qual
Minimum quality for bases. All bases below that quality are cut off. The quality score (also known as Phred quality score) describes the certainty of correctness of the base and is typically calculated as follows with e being the error probability: \(Q-Score = -10log_\text{10}(e)\)
Phred Quality score |
Error probability |
Accuracy |
---|---|---|
10 |
10% |
90% |
20 |
1% |
99% |
30 |
0.1% |
99.9% |
40 |
0.01% |
99.99% |
Usage (default):
--min_qual 20
--min_percent_qual_filter
Percentage of nucleotides that need to have a quality score above the chosen minimum base quality. Reads with less nucleotides above the desired quality will be removed.
Usage (default):
--min_percent_qual_filter 90
--barcode_mismatches
Number of mismatches allowed within the experimental barcode to still assign a read to an experiment. Typically, experimental barcodes should be designed with a v of at least 3 to each other in order to allow one mismatch.
Usage (default):
--barcode_mismatches 1
--mapq
Minimum alignment quality (mapq score) an alignment needs to retain. The meaning of different scores is dependant on the aligner chosen via --domain. All alignments with a mapq score below will be removed after the alignment step. Please note that these are just a short overview of the meaning of MAPQ scores and that they can be more complex than shown here when going into details. the MAPQ score can be found in alignment files (SAM/BAM/CRAM) in column 5.
Usage (default):
--mapq 2
Score meanings for Bowtie2 (--domain pro)
Apart from the description in the table a higher MAPQ score means less allowed mismatches (with difference of the base quality a mismatched nucleotide has)
MAPQ score |
Description |
---|---|
0 |
All mappable reads |
1 |
Multimapped reads that have the same alignment quality at different positions |
2-39 |
Mulitmapped reads that have one specific alignment with a better score than the other potential positions |
40 |
Reads mappable to only one position |
42 |
Reads mappable to only one position with an almost perfect alignment. Best MAPQ score in Bowtie2 alignments |
More information can be found here
Score meanings for STAR (--domain eu)
MAPQ score |
Description |
---|---|
0 |
Maps to 10 or more positions |
1 |
Maps to 4-9 positions |
2 |
Maps to 3 positions |
3 |
Maps to 2 positions |
255 |
Reads mappable to only one position. Best MAPQ score in STAR alignments. |
--map_to_transcripts
Should be used when transcripts are given as reference instead of a reference genome. Returns the transcripts with most hits from each sample. More information can be found here
Default: false
Usage:
--map_to_transcripts
--number_top_transcripts
The number of transcripts with most hits that are selected from each sample if parameter --map_to_transcripts was used. As the amount is chosen from each sample the total number of transcripts can excede this number.
Usage (default):
--number_top_transcripts 10
--omit_peak_calling
If specified peak calling will not be performed. Will be performed by default.
Usage:
--omit_peak_calling
--peak_calling_for_high_coverage
Only has an effect if peak calling is performed.
Proteins covering the whole reference genome can cause problems for PureCLIP causing it to throw an error.
From our experience the parameters added by this argument can help PureCLIP with performing it's analysis.
Adds following arguments to the PureCLIP execution: -mtc 5000 -mtc2 5000 -ld
Usage:
--peak_calling_for_high_coverage
--peak_calling_regions
Only has an effect if peak calling is performed. If specified peak regions instead of single peaks will be returned by PureCLIP.
Usage:
--peak_calling_regions
--peak_calling_regions_width
Only has an effect if peak calling regions are stated. Changes the width of peak calling regions returned by PureCLIP.
Usage (default):
--peak_calling_regions_width 8
--gene_id
ID=gene-LOC101842720;Dbxref=GeneID:101842720;Name=LOC101842720;gbkey=Gene;gene=LOC101842720;gene_biotype=pseudogene;pseudo=true
Usage (default):
--gene_id ID
--color_barplot
Color of barplots returned by PARANOiD. Affects graphs generated by peak height distribution, RNA subtype analysis and the experimental barcode distribution. Color is staded via a hexadecimal color code. If unsure which code translates to which color several websites can help to pick the correct one. Example
Usage (default):
--color_barplot #69b3a2
--rna_subtypes
Only has an effect if an annotation file is provided and thus the RNA subtype analysis performed. RNA subtypes/regions that shall be included in the RNA subtype analysis. RNA subtypes need to be separated by a , and should appear in the annotation file within the feature type column (3rd column). If both requirements are not met the analysis will either not be performed correctly or be aborted. If not sure which RNA subtypes are included within your annotation file you can use the script featuretypes-from-gtfgff.awk. Additionally, users should beware not to choose subtypes/regions that are in a hierarchical relationship to each other as they can cover the same regions and thus make affected peaks appear as ambiguous. Inormation about the hierarchical structure of RNA subtypes/regions can be obtained here.
Usage (default):
--rna_subtypes 3_prime_UTR,transcript,5_prime_UTR
--omit_peak_distance
Omits the peak distance analysis
Usage:
--omit_peak_distance
--distance
Max distance used for the peak distance analysis.
Usage (default):
--distance 30
--percentile
Peak percentiles for peak distance analysis and sequence extraction/motif analysis. Only peaks with a value above this threshold are considered while all peaks below are omitted as background noise. A percentile of 90 means that only top 10% of peaks are used.
Usage (default):
--percentile 90
--omit_sequence_extraction
Omits the motif detection
Usage:
--omit_sequence_extraction
--seq_len
Only applies when motif detection is performed. Length in nucleotides to each side of a peak that is extracted from the reference. A value of 20 will lead to sequences of 41 nucleotides being extracted. (20nt upstream;cross-link nt;20nt downstream)
Usage (default):
--seq_len 20
--omit_cl_nucleotide
Only applies when motif detection is performed. The nucleotide directly at the cross-linking position will be substituted with an N when extracting sequences. Can improve the motif detection since iCLIP tends to have a bias towards U when cross-linking which can influence the motif search.
Usage:
--omit_cl_nucleotide
--omit_cl_width
Only applies when motif detection is performed and the cl nucleotide is omitted. Omits nucleotides on both sides of the cross-linking position with an N to avoid potential uridine-polymers which can negatively influence the motif search. The number determines the amount of nucleotides on both sides that are to be replaced.
Usage (default):
--omit_cl_width 0
--remove_overlaps
Only applies when motif detection is performed. Removes cross-link sites with lower peak values if their extracted sequence would overlap with the sequence from another cross-link site. This can be done to avoid doubled sequences during motif detection.
Usage:
--remove_overlaps
--max_motif_num
Only applies when motif detection is performed. Maximum number of motifs that is reported by streme.
Usage (default):
--max_motif_num 50
--min_motif_width
Only applies when motif detection is performed. Minimum length of motifs reported by streme. Cannot be lower than 3
Usage (default):
--min_motif_width 8
--max_motif_width
Only applies when motif detection is performed. Maximum length of motifs reported by streme. Cannot be higher than 30
Usage (default):
--max_motif_width 15