PARANOiD Inputs
Detailed description of all input files
Reads
FASTQ file containing all reads. Each read is represented by 4 lines:
Sequence identifier and optional description. Starts with a
@Actual nucleotide sequence of the read
Delimiter line. Starts with a
+Quality values of nucleotide sequence (line2). Must contain same number of symbols as line 2
Example:
1. @SEQ_ID 2. GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT 3. + 4. !''((((***+))%%%++)(%%%%).1**-+*''))**55CCF>>>>>>CCCCCCC65
Barcodes
TSV file containing experiment names and the corresponding barcode sequences. Reads from the input FASTQ file are split according to the detected barcode sequence and assigned to the appropriate experiment. This results in one FASTQ file per experiment._rep_<number> has to be added to the experiment names, exchanging <number> with the replicate number.tab:experiment name
barcode sequence present in reads
{a-zA-Z}, numbers {1-9} and underscores _. Any whitespaces (e.g. space, tab) will result in errors and thus the termination of the pipeline execution. The length of the barcode sequence is dependent on the protocol used an can be adapted via --barcode_pattern.Example:
knockdown_N_rep_1 TGATAG
knockdown_N_rep_2 AGTGGA
knockdown_N_rep_3 GCTCGA
mock_N_rep_1 TAAGTA
mock_N_rep_2 GCAGTC
mock_N_rep_3 CCTAGG
Reference
FASTA file containing nucleotide data of interest. It is used to align reads and thus find the location of cross-link sites. The file may contain genomic or transcriptomic sequences of an organism or completely artificial sequences.
Each sequence consists of at least 2 lines:
1. Header
2–n. Nucleotide sequence
The header starts with a > and is followed by a description of the sequence
The sequence consists of nucleotides {ACGTN} and can span an arbitrary number of lines
Example:
>NW_024429180.1 Mesocricetus auratus isolate SY011 unplaced genomic scaffold AACTCTGTTGtaaaaaggctttcccacattcattcCATTCATAAGGTTTCTGTACATTATGGATTCTTTCATGCCTTTTA AGATGATTATGATATACATAGACTTTAACACCTCAAGAATAttcaggtttctctccagtatgacaATTTGGTCTAATTAT AAAGAAGAATCAGATATTAAGGTTTTATCACTGTTTACACTCATGCTGTTCCCCTTCATTAAGGTTGGTTTGGATCTTTG AATATACCTGGGTTCCTATAGTCTCCACCATCACATCTTTATGGAGATTCTTCTGGGAGGGATCCAGCAAATCCCACTCT ...
Annotation
GFF or GTF file. Contains annotation information belonging to the reference used in the input. Describes features and their positions. PARANOiD does not rely on the annotation for its analysis, however it is highly recommended to provide it when working with splicing capable organisms (--domain eu) as annotation files typically contain information about intron-exon structures which significantly improve the mapping capability.
Furthermore, providing an annotation file enables the RNA subtype analysis.
Consists of several header lines followed by feature lines.
Header lines start with a # and contain general information about the annotation.
.)+ for forward; - for reverse.)Example:
##gff-version 3 #!gff-spec-version 1.21 #!processor NCBI annotwriter #!genome-build BCM_Maur_2.0 #!genome-build-accession NCBI_Assembly:GCF_017639785.1 #!annotation-source NCBI Mesocricetus auratus Annotation Release 103 ##sequence-region NW_024429180.1 1 52462669 ##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=10036 NW_024429180.1 RefSeq region 1 52462669 . + . ID=NW_024429180.1:1..52462669;Dbxref=taxon:10036;Name=Unknown;chromosome=Unknown;dev-stage=adult;gbkey=Src;genome=genomic;isolate=SY011;mol_type=genomic DNA;sex=female;tissue-type=liver NW_024429180.1 Gnomon pseudogene 37366 38359 . + . ID=gene-LOC101842720;Dbxref=GeneID:101842720;Name=LOC101842720;gbkey=Gene;gene=LOC101842720;gene_biotype=pseudogene;pseudo=true