Nesting config files
Introduction
Sometimes, you might want to perform operations on sequences that have been extracted, modified, or corrected with splitcode (i.e. go through another round of splitcoding). For example, one might want to extract sequences then correct them.
This is possible by using a nested config file. Essentially, simply place @nest at the bottom of your config file and add config options after it.
A simple example
Say we have the following FASTQ file (input.fastq):
@read
AAATTTTGGGGG
+
KKKKKKKKKKKK
Let’s say we want to replace the AAA at the beginning with GGGGG and then, in a second step, replace GGGGG with CCC. In other words the sequence goes from AAATTTTGGGGG to GGGGGTTTTGGGGG to CCCTTTTCCC. We would specify the following config file (config.txt):
ids tags subs
X AAA GGGGG
@nest
ids tags subs
Y GGGGG CCC
When we run the following command:
splitcode -c config.txt --pipe input.fastq
THe following will be printed out:
@read
CCCTTTTCCC
+
KKKKKKKKKK
An extraction example
OK, we’ll use the same input.fastq as the previous example, where the sample read sequence is AAATTTTGGGGG.
@read
AAATTTTGGGGG
+
KKKKKKKKKKKK
Let’s say we want to do the following operations: 1) Extract 4 bp’s after encountering AAA, 2) Error-correct the extracted sequence to the following scheme (AAAA becomes TTT; TTTT becomes GGG; CCCC becomes AAA; GGGG becomes CCC). We set up the following config.txt file:
@extract {X}<extracted_seq[4]>
ids tags
X AAA
@nest
ids tags subs locations
Y1 AAAA TTT 0
Y2 TTTT GGG 0
Y3 CCCC AAA 0
Y4 GGGG CCC 0
OK, we set locations to be 0. Why? Because of @nest, at the next level, the extracted sequence will become file #0 and the input read will become file #1.
So, when we run:
splitcode -c config.txt -o out_R1.fq,out_R2.fq input.fastq
We’ll get the following outputs:
@read
GGG
+
KKK
@read
AAATTTTGGGGG
+
KKKKKKKKKKKK
Note that we specified two output files because, again, due to @nest, at the next level, the extracted sequence (from the first level) became file #0 and the input read became file #1.
Error-correcting extracted sequences to a list of barcodes
OK, building off the above, let’s reuse the input.fastq read sequence file:
@read
AAATTTTGGGGG
+
KKKKKKKKKKKK
Let’s say we have the following barcodes list (b.txt):
ATAT
TCGA
GAGG
TATT
And let’s set the following config file, allowing for one mismatch via the distances column, and correcting it to its original sequence via the . value in subs. We provide b.txt$ (with the $ after the file name to specify that each sequence in that file should be its own unique tag).
@extract {X}<extracted_seq[4]>
ids tags subs
X AAA GGGGG
@nest
ids tags subs locations distances
Y b.txt$ . 0 1
So, when we run:
splitcode -c config.txt -o out_R1.fq,out_R2.fq input.fastq
We’ll get the following outputs:
@read
TATT
+
KKKK
@read
AAATTTTGGGGG
+
KKKKKKKKKKKK
As you can see the TTTT that was extracted from input.fastq was corrected (via one hamming distance) to TATT.
You can do other stuff too, e.g. if you set minFinds to 1, the extracted sequences that did not match anything in b.txt will not be outputted.