Nesting config files

Introduction

Sometimes, you might want to perform operations on sequences that have been extracted, modified, or corrected with splitcode (i.e. go through another round of splitcoding). For example, one might want to extract sequences then correct them.

This is possible by using a nested config file. Essentially, simply place @nest at the bottom of your config file and add config options after it.

A simple example

Say we have the following FASTQ file (input.fastq):

@read
AAATTTTGGGGG
+
KKKKKKKKKKKK

Let’s say we want to replace the AAA at the beginning with GGGGG and then, in a second step, replace GGGGG with CCC. In other words the sequence goes from AAATTTTGGGGG to GGGGGTTTTGGGGG to CCCTTTTCCC. We would specify the following config file (config.txt):

ids  tags    subs
X    AAA     GGGGG

@nest

ids  tags    subs
Y    GGGGG   CCC

When we run the following command:

splitcode -c config.txt --pipe input.fastq

THe following will be printed out:

@read
CCCTTTTCCC
+
KKKKKKKKKK

An extraction example

OK, we’ll use the same input.fastq as the previous example, where the sample read sequence is AAATTTTGGGGG.

input.fastq

 @read
 AAATTTTGGGGG
 +
 KKKKKKKKKKKK

Let’s say we want to do the following operations: 1) Extract 4 bp’s after encountering AAA, 2) Error-correct the extracted sequence to the following scheme (AAAA becomes TTT; TTTT becomes GGG; CCCC becomes AAA; GGGG becomes CCC). We set up the following config.txt file:

@extract {X}<extracted_seq[4]>

ids  tags
X    AAA

@nest

ids  tags    subs    locations
Y1   AAAA    TTT     0
Y2   TTTT    GGG     0
Y3   CCCC    AAA     0
Y4   GGGG    CCC     0

OK, we set locations to be 0. Why? Because of @nest, at the next level, the extracted sequence will become file #0 and the input read will become file #1.

So, when we run:

splitcode -c config.txt -o out_R1.fq,out_R2.fq input.fastq

We’ll get the following outputs:

out_R1.fq

 @read
 GGG
 +
 KKK

out_R2.fq

 @read
 AAATTTTGGGGG
 +
 KKKKKKKKKKKK

Note that we specified two output files because, again, due to @nest, at the next level, the extracted sequence (from the first level) became file #0 and the input read became file #1.

Error-correcting extracted sequences to a list of barcodes

OK, building off the above, let’s reuse the input.fastq read sequence file:

input.fastq

 @read
 AAATTTTGGGGG
 +
 KKKKKKKKKKKK

Let’s say we have the following barcodes list (b.txt):

b.txt

 ATAT
 TCGA
 GAGG
 TATT

And let’s set the following config file, allowing for one mismatch via the distances column, and correcting it to its original sequence via the . value in subs. We provide b.txt$ (with the $ after the file name to specify that each sequence in that file should be its own unique tag).

@extract {X}<extracted_seq[4]>

ids  tags    subs
X    AAA     GGGGG

@nest

ids  tags    subs    locations       distances
Y    b.txt$  .       0       1

So, when we run:

splitcode -c config.txt -o out_R1.fq,out_R2.fq input.fastq

We’ll get the following outputs:

out_R1.fq

 @read
 TATT
 +
 KKKK

out_R2.fq

 @read
 AAATTTTGGGGG
 +
 KKKKKKKKKKKK

As you can see the TTTT that was extracted from input.fastq was corrected (via one hamming distance) to TATT.

You can do other stuff too, e.g. if you set minFinds to 1, the extracted sequences that did not match anything in b.txt will not be outputted.