Put technical sequences into separate files

Separating based on location

Let’s say we have barcodes, UMIs, and the biological sequence at specified locations. For example, 10X (version 3) chemistry has the following format:

  • Barcode (16 bp): File 0 (R1.fastq.gz), position 0 to 16

  • UMI (12 bp): File 0 (R1.fastq.gz), position 16-28

  • cDNA: File 1 (R2.fastq.gz), entire read

We can extract into separate files as follows:

splitcode -x "0:0<barcode>0:16,0:16<umi>0:28,1:0<cdna>1:-1" --x-only --nFastqs=2 --gzip R1.fastq.gz R2.fastq.gz

Three files will be generated: barcode.fastq.gz, umi.fastq.gz, cdna.fastq.gz

Note

Note that quality scores will NOT be preserved (all quality scores will be replaced with K).

Also note that in lieu of a config.txt file, we supplied the extraction pattern on the command line using -x (although we could use a config file with just one line containing the extraction string preceded by @extract).

Separating based on tag identification

Let’s revisit the example, where, for R1.fastq, we have a 5-bp Barcode A, followed by a variable length region (region 1), followed by a 5-bp/6-bp barcode B, followed by an 8-bp UMI 3-bp’s after barcode B, followed by a variable length region (region 2) that procedes until the end of the read. For R2.fastq, the entire sequence is the cDNA sequence (except the last 4 bp’s are trimmed).

We want to extract the barcodes, the UMI, region 1+2, and the cDNA into separate files.

Here, we’ll create the following config.txt file:

@extract <barcode_A{{@grp_A}}>,<barcode_B{{@grp_B}}>,{{grp_B}}3<umi[8]>,{{grp_A}}<region_1>{{grp_B}},{{grp_B}}3<region_2>0:-1,1:0<cdna>1:-1
@trim-3 0,4
groups ids             tags      distances   next      maxFindsG locations
grp_A  Barcode_A1      AAGGA     1           {{grp_B}} 1         0:0:5
grp_A  Barcode_A2      GTGTG     1           {{grp_B}} 1         0:0:5
grp_A  Barcode_A3      CGTAT     1           {{grp_B}} 1         0:0:5
grp_B  Barcode_B1      GCGCAA    0           -         1         0:5:100
grp_B  Barcode_B2      CCCGT     0           -         1         0:5:100

We can extract into separate files as follows:

splitcode -c config.txt --x-only --nFastqs=2 --gzip R1.fastq R2.fastq

Six files will be generated:

  • barcode_A.fastq.gz

  • barcode_B.fastq.gz

  • umi.fastq.gz

  • region_1.fastq.gz

  • region_2.fastq.gz

  • cdna.fastq.gz