SPLiT-seq processing

Introduction

SPLiT-seq (Parse Biosciences) utilizes three rounds of combinatorial barcoding. A protocol might look as follows (and is what will be used in this tutorial):

Round 1 barcode (8 bps): Position 78-86 in R2.fastq.gz
Round 2 barcode (8 bps): Position 48-56 in R2.fastq.gz
Round 3 barcode (8 bps): Position 10-18 in R2.fastq.gz
Biological read: The R1.fastq.gz file.

The round 1 barcode and round 2 barcode are separated by a ATCCACGTGCTTGAGACTGTGG linker and the round 2 barcode and the round 3 barcode are separated by a GTGGCCGATGTTTCGCATCGGCGTACGACT linker. The first 10 bp’s in R2.fastq.gz is the UMI.

Note: The round 1 barcode (position 78-86 in R2.fastq.gz) has two types of reads: 1) random oligo primed reads (R), and 2) polyT primed reads (T). These are distinguished by the round 1 barcodes (there are 96 R barcodes and 96 T barcodes. Therefore, two possible barcodes can belong to a single cell. It would be desirable to analyze them separately (because of technical biases). However, one may alternatively want to convert R barcodes to their corresponding T barcodes (as is done in many pipelines) so that each cell gets a single barcode; this is what we’ll do in the following section.

Example: Convert R to T

We create a config file, config_RT.txt, where each R barcode is specified to be replaced with its corresponding T barcode.

We then run splitcode on the R2.fastq.gz file as follows:

splitcode -c config_RT.txt -o modified_R2.fq.gz R2.fastq.gz

Tip

Instead of generating an output file via -o, one can use -p instead to pipe output to standard output (and then direct the standard output directly into a downstream read processing/alignment program).

Example: Barcode reformatting

Here, we’ll address producing a final “corrected” barcode with all three barcoding rounds stitched together and error-corrected (with the reads which don’t have all three rounds matching the barcode onlist being discarded).

With r1_R.txt and r1_T.txt containing the round 1 barcodes (for R and T, respectively), and r2_3.txt containing the round 2 and 3 barcodes (note that round 2 and round 3 use the same set of 96 barcodes), we specify the following config.txt file:

config.txt

 @extract <output_R2{*}>,1:0<umi[10]>
 tags         distances  ids     groups    minFindsG  locations
 r1_R.txt$    1          r1_R    round1    1          1,78,86
 r1_T.txt$    1          r1_T    round1    1          1,78,86
 r2_r3.txt$   1          r2      round2_3  2          1,48,56
 r2_r3.txt$   1          r3      round2_3  2          1,10,18

Note that the $ means to treat each barcode sequence as its own individual tag and we use minFindsG to specify that the round 1 barcodes must be found once and the round 2+3 barcodes must be found twice. We then run the following:

splitcode -c config.txt --nFastqs=2 --select=0 \
--gzip -o output_R1.fastq.gz R1.fastq.gz R2.fastq.gz

The --select=0 option means that we’re only outputting the zeroth (i.e. R1) file which is named output_R1.fastq.gz – we don’t want to output the R2 file because we’re already outputting the corrected, stitched-together barcodes (24-bp in length), which will be stored in output_R2.fastq.gz; the UMI will be stored in umi.fastq.gz.

Example: Demultiplexing wells

The 96-well plate contains 8 rows (A-H) and 12 columns (1-12). Each well can be identified by the first round of split-pool barcoding. For example purposes, let’s say wells A1-A8 were used for one experiment, wells B1-B8 were used for a second experiment, wells C1-C8 were used for a third experiment, and wells D1-D8 were used for a fourth experiment. We want to separate those 4 experiments into their own FASTQ files.

We can so by creating a config file, config_separate.txt, four files containing the barcodes for each experiment: A1_A8.txt, B1_B8.txt, C1_C8.txt, D1_D8.txt, and then a select_wells.txt file specifying the demultiplexing strategy (i.e. barcodes go into certain files based on their experiment/group). We then run the following:

splitcode -c config_separate.txt --nFastqs=2 \
--gzip --keep-grp=select_wells.txt \
--no-output --no-outb \
R1.fastq.gz R2.fastq.gz

The output will consist of a pair of files for each experiment:

A1_A8_0.fastq.gz and A1_A8_1.fastq.gz
B1_B8_0.fastq.gz and B1_B8_1.fastq.gz
C1_C8_0.fastq.gz and C1_C8_1.fastq.gz
D1_D8_0.fastq.gz and D1_D8_1.fastq.gz

Where _0.fastq.gz corresponds to R1 and _1.fastq.gz corresponds to R2 (because splitcode uses zero-indexing).

Example: Extracting barcodes based on linkers

In certain cases (e.g. long-read SPLiT-seq), the barcodes (and UMI) may not be in fixed position within a read and we’ll need to extract them relative to the linker sequences. Let’s further assume that we have single-end reads (input.fastq.gz) where the biological read occurs after the final 8-bp barcode (the round 1 barcode). Following the “Barcode reformatting” example above, we can do the following:

config.txt

 @extract <umi[10]>8{linker1},<bc[8]>{linker1},{linker1}<bc[8]>{linker2},{linker2}<bc[8]>,{linker2}8<read>0:-1
 tags                            ids      distances  minFinds  maxFinds  locations  next
 GTGGCCGATGTTTCGCATCGGCGTACGACT  linker1  2          1         1         0:18       {linker2}8-8
 ATCCACGTGCTTGAGACTGTGG*          linker2  2          1         1         0:-30      -

splitcode -c config.txt --x-only --gzip input.fastq.gz

We’ll get a umi.fastq.gz file, a bc.fastq.gz (with the 3 barcodes stitched together), and a read.fastq.gz (containing the biological read sequence).

Here, we set the hamming distance tolerance to 2, we terminate looking after the second linker, and we enforce extraction of barcodes of exactly 8-bp in length, but for applications such as long-read sequencing, one may want to adjust some of these to more sensitively pull out potential barcodes (that can further be refined downstream) to account for higher error rates.

References

The following references, which either describe the method, were posted prior to, or contributed to the development of this tutorial, are acknowledged and credited:

Rosenberg AB, Roco CM, Muscat RA, Kuchina A, Sample P, Yao Z, Graybuck LT, Peeler DJ, Mukherjee S, Chen W, Pun SH. Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding. Science. 2018 Apr 13;360(6385):176-82. https://doi.org/10.1126/science.aam8999
Rebboah E, Reese F, Williams K, Balderrama-Gutierrez G, McGill C, Trout D, Rodriguez I, Liang H, Wold BJ, Mortazavi A. Mapping and modeling the genomic basis of differential RNA isoform expression at single-cell resolution with LR-Split-seq. Genome biology. 2021 Dec;22(1):1-28. https://doi.org/10.1186/s13059-021-02505-w
Preprocess_SPLITseq_collapse_bcSharing.pl (a perl script to convert R barcodes to T barcodes)
splitp (a rust implementation of the previous perl script)
LR-splitpipe (used for processing long read SPLiT-seq data)