SPLiT-seq processing

Introduction

SPLiT-seq (Parse Biosciences) utilizes three rounds of combinatorial barcoding. A protocol might look as follows (and is what will be used in this tutorial):

  • Round 1 barcode (8 bps): Position 78-86 in R2.fastq.gz

  • Round 2 barcode (8 bps): Position 48-56 in R2.fastq.gz

  • Round 3 barcode (8 bps): Position 10-18 in R2.fastq.gz

  • Biological read: The R1.fastq.gz file.

The round 1 barcode and round 2 barcode are separated by a ATCCACGTGCTTGAGACTGTGG linker and the round 2 barcode and the round 3 barcode are separated by a GTGGCCGATGTTTCGCATCGGCGTACGACT linker. The first 10 bp’s in R2.fastq.gz is the UMI.

Note: The round 1 barcode (position 78-86 in R2.fastq.gz) has two types of reads: 1) random oligo primed reads (R), and 2) polyT primed reads (T). These are distinguished by the round 1 barcodes (there are 96 R barcodes and 96 T barcodes. Therefore, two possible barcodes can belong to a single cell. It would be desirable to analyze them separately (because of technical biases). However, one may alternatively want to convert R barcodes to their corresponding T barcodes (as is done in many pipelines) so that each cell gets a single barcode; this is what we’ll do in the following section.

Example: Convert R to T

We create a config file, config_RT.txt, where each R barcode is specified to be replaced with its corresponding T barcode.

We then run splitcode on the R2.fastq.gz file as follows:

splitcode -c config_RT.txt -o modified_R2.fq.gz R2.fastq.gz

Tip

Instead of generating an output file via -o, one can use -p instead to pipe output to standard output (and then direct the standard output directly into a downstream read processing/alignment program).

Example: Barcode reformatting

Here, we’ll address producing a final “corrected” barcode with all three barcoding rounds stitched together and error-corrected (with the reads which don’t have all three rounds matching the barcode onlist being discarded).

With r1_R.txt and r1_T.txt containing the round 1 barcodes (for R and T, respectively), and r2_3.txt containing the round 2 and 3 barcodes (note that round 2 and round 3 use the same set of 96 barcodes), we specify the following config.txt file:

config.txt
 @extract <output_R2{*}>,1:0<umi[10]>
 tags         distances  ids     groups    minFindsG  locations
 r1_R.txt$    1          r1_R    round1    1          1,78,86
 r1_T.txt$    1          r1_T    round1    1          1,78,86
 r2_r3.txt$   1          r2      round2_3  2          1,48,56
 r2_r3.txt$   1          r3      round2_3  2          1,10,18

Note that the $ means to treat each barcode sequence as its own individual tag and we use minFindsG to specify that the round 1 barcodes must be found once and the round 2+3 barcodes must be found twice. We then run the following:

splitcode -c config.txt --nFastqs=2 --select=0 \
--gzip -o output_R1.fastq.gz R1.fastq.gz R2.fastq.gz

The --select=0 option means that we’re only outputting the zeroth (i.e. R1) file which is named output_R1.fastq.gz – we don’t want to output the R2 file because we’re already outputting the corrected, stitched-together barcodes (24-bp in length), which will be stored in output_R2.fastq.gz; the UMI will be stored in umi.fastq.gz.

Example: Demultiplexing wells

The 96-well plate contains 8 rows (A-H) and 12 columns (1-12). Each well can be identified by the first round of split-pool barcoding. For example purposes, let’s say wells A1-A8 were used for one experiment, wells B1-B8 were used for a second experiment, wells C1-C8 were used for a third experiment, and wells D1-D8 were used for a fourth experiment. We want to separate those 4 experiments into their own FASTQ files.

We can so by creating a config file, config_separate.txt, four files containing the barcodes for each experiment: A1_A8.txt, B1_B8.txt, C1_C8.txt, D1_D8.txt, and then a select_wells.txt file specifying the demultiplexing strategy (i.e. barcodes go into certain files based on their experiment/group). We then run the following:

splitcode -c config_separate.txt --nFastqs=2 \
--gzip --keep-grp=select_wells.txt \
--no-output --no-outb \
R1.fastq.gz R2.fastq.gz

The output will consist of a pair of files for each experiment:

  • A1_A8_0.fastq.gz and A1_A8_1.fastq.gz

  • B1_B8_0.fastq.gz and B1_B8_1.fastq.gz

  • C1_C8_0.fastq.gz and C1_C8_1.fastq.gz

  • D1_D8_0.fastq.gz and D1_D8_1.fastq.gz

Where _0.fastq.gz corresponds to R1 and _1.fastq.gz corresponds to R2 (because splitcode uses zero-indexing).

See also

See the following pages for further assistance on demultiplexing:

Example: Extracting barcodes based on linkers

In certain cases (e.g. long-read SPLiT-seq), the barcodes (and UMI) may not be in fixed position within a read and we’ll need to extract them relative to the linker sequences. Let’s further assume that we have single-end reads (input.fastq.gz) where the biological read occurs after the final 8-bp barcode (the round 1 barcode). Following the “Barcode reformatting” example above, we can do the following:

config.txt
 @extract <umi[10]>8{linker1},<bc[8]>{linker1},{linker1}<bc[8]>{linker2},{linker2}<bc[8]>,{linker2}8<read>0:-1
 tags                            ids      distances  minFinds  maxFinds  locations  next
 GTGGCCGATGTTTCGCATCGGCGTACGACT  linker1  2          1         1         0:18       {linker2}8-8
 ATCCACGTGCTTGAGACTGTGG*          linker2  2          1         1         0:-30      -
splitcode -c config.txt --x-only --gzip input.fastq.gz

We’ll get a umi.fastq.gz file, a bc.fastq.gz (with the 3 barcodes stitched together), and a read.fastq.gz (containing the biological read sequence).

Here, we set the hamming distance tolerance to 2, we terminate looking after the second linker, and we enforce extraction of barcodes of exactly 8-bp in length, but for applications such as long-read sequencing, one may want to adjust some of these to more sensitively pull out potential barcodes (that can further be refined downstream) to account for higher error rates.

References

The following references, which either describe the method, were posted prior to, or contributed to the development of this tutorial, are acknowledged and credited:

  1. Rosenberg AB, Roco CM, Muscat RA, Kuchina A, Sample P, Yao Z, Graybuck LT, Peeler DJ, Mukherjee S, Chen W, Pun SH. Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding. Science. 2018 Apr 13;360(6385):176-82. https://doi.org/10.1126/science.aam8999

  2. Rebboah E, Reese F, Williams K, Balderrama-Gutierrez G, McGill C, Trout D, Rodriguez I, Liang H, Wold BJ, Mortazavi A. Mapping and modeling the genomic basis of differential RNA isoform expression at single-cell resolution with LR-Split-seq. Genome biology. 2021 Dec;22(1):1-28. https://doi.org/10.1186/s13059-021-02505-w

  3. Preprocess_SPLITseq_collapse_bcSharing.pl (a perl script to convert R barcodes to T barcodes)

  4. splitp (a rust implementation of the previous perl script)

  5. LR-splitpipe (used for processing long read SPLiT-seq data)