.. _SPLITSEQ guide: SPLiT-seq processing ==================== Introduction ^^^^^^^^^^^^ SPLiT-seq (Parse Biosciences) utilizes three rounds of combinatorial barcoding. A protocol might look as follows (and is what will be used in this tutorial): * **Round 1** barcode (8 bps): Position **78-86** in R2.fastq.gz * **Round 2** barcode (8 bps): Position **48-56** in R2.fastq.gz * **Round 3** barcode (8 bps): Position **10-18** in R2.fastq.gz * Biological read: The R1.fastq.gz file. The round 1 barcode and round 2 barcode are separated by a ATCCACGTGCTTGAGACTGTGG linker and the round 2 barcode and the round 3 barcode are separated by a GTGGCCGATGTTTCGCATCGGCGTACGACT linker. The first 10 bp's in R2.fastq.gz is the UMI. Note: The round 1 barcode (position 78-86 in R2.fastq.gz) has two types of reads: 1) random oligo primed reads (**R**), and 2) polyT primed reads (**T**). These are distinguished by the round 1 barcodes (there are 96 R barcodes and 96 T barcodes. Therefore, two possible barcodes can belong to a single cell. It would be desirable to analyze them separately (because of technical biases). However, one may alternatively want to convert R barcodes to their corresponding T barcodes (as is done in many pipelines) so that each cell gets a single barcode; this is what we'll do in the following section. Example: Convert R to T ^^^^^^^^^^^^^^^^^^^^^^^ We create a config file, `config_RT.txt `_, where each R barcode is specified to be replaced with its corresponding T barcode. We then run splitcode on the R2.fastq.gz file as follows: .. code-block:: text splitcode -c config_RT.txt -o modified_R2.fq.gz R2.fastq.gz .. tip:: Instead of generating an output file via ``-o``, one can use ``-p`` instead to pipe output to standard output (and then direct the standard output directly into a downstream read processing/alignment program). Example: Barcode reformatting ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Here, we'll address producing a final "corrected" barcode with all three barcoding rounds stitched together and error-corrected (with the reads which don't have all three rounds matching the barcode onlist being discarded). With `r1_R.txt `_ and `r1_T.txt `_ containing the round 1 barcodes (for R and T, respectively), and `r2_3.txt `_ containing the round 2 and 3 barcodes (note that round 2 and round 3 use the same set of 96 barcodes), we specify the following config.txt file: .. code-block:: text :caption: config.txt @extract ,1:0 tags distances ids groups minFindsG locations r1_R.txt$ 1 r1_R round1 1 1,78,86 r1_T.txt$ 1 r1_T round1 1 1,78,86 r2_r3.txt$ 1 r2 round2_3 2 1,48,56 r2_r3.txt$ 1 r3 round2_3 2 1,10,18 Note that the ``$`` means to treat each barcode sequence as its own individual tag and we use ``minFindsG`` to specify that the round 1 barcodes must be found once and the round 2+3 barcodes must be found twice. We then run the following: .. code-block:: shell splitcode -c config.txt --nFastqs=2 --select=0 \ --gzip -o output_R1.fastq.gz R1.fastq.gz R2.fastq.gz The ``--select=0`` option means that we're only outputting the zeroth (i.e. R1) file which is named ``output_R1.fastq.gz`` -- we don't want to output the R2 file because we're already outputting the corrected, stitched-together barcodes (24-bp in length), which will be stored in ``output_R2.fastq.gz``; the UMI will be stored in ``umi.fastq.gz``. Example: Demultiplexing wells ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The 96-well plate contains 8 rows (A-H) and 12 columns (1-12). Each well can be identified by the first round of split-pool barcoding. For example purposes, let's say wells A1-A8 were used for one experiment, wells B1-B8 were used for a second experiment, wells C1-C8 were used for a third experiment, and wells D1-D8 were used for a fourth experiment. We want to separate those 4 experiments into their own FASTQ files. We can so by creating a config file, `config_separate.txt `_, four files containing the barcodes for each experiment: `A1_A8.txt `_, `B1_B8.txt `_, `C1_C8.txt `_, `D1_D8.txt `_, and then a `select_wells.txt `_ file specifying the demultiplexing strategy (i.e. barcodes go into certain files based on their experiment/group). We then run the following: .. code-block:: shell splitcode -c config_separate.txt --nFastqs=2 \ --gzip --keep-grp=select_wells.txt \ --no-output --no-outb \ R1.fastq.gz R2.fastq.gz The output will consist of a pair of files for each experiment: * ``A1_A8_0.fastq.gz`` and ``A1_A8_1.fastq.gz`` * ``B1_B8_0.fastq.gz`` and ``B1_B8_1.fastq.gz`` * ``C1_C8_0.fastq.gz`` and ``C1_C8_1.fastq.gz`` * ``D1_D8_0.fastq.gz`` and ``D1_D8_1.fastq.gz`` Where ``_0.fastq.gz`` corresponds to ``R1`` and ``_1.fastq.gz`` corresponds to ``R2`` (because splitcode uses zero-indexing). .. seealso:: See the following pages for further assistance on demultiplexing: * :ref:`demultiplexing page` * :ref:`DemuxCells guide` Example: Extracting barcodes based on linkers ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In certain cases (e.g. long-read SPLiT-seq), the barcodes (and UMI) may not be in fixed position within a read and we'll need to extract them relative to the linker sequences. Let's further assume that we have single-end reads (input.fastq.gz) where the biological read occurs after the final 8-bp barcode (the round 1 barcode). Following the "Barcode reformatting" example above, we can do the following: .. code-block:: text :caption: config.txt @extract 8{linker1},{linker1},{linker1}{linker2},{linker2},{linker2}80:-1 tags ids distances minFinds maxFinds locations next GTGGCCGATGTTTCGCATCGGCGTACGACT linker1 2 1 1 0:18 {linker2}8-8 ATCCACGTGCTTGAGACTGTGG* linker2 2 1 1 0 - .. code-block:: shell splitcode -c config.txt --x-only --gzip input.fastq.gz We'll get a ``umi.fastq.gz`` file, a ``bc.fastq.gz`` (with the 3 barcodes stitched together), and a ``read.fastq.gz`` (containing the biological read sequence). Here, we set the hamming distance tolerance to 2, we terminate looking after the second linker, and we enforce extraction of barcodes of exactly 8-bp in length, but for applications such as long-read sequencing, one may want to adjust some of these to more sensitively pull out potential barcodes (that can further be refined downstream) to account for higher error rates. References ^^^^^^^^^^ The following references, which either describe the method, were posted prior to, or contributed to the development of this tutorial, are acknowledged and credited: 1. Rosenberg AB, Roco CM, Muscat RA, Kuchina A, Sample P, Yao Z, Graybuck LT, Peeler DJ, Mukherjee S, Chen W, Pun SH. Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding. Science. 2018 Apr 13;360(6385):176-82. `https://doi.org/10.1126/science.aam8999 `_ 2. Rebboah E, Reese F, Williams K, Balderrama-Gutierrez G, McGill C, Trout D, Rodriguez I, Liang H, Wold BJ, Mortazavi A. Mapping and modeling the genomic basis of differential RNA isoform expression at single-cell resolution with LR-Split-seq. Genome biology. 2021 Dec;22(1):1-28. `https://doi.org/10.1186/s13059-021-02505-w `_ 3. `Preprocess_SPLITseq_collapse_bcSharing.pl `_ (a perl script to convert R barcodes to T barcodes) 4. `splitp `_ (a rust implementation of the previous perl script) 5. `LR-splitpipe `_ (used for processing long read SPLiT-seq data)