SPRITE barcodes

Introduction

SPRITE is a technique whereby interacting genomic regions can be identified by clusters of sequences that share the same barcode. The barcodes are the product of split-pool barcoding and therefore the first step is identify those cluster barcodes.

Read structure

The read structure is as follows:

SPRITE image

The reads, in accordance with the original SPRITE manuscript, are processed as follows:

  1. The first 8 bp’s of read 1 are searched for an exact match to a DPM tag. No mismatch errors are allowed.

  2. The beginning of read 2 is scanned for a Y tag, which varies from 9-12 bases. The tag must be anchored to the beginning (5’ end) of the read and cannot have mismatches.

  3. Starting from 15 bp’s after the start of read 2, a 15-bp ODD tag is searched for (allowing hamming distance 2 mismatch). If the Y tag was found in the previous step, the next tag must be an ODD tag and 6-12 bases are allowed in between that ODD tag and the previous Y tag (because each tag is separated by a spacer region).

  4. Starting from 30 bp’s after the start of read 2, a 15-bp EVEN tag is searched for (allowing hamming distance 2 mismatch). If an ODD tag was found in the previous step, the next tag must be an EVEN tag and 6-12 bases are allowed in between that EVEN tag and the previous ODD tag (because each tag is separated by a spacer region).

  5. Finally, an ODD tag is searched for again (allowing hamming distance 2 mismatch). If an EVEN tag was found in the previous step, the next tag must be an ODD tag and 6-12 bases are allowed in between that ODD tag and the previous EVEN tag (because each tag is separated by a spacer region).

The config file is located at sprite_config.txt

Processing

We will use the SRR7216015 FASTQ files as an example. We run the following:

splitcode --assign -N 2 -o SRR7216015_o1.fastq.gz,SRR7216015_o2.fastq.gz \
--unassigned=SRR7216015_u1.fastq.gz,SRR7216015_u2.fastq.gz --outb=barcodeids.fq.gz \
-c sprite_config.txt --keep-grp=<(echo "DPM,Y,ODD,EVEN,ODD") --mod-names \
--gzip --mapping=mapping.txt SRR7216015_1.fastq.gz SRR7216015_2.fastq.gz

Note that for --keep-grp, we specified <(echo "DPM,Y,ODD,EVEN,ODD"). We could have put DPM,Y,ODD,EVEN,ODD into a separate file and supplied that separate file name, but, for simplicity, we just used a process substitution to create an anonymous pipe.

Hint

You can add left and/or right columns to the sprite config file to trim tags (e.g. remove the DPMs). You can additionally apply quality trimming via the qtrim option, which will happen afterwards.

Output

The output contains 6 files:

  • SRR7216015_o1.fastq.gz and SRR7216015_o2.fastq.gz: The assigned reads files, i.e. the R1 and R2 files with the five tags (DPM,Y,ODD,EVEN,ODD) identified in order.

  • SRR7216015_u1.fastq.gz and SRR7216015_u2.fastq.gz: The R1 and R2 files with that don’t have the five tags identified (e.g. only a few tags or no tags were identified). Hint: You can omit the –unassigned option if you don’t care about these unassigned reads files.

  • barcodeids.fq.gz: The final barcodes (e.g. each SPRITE clusters get a unique final barcode) that’s associated with the assigned reads files. The tags associated with each barcode are outputted in mapping.txt.

Because we used --mod-names, the tag names will be outputted in the FASTQ header, e.g. as follows:

@SRR7216015.2::[DPM6A2][NYBot86_Stg][Odd2Bo12][Even2Bo90][Odd2Bo62]
TGACATGTTTGGCTCTCTGTTTGTCTGTTATTGGTGTAAAAGAATGCTTGTGATTTTTGCACATTGATTTTGTATCCTGAGACTTTGCTGAAGTTGCTTCTGGATGGATTAAATT
+
DDDDDIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIFHHIIIIIIIIIII

Hint

Add --com-names option if you want a numerical identifier for the SPRITE clusters placed into the FASTQ header comments rather than (or in addition to) the “final barcodes”.

If you don’t care about “final barcodes”, You can omit –outb and –mapping; this will make things faster and more memory efficient.

Ligation Efficiency

To assess ligation efficiency, use the script at ligeff.sh

./ligeff.sh SRR7216015_o1.fastq.gz SRR7216015_u1.fastq.gz

RD-SPRITE

Processing RD-SPRITE (RNA-DNA SPRITE) is also possible; see rdsprite_config.txt for an example.

References

The following references, which either describe the method, were posted prior to, or contributed to the development of this tutorial, are acknowledged and credited:

  1. Quinodoz SA, Ollikainen N, Tabak B, Palla A, Schmidt JM, Detmar E, Lai MM, Shishkin AA, Bhat P, Takei Y, Trinh V. Higher-order inter-chromosomal hubs shape 3D genome organization in the nucleus. Cell. 2018 Jul 26;174(3):744-57. https://doi.org/10.1016/j.cell.2018.05.024

  2. Quinodoz SA, Bhat P, Chovanec P, Jachowicz JW, Ollikainen N, Detmar E, Soehalim E, Guttman M. SPRITE: a genome-wide method for mapping higher-order 3D interactions in the nucleus using combinatorial split-and-pool barcoding. Nature protocols. 2022 Jan;17(1):36-75. https://doi.org/10.1038/s41596-021-00633-y

  3. SPRITE pipeline wiki