inDrops guide

Introduction

inDrops is a droplet microfluidics-based single-cell RNA-seq method. This tutorial will describe processing the original version 1 of the technology, which involves two rounds of barcoding (each using 384 barcodes). For our example, we will use the dataset at GSM1599501 (SRR accession: SRR1784317).

Read structure

The R1 reads contain the first (8-11 bp) barcode, followed by a constant 22-bp GAGTGATTGCTTGTGACGCCTT sequence, followed by the second (8-bp) barcode, followed by a 6-bp UMI.

The R2 reads contain the sequence to be aligned to the genome/transcriptome.

Barcode list preprocessing

Here are the two lists of barcodes (one for each round of barcoding):

  1. inDrop_barcode1_list.txt

  2. inDrop_barcode2_list.txt

The inDrop_barcode1_list.txt file is the file that contains variable length barcodes.

First, note that some preprocessing will need to happen with the first 384 barcodes list, because the barcodes are of variable length (and also, in the list above, the barcodes are reverse complemented). Thus, for each barcode in the list, we will 1) reverse complement it, then 2) pad it if necessary so that it reaches 11-bp in length. Since we know that the constant region starts with GAG, if a barcode is only 8-bp long, we pad GAG, if it is only 9-bp long, we pad GA, and if it is only 10-bp long, we pad G.

cat inDrop_barcode1_list.txt|tr ACGTacgt TGCAtgca | rev | awk 'length($0)==8{$0=$0 "GAG"} length($0)==9{$0=$0 "GA"} length($0)==10{$0=$0 "G"} {print}' > list1.txt

We will use the newly created list1.txt for further processing.

Config file

Next, we’ll use a config.txt file as follows, which first matches a barcode in the modified first barcodes list (allowing one substitution error) in the first 11-bp’s, then matches the remaining 19-bp of the 22-bp constant region (remember, the first 3 nucleotides, GAG, are the end-padding of the shorter barcodes, so, since those could appear as part of the barcode region, we omit them when searching for the constant region). The 19-bp’s are matched with two-substitution error tolerance, then the second barcodes list is matched with a one-substitution error tolerance.

The @extract pattern will extract the first barcode (11-bp), followed by the second barcode (8-bp), followed by the 6-bp UMI sequence into a new modified_R1.fastq.gz file. Barcode error correction is already technically performed since the barcodes are extracted in the original form that they exist in the config file.

config.txt
@extract <modified_R1{{barcode1}}>,<modified_R1{{barcode2}}>,{{barcode2}}<modified_R1[6]>
tag                        location   distance  group     maxFindsG  minFindsG   next
TGATTGCTTGTGACGCCTT        0:11:33    2         adapter   1          1           {{barcode2}}0-0
list1.txt$                 0:0:11     1         barcode1  1          1           -
inDrop_barcode2_list.txt$  0:30       1         barcode2  1          1           -

Running splitcode

Now with our config.txt file and our list1.txt (i.e. our modified first barcodes list) in place, we can run splitcode as follows to create a new set of reads:

splitcode --nFastqs=2 --config=config.txt --gzip --select=1 --output=modified_R2.fastq.gz SRR1784317_R1.fastq.gz SRR1784317_R2.fastq.gz

The output files will be modified_R1.fastq.gz and modified_R2.fastq.gz where R1 contains the merged 19-bp barcode (11-bp + 8-bp) followed by the 6-bp UMI.

Hint

Want to save space and time? You can use --pipe instead of --output to produce interleaved paired-end reads that could be streamed directly into tools that accept interleaved reads.

Hint

Want to view (perhaps for debugging purposes) exactly what barcodes are matched in what reads? You can use --mod-names to modify the read header of the output files to identify the barcode sequences that are present in a given read.

Downstream processing

Now, the sequences have a structure that can be readily interpreted by a single-cell RNA-seq mapping program (e.g. STARsolo, alevin-fry, or kallisto | bustools). Here, we will use kallisto | bustools (via kb-python) as an example of how to map these newly modified reads. kallisto | bustools can interpret a user-specified read structure via the -x option (see page 7 of this manual). Assuming we have created a transcriptome index index.idx and its corresponding t2g.txt file from running kb ref, we can then run the read mapping step as follows to produce output count matrices:

kb count -x 0,0,19:0,19,25:1,0,0 -w None -i index.idx -g t2g.txt modified_R1.fastq.gz modified_R2.fastq.gz

References

The following references, which either describe the method, were posted prior to, or contributed to the development of this tutorial, are acknowledged and credited:

  1. Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, Peshkin L, Weitz DA, Kirschner MW. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. 2015 May 21;161(5):1187-201. https://doi.org/10.1016/j.cell.2015.04.044

  2. Zilionis R, Nainys J, Veres A, Savova V, Zemmour D, Klein AM, Mazutis L. Single-cell barcoding and sequencing using droplet microfluidics. Nature protocols. 2017 Jan;12(1):44-73. https://doi.org/10.1038/nprot.2016.154

  3. Teichmann Lab inDrops read structure description