Demultiplex single-cells into individual files
Let’s say you have single-cell sequencing reads, such as those produced by 10x Genomics, where the 16 bp barcodes (used to label individual cells) are at the beginning of the first read file (R1.fastq) but you want to put each individual cell into its own file. This can be done using splitcode’s demultiplexing capabilities (see the page: Demultiplexing for a more detailed description of splitcode’s demultiplexing capabilities).
Here, we’ll create the following config.txt
file:
tags locations distances
GACGTAGCGCCCCCCA 0:0:16 1
TAGCTACCTTGTAACA 0:0:16 1
CCACCGAATAGGAACC 0:0:16 1
TTGTAGCTGAGTAGTA 0:0:16 1
CCTTCCTTAAACTTAC 0:0:16 1
CCTTCCTTAAACTTAG 0:0:16 1
We showcase a series of 16 bp barcodes and specify that it occurs at the beginning (i.e. first 16 bps) of the first sequencing read via 0:0:16
and a hamming distance error of 1
(one substitution) is allowed. Of course, single-cell technologies can have thousands of barcodes, all of which would be put into this file via this format, but we only show a few here for demonstration purposes.
Next, we’ll create the following keep.txt
file for our demultiplexing strategy (the first column is the tag ID, which by default is the tag sequence itself, and the second column is the prefix of the output file name):
GACGTAGCGCCCCCCA 1
TAGCTACCTTGTAACA 2
CCACCGAATAGGAACC 3
TTGTAGCTGAGTAGTA 4
CCTTCCTTAAACTTAC 5
CCTTCCTTAAACTTAG 6
We then run the following:
splitcode --gzip --keep-r1-r2 --keep=keep.txt -c config.txt --nFastqs=2 --no-output --no-outb R1.fastq R2.fastq
We use --no-output
because we don’t have our “typical output” (i.e. those specified by --pipe
or -o
; for all entries in select.txt where the second column is blank, those would go into the “typical output”). We use -no-outb
because we don’t have any “final barcodes” (since --assign
is not specified) so we shouldn’t allocate files for them.
The following six pairs of files will be generated (the _R1.fastq.gz files are the R1 files and the _R2.fastq.gz files are the R2 files since we specified --keep-r1-r2
):
1_R1.fastq.gz
and1_R2.fastq.gz
2_R1.fastq.gz
and2_R2.fastq.gz
3_R1.fastq.gz
and3_R2.fastq.gz
4_R1.fastq.gz
and4_R2.fastq.gz
5_R1.fastq.gz
and5_R2.fastq.gz
6_R1.fastq.gz
and6_R2.fastq.gz
Each of these files will contain the barcodes associated with the respective files (e.g. reads with the barcode TTGTAGCTGAGTAGTA will be thrown into the files prefixed with 4, i.e. 4_R1.fastq.gz and 4_R2.fastq.gz).