Getting started

Installation

To use splitcode, first install it from source:

git clone https://github.com/pachterlab/splitcode
cd splitcode
mkdir build
cd build
cmake ..
make
make install

Alternately, one can download the binaries for Mac and Linux here: https://github.com/pachterlab/splitcode/releases

Note

make install will not work unless you have permission to access the systems folders. In this case, after running the make step in the build directory, one can simply find the splitcode binary at src/splitcode and use that directly.

Graphical User Interface (GUI)

To use splitcode’s GUI, please visit https://pachterlab.github.io/splitcode/

Note

This GUI simply serves as a sandbox to try out and test certain features.

Command-line structure

The command-line structure for running splitcode is as follows:

splitcode [arguments] fastq-files

A list of options can be viewed by running splitcode -h.

The arguments you supply give splitcode instructions on what to do with your FASTQ files. Most often, you’d want to supply a config file to splitcode, specifying how you want your reads to be processed. You’d also want to supply an output option.

Overview

Tags

splitcode is organized such that tags, the technical sequences that can be identified in reads, are supplied in a tab-separated value config file by the user. The rows of the config file contain the user-supplied sequences while the columns of the config file describe the properties of each sequence. Each sequence is associated with a tag via a tag ID and multiple sequences can be associated with the same tag ID (e.g. if you want to treat AAAA and TTTT as the same, you can give the same tag ID). Each row of the config file describes a tag and each column of the config file is for a certain property of the tag (e.g. tag ID, tag sequence, error tolerance, tag group, etc. all comprise the columns). splitcode identifies tags within sequencing reads by scanning each read from beginning to end.

Barcodes

A permutation of tags identified within a read forms a unique barcode. This generated barcode can thus be used to demultiplex reads based on the identified tags. This barcode is 16 base pairs in length and supplying --mapping=mapfile.txt will output a file named mapfile.txt that maps the generated barcode with the tags (and their order).

Extraction

Sometimes important technical sequences are unknown (such as in the case of UMIs) and we need to pick them out from reads based on their absolute location within reads or based on their location relative to a tag. It is possible to isolate such sequences by using an extraction expression.

Output

Basic usage

Here, we demonstrate a basic usage example of splitcode where we search for the sequence ATCG and replace it with TTTT.

First, create a config file named config.txt with the following contents:

id  tag  sub
id1 ATCG TTTT

Next, let’s create a sample FASTQ file called intro.fastq with the following contents:

@read1
GGGATCGCCC
+
!!!!!!!!!!
@read2
ATCGTTTTTT
+
!!!!!!!!!!

Then, run the following:

splitcode -c config.txt --nFastqs=1 --pipe intro.fastq

The resulting output will be as follows:

@read1
GGGTTTTCCC
+
!!!KKKK!!!
@read2
TTTTTTTTTT
+
KKKK!!!!!!

As you can see from the output, the sequence ATCG has been replaced with TTTT. Also note that the quality scores are set to K – every new nucleotide that splitcode inserts will always have this quality score. The --nFastqs=1 argument means that we’re only considering one FASTQ file as part of a set of reads. If we had two FASTQ files as part of our set of reads (as is the case with paired-end reads), we’d set that value to 2. The --pipe argument means that we’re writing the results directly to standard output. If we wanted to write to a file called output.fastq, we would not use that argument; instead, we would supply -o output.fastq.