Reference guide
Config file options
Table Options
These options are supplied as a tab-delimited table in the config file, with each option being a column. These options are used for tag sequence identification. (Note: the tags, ids, groups, distances, locations, and subs option can be specified either singular or plural, i.e. you can write tag instead of tags)
Option |
Description |
Additional info |
Example |
---|---|---|---|
tags |
Tag sequence |
String of ATCG bases. Alternately, can supply a file containing multiple tag sequences. |
GGATC |
ids |
Tag name/ID |
tag_A |
|
groups |
Tag group name/ID |
Tags can be grouped together under a group name. |
grp_A |
distances |
Allowable error tolerance |
Supports setting hamming distance, indel, and total error (hamming+indel) allowance. |
2 |
locations |
Where a tag should be searched for in a read |
Can specify file, start position, and end position. |
0:5:13 |
minFinds |
Minimum number of times a tag must be found in a read |
If this isn’t met, the read is discarded |
3 |
minFindsG |
Minimum number of times a tag group must be found in a read |
If this isn’t met, the read is discarded |
3 |
maxFinds |
Maximum number of times a tag must be found in a read |
Once this is reached, the program simply stops looking for that tag |
5 |
maxFindsG |
Maximum number of times a tag group must be found in a read |
Once this is reached, the program simply stops looking for any tag belonging to that group |
5 |
left |
Whether tag should be a left trimming point (0 = no; 1 = yes) |
At the location the tag is found, that tag and all bases to the left of the tag in the read are removed |
1 |
right |
Whether tag should be a right trimming point (0 = no; 1 = yes) |
At the location the tag is found, that tag and all bases to the right of the tag in the read are removed |
0 |
next |
What tag ID or group ID must come after the tag |
When the tag is found, only the tag ID or group ID specified as “next” will be searched for |
{tag_A} |
previous |
What tag ID or group ID must come before the tag |
The tag will not be searched for unless the tag ID or group ID specified as “previous” was found right before |
{{grp_A}} |
subs |
Sequence to substitute tag with when tag is found in read |
Note: Useful for error correction: one can specify substituting the original tag sequence in if a mismatched version was found |
NNNN |
partial5 |
Identifies sequences that may be truncated at the 5′ end |
Specify the minimum bp’s that must match and the allowable substitution mismatch frequency (min_match:mismatch_freq) |
3:0.1 |
partial3 |
Identifies sequences that may be truncated at the 3′ end |
Specify the minimum bp’s that must match and the allowable substitution mismatch frequency (min_match:mismatch_freq) |
3:0.1 |
revcomp |
Whether sequence’s reverse complement should be identified (0 = no; 1 = yes) |
1 |
Header Options
These options are supplied at the very beginning of the config file, with each option being a line that begins with @
. These options are used for read modification and extraction.
Option |
Example |
Description |
---|---|---|
@extract |
{tag_A}<umi[8]> |
Extracts UMI-like sequences |
@no-chain |
Disable stitching multiple extracted sequences together |
|
@trim-5 |
4 |
Number of bases to trim from 5′ end (done first before any tag operations or other trimming operations) |
@trim-3 |
6 |
Number of bases to trim from 3′ end (done first before any tag operations or other trimming operations) |
@filter-len |
10-100 |
Filter reads based on length (min_length:max_length) |
@qtrim |
30 |
Threshold for quality trimming (uses cutadapt algorithm) |
@qtrim-naive |
Switch quality trimming algorithm to naive one (trim until a base that meets quality threshold is found) |
|
@qtrim-5 |
Enable quality trimming from 5′ end of each read |
|
@qtrim-3 |
Enable quality trimming from 3′ end of each read |
|
@qtrim-pre |
Do quality trimming first (i.e. before all the operations involving tags) |
|
@phred64 |
Use the old phred+64 quality scores instead of the newer phred+33 scores |
|
@prefix |
CG |
Bases that will prefix each 16-bp final barcode sequence (useful for merging separate experiments) |
Command-line options
splitcode help menu which can be accessed via splitcode -h
Note
Since this manual uses the splitcode config file for setting config options, the command-line arguments that set the config options are not necessary and therefore are not shown. We discuss those arguments in the following section.
Usage: splitcode [arguments] fastq-files
Options (configurations supplied in a file):
-c, --config Configuration file
Output Options:
-m, --mapping Output file where the mapping between final barcode sequences and names will be written
-o, --output FASTQ file(s) where output will be written (comma-separated)
Number of output FASTQ files should equal --nFastqs (unless --select is provided)
-O, --outb FASTQ file where final barcodes will be written
If not supplied, final barcodes are prepended to reads of first FASTQ file (or as the first read for --pipe)
-u, --unassigned FASTQ file(s) where output of unassigned reads will be written (comma-separated)
Number of FASTQ files should equal --nFastqs (unless --select is provided)
-E, --empty Sequence to fill in empty reads in output FASTQ files (default: no sequence is used to fill in those reads)
--empty-remove Empty reads are stripped in output FASTQ files (don't even output an empty sequence)
-p, --pipe Write to standard output (instead of output FASTQ files)
-S, --select Select which FASTQ files to output (comma-separated) (e.g. 0,1,3 = Output files #0, #1, #3)
--gzip Output compressed gzip'ed FASTQ files
--out-fasta Output in FASTA format rather than FASTQ format
--out-bam Output a BAM file rather than FASTQ files (enter the output BAM file name to -o or --output)
--keep-com Preserve the comments of the read names of the input FASTQ file(s)
--no-output Don't output any sequences
--no-outb Don't output final barcode sequences
--no-x-out Don't output extracted UMI-like sequences (should be used with --x-names)
--mod-names Modify names of outputted sequences to include identified tag names
--com-names Modify names of outputted sequences to include final barcode sequence ID
--seq-names Modify names of outputted sequences to include the sequences of identified tags
--loc-names Modify names of outputted sequences to include found tag names and locations
--x-names Modify names of outputted sequences to include extracted UMI-like sequences
--x-only Only output extracted UMI-like sequences
--bc-names Modify names of outputted sequences to include final barcode sequence string
-X, --sub-assign Assign reads to a secondary sequence ID based on a subset of tags present (must be used with --assign)
(e.g. 0,2 = Generate unique ID based the tags present by subsetting those tags to tag #0 and tag #2 only)
The names of the outputted sequences will be modified to include this secondary sequence ID
-C --compress Set the gzip compression level (default: 1) (range: 1-9)
-M --sam-tags Modify the default SAM tags (default: CB:Z:,RX:Z:,BI:i:,SI:i:,BC:Z:,LX:Z:,YM:Z:)
Other Options:
-N, --nFastqs Number of FASTQ file(s) per run
(default: 1) (specify 2 for paired-end)
-n, --numReads Maximum number of reads to process from supplied input
-A, --append An existing mapping file that will be added on to
-k, --keep File containing a list of arrangements of tag names to keep
-r, --remove File containing a list of arrangements of tag names to remove/discard
-y, --keep-grp File containing a list of arrangements of tag groups to keep
-Y, --remove-grp File containing a list of arrangements of tag groups to remove/discard
-t, --threads Number of threads to use
-s, --summary File where summary statistics will be written to
-h, --help Displays usage information
--assign Assign reads to a final barcode sequence identifier based on tags present
--bclen The length of the final barcode sequence identifier (default: 16)
--inleaved Specifies that input is an interleaved FASTQ file
--keep-r1-r2 Use R1.fastq, R2.fastq, etc. file name formats when demultiplexing using --keep or --keep-grp
--remultiplex Turn on remultiplexing mode
--unmask Turn on unmasking mode (extract differences from a masked vs. unmasked FASTA)
--version Prints version number
--cite Prints citation information
Command-line config (optional)
This section is highly optional and is not the recommended way to run splitcode. It is possible to run splitcode without supplying a config file. To do this, we can specify the config options on the command-line. Let’s take a look at some of those options below:
Usage: splitcode [arguments] fastq-files
Sequence identification options (for configuring on the command-line):
-b, --tags List of tag sequences (comma-separated)
-d, --distances List of error distance (mismatch:indel:total) thresholds (comma-separated)
-l, --locations List of locations (file:pos1:pos2) (comma-separated)
-i, --ids List of tag names/identifiers (comma-separated)
-g, --groups List of tag group names (comma-separated)
-f, --minFinds List of minimum times a tag must be found in a read (comma-separated)
-F, --maxFinds List of maximum times a tag can be found in a read (comma-separated)
-j, --minFindsG List of minimum times tags in a group must be found in a read (comma-separated group_name:min_times)
-J, --maxFindsG List of maximum times tags in a group can be found in a read (comma-separated group_name:max_times)
-e, --exclude List of what to exclude from final barcode (comma-separated; 1 = exclude, 0 = include)
-L, --left List of what tags to include when trimming from the left (comma-separated; 1 = include, 0 = exclude)
-R, --right List of what tags to include when trimming from the right (comma-separated; 1 = include, 0 = exclude)
(Note: for --left/--right, can specify an included tag as 1:x where x = number of extra bp's to trim
from left/right side if that included tag is at the leftmost/rightmost position)
-a, --next List of what tag names must come immediately after each tag (comma-separated)
-v, --previous List of what tag names must come immediately before each tag (comma-separated)
(Note: for --next/--previous, specify tag names as {name} and specify tag group names as {{group}}
Can also specify the number of base pairs that must appear between the current tag and the next/previous tag.
E.g. {bc}4-12 means the next/previous tag is 4-12 bases away and has name 'bc')
-U, --subs Specifies sequence to substitute tag with when found in read (. = original sequence) (comma-separated)
-z, --partial5 Specifies tag may be truncated at the 5′ end (comma-separated min_match:mismatch_freq)
-Z, --partial3 Specifies tag may be truncated at the 3′ end (comma-separated min_match:mismatch_freq)
Read modification and extraction options (for configuring on the command-line):
-x, --extract Pattern(s) describing how to extract UMI and UMI-like sequences from reads
(E.g. {bc}2<umi_1[5]> means extract a 5-bp UMI sequence, called umi_1, 2 base pairs following the tag named 'bc')
--no-chain If an extraction pattern for a UMI/UMI-like sequence is matched multiple times, only extract based on the first match
-5, --trim-5 Number of base pairs to trim from the 5′-end of reads (comma-separated; one number per each FASTQ file in a run)
-3, --trim-3 Number of base pairs to trim from the 3′-end of reads (comma-separated; one number per each FASTQ file in a run)
-w, --filter-len Filter reads based on length (min_length:max_length)
-q, --qtrim Quality trimming threshold
--qtrim-5 Perform quality trimming from the 5′-end of reads of each FASTQ file
--qtrim-3 Perform quality trimming from the 3′-end of reads of each FASTQ file
--qtrim-pre Perform quality trimming before sequence identification operations
--qtrim-naive Perform quality trimming using a naive algorithm (i.e. trim until a base that meets the quality threshold is encountered)
--phred64 Use phred+64 encoded quality scores
-P, --prefix Bases that will prefix each final barcode sequence (useful for merging separate experiments)
-D, --min-delta When matching tags error-tolerantly, specifies how much worse the next best match must be than the best match
From the Example, we can use the command-line rather than the config.txt file to specify our configurations we which then write as output:
splitcode --nFastqs=2 --assign --pipe --mapping=mapping.txt \
-g "grp_A,grp_A,grp_A,grp_B,grp_B" \
-i "Barcode_A1,Barcode_A2,Barcode_A3,Barcode_B1,Barcode_B2" \
-b "AAGGA,GTGTG,CGTAT,GCGCAA,CCCGT" \
-d "1,1,1,0,0" -a "{{grp_B}},{{grp_B}},{{grp_B}},," \
-J "grp_A:1,grp_B:1" -l "0:0:5,0:0:5,0:0:5,0:5:100,0:5:100" \
-3 "0,4" -x "{{grp_B}}3<umi[8]>" \
R1.fastq R2.fastq