Reference guide =============== Config file options ^^^^^^^^^^^^^^^^^^^ Table Options ~~~~~~~~~~~~~ These options are supplied as a tab-delimited table in the config file, with each option being a column. These options are used for tag sequence identification. (Note: the tags, ids, groups, distances, locations, and subs option can be specified either singular or plural, i.e. you can write *tag* instead of *tags*) .. list-table:: :widths: 15 35 35 15 :header-rows: 1 * - Option - Description - Additional info - Example * - tags - Tag sequence - String of ATCG bases. Alternately, can supply a file containing multiple tag sequences. - GGATC * - ids - Tag name/ID - - tag_A * - groups - Tag group name/ID - Tags can be grouped together under a group name. - grp_A * - distances - Allowable error tolerance - Supports setting hamming distance, indel, and total error (hamming+indel) allowance. - 2 * - locations - Where a tag should be searched for in a read - Can specify file, start position, and end position. - 0:5:13 * - minFinds - Minimum number of times a tag must be found in a read - If this isn’t met, the read is discarded - 3 * - minFindsG - Minimum number of times a tag *group* must be found in a read - If this isn’t met, the read is discarded - 3 * - maxFinds - Maximum number of times a tag must be found in a read - Once this is reached, the program simply stops looking for that tag - 5 * - maxFindsG - Maximum number of times a tag *group* must be found in a read - Once this is reached, the program simply stops looking for any tag belonging to that group - 5 * - left - Whether tag should be a left trimming point (0 = no; 1 = yes) - At the location the tag is found, that tag and all bases to the left of the tag in the read are removed - 1 * - right - Whether tag should be a right trimming point (0 = no; 1 = yes) - At the location the tag is found, that tag and all bases to the right of the tag in the read are removed - 0 * - next - What tag ID or group ID must come after the tag - When the tag is found, only the tag ID or group ID specified as "next" will be searched for - {tag_A} * - previous - What tag ID or group ID must come before the tag - The tag will not be searched for unless the tag ID or group ID specified as "previous" was found right before - {{grp_A}} * - subs - Sequence to substitute tag with when tag is found in read - Note: Useful for error correction: one can specify substituting the original tag sequence in if a mismatched version was found - NNNN * - partial5 - Identifies sequences that may be truncated at the 5′ end - Specify the minimum bp's that must match and the allowable substitution mismatch frequency (min_match:mismatch_freq) - 3:0.1 * - partial3 - Identifies sequences that may be truncated at the 3′ end - Specify the minimum bp's that must match and the allowable substitution mismatch frequency (min_match:mismatch_freq) - 3:0.1 * - revcomp - Whether sequence's reverse complement should be identified (0 = no; 1 = yes) - - 1 Header Options ~~~~~~~~~~~~~~ These options are supplied at the very beginning of the config file, with each option being a line that begins with ``@``. These options are used for read modification and extraction. .. list-table:: :widths: 20 20 60 :header-rows: 1 * - Option - Example - Description * - @extract - {tag_A} - Extracts UMI-like sequences * - @no-chain - - Disable stitching multiple extracted sequences together * - @trim-5 - 4 - Number of bases to trim from 5′ end (done first before any tag operations or other trimming operations) * - @trim-3 - 6 - Number of bases to trim from 3′ end (done first before any tag operations or other trimming operations) * - @filter-len - 10-100 - Filter reads based on length (min_length:max_length) * - @qtrim - 30 - Threshold for quality trimming (uses `cutadapt algorithm `_) * - @qtrim-naive - - Switch quality trimming algorithm to naive one (trim until a base that meets quality threshold is found) * - @qtrim-5 - - Enable quality trimming from 5′ end of each read * - @qtrim-3 - - Enable quality trimming from 3′ end of each read * - @qtrim-pre - - Do quality trimming first (i.e. before all the operations involving tags) * - @phred64 - - Use the old phred+64 quality scores instead of the newer phred+33 scores * - @prefix - CG - Bases that will prefix each 16-bp final barcode sequence (useful for merging separate experiments) Command-line options ^^^^^^^^^^^^^^^^^^^^ splitcode help menu which can be accessed via ``splitcode -h`` .. note:: Since this manual uses the splitcode config file for setting config options, the command-line arguments that set the config options are not necessary and therefore are not shown. We discuss those arguments in the following section. .. code-block :: text :caption: splitcode help menu Usage: splitcode [arguments] fastq-files Options (configurations supplied in a file): -c, --config Configuration file Output Options: -m, --mapping Output file where the mapping between final barcode sequences and names will be written -o, --output FASTQ file(s) where output will be written (comma-separated) Number of output FASTQ files should equal --nFastqs (unless --select is provided) -O, --outb FASTQ file where final barcodes will be written If not supplied, final barcodes are prepended to reads of first FASTQ file (or as the first read for --pipe) -u, --unassigned FASTQ file(s) where output of unassigned reads will be written (comma-separated) Number of FASTQ files should equal --nFastqs (unless --select is provided) -E, --empty Sequence to fill in empty reads in output FASTQ files (default: no sequence is used to fill in those reads) --empty-remove Empty reads are stripped in output FASTQ files (don't even output an empty sequence) -p, --pipe Write to standard output (instead of output FASTQ files) -S, --select Select which FASTQ files to output (comma-separated) (e.g. 0,1,3 = Output files #0, #1, #3) --gzip Output compressed gzip'ed FASTQ files --out-fasta Output in FASTA format rather than FASTQ format --out-bam Output a BAM file rather than FASTQ files (enter the output BAM file name to -o or --output) --keep-com Preserve the comments of the read names of the input FASTQ file(s) --no-output Don't output any sequences --no-outb Don't output final barcode sequences --no-x-out Don't output extracted UMI-like sequences (should be used with --x-names) --mod-names Modify names of outputted sequences to include identified tag names --com-names Modify names of outputted sequences to include final barcode sequence ID --seq-names Modify names of outputted sequences to include the sequences of identified tags --loc-names Modify names of outputted sequences to include found tag names and locations --x-names Modify names of outputted sequences to include extracted UMI-like sequences --x-only Only output extracted UMI-like sequences --bc-names Modify names of outputted sequences to include final barcode sequence string -X, --sub-assign Assign reads to a secondary sequence ID based on a subset of tags present (must be used with --assign) (e.g. 0,2 = Generate unique ID based the tags present by subsetting those tags to tag #0 and tag #2 only) The names of the outputted sequences will be modified to include this secondary sequence ID -C --compress Set the gzip compression level (default: 1) (range: 1-9) -M --sam-tags Modify the default SAM tags (default: CB:Z:,RX:Z:,BI:i:,SI:i:,BC:Z:,LX:Z:,YM:Z:) Other Options: -N, --nFastqs Number of FASTQ file(s) per run (default: 1) (specify 2 for paired-end) -n, --numReads Maximum number of reads to process from supplied input -A, --append An existing mapping file that will be added on to -k, --keep File containing a list of arrangements of tag names to keep -r, --remove File containing a list of arrangements of tag names to remove/discard -y, --keep-grp File containing a list of arrangements of tag groups to keep -Y, --remove-grp File containing a list of arrangements of tag groups to remove/discard -t, --threads Number of threads to use -s, --summary File where summary statistics will be written to -h, --help Displays usage information --assign Assign reads to a final barcode sequence identifier based on tags present --bclen The length of the final barcode sequence identifier (default: 16) --inleaved Specifies that input is an interleaved FASTQ file --keep-r1-r2 Use R1.fastq, R2.fastq, etc. file name formats when demultiplexing using --keep or --keep-grp --remultiplex Turn on remultiplexing mode --unmask Turn on unmasking mode (extract differences from a masked vs. unmasked FASTA) --version Prints version number --cite Prints citation information Command-line config (optional) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This section is highly optional and is not the recommended way to run splitcode. It is possible to run splitcode without supplying a config file. To do this, we can specify the config options on the command-line. Let's take a look at some of those options below: .. code-block :: text :caption: splitcode help menu Usage: splitcode [arguments] fastq-files Sequence identification options (for configuring on the command-line): -b, --tags List of tag sequences (comma-separated) -d, --distances List of error distance (mismatch:indel:total) thresholds (comma-separated) -l, --locations List of locations (file:pos1:pos2) (comma-separated) -i, --ids List of tag names/identifiers (comma-separated) -g, --groups List of tag group names (comma-separated) -f, --minFinds List of minimum times a tag must be found in a read (comma-separated) -F, --maxFinds List of maximum times a tag can be found in a read (comma-separated) -j, --minFindsG List of minimum times tags in a group must be found in a read (comma-separated group_name:min_times) -J, --maxFindsG List of maximum times tags in a group can be found in a read (comma-separated group_name:max_times) -e, --exclude List of what to exclude from final barcode (comma-separated; 1 = exclude, 0 = include) -L, --left List of what tags to include when trimming from the left (comma-separated; 1 = include, 0 = exclude) -R, --right List of what tags to include when trimming from the right (comma-separated; 1 = include, 0 = exclude) (Note: for --left/--right, can specify an included tag as 1:x where x = number of extra bp's to trim from left/right side if that included tag is at the leftmost/rightmost position) -a, --next List of what tag names must come immediately after each tag (comma-separated) -v, --previous List of what tag names must come immediately before each tag (comma-separated) (Note: for --next/--previous, specify tag names as {name} and specify tag group names as {{group}} Can also specify the number of base pairs that must appear between the current tag and the next/previous tag. E.g. {bc}4-12 means the next/previous tag is 4-12 bases away and has name 'bc') -U, --subs Specifies sequence to substitute tag with when found in read (. = original sequence) (comma-separated) -z, --partial5 Specifies tag may be truncated at the 5′ end (comma-separated min_match:mismatch_freq) -Z, --partial3 Specifies tag may be truncated at the 3′ end (comma-separated min_match:mismatch_freq) Read modification and extraction options (for configuring on the command-line): -x, --extract Pattern(s) describing how to extract UMI and UMI-like sequences from reads (E.g. {bc}2 means extract a 5-bp UMI sequence, called umi_1, 2 base pairs following the tag named 'bc') --no-chain If an extraction pattern for a UMI/UMI-like sequence is matched multiple times, only extract based on the first match -5, --trim-5 Number of base pairs to trim from the 5′-end of reads (comma-separated; one number per each FASTQ file in a run) -3, --trim-3 Number of base pairs to trim from the 3′-end of reads (comma-separated; one number per each FASTQ file in a run) -w, --filter-len Filter reads based on length (min_length:max_length) -q, --qtrim Quality trimming threshold --qtrim-5 Perform quality trimming from the 5′-end of reads of each FASTQ file --qtrim-3 Perform quality trimming from the 3′-end of reads of each FASTQ file --qtrim-pre Perform quality trimming before sequence identification operations --qtrim-naive Perform quality trimming using a naive algorithm (i.e. trim until a base that meets the quality threshold is encountered) --phred64 Use phred+64 encoded quality scores -P, --prefix Bases that will prefix each final barcode sequence (useful for merging separate experiments) -D, --min-delta When matching tags error-tolerantly, specifies how much worse the next best match must be than the best match From the :ref:`example page`, we can use the command-line rather than the `config.txt `_ file to specify our configurations we which then write as output: .. code-block:: shell splitcode --nFastqs=2 --assign --pipe --mapping=mapping.txt \ -g "grp_A,grp_A,grp_A,grp_B,grp_B" \ -i "Barcode_A1,Barcode_A2,Barcode_A3,Barcode_B1,Barcode_B2" \ -b "AAGGA,GTGTG,CGTAT,GCGCAA,CCCGT" \ -d "1,1,1,0,0" -a "{{grp_B}},{{grp_B}},{{grp_B}},," \ -J "grp_A:1,grp_B:1" -l "0:0:5,0:0:5,0:0:5,0:5:100,0:5:100" \ -3 "0,4" -x "{{grp_B}}3" \ R1.fastq R2.fastq