User guide: Extraction
Extraction
Sequences that you want to extract can be specified in the config file header using the @extract
directive followed by an expression of what you want to extract and how. The extraction options are listed below:
Extracting relative to a tag or tag group
To extract relative to a tag or tag group, specify the following four things:
The tag or group name (tag IDs should be enclosed in single curly braces, i.e.
{tag_id}
whereas group names should be enclosed in double curly braces, i.e.{{group_name}}
.An optional spacer denoting how many bp’s away from the tag should the extraction be done.
The name you want to give the that sequence you want to extract
The length of the sequence you want to extract
For example, to extract a 6-bp sequence, which you decide to name xxx, immediately following identification of the tag with tag ID: BC, you’d write in the config file header:
@extract {BC}<xxx[6]>
Now let’s say you want to extract the 6-bp sequence 2 bp’s following identification of BC. You’d then write instead:
@extract {BC}2<xxx[6]>
You could also extract the 6-bp sequence 2 bp’s before BC via:
@extract <xxx[6]>2{BC}
What happens because we named it xxx? Our output file name would be named xxx.fastq or xxx.fastq.gz (in the case that we’re working with compressed gzip’d files). If using –pipe, the output gets interleaved into the standard output stream and the extracted sequence will appear in the output right before the read sequences.
See also
- Example
The example in this documentation provides a sample usage of the
@extract
directive.
Extracting relative to a location
In addition to extracting sequences relative to a tag, you can also extract sequences relative to a location (i.e. a specific file at a specific read position). The location is specified as file:position
where file is the zero-indexed file number (i.e. file #0, file #1, etc.) and position is the position within the read (again, zero-indexed, such that 0 means you’re starting at the beginning of the read).
For example, given two files: R1.fastq and R2.fastq, to extract an 8-bp sequence (named xxx) following the first 10 bp’s of R2.fastq, you’d write:
@extract 1:10<xxx[8]>
Additionally, you can use -1 if you want to extract a sequence at the end of the read; for example, you can extract the last 8-bp of reads in R2.fastq by writing:
@extract <xxx[8]>1:-1
Extracting between two things
splitcode allows you to extract sequences between two tags or between a location and a tag (in effect, sandwiching a sequence to be extracted). In this configuration, you don’t need to specify a length for the sequence you want to extract. Here are some examples:
Extracting between two tags: If you want to extract a sequence between a tag with tag id tag_A and a tag in the group group_1, you can write:
@extract {tag_A}<xxx>{{group_1}}
Extracting between a tag and a location: If you want to extract a sequence between the tag tag_A and position 30 of the reads in the FASTQ file #0, you can write the following:
@extract {tag_A}<xxx>0:30
Tip
The extraction can sometimes fail. For example, if you enter the following:
@extract {tag_A}<xxx>{tag_B}
But you don’t encounter an instance of tag_A followed by tag_B), the extracted sequence will be empty. You can also put more constraints on the extraction: say you want to extract between tag_A and the end of the read in FASTQ file #0, but only if the extracted sequence is between 2 and 4 bp’s in length, you can specify this as:
@extract {tag_A}<xxx[2-4]>0:-1
If this criteria is not met, the extracted sequence will be empty.
Tip
You can still use spacers when extracting between two tags. For example, if you want the to begin 1 bp after tag_A and 2 bp’s before tag_B, you’d write:
@extract {tag_A}1<xxx>2{tag_B}
Reverse complementing extracted sequence
You can extract the reverse complement of a sequence by putting a ~
in front of the extracted sequence name. For example, to extract the reverse complement of the 8-bp sequence immediately following the tag tag_A, do the following:
@extract {tag_A}<~xxx[8]>
Appending/Prepending to extracted sequence
One can do ^...^
to prepend a sequence to an extraction pattern or ^^...^^
to append a sequence. E.g. <^AG^xxx>
means prepending AG
to xxx
when it is extracted.
Multiple extraction sequences
Multiple instances of a single extraction within a read: They get stitched together.
@extract {tag_A}<xxx[8]>
if encountered multiple times in a read: splitcode will extract all instances of the 8-bp sequence following tag_A whenever tag_A is identified within the read. All those 8-bp extracted sequences will be stitched together into a single sequence in the final xxx.fastq file (or in the interleaved output when using –pipe).Multiple extractions specified (comma-separated) using the same name: Each extraction gets stitched together.
@extract {tag_A}<xxx[8]>,{{group_2}}<xxx[3]>
: With each encounter of tag_A or group_2, splitcode will keep adding on the extracted sequence (the next 8 bp’s for tag_A or the next 3 bp’s for group_2) to form a single final sequence that gets placed in the resulting xxx.fastq file (or in the interleaved output when using –pipe).Multiple extractions specified (comma-separated) using different names: Each extraction gets put in a different file.
@extract {tag_A}<xxx[8]>,{{group_2}}<rrr[3]>
: For each encounter of tag_A, splitcode will extract the 8-bp’s following it and put it into xxx.fastq. For each encounter of group_2, splitcode will extract the 3 bp’s following it and put it into rrr.fastq. If, instead, –pipe is used as output, the interleaved output will consist of two separate read sequences: the *xxx* read sequence, then the rrr read sequence, right before the rest of the output read sequences.
Tip
@no-chain
specified in the config file header can disable the “stitching” behavior such that only the first encounter of a name (e.g. xxx) is extracted.
@no-chain xxx,yy
means only disable the stitching behavior for the names xxx and yy.
Tip
In the config file, you can perform multiple extractions by putting @extract
multiple times over separate lines (i.e. one pattern per line), rather than comma-separating the multiple extraction patterns.
Output options
In addition to the default output options (named FASTQ files or interleaved output via pipe), there are additional output options you can specify for extracted sequences:
–empty Use this to enter a sequence to put in place of empty sequences in case the final extracted read sequence has nothing in it.
–empty-remove: Use this to completely remove empty sequences from output (Warning: This will end up breaking the proper pairing of reads with one another).
Note: These options apply to all read sequences that might be empty, not just extracted read sequences.
–x-only Use this to output only all of the extracted sequences (and the final barcodes if –assign is supplied) but do NOT output the other read sequences.
–x-names Use this to put the extracted sequences into the header of the FASTQ file. By default, the extracted sequences will be prepended with the SAM tag RX:Z:, such as RX:Z:GATGATGG or RX:Z:GATGATGG-ATCC (in the case that two different names are given) in the read name header. This is so that downstream tools can make use of the SAM tags.
–no-x-out Use this to not output extracted sequences; this should be used with –x-names because if that option is supplied, the extracted sequences will still appear in the read name header. In other words, use these two options together if you prefer your extracted sequences to be in the read header (e.g. as SAM tags) rather than outputted in FASTQ format.