As mentioned in the initial introductory blog posts about bcftools, which you can read here, my goal is to cover all the individual bcftools commands in the form of blog post tutorials.
In this blog post, I will cover the bcftools query command. If you are not entirely familiar with bcftools, I recommend you check out my short introduction post about bcftools here; otherwise, continue reading.
Like in my other tutorial posts about bcftools, I will use my example of a VCF file and the corresponding BCF file generated from the VCF file for this and all consequent tutorials about bcftools. These files are named test.vcf and test.bcf file. The short description you can check in the blog post about the bcftools index command here.
This blog post is composed of three sections:
Short explanations for the most used bcftools query command parameters
Tutorial: The most common examples of how to use bcftools query
How to list samples / IDs from the VCF file using the bcftools query command?
How to select specific VCF columns and INFO and FORMAT fields?
How to quickly count the number of genetic variants in a VCF file?
How to select specific VCF columns and filter out rows based on the specific genomic region?
How to use the –allow-undef-tags option of the bcftools query command?
How to apply the filtering (include/exclude) parameter of the bcftools query command?
What is the bcftools query command?
As bcftools documentation states, the bcftools query command extracts specific fields from VCF or BCF files by applying specific filtering criteria, which finally outputs those fields in a user-defined format. This user-defined format is a text file containing specific columns.
The bcftools query command can be used to extract a wide range of information from a VCF file, basically, everything that is included in specific columns within the VCF file, such as variant and genotype information, quality scores, and annotations other information.
The two crucial aspects of the bcftools query command are that it allows:
Flexible selection of columns from within the VCF file and
Apply user-specified filtering criteria on rows.
Here is just a simple example, and for more examples, check the last section of the post here:
bcftools query -f '%CHROM\t%POS\t%INFO\n' -r chr2:1-1500 -i "GT==AA" file.vcf > output.txt
The above command does the following:
It selects chromosome (CHROM), base pair position (POS), and INFO column (parameter -f ) for the genetic variants located on chromosome 2 from position 1 to position 1500 (parameter -r) and where genetic variants have AA genotype (parameter -i). But, as I wrote, I provided more examples in the last section of this blog post. You can jump to that section by clicking here.
What are the command options for the bcftools query?
The bcftools query command has multiple parameters. I will here give you an overview of all the available parameters for the bcftools query command and ordered them by how frequently I used them:
Probably the most used parameter in the case of bcftools query commands, at least in my case, is -f, --format FORMAT. This parameter allows you to define specific columns from the VCF file which will be selected.
The following two most used parameters are: -e, --exclude EXPRESSION, which is used to exclude sites for which EXPRESSION is true, and -i, --include EXPRESSION to include only sites for which EXPRESSION is true. I will provide you with examples of valid expressions in the examples.
The following two parameters that I at least most often are: -r, --regions chr|chr:pos|chr:from-to|chr:from-[,…] to select for specific genomic regions on which filtering or selection of fields should be applied and -R, --regions-file file which is a parameter that does the same like the previous one, but where you provide a file with the list of regions.
Another bcftools query parameter that I often use is -l, --list-samples which list sample names within provided VCF file. This often comes as a handy parameter.
Of course, this parameter is used by default -o, --output FILE as it defines the output file.
Now, the remaining parameters are ones that I rarely use or, in some cases, rarely use except for the purpose of learning them, and those are:
-s, --samples LIST: see Common Options
-S, --samples-file FILE: see Common Options
-t, --targets chr|chr:pos|chr:from-to|chr:from-[,…]: see Common Options
-T, --targets-file file: see Common Options
--targets-overlap 0|1|2: see Common Options
--force-samples, if provided, it continues with selecting the specific list of samples even when some samples requested via -s/-S do not exist.
-H, --print-header prints the header of the VCF file
--regions-overlap 0|1|2: see Common Options
-u, --allow-undef-tags do not throw an error if there are undefined tags in the format string, print "." instead
-v, --vcf-list FILE: process multiple VCFs listed in the file
Tutorial: The most common examples of how to use bcftools query
In this final section of this blog post, I will do a short tutorial by going through the most common examples of how I use the bcftools query command:
How to list samples / IDs from the VCF file using the bcftools query command?
A handy use of the bcftools query command is to list all samples/IDs for which your VCF file contains genotype data:
bcftools query -l input_file.vcf
You can also stream output from this command directly to an output file using following command:
bcftools query -l input_file.vcf > list_of_samples.txt
How to create a BED file using the bcftools query command?
Another thing for which the bcftools query command can be used is to create a BED file from your VCF file in the following way:
# Make a BED file: chr, pos (0-based), end pos (1-based), id
bcftools query -f ‘%CHROM\t%POS0\t%END\t%ID\n’ input_file.vcf -o output.bed
For this purpose, we used the -f parameter to define the fields we want to extract and those being chromosome (CHROM), starting position (POS0), end position (END), and variant id (ID), and saved everything into output.bed file. As you can see, you can select POS0 for 0-based coordinate indexing (a straightforward tutorial to 0-based and 1-based indexing check following BioStars post.
How to select specific VCF columns and INFO and FORMAT fields?
As I illustrated in the previous example, the bcftools query command can be used to select specific VCF columns/fields. Here are a couple of examples.
If you want to extract chromosome, position, reference allele, and alternative allele and save those into a separate file, you can use the following command:
bcftools query -f '%CHROM %POS %REF %ALT\n' input_file.vcf -o output.txt
And if the input_file.vcf is large, you can stream output to a compression command such as bgzip:
bcftools query -f '%CHROM %POS %REF %ALT\n' input_file.vcf | bgzip > -o output.txt
Of course, this command can be combined with many different fields available in standard VCF files like for example:
bcftools query -f '%CHROM\t%POS\t%REF\t%ALT[\t%ID]\n' input_file.vcf -o output.txt
In the above example, besides the above-listed fields, I also included the ID field, which you can see enclosed in brackets which is necessary for FORMAT fields if provided in -f parameter string.
How to quickly count the number of genetic variants in a VCF file?
You can also use bcftools to quickly count how many variants or rows there are in your file using the following combination of bcftools and the wc (word count) command:
bcftools query -f '%POS\n' input_file.vcf.gz | wc
You can also simply use the bcftools stats command for this:
bcftools stats input_file.vcf.gz -o input_file.stats
How to select specific VCF columns and filter out rows based on the specific genomic region?
This is rather simple by combining the -f parameter we already introduced and -r parameter in the following way:
bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\n' input_file.vcf.gz -r 1:10000-50000 -o output.txt
The above command selects chromosome, position, reference, and alternative allele fields for the genetic variants located on chromosome 1 in positions between 10000 and 50000.
How to use the header option of the bcftools query command?
If you want to include column/field names as the header in the output file, you can do this by using the -H parameter. So, if we recycle the previous example, it would be like this:
bcftools query -H -f '%CHROM\t%POS\t%REF\t%ALT\n' input_file.vcf.gz -r 1:10000-50000 -o output.txt
So, the only difference between this and the previous command is that the output of this command will include chromosome, position, reference, and alternative allele IDs in the output.txt file.
How to use the –allow-undef-tags option of the bcftools query command?
Another handy parameter is -u, --allow-undef-tags which, when used, does not raise errors if in the -f parameter string are some undefined tags, and if there are some undefined tags, then the bcftools query command will replace missing values of those tags with “.”.
An example of the bcftools command would be:
bcftools query -H --allow-undef-tags -f '%CHROM\t%POS\t%REF\t%ALT\n' input_file.vcf.gz -r 1:10000-50000 -o output.txt
Or
bcftools query -H -u -f '%CHROM\t%POS\t%REF\t%ALT\n' input_file.vcf.gz -r 1:10000-50000 -o output.txt
How to apply the filtering (include/exclude) parameter of the bcftools query command?
Other often used parameter of the bcftools query command is the -i parameter for including rows meeting specific criteria or the -e parameter for excluding rows meeting specific criteria. Here are a couple of examples.
If you want to select genetic variants/rows with the call rate above 90:
bcftools query -i 'INFO/CR>90' -f '%CHROM\t%POS\t%REF\t%ALT\t%FORMAT\t%CR\n' input_file.vcf -o output_include_cr_above90.txt
Or you can use exclude option:
bcftools query -H -e 'INFO/CR<90' -f '%CHROM\t%POS\t%REF\t%ALT\t%FORMAT\t%CR\n' input_file.vcf -o output_exclude_cr_below90.txt
Also, you can use logical statements within the -i (include) and the -e (exclude) parameter strings. For example, in the next example, to include only row/genetic variants for which you have a call rate above 80 and below 90. This concrete example does not make much sense in reality, but I just want to give you an illustration of what can be done:
bcftools query -H -i 'INFO/CR>80 & INFO/CR<90' -f '%CHROM\t%POS\t%REF\t%ALT\t%FORMAT\t%CR\n' input_file.vcf -o output_exclude_cr_below90.txt
Now, a more meaningful example would be the following one, where you select genetic variants with read depth (DP) greater than 10 and quality score (QUAL) above 20. Just a note, I do not have DP and QUAL scores in my example file, but as I said, this is for illustration purposes:
bcftools query -H -i 'QUAL>20 && DP>10' -f '%CHROM\t%POS\t%REF\t%ALT\t%FORMAT\t%CR\n' input_file.vcf -o output_exclude_cr_below90.txt
I will not present all possible combinations of how to use the bcftool query but will leave it to you to play around. You can use different fields available within your VCF file to make specific filtering.
How to make bcftools query for a specific set of samples?
All the different types of filtering options introduced in previous examples such as selecting specific fields using the -f option, selecting for the specific genomic regions using the -r, and applying inclusion or exclusion criteria using the -i or -e parameter, can also be combined with -s samples parameter where the list of specific samples is provided for which bcftools query needs to be applied or providing the list of specific samples in the form of a file using -S sample file parameter. Here is an example below for one of the previous examples:
bcftools query -H -i 'QUAL>20 && DP>10' -f '%CHROM\t%POS\t%REF\t%ALT\t%FORMAT\t%CR\n' -s 'ID001,ID002,ID003' input_file.vcf -o output_samples.txt
The alternative for the above example would be the use of the -S parameter in the following way:
bcftools query -H -i 'QUAL>20 && DP>10' -f '%CHROM\t%POS\t%REF\t%ALT\t%FORMAT\t%CR\n' -S samples.txt input_file.vcf -o output_samples.txt
I am finishing this tutorial section and the whole post with these two examples. I hope you will find it useful for your everyday work.
Comments