top of page
  • Writer's pictureDr Edin Hamzić

bcftools index: How to create index for VCF files?

Updated: Jan 29

As mentioned in the initial introductory blog posts about bcftools, you can read here, I will cover all the individual bcftools commands. In the introductory blog posts on bcftools, I explained how bcftools commands are divided into three groups:

  • Group 1: Group of indexing tools which is only one command which bcftools index

  • Group 2: The group of VCF/BCF manipulation commands

  • Group 3: The group of VCF/BCF analysis commands


The first command I will cover, as you can guess from the post title, will be bcftools index which makes sense as, generally, the first step before doing any analysis is to build the index file for the VCF file(s) you are working on.


Still, in this post, I will refer to two other bcftools commands that you might need before you apply the bcftools index, and those are bcftools view and bcftools sort as well as some other command line tools like tabix, bgzip, and sort, but more about that later.


Before I jump to concrete examples of how the bcftools index is used, let me first explain that I will use an example of a VCF file for this and all consequent blog posts about bcftools commands. This file is named input_file.vcf.


Btw, if you are interested in tutorials focused on other bcftools commands check my other blog posts:



What is the bcftools index command? Why do we create index files? Why are those important?


Before I start with practical examples, let me briefly explain what the bcftools index command is all about. This command (bcftools index) is a command-line tool for indexing VCF (variant call format) and BCF (binary variant call format) files, which are used to store genomic variation data.


The VCF index allows fast access to specific regions or positions in the file, which can be helpful for tasks such as filtering or extracting data. This is especially important in cases when VCF files are huge and complex. In this case, having index VCF files helps genomic processing tools such as bcftools more efficiently access and process the data in the VCF file.


I will present an example of how to create the index file. Still, I will proably expand this blog posts to cover other bcftools index options and capabilities. The bcftools index command has several options, and those are:


  1. Option -c or --csi is invoked if you want to generate a CSI index which is, by the way, created by default. Also, the CSI index is a coordinate-sorted index, and this option supports indexing chromosomes up to a length of 2^31.

  2. Option -f, --force will overwrite the index file if it already exists

  3. Option -m, --min-shift INT sets minimal interval size for CSI indices to 2^INT. The default value is 14.

  4. Option -o, --output FILE is used to define the name of the output index file. If not provided, then the index will be created using the input file name plus with .csi or .tbi extension

  5. Option -t, --tbi is used to generate a TBI-format index for VCF files.

  6. Option --threads INT is used to define a number of threads used and currently is used Use multithreading with INT worker threads. The option is currently used only for the compression of the output stream, only when --output-type is b or z. So, the default value is 0.

  7. Option -n, --nrecords is used if you want to print the number of records based on the CSI or TBI index files

  8. Option -s, --stats is used to generate statistics per contig based on the CSI or TBI index files. Output format is three tab-delimited columns listing the contig name, contig length (. if unknown) and number of records for the contig. Contigs with zero records are not printed.



How to generate an index (CSI and TBI) file for the VCF file?

First, we need a VCF file or BCF file as the index is created using the content of the VCF/BCF file, and it is stored in a separate file or included in the BCF file itself. For this purpose, as I explained, we will use input_file.vcf file.


Before we can even create an index file either CSI or TBI, we have to convert the VCF file into a compressed version of it, and it needs to be BGZIP compressed VCF file.


How to create BGZIP compressed VCF file?

A VCF file can be compressed using bgzip command like this:



bgzip -c input_file.vcf > input_file.vcf.gz 


Also, you can use bcftools commands to generate compressed VCF files using bcftools view. I will write a separate blog post about the bcftools view command, which has many handy functionalities and options. Now let’s just use this command without going into many details.


bcftools view input_file.vcf  -Oz -o input_file.vcf.gz

In essence, what the above command does is that it takes input_file.vcf, it takes option -Oz that tells it to create a compressed VCF file, and it takes -o option that defines the name for the compressed output VCF file.


Let’s try to create index file using bcftools index

Once we obtained the compressed version of VCF file we can try using bcftools index file and create CSI or TBI index files like this for TBI indexing:



bcftools index -t input_file.vcf.gz

Or by using this command for CSI indexing:

bcftools index -c input_file.vcf.gz

However, if your VCF file is not sorted correctly, you can end up getting the following error messages for the above commands:


[E::hts_idx_push] Chromosome blocks not continuous
index: failed to create index for "infput_file.vcf.gz"

How to sort the VCF file and finally generate an index file? How to index a VCF file with tabix?



If you get the above error messages, you have to first sort your VCF file, and you can do it in several ways using the sort command and compressing it using bgzip and finally applying bcftools index or indexing using tabix tool:



sort -k1,1 -k2,2n input_file.vcf > input_file_sorted.vcf
bgzip -c input_file_sorted.vcf > input_file_sorted.vcf.gz
tabix -p vcf input_file_sorted.vcf.gz 
# OR
sort -k1,1 -k2,2n input_file.vcf > input_file_sorted.vcf
bgzip -c input_file_sorted.vcf > input_file_sorted.vcf.gz 
bcftools index -t input_file_sorted.vcf.gz

However, I personally prefer using bcftools for all the above steps, which is faster and shorter. For this puprpose, we have to use bcftools sort command as with bcftools sort you can both sort and creates a compressed VCF file in one step and finally generate index file using bcftools index:


bcftools sort input_file.vcf -Oz -o input_file.vcf.gz
bcftools index -t  input_file.vcf.gz

This is my initial version of the bcftools index blog post. As you can see, it mainly focuses on how to and how not to generate bcftools index files. I will try to expand this blog post by covering other options within the bcftools index command. For now, this is it :) Hope you are enojyed it.


If you are interested in other bcftools commands, check out my other blog posts:




Comments


bottom of page