top of page
  • Writer's pictureDr Edin Hamzić

bcftools sort: How to sort VCF/BCF files?

Updated: Oct 21, 2023

I am continuing with my bcftools series of blog posts. This time I will cover the bcftools sort command. Unfortunately, there is not much to cover in the case of this command as it effectively does only one thing, sorting VCF/BCF files.

For those jumping straight to this blog post and before jumping to bcftools sort command examples, let me tell you that is blog post is part of a more extensive series of blog posts about bcftools.

If you are interested in bcftools, you can check my introductory blog posts here [insert], where you can check links to all other bcftools-related blog posts and the new ones I will be writing in the near future covering other bcftools commands.

For this blog post, like for others, I will be using input_file.vcf file and its compressed version (input_file.vcf.gz).

For what bcftools sort command is used?

The bcftools sort command is used to sort the variants in a VCF or BCF file based on their chromosomal positions, and the basic and only syntax of the bcftools sort command is the following one:

bcftools sort input_file.vcf -o input_file_sorted.vcf

If the output (-o) option, which is the option for defining the output file name, is not specified, then the output is written to standard output (the terminal).

In addition, the output option (-o) can be combined with common options like output type (-O). Output type (-O), as its name tells, enables you to define the type of file, and the options are:

  • Output compressed BCF (b)

  • Uncompressed BCF (u)

  • Compressed VCF (z)

  • Uncompressed VCF (v)

It is advised to use the -Ou option when piping between bcftools subcommands to speed up performance by removing unnecessary compression/decompression and conversion between VCF and BCF formats. In the case of output type option (-O) the compression level can also be set for the compressed formats (b and z) by adding a number between 0-9.

What is the compression level?

I will assume that you already know that compression is the process of reducing the size of a file by encoding it in a more compact format. There are many compression algorithms like GZIP, BZIP2, and others, and I will not go into details about those as it is not my domain of expertise. However, I want to shortly explain what the compression level is. In essence, compression level describes the degree of compression applied to a specific file, and this level of compression can be adjusted. The higher the compression level, the more aggressive the compression, and the smaller the resulting file size will be. Conversely, higher compression comes at the cost of increased computational resources required to compress and decompress the file. In the case of the bcftools sort command, compression can be adjusted in the range between 0 and 9. Where 0 level means that are no compression and 9 means that the highest level of compression will be applied.


Here are a couple of examples of sorting and using output options:

  • How to sort a VCF file and create a sorted compressed VCF file as an output?

bcftools sort input_file.vcf -Oz -o input_file_sorted.vcf.gz
  • How to sort a VCF file and create a sorted VCF file as an output?

bcftools sort input_file.vcf -o input_file_sorted.vcf
  • How to sort a VCF file and create a sorted compressed BCF file as an output?

bcftools sort input_file.vcf -Ob -o input_file_sorted.bcf

Well, by now, I think you got an idea of how to use bcftools sort in combination with output type (-O) and output (-o) options.


Besides output (-o) and output type (-O) options, there are also two additional options:

  • Option -m, --max-mem, which is used to define the maximum memory to use. I have been playing around with this option, and generally, you can use the default value, which is defined by your ulimit which defines the maximum number of open files your system has a limit on. I would advise generally using the default value for this option.

  • The other option is -T, --temp-dir DIR, which is used to precisely define a specific directory that will be used to store temporary files during the process of sorting. Again, if there is no other reason, you should use the default value for this option.


This is regarding the bcftools sort command. If you have any questions or use cases you would like to cover, feel free to contact me.


Commentaires


bottom of page