Extracting information from VCFs

General

Calling

Tips and Tricks

Extracting information from VCFs

The versatile bcftools query command can be used to extract any VCF field. Combined with standard UNIX commands, this gives a powerful tool for quick querying of VCFs.

Below is a list of some of the most common tasks with explanation how it works. For a full list of options, see the manual page.

List of samples

bcftools query -l file.bcf

Number of samples

bcftools query -l file.bcf | wc -l

List of positions

bcftools query -f '%POS\n' file.bcf

In this example, the -f otion defines the output format. The %POS string indicates that for each VCF line we want the POS column printed. The \n stands for a newline character, a notation commonly used in the world of computer programming. Any characters without a special meaning will be passed as is, so for example see this command and its output below:

$ bcftools query -f 'pos=%POS\n' file.bcf | head -3
pos=13380
pos=16071
pos=16141

Here the $ character precedes the command we typed on the command line and below is the actual output that was printed. The | head -3 part limited the output to the first three lines.

List of positions and alleles

$ bcftools query -f '%CHROM %POS %REF %ALT\n' file.bcf | head -3
1 13380 C G
1 16071 G A
1 16141 C T

Extract allele frequency at each position

Assuming the INFO/AF tag is present, we can write:

$ bcftools query -f '%CHROM %POS %AF\n' file.bcf | head -3
1 13380 7.69515e-05
1 16071 0.000123122
1 16141 0.000138513

If AF annotation is not present but AN and AC are, we can compute the frequencies on the fly:

$ bcftools query -f '%CHROM %POS %AN %AC{0}\n' file.bcf | awk '{printf "%s %s %f\n",$1,$2,$4/$3}' | head -3
1 13380 0.000077
1 16071 0.000123
1 16141 0.000139

Because the AC tag can have multiple comma-separated values, we select the first one using the subscript {0}. The awk outputs the first two fields unchanged and computes the fraction using the third and fourth.

Extracting per-sample tags

FORMAT tags can be extracted using the square brackets [] operator, which loops over all samples. For example, to print the GT field followed by PL field we can write:

$ bcftools query -f '%CHROM %POS[\t%GT\t%PL]\n' file.bcf | head -3
1 10234 1/1 28,3,0  1/1 29,3,0
1 10291 ./. 0,0,0   1/1 28,3,0
1 14907 0/1 8,0,17  0/1 26,0,48

Here we used the tab character \t instead of space for a change. If we wanted to print GTs for all samples first followed by PLs for all samples rather than mixing the two types as above, we could write two bracket operators instead:

$ bcftools query -f '%CHROM %POS  GTs:[ %GT]\t PLs:[ %PL]\n' file.bcf | head -3
1 10234  GTs: 1/1 1/1    PLs: 28,3,0 29,3,0
1 10291  GTs: ./. 1/1    PLs: 0,0,0 28,3,0
1 14907  GTs: 0/1 0/1    PLs: 8,0,17 26,0,48

Feedback

We welcome your feedback, please help us improve this page by either opening an issue on github or editing it directly and sending a pull request.