Extracting information from VCFs
The versatile bcftools query
command can be used to extract any VCF field.
Combined with standard UNIX commands, this gives a powerful tool for
quick querying of VCFs.
Below is a list of some of the most common tasks with explanation how it works. For a full list of options, see the manual page.
bcftools query -l file.bcf
bcftools query -l file.bcf | wc -l
bcftools query -f '%POS\n' file.bcf
In this example, the -f
otion defines the output format. The %POS
string
indicates that for each VCF line we want the POS column printed. The \n
stands for a newline character, a notation commonly used in the
world of computer programming. Any characters without a special meaning
will be passed as is, so for example see this command and its output below:
$ bcftools query -f 'pos=%POS\n' file.bcf | head -3 pos=13380 pos=16071 pos=16141
Here the $
character precedes the command we typed on the command line
and below is the actual output that was printed.
The | head -3
part limited the output to the first three lines.
$ bcftools query -f '%CHROM %POS %REF %ALT\n' file.bcf | head -3 1 13380 C G 1 16071 G A 1 16141 C T
Assuming the INFO/AF tag is present, we can write:
$ bcftools query -f '%CHROM %POS %AF\n' file.bcf | head -3 1 13380 7.69515e-05 1 16071 0.000123122 1 16141 0.000138513
If AF annotation is not present but AN and AC are, we can compute the frequencies on the fly:
$ bcftools query -f '%CHROM %POS %AN %AC{0}\n' file.bcf | awk '{printf "%s %s %f\n",$1,$2,$4/$3}' | head -3 1 13380 0.000077 1 16071 0.000123 1 16141 0.000139
Because the AC tag can have multiple comma-separated values, we select the first one using the subscript {0}
.
The awk
outputs the first two fields unchanged and computes the fraction using the third and fourth.
FORMAT tags can be extracted using the square brackets []
operator, which
loops over all samples. For example, to print the GT field followed by PL field
we can write:
$ bcftools query -f '%CHROM %POS[\t%GT\t%PL]\n' file.bcf | head -3 1 10234 1/1 28,3,0 1/1 29,3,0 1 10291 ./. 0,0,0 1/1 28,3,0 1 14907 0/1 8,0,17 0/1 26,0,48
Here we used the tab character \t
instead of space for a change. If we wanted
to print GTs for all samples first followed by PLs for all samples rather than
mixing the two types as above, we could write two bracket operators instead:
$ bcftools query -f '%CHROM %POS GTs:[ %GT]\t PLs:[ %PL]\n' file.bcf | head -3 1 10234 GTs: 1/1 1/1 PLs: 28,3,0 29,3,0 1 10291 GTs: ./. 1/1 PLs: 0,0,0 28,3,0 1 14907 GTs: 0/1 0/1 PLs: 8,0,17 26,0,48
Feedback
We welcome your feedback, please help us improve this page by either opening an issue on github or editing it directly and sending a pull request.