Generating a mutation profile
Quickstart
Make sure to go through the Quickstart guide first where the basics of VARGRAM syntax are explained.
Mutation Profile
During an outbreak or a period of genomic surveillance, batches of samples are taken from the field or testing centers that are then sequenced. Often, the diversity of these sequences need to be quickly identified to understand the evolution of a pathogen.
A mutation profile such as the one above is an intuitive figure that shows you what mutations are present in which genes by how many times. Differences among batches can be quickly seen and also be compared against reference sets of mutations. VARGRAM makes it easy to generate these figures.
Input
Providing sequence files
There are two main sets of input that you can provide. The first is the sequence files which include (1) the FASTA file of sequences or a directory of FASTA files for multiple batches, (2) the reference FASTA, and (3) a genome annotation file following the GFF3 format.
Is VARGRAM only applicable to viral data?
VARGRAM relies on Nextclade to perform mutation calling when sequence files are provided, and Nextclade currently supports viral data only. However, if you can perform mutation calling through another tool, you can still use VARGRAM to create a mutation profile of other organisms provided that you generate a CSV file of the mutations. See Other features.
Instead of providing the reference FASTA, you may also specify the name or shortcut of a Nextclade dataset. In this case, the annotation file does not need to be provided. VARGRAM will download the reference and the annotation (these are deleted after use).
When sequence files are provided, VARGRAM will run Nextclade CLI and capture the analysis file so make sure that Nextclade is installed. These files can be provided to VARGRAM through the vargram
class:
vg = vargram(seq='path/to/covid_samples/', # Provide FASTA of sequences or directory containing FASTA files
ref='path/to/covid_reference.fa', # Provide reference sequence
gene='path/to/covid_annotation.gff') # Provide genome annotation
vg = vargram(seq='path/to/covid_samples/', # Provide sample sequences
ref='sars-cov-2') # Provide reference name/shortcut
Nextclade dataset name
To view available datasets, run nextclade dataset list
. The official names and shortcuts are on the first name
column—not the attribute
column. You may alternatively run nextclade dataset list --only-names
which only lists the recognized dataset names.
FASTA directory
The directory for the sequences need not contain only FASTA files. VARGRAM will ignore all other files in the directory that do not have a .fasta
or .fa
extension. But make sure that only the FASTA files of interest are in the directory.
Removal of rows with errors and warnings
VARGRAM removes sample rows in the captured analysis file that Nextclade flags with an error or warning
Providing analysis files
You may already have a Nextclade analysis file obtained through Nextclade CLI or Nextclade Web. In this case, simply provide its path:
vg = vargram(data='path/to/nextclade_analysis.csv')
gene
.
The VARGRAM data output can also be provided as an input but you must specify its format:
vg = vargram(data='path/to/vargram_analysis.csv',
format='vargram')
Nextclade CLI vs. Nextclade Web
Nextclade CLI and Nextclade Web yield identical results when the datasets they use are also the same. If you get different mutation profiles, try updating your dataset. See this discussion on Nexstrain for more info.
Output
Terminal methods are independent
VARGRAM outputs are produced by calling terminal methods, which are independent of each other and can therefore be called in any order.
Showing the plot
To show the mutation profile, simply call the show()
method, e.g.
vg = vargram(data='path/to/nextclade_analysis.csv')
vg.profile()
vg.show()
Getting summary data
The summary data for the mutation profile contains the total counts of mutations and the counts per batch. It can be accessed using the stat()
method which returns a Pandas DataFrame:
vg = vargram(data='path/to/nextclade_analysis.csv')
vg.profile()
vg.stat()
gene | mutation | position | type | my_batch | sum |
---|---|---|---|---|---|
E | T9I | 9 | sub | 80 | 80 |
M | D3G | 3 | sub | 70 | 70 |
M | Q19E | 19 | sub | 78 | 78 |
M | A63T | 63 | sub | 80 | 80 |
N | P13L | 13 | sub | 80 | 80 |
N | E31- | 31 | del | 80 | 80 |
N | R32- | 32 | del | 80 | 80 |
N | S33- | 33 | del | 80 | 80 |
N | R203K | 203 | sub | 80 | 80 |
N | G204R | 204 | sub | 80 | 80 |
ORF1a | K856R | 856 | sub | 80 | 80 |
ORF1a | S2083I | 2083 | sub | 80 | 80 |
ORF1a | L2084- | 2084 | del | 80 | 80 |
ORF1a | A2710T | 2710 | sub | 80 | 80 |
ORF1a | T3255I | 3255 | sub | 80 | 80 |
ORF1a | P3395H | 3395 | sub | 80 | 80 |
ORF1a | L3674- | 3674 | del | 80 | 80 |
ORF1a | S3675- | 3675 | del | 80 | 80 |
ORF1a | G3676- | 3676 | del | 80 | 80 |
ORF1a | I3758V | 3758 | sub | 80 | 80 |
ORF1b | P314L | 314 | sub | 80 | 80 |
ORF1b | I1566V | 1566 | sub | 80 | 80 |
ORF9b | P10S | 10 | sub | 80 | 80 |
ORF9b | E27- | 27 | del | 80 | 80 |
ORF9b | N28- | 28 | del | 80 | 80 |
ORF9b | A29- | 29 | del | 80 | 80 |
S | A67V | 67 | sub | 80 | 80 |
S | H69- | 69 | del | 80 | 80 |
S | V70- | 70 | del | 80 | 80 |
S | T95I | 95 | sub | 80 | 80 |
S | G142D | 142 | sub | 80 | 80 |
S | V143- | 143 | del | 80 | 80 |
S | Y144- | 144 | del | 80 | 80 |
S | Y145- | 145 | del | 80 | 80 |
S | N211I | 211 | sub | 58 | 58 |
S | L212- | 212 | del | 58 | 58 |
S | 214:EPE | 214 | in | 54 | 54 |
S | G339D | 339 | sub | 72 | 72 |
S | S371L | 371 | sub | 63 | 63 |
S | S373P | 373 | sub | 68 | 68 |
S | S375F | 375 | sub | 68 | 68 |
S | K417N | 417 | sub | 67 | 67 |
S | G446S | 446 | sub | 56 | 56 |
S | S477N | 477 | sub | 61 | 61 |
S | T478K | 478 | sub | 61 | 61 |
S | E484A | 484 | sub | 59 | 59 |
S | Q493R | 493 | sub | 59 | 59 |
S | G496S | 496 | sub | 57 | 57 |
S | Q498R | 498 | sub | 61 | 61 |
S | N501Y | 501 | sub | 61 | 61 |
S | Y505H | 505 | sub | 58 | 58 |
S | T547K | 547 | sub | 80 | 80 |
S | D614G | 614 | sub | 80 | 80 |
S | H655Y | 655 | sub | 80 | 80 |
S | N679K | 679 | sub | 80 | 80 |
S | P681H | 681 | sub | 80 | 80 |
S | N764K | 764 | sub | 80 | 80 |
S | D796Y | 796 | sub | 80 | 80 |
S | N856K | 856 | sub | 80 | 80 |
S | Q954H | 954 | sub | 80 | 80 |
S | N969K | 969 | sub | 80 | 80 |
S | L981F | 981 | sub | 80 | 80 |
At the minimum, the columns include the gene
, mutation
, position
, mutation type
as well as the name of the batch (here, my_batch
) and the total count of the mutation (sum
). The data above is the summary data for BA1_analysis_cli.csv
. When there is only one batch, the default batch name is my_batch
.
gene | mutation | position | type | BA1_analysis_cli | BA2_analysis_cli | sum |
---|---|---|---|---|---|---|
E | T9I | 9 | sub | 80 | 80 | 160 |
M | D3G | 3 | sub | 70 | 0 | 70 |
M | Q19E | 19 | sub | 78 | 71 | 149 |
M | A63T | 63 | sub | 80 | 80 | 160 |
N | P13L | 13 | sub | 80 | 78 | 158 |
N | E31- | 31 | del | 80 | 79 | 159 |
N | R32- | 32 | del | 80 | 79 | 159 |
N | S33- | 33 | del | 80 | 79 | 159 |
N | R203K | 203 | sub | 80 | 79 | 159 |
N | G204R | 204 | sub | 80 | 79 | 159 |
N | S413R | 413 | sub | 0 | 78 | 78 |
ORF1a | S135R | 135 | sub | 0 | 80 | 80 |
ORF1a | I694T | 694 | sub | 13 | 0 | 13 |
ORF1a | T842I | 842 | sub | 0 | 79 | 79 |
ORF1a | K856R | 856 | sub | 80 | 0 | 80 |
ORF1a | G1307S | 1307 | sub | 0 | 79 | 79 |
ORF1a | T1543I | 1543 | sub | 0 | 7 | 7 |
ORF1a | S2083I | 2083 | sub | 80 | 0 | 80 |
ORF1a | L2084- | 2084 | del | 80 | 0 | 80 |
ORF1a | I2138V | 2138 | sub | 8 | 0 | 8 |
ORF1a | A2710T | 2710 | sub | 80 | 0 | 80 |
ORF1a | L3027F | 3027 | sub | 0 | 76 | 76 |
ORF1a | T3090I | 3090 | sub | 0 | 79 | 79 |
ORF1a | L3201F | 3201 | sub | 0 | 78 | 78 |
ORF1a | T3255I | 3255 | sub | 80 | 79 | 159 |
ORF1a | P3395H | 3395 | sub | 80 | 80 | 160 |
ORF1a | L3674- | 3674 | del | 80 | 0 | 80 |
ORF1a | S3675- | 3675 | del | 80 | 80 | 160 |
ORF1a | G3676- | 3676 | del | 80 | 80 | 160 |
ORF1a | F3677- | 3677 | del | 0 | 80 | 80 |
ORF1a | I3758V | 3758 | sub | 80 | 0 | 80 |
ORF1b | P314L | 314 | sub | 80 | 80 | 160 |
ORF1b | R1315C | 1315 | sub | 0 | 79 | 79 |
ORF1b | I1566V | 1566 | sub | 80 | 80 | 160 |
ORF1b | T2163I | 2163 | sub | 0 | 78 | 78 |
ORF3a | T223I | 223 | sub | 0 | 79 | 79 |
ORF6 | D61L | 61 | sub | 0 | 77 | 77 |
ORF9b | P10S | 10 | sub | 80 | 76 | 156 |
ORF9b | E27- | 27 | del | 80 | 79 | 159 |
ORF9b | N28- | 28 | del | 80 | 79 | 159 |
ORF9b | A29- | 29 | del | 80 | 79 | 159 |
S | T19I | 19 | sub | 0 | 79 | 79 |
S | L24S | 24 | sub | 0 | 79 | 79 |
S | P25- | 25 | del | 0 | 79 | 79 |
S | P26- | 26 | del | 0 | 79 | 79 |
S | A27- | 27 | del | 0 | 79 | 79 |
S | A67V | 67 | sub | 80 | 0 | 80 |
S | H69- | 69 | del | 80 | 0 | 80 |
S | V70- | 70 | del | 80 | 0 | 80 |
S | T95I | 95 | sub | 80 | 0 | 80 |
S | G142D | 142 | sub | 80 | 72 | 152 |
S | V143- | 143 | del | 80 | 0 | 80 |
S | Y144- | 144 | del | 80 | 0 | 80 |
S | Y145- | 145 | del | 80 | 0 | 80 |
S | N211I | 211 | sub | 58 | 0 | 58 |
S | L212- | 212 | del | 58 | 0 | 58 |
S | V213G | 213 | sub | 0 | 77 | 77 |
S | 214:EPE | 214 | in | 54 | 0 | 54 |
S | G339D | 339 | sub | 72 | 79 | 151 |
S | S371F | 371 | sub | 5 | 79 | 84 |
S | S371L | 371 | sub | 63 | 0 | 63 |
S | S373P | 373 | sub | 68 | 79 | 147 |
S | S375F | 375 | sub | 68 | 79 | 147 |
S | T376A | 376 | sub | 5 | 79 | 84 |
S | D405N | 405 | sub | 5 | 79 | 84 |
S | R408S | 408 | sub | 5 | 79 | 84 |
S | K417N | 417 | sub | 67 | 79 | 146 |
S | N440K | 440 | sub | 37 | 17 | 54 |
S | G446S | 446 | sub | 56 | 0 | 56 |
S | S477N | 477 | sub | 61 | 72 | 133 |
S | T478K | 478 | sub | 61 | 72 | 133 |
S | E484A | 484 | sub | 59 | 71 | 130 |
S | Q493R | 493 | sub | 59 | 71 | 130 |
S | G496S | 496 | sub | 57 | 0 | 57 |
S | Q498R | 498 | sub | 61 | 72 | 133 |
S | N501Y | 501 | sub | 61 | 72 | 133 |
S | Y505H | 505 | sub | 58 | 72 | 130 |
S | T547K | 547 | sub | 80 | 0 | 80 |
S | D614G | 614 | sub | 80 | 79 | 159 |
S | H655Y | 655 | sub | 80 | 79 | 159 |
S | N679K | 679 | sub | 80 | 79 | 159 |
S | P681H | 681 | sub | 80 | 79 | 159 |
S | N764K | 764 | sub | 80 | 78 | 158 |
S | D796Y | 796 | sub | 80 | 79 | 159 |
S | Q853R | 853 | sub | 6 | 0 | 6 |
S | N856K | 856 | sub | 80 | 0 | 80 |
S | Q954H | 954 | sub | 80 | 79 | 159 |
S | N969K | 969 | sub | 80 | 78 | 158 |
S | L981F | 981 | sub | 80 | 0 | 80 |
When there are multiple batches, each batch gets their own column containing the mutation counts in that batch. The data above is the summary data for omicron_analysis_cli.csv
.
gene | mutation | position | type | BA1_analysis_cli | BA2_analysis_cli | sum | BA.1 | BA.2 |
---|---|---|---|---|---|---|---|---|
E | T9I | 9 | sub | 80 | 80 | 160 | 1 | 1 |
M | D3G | 3 | sub | 70 | 0 | 70 | 1 | 0 |
M | Q19E | 19 | sub | 78 | 71 | 149 | 1 | 1 |
M | A63T | 63 | sub | 80 | 80 | 160 | 1 | 1 |
N | P13L | 13 | sub | 80 | 78 | 158 | 1 | 1 |
N | E31- | 31 | del | 80 | 79 | 159 | 1 | 1 |
N | R32- | 32 | del | 80 | 79 | 159 | 1 | 1 |
N | S33- | 33 | del | 80 | 79 | 159 | 1 | 1 |
N | R203K | 203 | sub | 80 | 79 | 159 | 1 | 1 |
N | G204R | 204 | sub | 80 | 79 | 159 | 1 | 1 |
N | S413R | 413 | sub | 0 | 78 | 78 | 0 | 1 |
ORF1a | S135R | 135 | sub | 0 | 80 | 80 | 0 | 1 |
ORF1a | I694T | 694 | sub | 13 | 0 | 13 | 0 | 0 |
ORF1a | T842I | 842 | sub | 0 | 79 | 79 | 0 | 1 |
ORF1a | K856R | 856 | sub | 80 | 0 | 80 | 1 | 0 |
ORF1a | G1307S | 1307 | sub | 0 | 79 | 79 | 0 | 1 |
ORF1a | T1543I | 1543 | sub | 0 | 7 | 7 | 0 | 0 |
ORF1a | S2083I | 2083 | sub | 80 | 0 | 80 | 1 | 0 |
ORF1a | L2084- | 2084 | del | 80 | 0 | 80 | 1 | 0 |
ORF1a | I2138V | 2138 | sub | 8 | 0 | 8 | 0 | 0 |
ORF1a | A2710T | 2710 | sub | 80 | 0 | 80 | 1 | 0 |
ORF1a | L3027F | 3027 | sub | 0 | 76 | 76 | 0 | 1 |
ORF1a | T3090I | 3090 | sub | 0 | 79 | 79 | 0 | 1 |
ORF1a | L3201F | 3201 | sub | 0 | 78 | 78 | 0 | 1 |
ORF1a | T3255I | 3255 | sub | 80 | 79 | 159 | 1 | 1 |
ORF1a | P3395H | 3395 | sub | 80 | 80 | 160 | 1 | 1 |
ORF1a | L3674- | 3674 | del | 80 | 0 | 80 | 1 | 0 |
ORF1a | S3675- | 3675 | del | 80 | 80 | 160 | 1 | 1 |
ORF1a | G3676- | 3676 | del | 80 | 80 | 160 | 1 | 1 |
ORF1a | F3677- | 3677 | del | 0 | 80 | 80 | 0 | 1 |
ORF1a | I3758V | 3758 | sub | 80 | 0 | 80 | 1 | 0 |
ORF1b | P314L | 314 | sub | 80 | 80 | 160 | 1 | 1 |
ORF1b | R1315C | 1315 | sub | 0 | 79 | 79 | 0 | 1 |
ORF1b | I1566V | 1566 | sub | 80 | 80 | 160 | 1 | 1 |
ORF1b | T2163I | 2163 | sub | 0 | 78 | 78 | 0 | 1 |
ORF3a | T223I | 223 | sub | 0 | 79 | 79 | 0 | 1 |
ORF6 | D61L | 61 | sub | 0 | 77 | 77 | 0 | 1 |
ORF9b | P10S | 10 | sub | 80 | 76 | 156 | 1 | 1 |
ORF9b | E27- | 27 | del | 80 | 79 | 159 | 0 | 1 |
ORF9b | N28- | 28 | del | 80 | 79 | 159 | 1 | 1 |
ORF9b | A29- | 29 | del | 80 | 79 | 159 | 1 | 1 |
S | T19I | 19 | sub | 0 | 79 | 79 | 0 | 1 |
S | L24S | 24 | sub | 0 | 79 | 79 | 0 | 1 |
S | P25- | 25 | del | 0 | 79 | 79 | 0 | 1 |
S | P26- | 26 | del | 0 | 79 | 79 | 0 | 1 |
S | A27- | 27 | del | 0 | 79 | 79 | 0 | 1 |
S | A67V | 67 | sub | 80 | 0 | 80 | 1 | 0 |
S | H69- | 69 | del | 80 | 0 | 80 | 1 | 0 |
S | V70- | 70 | del | 80 | 0 | 80 | 1 | 0 |
S | T95I | 95 | sub | 80 | 0 | 80 | 1 | 0 |
S | G142D | 142 | sub | 80 | 72 | 152 | 1 | 1 |
S | V143- | 143 | del | 80 | 0 | 80 | 1 | 0 |
S | Y144- | 144 | del | 80 | 0 | 80 | 1 | 0 |
S | Y145- | 145 | del | 80 | 0 | 80 | 1 | 0 |
S | N211I | 211 | sub | 58 | 0 | 58 | 1 | 0 |
S | L212- | 212 | del | 58 | 0 | 58 | 1 | 0 |
S | V213G | 213 | sub | 0 | 77 | 77 | 0 | 1 |
S | 214:EPE | 214 | in | 54 | 0 | 54 | 1 | 0 |
S | G339D | 339 | sub | 72 | 79 | 151 | 1 | 1 |
S | S371F | 371 | sub | 5 | 79 | 84 | 0 | 1 |
S | S371L | 371 | sub | 63 | 0 | 63 | 1 | 0 |
S | S373P | 373 | sub | 68 | 79 | 147 | 1 | 1 |
S | S375F | 375 | sub | 68 | 79 | 147 | 1 | 1 |
S | T376A | 376 | sub | 5 | 79 | 84 | 0 | 1 |
S | D405N | 405 | sub | 5 | 79 | 84 | 0 | 1 |
S | R408S | 408 | sub | 5 | 79 | 84 | 0 | 1 |
S | K417N | 417 | sub | 67 | 79 | 146 | 1 | 1 |
S | N440K | 440 | sub | 37 | 17 | 54 | 1 | 1 |
S | G446S | 446 | sub | 56 | 0 | 56 | 1 | 0 |
S | S477N | 477 | sub | 61 | 72 | 133 | 1 | 1 |
S | T478K | 478 | sub | 61 | 72 | 133 | 1 | 1 |
S | E484A | 484 | sub | 59 | 71 | 130 | 1 | 1 |
S | Q493R | 493 | sub | 59 | 71 | 130 | 1 | 1 |
S | G496S | 496 | sub | 57 | 0 | 57 | 1 | 0 |
S | Q498R | 498 | sub | 61 | 72 | 133 | 1 | 1 |
S | N501Y | 501 | sub | 61 | 72 | 133 | 1 | 1 |
S | Y505H | 505 | sub | 58 | 72 | 130 | 1 | 1 |
S | T547K | 547 | sub | 80 | 0 | 80 | 1 | 0 |
S | D614G | 614 | sub | 80 | 79 | 159 | 1 | 1 |
S | H655Y | 655 | sub | 80 | 79 | 159 | 1 | 1 |
S | N679K | 679 | sub | 80 | 79 | 159 | 1 | 1 |
S | P681H | 681 | sub | 80 | 79 | 159 | 1 | 1 |
S | N764K | 764 | sub | 80 | 78 | 158 | 1 | 1 |
S | D796Y | 796 | sub | 80 | 79 | 159 | 1 | 1 |
S | Q853R | 853 | sub | 6 | 0 | 6 | 0 | 0 |
S | N856K | 856 | sub | 80 | 0 | 80 | 1 | 0 |
S | Q954H | 954 | sub | 80 | 79 | 159 | 1 | 1 |
S | N969K | 969 | sub | 80 | 78 | 158 | 1 | 1 |
S | L981F | 981 | sub | 80 | 0 | 80 | 1 | 0 |
Keys (here, BA.1
and BA.2
) get their own columns. A 1
indicates that the mutation is part of that key and a 0
indicates that it's not. The data above is the summary data for omicron_analysis_cli.csv
with the keys BA1_key.csv
and BA2_key.csv
.
Saving the plot or data
Use the save()
method to save either the mutation profile figure or the accompanying data:
vg = vargram(data='path/to/nextclade_analysis.csv')
vg.profile()
vg.save('my_figure.png')
When the figure is saved (i.e. when the extension is .png
, .pdf
or .jpg
), save()
acts like Matplotlib's matplotlib.pyplot.savefig()
with the bbox_inches
argument set to tight
. Thus, save()
can take other savefig()
arguments like dpi
or transparent
, e.g.
vg = vargram(data='path/to/nextclade_analysis.csv')
vg.profile()
vg.save('my_figure.png', dpi=300, transparent=True)
vg = vargram(data='path/to/nextclade_analysis.csv')
vg.profile()
vg.save('my_data.csv')
When the summary data is saved (i.e. when the extension is not .png
, .pdf
or .jpg
), save()
acts like Pandas' pandas.DataFrame.to_csv()
with the sep
argument automatically set for .csv
(sep=','
), .tsv
(sep='\t'
) and .txt
(sep=' '
) extensions, and index
set to False
. Thus, save()
can take other to_csv()
arguments like columns
, e.g.
vg = vargram(data='path/to/nextclade_analysis.csv')
vg.profile()
vg.save('my_data.csv', index=True, columns=['gene','mutation','batch_1'])
Customization
Setting the y-axis type and the count threshold
The y-axis of the profile can show either the raw count of a mutation (i.e. no. of sequences that has the mutation) or its weight. The weight of a mutation is simply the count of a mutation in a batch divided by the sum of all mutation counts in that batch. Whether the count or weight is shown can be changed through the profile()
method:
vg = vargram(data='path/to/nextclade_analysis.csv')
vg.profile(ytype='counts')
vg.show()
vg = vargram(data='path/to/nextclade_analysis.csv')
vg.profile(ytype='weights')
vg.show()
By default, when multiple batches are provided, the y-axis shows the weights.
On the other hand, the threshold (default 50) is the minimum count or no. of occurences of a mutation in a batch that is needed for a mutation to be included in the figure. The threshold is applied per batch and a mutation will only be shown in the batch that it meets the threshold but it will have a count of zero in every other batch.
Threshold example
To make this more concrete, suppose we have two batches, each with 200 sequences. Suppose a mutation D3G occurs in 150 out of 200 samples in the first batch, but it occurs only 30 times in the second batch. With a default threshold of 50, D3G will show up in the profile as a mutation in the first batch with a count of 150. But D3G will not register as a mutation in the second batch--its count will be zero even if 30 sequences actually have the mutation in that batch.
The threshold can also be changed through profile()
:
vg = vargram(data='path/to/nextclade_analysis.csv')
vg.profile(threshold=10)
vg.show()
Changing aesthetic attributes
Aesthetic attributes like colors, labels, and font sizes can be changed through the aes()
method:
vg = vargram(data='path/to/nextclade_analysis.csv')
vg.profile()
vg.aes(stack_title='Region', # Change batch legend title
stack_label=['Foreign', 'Local'], # Change batch names
stack_color=['#E33E84', '#009193'], # Change batch bar colors
group_title='ORFs', # Change gene legend title
legtitle_fontsize=15, # Change legend title font size
legentry_fontsize=10, # Change legend entry font size
ylabel='Normalized Count', # Change y-axis label
ylabel_fontsize=30, # Change y-axis label fontsize
xticks_rotation=45, # Change x-axis ticks rotation
xticks_fontsize=5) # Change x-axis ticks label fontsize
vg.show()
Why is a batch referred to as a stack and a gene referred to as a group?
A profile figure (essentially a grid of barplots) can be generated without using Nextclade data or even sequence-specific data. For this reason, the variable names were chosen to be agnostic. See Other features.
Ordering genes
VARGRAM automatically determines which gene barplot goes to which row and in what order in an attempt to create a compact figure.
However, you may want to rearrange the rows and/or show only certain genes.
You can order genes according to their start position and/or fix the profile to have a horizontal layout through the profile()
method:
vg = vargram(data='path/to/nextclade_analysis.csv')
vg.profile(order=True, # Order the genes based on the annotation file
flat=True) # Force a horizontal layout
vg.show()
struct()
method, which takes one required positional argument:
a formatted string where gene or group names are separated by a comma and rows are separated by a forward slash.
Suppose we have the following genes: 'ORF1a', 'ORF1b', 'M', 'S', and 'N'. Say we only want to keep 'ORF1a', 'ORF1b' and 'M' such that on the first row 'M' is followed by 'ORF1a' and on the second row is 'ORF1b' alone. This new structure can be specified as follows:
vg = vargram(data='path/to/nextclade_analysis.csv')
vg.profile()
vg.struct('M,ORF1a/ORF1b') # Put M & ORF1a in first row in that order
# and put ORF1b in second row
vg.show()
Adding and creating keys
It is often helpful not just to compare batches with each other but also against reference lineages.
This can be done by adding a key, which is a CSV or TSV file withgene
and mutation
columns.
The mutations can be representative ("key") mutations of a lineage or they could just be mutations of interest.
VARGRAM will mark these mutations in the profile through a heatmap below the barplots.
The keys can be added with the key()
method. Each key is added individually:
vg = vargram(data='path/to/nextclade_analysis.csv')
vg.profile()
vg.key('path/to/my_key1.csv') # Provide key file 1
vg.key('path/to/my_key2.csv',
label='Key 2') # Add an optional label
vg.show()
You can manually create these key files yourself, but note that the format of mutation must follow Nextclade's notation.
See the aaSubstitutions
, aaDeletions
, and aaInsertions
definitions.
Alternatively, the VARGRAM summary data can serve as a key file. To create a key with VARGRAM, simply provide the batch of sequences from which key mutations will be identified. In this case, those key mutations can be identified as those that met a high threshold.
Other features
Although VARGRAM was made in the context of viral genomic surveillance, data from any CSV or TSV file or a Pandas DataFrame can be extracted to generate a mutation profile-type figure. The input tabular data needs to have three columns that will play the role of the gene, the mutation, and the batch.
These columns can be specified in the profile()
method through the arguments x
, group
, and stack
:
- The
x
(defaultmutation
) argument specifies the column from which to take the x-axis values for the bar plots. - The
group
(defaultgene
) argument specifies the column whose unique values will determine the bar plots (with their correspondingx
values) to be created. - The
stack
(defaultbatch
) argument specifies the column for the batches or the sets of (x
value,y
value) pairs.
For example, consider the file XBB_analysis_web.tsv
from the test data. This is a Nextclade analysis file, but we can still use it to show the flexibility of VARGRAM. Suppose that we want the clade
column to play the role of the genes, the unique values of the qc.overallStatus
to be the batches, and the x-axis values to be given by the Nextclade_pango
column. The code should look as follows:
vg = vargram(data='test_data/analysis/XBB_analysis_web.tsv',
format='delimited') # (1)
vg.profile(group='clade',
x='Nextclade_pango',
stack='qc.overallStatus',
threshold=1
ytype='counts')
vg.aes(stack_title="Quality Control",
group_title="Clade")
vg.show()
- When the
data
argument is used and the optionalformat
argument is not, VARGRAM assumes that the file is a Nextclade analysis file which would result to default, fixed settings. By specifyingformat=delimited
, VARGRAM would not run the file through the internal Nextclade wrangler functions.
This will result to the following profile: