Generating a mutation profile

Quickstart

Make sure to go through the Quickstart guide first where the basics of VARGRAM syntax are explained.

Mutation Profile

plot

During an outbreak or a period of genomic surveillance, batches of samples are taken from the field or testing centers that are then sequenced. Often, the diversity of these sequences need to be quickly identified to understand the evolution of a pathogen.

A mutation profile such as the one above is an intuitive figure that shows you what mutations are present in which genes by how many times. Differences among batches can be quickly seen and also be compared against reference sets of mutations. VARGRAM makes it easy to generate these figures.

Input

Providing sequence files

There are two main sets of input that you can provide. The first is the sequence files which include (1) the FASTA file of sequences or a directory of FASTA files for multiple batches, (2) the reference FASTA, and (3) a genome annotation file following the GFF3 format. An example annotation file is shown below:

# Gene map (genome annotation) of SARS-CoV-2 in GFF format.
# For gene map purpses we only need some of the columns. We substitute unused values with "." as per GFF spec.
# See GFF format reference at https://www.ensembl.org/info/website/upload/gff.html
# seqname   source  feature start   end score   strand  frame   attribute
.   .   gene    26245   26472   .   +   .    gene_name=E
.   .   gene    26523   27191   .   +   .    gene_name=M
.   .   gene    28274   29533   .   +   .    gene_name=N
.   .   gene    266 13468   .   +   .    gene_name=ORF1a
.   .   gene    13468   21555   .   +   .    gene_name=ORF1b
.   .   gene    25393   26220   .   +   .    gene_name=ORF3a
.   .   gene    27202   27387   .   +   .    gene_name=ORF6
.   .   gene    27394   27759   .   +   .    gene_name=ORF7a
.   .   gene    27756   27887   .   +   .    gene_name=ORF7b
.   .   gene    27894   28259   .   +   .    gene_name=ORF8
.   .   gene    28284   28577   .   +   .    gene_name=ORF9b
.   .   gene    21563   25384   .   +   .    gene_name=S

Is VARGRAM only applicable to viral data?

VARGRAM relies on Nextclade to perform mutation calling when sequence files are provided, and Nextclade currently supports viral data only. However, if you can perform mutation calling through another tool, you can still use VARGRAM to create a mutation profile of other organisms provided that you generate a CSV file of the mutations. See Other features.

Instead of providing the reference FASTA, you may also specify the name or shortcut of a Nextclade dataset. In this case, the annotation file does not need to be provided. VARGRAM will download the reference and the annotation (these are deleted after use).

When sequence files are provided, VARGRAM will run Nextclade CLI and capture the analysis file so make sure that Nextclade is installed. These files can be provided to VARGRAM through the vargram class:

Local referenceOnline Nextclade reference

vg = vargram(seq='path/to/covid_samples/', # Provide FASTA of sequences or directory containing FASTA files
            ref='path/to/covid_reference.fa', # Provide reference sequence
            gene='path/to/covid_annotation.gff') # Provide genome annotation

vg = vargram(seq='path/to/covid_samples/', # Provide sample sequences
            ref='sars-cov-2') # Provide reference name/shortcut

Nextclade dataset name

To view available datasets, run nextclade dataset list. The official names and shortcuts are on the first name column—not the attribute column. You may alternatively run nextclade dataset list --only-names which only lists the recognized dataset names.

FASTA directory

The directory for the sequences need not contain only FASTA files. VARGRAM will ignore all other files in the directory that do not have a .fasta or .fa extension. But make sure that only the FASTA files of interest are in the directory.

Removal of rows with errors and warnings

VARGRAM removes sample rows in the captured analysis file that Nextclade flags with an error or warning

Providing analysis files

You may already have a Nextclade analysis file obtained through Nextclade CLI or Nextclade Web. In this case, simply provide its path:

vg = vargram(data='path/to/nextclade_analysis.csv')

If you want to order the genes in the profile, you must also provide the annotation file using the same keyword argument gene.

The VARGRAM data output can also be provided as an input but you must specify its format:

vg = vargram(data='path/to/vargram_analysis.csv',
             format='vargram')

Nextclade CLI vs. Nextclade Web

Nextclade CLI and Nextclade Web yield identical results when the datasets they use are also the same. If you get different mutation profiles, try updating your dataset. See this discussion on Nexstrain for more info.

Output

Terminal methods are independent

VARGRAM outputs are produced by calling terminal methods, which are independent of each other and can therefore be called in any order.

Showing the plot

To show the mutation profile, simply call the show() method, e.g.

vg = vargram(data='path/to/nextclade_analysis.csv')
vg.profile()
vg.show()

Getting summary data

The summary data for the mutation profile contains the total counts of mutations and the counts per batch. It can be accessed using the stat() method which returns a Pandas DataFrame:

vg = vargram(data='path/to/nextclade_analysis.csv')
vg.profile()
vg.stat()

The summary data contains the following columns:

Single batchMultiple batchesMultiple batches with keys

gene	mutation	position	type	my_batch	sum
E	T9I	9	sub	80	80
M	D3G	3	sub	70	70
M	Q19E	19	sub	78	78
M	A63T	63	sub	80	80
N	P13L	13	sub	80	80
N	E31-	31	del	80	80
N	R32-	32	del	80	80
N	S33-	33	del	80	80
N	R203K	203	sub	80	80
N	G204R	204	sub	80	80
ORF1a	K856R	856	sub	80	80
ORF1a	S2083I	2083	sub	80	80
ORF1a	L2084-	2084	del	80	80
ORF1a	A2710T	2710	sub	80	80
ORF1a	T3255I	3255	sub	80	80
ORF1a	P3395H	3395	sub	80	80
ORF1a	L3674-	3674	del	80	80
ORF1a	S3675-	3675	del	80	80
ORF1a	G3676-	3676	del	80	80
ORF1a	I3758V	3758	sub	80	80
ORF1b	P314L	314	sub	80	80
ORF1b	I1566V	1566	sub	80	80
ORF9b	P10S	10	sub	80	80
ORF9b	E27-	27	del	80	80
ORF9b	N28-	28	del	80	80
ORF9b	A29-	29	del	80	80
S	A67V	67	sub	80	80
S	H69-	69	del	80	80
S	V70-	70	del	80	80
S	T95I	95	sub	80	80
S	G142D	142	sub	80	80
S	V143-	143	del	80	80
S	Y144-	144	del	80	80
S	Y145-	145	del	80	80
S	N211I	211	sub	58	58
S	L212-	212	del	58	58
S	214:EPE	214	in	54	54
S	G339D	339	sub	72	72
S	S371L	371	sub	63	63
S	S373P	373	sub	68	68
S	S375F	375	sub	68	68
S	K417N	417	sub	67	67
S	G446S	446	sub	56	56
S	S477N	477	sub	61	61
S	T478K	478	sub	61	61
S	E484A	484	sub	59	59
S	Q493R	493	sub	59	59
S	G496S	496	sub	57	57
S	Q498R	498	sub	61	61
S	N501Y	501	sub	61	61
S	Y505H	505	sub	58	58
S	T547K	547	sub	80	80
S	D614G	614	sub	80	80
S	H655Y	655	sub	80	80
S	N679K	679	sub	80	80
S	P681H	681	sub	80	80
S	N764K	764	sub	80	80
S	D796Y	796	sub	80	80
S	N856K	856	sub	80	80
S	Q954H	954	sub	80	80
S	N969K	969	sub	80	80
S	L981F	981	sub	80	80

At the minimum, the columns include the gene, mutation, position, mutation type as well as the name of the batch (here, my_batch) and the total count of the mutation (sum). The data above is the summary data for BA1_analysis_cli.csv. When there is only one batch, the default batch name is my_batch.

gene	mutation	position	type	BA1_analysis_cli	BA2_analysis_cli	sum
E	T9I	9	sub	80	80	160
M	D3G	3	sub	70	0	70
M	Q19E	19	sub	78	71	149
M	A63T	63	sub	80	80	160
N	P13L	13	sub	80	78	158
N	E31-	31	del	80	79	159
N	R32-	32	del	80	79	159
N	S33-	33	del	80	79	159
N	R203K	203	sub	80	79	159
N	G204R	204	sub	80	79	159
N	S413R	413	sub	0	78	78
ORF1a	S135R	135	sub	0	80	80
ORF1a	I694T	694	sub	13	0	13
ORF1a	T842I	842	sub	0	79	79
ORF1a	K856R	856	sub	80	0	80
ORF1a	G1307S	1307	sub	0	79	79
ORF1a	T1543I	1543	sub	0	7	7
ORF1a	S2083I	2083	sub	80	0	80
ORF1a	L2084-	2084	del	80	0	80
ORF1a	I2138V	2138	sub	8	0	8
ORF1a	A2710T	2710	sub	80	0	80
ORF1a	L3027F	3027	sub	0	76	76
ORF1a	T3090I	3090	sub	0	79	79
ORF1a	L3201F	3201	sub	0	78	78
ORF1a	T3255I	3255	sub	80	79	159
ORF1a	P3395H	3395	sub	80	80	160
ORF1a	L3674-	3674	del	80	0	80
ORF1a	S3675-	3675	del	80	80	160
ORF1a	G3676-	3676	del	80	80	160
ORF1a	F3677-	3677	del	0	80	80
ORF1a	I3758V	3758	sub	80	0	80
ORF1b	P314L	314	sub	80	80	160
ORF1b	R1315C	1315	sub	0	79	79
ORF1b	I1566V	1566	sub	80	80	160
ORF1b	T2163I	2163	sub	0	78	78
ORF3a	T223I	223	sub	0	79	79
ORF6	D61L	61	sub	0	77	77
ORF9b	P10S	10	sub	80	76	156
ORF9b	E27-	27	del	80	79	159
ORF9b	N28-	28	del	80	79	159
ORF9b	A29-	29	del	80	79	159
S	T19I	19	sub	0	79	79
S	L24S	24	sub	0	79	79
S	P25-	25	del	0	79	79
S	P26-	26	del	0	79	79
S	A27-	27	del	0	79	79
S	A67V	67	sub	80	0	80
S	H69-	69	del	80	0	80
S	V70-	70	del	80	0	80
S	T95I	95	sub	80	0	80
S	G142D	142	sub	80	72	152
S	V143-	143	del	80	0	80
S	Y144-	144	del	80	0	80
S	Y145-	145	del	80	0	80
S	N211I	211	sub	58	0	58
S	L212-	212	del	58	0	58
S	V213G	213	sub	0	77	77
S	214:EPE	214	in	54	0	54
S	G339D	339	sub	72	79	151
S	S371F	371	sub	5	79	84
S	S371L	371	sub	63	0	63
S	S373P	373	sub	68	79	147
S	S375F	375	sub	68	79	147
S	T376A	376	sub	5	79	84
S	D405N	405	sub	5	79	84
S	R408S	408	sub	5	79	84
S	K417N	417	sub	67	79	146
S	N440K	440	sub	37	17	54
S	G446S	446	sub	56	0	56
S	S477N	477	sub	61	72	133
S	T478K	478	sub	61	72	133
S	E484A	484	sub	59	71	130
S	Q493R	493	sub	59	71	130
S	G496S	496	sub	57	0	57
S	Q498R	498	sub	61	72	133
S	N501Y	501	sub	61	72	133
S	Y505H	505	sub	58	72	130
S	T547K	547	sub	80	0	80
S	D614G	614	sub	80	79	159
S	H655Y	655	sub	80	79	159
S	N679K	679	sub	80	79	159
S	P681H	681	sub	80	79	159
S	N764K	764	sub	80	78	158
S	D796Y	796	sub	80	79	159
S	Q853R	853	sub	6	0	6
S	N856K	856	sub	80	0	80
S	Q954H	954	sub	80	79	159
S	N969K	969	sub	80	78	158
S	L981F	981	sub	80	0	80

When there are multiple batches, each batch gets their own column containing the mutation counts in that batch. The data above is the summary data for omicron_analysis_cli.csv.

gene	mutation	position	type	BA1_analysis_cli	BA2_analysis_cli	sum	BA.1	BA.2
E	T9I	9	sub	80	80	160	1	1
M	D3G	3	sub	70	0	70	1	0
M	Q19E	19	sub	78	71	149	1	1
M	A63T	63	sub	80	80	160	1	1
N	P13L	13	sub	80	78	158	1	1
N	E31-	31	del	80	79	159	1	1
N	R32-	32	del	80	79	159	1	1
N	S33-	33	del	80	79	159	1	1
N	R203K	203	sub	80	79	159	1	1
N	G204R	204	sub	80	79	159	1	1
N	S413R	413	sub	0	78	78	0	1
ORF1a	S135R	135	sub	0	80	80	0	1
ORF1a	I694T	694	sub	13	0	13	0	0
ORF1a	T842I	842	sub	0	79	79	0	1
ORF1a	K856R	856	sub	80	0	80	1	0
ORF1a	G1307S	1307	sub	0	79	79	0	1
ORF1a	T1543I	1543	sub	0	7	7	0	0
ORF1a	S2083I	2083	sub	80	0	80	1	0
ORF1a	L2084-	2084	del	80	0	80	1	0
ORF1a	I2138V	2138	sub	8	0	8	0	0
ORF1a	A2710T	2710	sub	80	0	80	1	0
ORF1a	L3027F	3027	sub	0	76	76	0	1
ORF1a	T3090I	3090	sub	0	79	79	0	1
ORF1a	L3201F	3201	sub	0	78	78	0	1
ORF1a	T3255I	3255	sub	80	79	159	1	1
ORF1a	P3395H	3395	sub	80	80	160	1	1
ORF1a	L3674-	3674	del	80	0	80	1	0
ORF1a	S3675-	3675	del	80	80	160	1	1
ORF1a	G3676-	3676	del	80	80	160	1	1
ORF1a	F3677-	3677	del	0	80	80	0	1
ORF1a	I3758V	3758	sub	80	0	80	1	0
ORF1b	P314L	314	sub	80	80	160	1	1
ORF1b	R1315C	1315	sub	0	79	79	0	1
ORF1b	I1566V	1566	sub	80	80	160	1	1
ORF1b	T2163I	2163	sub	0	78	78	0	1
ORF3a	T223I	223	sub	0	79	79	0	1
ORF6	D61L	61	sub	0	77	77	0	1
ORF9b	P10S	10	sub	80	76	156	1	1
ORF9b	E27-	27	del	80	79	159	0	1
ORF9b	N28-	28	del	80	79	159	1	1
ORF9b	A29-	29	del	80	79	159	1	1
S	T19I	19	sub	0	79	79	0	1
S	L24S	24	sub	0	79	79	0	1
S	P25-	25	del	0	79	79	0	1
S	P26-	26	del	0	79	79	0	1
S	A27-	27	del	0	79	79	0	1
S	A67V	67	sub	80	0	80	1	0
S	H69-	69	del	80	0	80	1	0
S	V70-	70	del	80	0	80	1	0
S	T95I	95	sub	80	0	80	1	0
S	G142D	142	sub	80	72	152	1	1
S	V143-	143	del	80	0	80	1	0
S	Y144-	144	del	80	0	80	1	0
S	Y145-	145	del	80	0	80	1	0
S	N211I	211	sub	58	0	58	1	0
S	L212-	212	del	58	0	58	1	0
S	V213G	213	sub	0	77	77	0	1
S	214:EPE	214	in	54	0	54	1	0
S	G339D	339	sub	72	79	151	1	1
S	S371F	371	sub	5	79	84	0	1
S	S371L	371	sub	63	0	63	1	0
S	S373P	373	sub	68	79	147	1	1
S	S375F	375	sub	68	79	147	1	1
S	T376A	376	sub	5	79	84	0	1
S	D405N	405	sub	5	79	84	0	1
S	R408S	408	sub	5	79	84	0	1
S	K417N	417	sub	67	79	146	1	1
S	N440K	440	sub	37	17	54	1	1
S	G446S	446	sub	56	0	56	1	0
S	S477N	477	sub	61	72	133	1	1
S	T478K	478	sub	61	72	133	1	1
S	E484A	484	sub	59	71	130	1	1
S	Q493R	493	sub	59	71	130	1	1
S	G496S	496	sub	57	0	57	1	0
S	Q498R	498	sub	61	72	133	1	1
S	N501Y	501	sub	61	72	133	1	1
S	Y505H	505	sub	58	72	130	1	1
S	T547K	547	sub	80	0	80	1	0
S	D614G	614	sub	80	79	159	1	1
S	H655Y	655	sub	80	79	159	1	1
S	N679K	679	sub	80	79	159	1	1
S	P681H	681	sub	80	79	159	1	1
S	N764K	764	sub	80	78	158	1	1
S	D796Y	796	sub	80	79	159	1	1
S	Q853R	853	sub	6	0	6	0	0
S	N856K	856	sub	80	0	80	1	0
S	Q954H	954	sub	80	79	159	1	1
S	N969K	969	sub	80	78	158	1	1
S	L981F	981	sub	80	0	80	1	0

Keys (here, BA.1 and BA.2) get their own columns. A 1 indicates that the mutation is part of that key and a 0 indicates that it's not. The data above is the summary data for omicron_analysis_cli.csv with the keys BA1_key.csv and BA2_key.csv.

Saving the plot or data

Use the save() method to save either the mutation profile figure or the accompanying data:

Save figureSave data

vg = vargram(data='path/to/nextclade_analysis.csv')
vg.profile()
vg.save('my_figure.png')

When the figure is saved (i.e. when the extension is .png, .pdf or .jpg), save() acts like Matplotlib's matplotlib.pyplot.savefig() with the bbox_inches argument set to tight. Thus, save() can take other savefig() arguments like dpi or transparent, e.g.

vg = vargram(data='path/to/nextclade_analysis.csv')
vg.profile()
vg.save('my_figure.png', dpi=300, transparent=True)

vg = vargram(data='path/to/nextclade_analysis.csv')
vg.profile()
vg.save('my_data.csv')

When the summary data is saved (i.e. when the extension is not .png, .pdf or .jpg), save() acts like Pandas' pandas.DataFrame.to_csv() with the sep argument automatically set for .csv (sep=','), .tsv (sep='\t') and .txt (sep=' ') extensions, and index set to False. Thus, save() can take other to_csv() arguments like columns, e.g.

vg = vargram(data='path/to/nextclade_analysis.csv')
vg.profile()
vg.save('my_data.csv', index=True, columns=['gene','mutation','batch_1'])

Customization

Setting the y-axis type and the count threshold

The y-axis of the profile can show either the raw count of a mutation (i.e. no. of sequences that has the mutation) or its weight. The weight of a mutation is simply the count of a mutation in a batch divided by the sum of all mutation counts in that batch. Whether the count or weight is shown can be changed through the profile() method:

Show countsShow weights

vg = vargram(data='path/to/nextclade_analysis.csv')
vg.profile(ytype='counts')
vg.show()

vg = vargram(data='path/to/nextclade_analysis.csv')
vg.profile(ytype='weights')
vg.show()

By default, when multiple batches are provided, the y-axis shows the weights.

On the other hand, the threshold (default 50) is the minimum count or no. of occurences of a mutation in a batch that is needed for a mutation to be included in the figure. The threshold is applied per batch and a mutation will only be shown in the batch that it meets the threshold but it will have a count of zero in every other batch.

Threshold example

To make this more concrete, suppose we have two batches, each with 200 sequences. Suppose a mutation D3G occurs in 150 out of 200 samples in the first batch, but it occurs only 30 times in the second batch. With a default threshold of 50, D3G will show up in the profile as a mutation in the first batch with a count of 150. But D3G will not register as a mutation in the second batch--its count will be zero even if 30 sequences actually have the mutation in that batch.

The threshold can also be changed through profile():

vg = vargram(data='path/to/nextclade_analysis.csv')
vg.profile(threshold=10)
vg.show()

Changing aesthetic attributes

Aesthetic attributes like colors, labels, and font sizes can be changed through the aes() method:

vg = vargram(data='path/to/nextclade_analysis.csv')
vg.profile()
vg.aes(stack_title='Region', # Change batch legend title
        stack_label=['Foreign', 'Local'], # Change batch names
        stack_color=['#E33E84', '#009193'], # Change batch bar colors
        group_title='ORFs', # Change gene legend title
        legtitle_fontsize=15, # Change legend title font size
        legentry_fontsize=10, # Change legend entry font size
        ylabel='Normalized Count', # Change y-axis label
        ylabel_fontsize=30, # Change y-axis label fontsize
        xticks_rotation=45, # Change x-axis ticks rotation
        xticks_fontsize=5) # Change x-axis ticks label fontsize
vg.show()

Why is a batch referred to as a stack and a gene referred to as a group?

A profile figure (essentially a grid of barplots) can be generated without using Nextclade data or even sequence-specific data. For this reason, the variable names were chosen to be agnostic. See Other features.

Ordering genes

VARGRAM automatically determines which gene barplot goes to which row and in what order in an attempt to create a compact figure. However, you may want to rearrange the rows and/or show only certain genes. You can order genes according to their start position and/or fix the profile to have a horizontal layout through the profile() method:

vg = vargram(data='path/to/nextclade_analysis.csv')
vg.profile(order=True, # Order the genes based on the annotation file
           flat=True) # Force a horizontal layout
vg.show()

If you want to keep only certain genes and specify a particular order, then you can use the struct() method, which takes one required positional argument: a formatted string where gene or group names are separated by a comma and rows are separated by a forward slash.

Suppose we have the following genes: 'ORF1a', 'ORF1b', 'M', 'S', and 'N'. Say we only want to keep 'ORF1a', 'ORF1b' and 'M' such that on the first row 'M' is followed by 'ORF1a' and on the second row is 'ORF1b' alone. This new structure can be specified as follows:

vg = vargram(data='path/to/nextclade_analysis.csv')
vg.profile()
vg.struct('M,ORF1a/ORF1b') # Put M & ORF1a in first row in that order 
                           # and put ORF1b in second row
vg.show()

Adding and creating keys

It is often helpful not just to compare batches with each other but also against reference lineages. This can be done by adding a key, which is a CSV or TSV file withgene and mutation columns. The mutations can be representative ("key") mutations of a lineage or they could just be mutations of interest for whatever reason. VARGRAM will mark these mutations in the profile through a heatmap below the barplots. The keys can be added with the key() method. Each key is added individually:

vg = vargram(data='path/to/nextclade_analysis.csv')
vg.profile()
vg.key('path/to/my_key1.csv') # Provide key file 1
vg.key('path/to/my_key2.csv',
        label='Key 2') # Add an optional label
vg.show()

Key mutations will always be shown on the x-axis even if they are not found in any of the batches. Note however that if those key mutations do not meet the threshold, there will be an empty column for it in the barplot but no bar will be shown. On the other hand, any mutation that meets the threshold in any of the batches will always be shown even if it is not a key mutation.

You can manually create these key files yourself, but note that the format of mutation must follow Nextclade's notation. See the aaSubstitutions, aaDeletions, and aaInsertions definitions.

Alternatively, the VARGRAM summary data can serve as a key file. To create a key with VARGRAM, simply provide the batch of sequences from which key mutations will be identified. In this case, those key mutations can be identified as those that met a high threshold.

Other features

Although VARGRAM was made in the context of viral genomic surveillance, data from any CSV or TSV file or a Pandas DataFrame can be extracted to generate a mutation profile-type figure. The input tabular data needs to have three columns that will play the role of the gene, the mutation, and the batch.

These columns can be specified in the profile() method through the arguments x, group, and stack:

The x (default mutation) argument specifies the column from which to take the x-axis values for the bar plots.
The group (default gene) argument specifies the column whose unique values will determine the bar plots (with their corresponding x values) to be created.
The stack (default batch) argument specifies the column for the batches or the sets of (x value, y value) pairs.

For example, consider the file XBB_analysis_web.tsv from the test data. This is a Nextclade analysis file, but we can still use it to show the flexibility of VARGRAM. Suppose that we want the clade column to play the role of the genes, the unique values of the qc.overallStatus to be the batches, and the x-axis values to be given by the Nextclade_pango column. The code should look as follows:

vg = vargram(data='test_data/analysis/XBB_analysis_web.tsv', 
            format='delimited') # (1)
vg.profile(group='clade',
           x='Nextclade_pango', 
           stack='qc.overallStatus', 
           threshold=1
           ytype='counts')
vg.aes(stack_title="Quality Control",
       group_title="Clade")
vg.show()

When the data argument is used and the optional format argument is not, VARGRAM assumes that the file is a Nextclade analysis file which would result to default, fixed settings. By specifying format=delimited, VARGRAM would not run the file through the internal Nextclade wrangler functions.

This will result to the following profile: plot