Interactive Analysis

Statistical Analysis Using COVID Cloud and Azure

This notebook will demonstrate how to perform statistical analysis with Azure Notebooks using SARS-CoV-2 variant data from COVID Cloud.

The statistical analysis in this notebook will include:

  1. Basic summary statistics using pandas
  2. Hypothesis testing using scipy
  3. Phylogenetic trees using scipy

Before beginning, you should get up to speed with the provided Python Library for COVID Cloud. Make sure you’ve installed the Library and restarted the notebook if necessary.

The notebook and requirements file can be found here.

In order to download the necessary libraries, make sure the requirements.txt file is included in the directory.

1
pip install --no-cache-dir -r requirements.txt

Next, use the Python Library to download COVID variant data.

There are two main tables we want to use for the analysis:

The following downloads should be complete in less than a minute.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import pandas as pd
import json

from covid_cloud import COVIDCloud

#### Create the client
search_url = "https://search.international.covidcloud.ca/"
covid_client = COVIDCloud(search_url=search_url)

#### Download metadata
print('Fetching metadata from COVID Cloud…')
query = 'SELECT * FROM covid.cloud.sequences'
meta_data = json.loads(covid_client.query(query))
meta_df = pd.DataFrame(meta_data)
print("Metadata:")
print(meta_df)

#### Download variant data
print('Fetching variants from COVID Cloud…')
query = 'SELECT * FROM covid.cloud.variants'
variant_data = json.loads(covid_client.query(query))
variant_df = pd.DataFame(covid_client.query(query))
print("Variant Data:")
print(variant_df)
1. Basic Summary Statistics Using Pandas

In this example, we will be investigating the number of samples by variant.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import pandas as pd

####  Plot the number of variants there are at each position for the top 20 most frequent variants
_ = (
    variant_df.start_position
              .value_counts()
              .head(20)
              .plot.bar(figsize=(10, 5),
                        xlabel='Position',
                        ylabel='# Samples')
)

2. Hypothesis Testing Using SciPy

In this example, we will be investigating the difference between the proportion of positive and negative samples of the D614 variant in March and August.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
#### Get sample accessions for samples that have the D614G mutation
accessions = variant_df[variant_df.start_position == 23402].sequence_accession


#### Filter by months { collection_date.contains(2020-03), collection_date.contains(2020-08) } 
march_meta_df = meta_df[meta_df.collection_date.str.contains('2020-03', na=False)] 
august_meta_df = meta_df[meta_df.collection_date.str.contains('2020-08', na=False)]

#### Filter meta table by sample accessions to obtain contingency data
h0_n = march_meta_df[~(march_meta_df.accession.isin(accessions))].shape[0]
h0_p = march_meta_df[(march_meta_df.accession.isin(accessions))].shape[0]
h1_n = august_meta_df[~(august_meta_df.accession.isin(accessions))].shape[0]
h1_p = august_meta_df[(august_meta_df.accession.isin(accessions))].shape[0]

#### Chi2 test
chi_results = chi2_contingency([[h0_n, h0_p],
                                [h1_n, h1_p]])
print('χ2:', chi_results[0])
print('p:', chi_results[1])

3. Phylogenetic Trees using SciPy

In this example, we will be creating a phylogenetic tree for the 20 most common variants, in order to investigate which variants most commonly occur together.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import warnings
warnings.simplefilter('ignore')

from scipy.cluster import hierarchy
from scipy.spatial.distance import pdist, squareform

#### Filtering to 20 variants
v_df = variant_df[variant_df.start_position.isin(variant_df.start_position.value_counts().iloc[100: 120].index)]

#### Distance calculations
ct           = pd.crosstab(v_df.sequence_accession, v_df.start_position)
corr         = squareform(pdist(ct.T, metric='hamming'))
corr_linkage = hierarchy.ward(corr)

#### Plot misc
sns.set_style("white")
plt.rcParams["font.family"] = "serif"

#### Figure
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(20, 8))
fig.set_facecolor('white')
ax1, ax2 = ax

#### Plot
_ = hierarchy.dendrogram(corr_linkage, labels=ct.columns, leaf_rotation=90, ax=ax1)
sns.heatmap(pd.DataFrame(corr, columns=ct.columns, index=ct.columns), ax=ax2)

#### Labels
_ = fig.suptitle('Phylogenetic Analysis on SARS-CoV-2 VOC SNPs', fontsize=28)
_ = ax1.set_xlabel('SNP', fontsize=16)
_ = ax1.set_ylabel('Distance', fontsize=16)
_ = ax2.set_xlabel('SNP', fontsize=16)
_ = ax2.set_ylabel('SNP', fontsize=16)

#### Ticks
_ = ax1.tick_params(axis='both', labelsize=14)
_ = ax2.tick_params(axis='both', labelsize=14)

Running your notebooks on Azure ML

Download the notebooks

Download the .zip here

The zip folder should contain:

Create a workspace
  1. Navigate to the Azure Portal.
  2. Select Create a new Resource.
  3. Select AI + Machine Learning > Machine Learning
  1. Set up the workspace by entering your subscription, resource group, region, and other required information.
  2. Navigate to Machine Learning Studio by clicking Launch Studio.

Further reference for creating an Azure Machine Learning workspace is available here

Add the notebooks to you workspace

Navigate to the Notebooks tab, then upload the Jupyter Notebooks and requirements.txt from your local PC.

Once uploaded, open the notebook.

Configure your environment
  1. Select the Compute instance that will execute your notebook.
  1. Once the virtual machine is set up, set your kernel to Python 3
Run the notebook

You can run the entire Jupyter notebook by clicking the button on the upper-left of the notebook

Alternatively, you can run individual cells by clicking the button on each cell