Machine Learning

Machine Learning Using COVID Cloud Variant Data and Azure

This guide will cover the following topics:

  1. Demonstrating how to do machine learning with SARS-CoV-2 variant data from COVID Cloud using the covid_cloud Python library
  2. Store and run your models in in the cloud using Azure Notebooks.

The machine learning in this tutorial will include:

  1. Linear Regression using statsmodels
  2. K-neighest-neighbor using scikit-learn
  3. Deep Learning Compression using TensorFlow / Keras

Before beginning, you should get up to speed with the provided Python Library for COVID Cloud. Make sure you’ve installed the Library and restarted the notebook if necessary. In order to download the necessary libraries, make sure the requirements.txt file is included in the directory.

The Jupyter Notebook and requirements.txt can be found in this zip package.

1
pip install --no-cache-dir -r requirements.txt

Next, use the Python Library to download COVID variant data.

There are two main tables we want to use for the analysis:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import pandas as pd
import json

from covid_cloud import COVIDCloud

#### Create the client
search_url = 'https://search.international.covidcloud.ca/'
covid_client = COVIDCloud(search_url=search_url)

#### Download metadata
print('Fetching metadata from COVID Cloud…')
query = 'SELECT * FROM covid.cloud.sequences'
meta_df = pd.DataFrame(covid_client.query(query))
print("Metadata:")
print(meta_df)

#### Download variant data
print('Fetching variants from COVID Cloud…')
query = 'SELECT * FROM covid.cloud.variants'
variant_df = pd.DataFame(covid_client.query(query))
print("Variant Data:")
print(variant_df)

1. Linear Regression Using Statsmodels

In this example, the statsmodels library is used to find the rate of change of a particular variant through performing linear regression.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import numpy as np
import statsmodels.api as sm

from datetime import timedelta

##### Release date --> Day of the year
meta_df.release_date = meta_df.release_date.astype('datetime64').apply(lambda x: (x - timedelta(days=x.dayofweek)))

##### Get positions for SNP of interest
accessions = variant_df[variant_df.start_position == 23402].sequence_accession
pos_count = meta_df[(meta_df.accession.isin(accessions))].groupby('release_date').count().host
neg_count = meta_df[~(meta_df.accession.isin(accessions))].groupby('release_date').count().host

##### Calculate counts of number of days
vc = pos_df.release_date.value_counts()

##### Merge dataframes
merge_df = pd.merge(neg_count, pos_count, left_index=True, right_index=True, how='inner').fillna(0)

##### Linear regression
y = merge_df.host_y / merge_df.sum(axis=1)
x = sm.add_constant(np.arange(len(y)))
model = sm.OLS(y, x).fit()
b1 = model.params[1].round(3)
p = model.pvalues[1].round(3)
print(f'The weekly rate of change is {b1} with a p-value of {p}')

2. K-nearest-neighbor Using Scikit-learn

In this example, scikit-learn is used to perform clustering with the K-nearest-neighbor algorithm. The goal of this analysis is to see whether there is a difference in the mutation profiles of American and Australian samples.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import matplotlib.pyplot as plt

from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier

##### Get the meta dataframe of USA, Australia
meta_df_oi = meta_df[(meta_df.location == 'USA') |
                     (meta_df.location == 'Australia: Victoria')]

##### Filter variant_df by meta_df accessions
v_df = variant_df[variant_df.sequence_accession.isin(meta_df_oi.accession)]

##### Get the top 100 most prevalent mutations
v_df = v_df[v_df.start_position.isin(v_df.start_position.value_counts().head(100).index)]

##### Crosstab
ct = pd.crosstab(v_df.sequence_accession, v_df.start_position)

##### Get locations (y-variable)
locations = pd.merge(ct,
                     meta_df_oi[['accession', 'location']],
                     left_index=True,
                     right_on='accession').location

##### Cross tabbed data to pca
X_pca = PCA(n_components=2).fit_transform(ct.values)

##### KNN classification
knn = KNeighborsClassifier().fit(X_pca, locations)

##### Predictions
predictions = knn.predict(X_pca)

##### Figure
fig = plt.figure(figsize=(8, 8))

##### Plot
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=(predictions == 'USA').astype(int))

##### Labels
plt.xlabel('PCA_1')
plt.ylabel('PCA_2')

##### Legend
plt.legend(handles=scatter.legend_elements()[0], labels=['Australia', 'USA'])

##### Accuracy
print('KNN model with training accuracy of :', knn.score(X_pca, locations))

3. Deep Learning Compression Using TensorFlow / Keras

In this example, TensorFlow is used to compress variant data in order to capture meaningful insights while reducing the size of the variant matrix.

tensorflow is a massive library, so its installation is placed seperatly here. Run the cell below to install tensorflow. The entire download should take <5 minutes.

1
pip install --no-cache-dir tensorflow==2.4.1

After the installation is complete, run the following:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import matplotlib.pyplot as plt

from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier

##### Get the meta dataframe of USA, Australia
meta_df_oi = meta_df[(meta_df.location == 'USA') |
                     (meta_df.location == 'Australia: Victoria')]

##### Filter variant_df by meta_df accessions
v_df = variant_df[variant_df.sequence_accession.isin(meta_df_oi.accession)]

##### Get the top 100 most prevalent mutations
v_df = v_df[v_df.start_position.isin(v_df.start_position.value_counts().head(100).index)]

##### Crosstab
ct = pd.crosstab(v_df.sequence_accession, v_df.start_position)

##### Get locations (y-variable)
locations = pd.merge(ct,
                     meta_df_oi[['accession', 'location']],
                     left_index=True,
                     right_on='accession').location

##### Cross tabbed data to pca
X_pca = PCA(n_components=2).fit_transform(ct.values)

##### KNN classification
knn = KNeighborsClassifier().fit(X_pca, locations)

##### Predictions
predictions = knn.predict(X_pca)

##### Figure
fig = plt.figure(figsize=(8, 8))

##### Plot
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=(predictions == 'USA').astype(int))

##### Labels
plt.xlabel('PCA_1')
plt.ylabel('PCA_2')

##### Legend
plt.legend(handles=scatter.legend_elements()[0], labels=['Australia', 'USA'])

##### Accuracy
print('KNN model with training accuracy of :', knn.score(X_pca, locations))

Running Your Notebooks With Azure ML

Download the Notebooks

Download the .zip here.

The zip folder contains:

Create a Workspace
  1. Navigate to the Azure Portal.
  2. Select Create a new Resource.
  3. Select AI + Machine Learning > Machine Learning.
  1. Set up the workspace by entering your subscription, resource group, region, and other required information.
  2. Navigate to Machine Learning Studio by clicking Launch Studio.

Further reference for creating an Azure Machine Learning workspace is available here

Add the notebooks to you workspace

Navigate to the Notebooks tab, then upload the Jupyter Notebooks and requirements.txt from your local PC.

Once uploaded, open the notebook.

Configure your environment
  1. Select the Compute instance that will execute your notebook.
  1. Once the virtual machine is set up, set your kernel to Python 3.
Run the notebook

You can run the entire Jupyter notebook by clicking the button on the upper-left of the notebook.

Alternatively, you can run individual cells by clicking the button on each cell.