mhcflurry
Open source neural network models for peptide-MHC binding affinity prediction
The adaptive immune system depends on the presentation of protein fragments by MHC molecules. Machine learning models of this interaction are used in studies of infectious diseases, autoimmune diseases, vaccine development, and cancer immunotherapy.
MHCflurry currently supports peptide / MHC class I affinity prediction using one model per MHC allele. The predictors may be trained on data that has been augmented with data imputed based on other alleles (see Rubinsteyn 2016). We anticipate adding additional models, including pan-allele and class II predictors.
You can fit MHCflurry models to your own data or download trained models that we provide. Our models are trained on data from IEDB and Kim 2014. See here for details on the training data preparation. The steps we use to train predictors on this data, including hyperparameter selection using cross validation, are here.
The MHCflurry predictors are implemented in Python using keras.
Setup
To configure keras, the neural network library used by MHCflurry, you'll need to set an environment variable in your shell:
export KERAS_BACKEND=theano
If you're familiar with keras, you may also try using the tensorflow backend. MHCflurry is currently tested using theano, however.
Now install the package:
pip install mhcflurry
Then download our datasets and trained models:
mhcflurry-downloads fetch
From a checkout you can run the unit tests with:
nosetests .
Making predictions from the command-line
$ mhcflurry-predict --alleles HLA-A0201 HLA-A0301 --peptides SIINFEKL SIINFEKD SIINFEKQ
Predicting for 2 alleles and 3 peptides = 6 predictions
allele,peptide,mhcflurry_prediction
HLA-A0201,SIINFEKL,10672.34765625
HLA-A0201,SIINFEKD,26042.716796875
HLA-A0201,SIINFEKQ,26375.794921875
HLA-A0301,SIINFEKL,25532.703125
HLA-A0301,SIINFEKD,24997.876953125
HLA-A0301,SIINFEKQ,28262.828125
You can also specify the input and output as CSV files. Run mhcflurry-predict -h
for details.
Making predictions from Python
from mhcflurry import predict
predict(alleles=['A0201'], peptides=['SIINFEKL'])
Allele Peptide Prediction
0 A0201 SIINFEKL 10672.347656
The predictions returned by predict
are affinities (KD) in nM.
Training your own models
See the class1_allele_specific_models.ipynb notebook for an overview of the Python API, including predicting, fitting, and scoring models.
There is also a script called mhcflurry-class1-allele-specific-cv-and-train
that will perform cross validation and model selection given a CSV file of training data. Try mhcflurry-class1-allele-specific-cv-and-train --help
for details.
Details on the downloaded class I allele-specific models
Besides the actual model weights, the data downloaded with mhcflurry-downloads fetch
also includes a CSV file giving the hyperparameters used for each predictor. Another CSV gives the cross validation results used to select these hyperparameters.
To see the hyperparameters for the production models, run:
open "$(mhcflurry-downloads path models_class1_allele_specific_single)/production.csv"
To see the cross validation results:
open "$(mhcflurry-downloads path models_class1_allele_specific_single)/cv.csv"
Problems and Solutions
undefined symbol
If you get an error like:
ImportError: _CVXcanon.cpython-35m-x86_64-linux-gnu.so: undefined symbol: _ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEED1Ev
Try installing cvxpy using conda instead of pip.
Environment variables
The path where MHCflurry looks for model weights and data can be set with the MHCFLURRY_DOWNLOADS_DIR
environment variable. This directory should contain subdirectories like "models_class1_allele_specific_single". Setting this variable overrides the other environment variables described below.
If you only want to change the version of the released data used, you can set MHCFLURRY_DOWNLOADS_CURRENT_RELEASE
. If you want to change the base directory used for all releases, set MHCFLURRY_DATA_DIR
.
By default, MHCFLURRY_DOWNLOADS_DIR
is a platform specific application storage directory, MHCFLURRY_DOWNLOADS_CURRENT_RELEASE
is the latest release, and MHCFLURRY_DOWNLOADS_DIR
is set to $MHCFLURRY_DATA_DIR/$MHCFLURRY_DOWNLOADS_CURRENT_RELEASE
.