Skip to content
Snippets Groups Projects
README.md 5.94 KiB
Newer Older
[![Build Status](https://travis-ci.org/hammerlab/mhcflurry.svg?branch=master)](https://travis-ci.org/hammerlab/mhcflurry) [![Coverage Status](https://coveralls.io/repos/github/hammerlab/mhcflurry/badge.svg?branch=master)](https://coveralls.io/github/hammerlab/mhcflurry?branch=master)
Alex Rubinsteyn's avatar
Alex Rubinsteyn committed
# mhcflurry
Tim O'Donnell's avatar
Tim O'Donnell committed
Open source neural network models for peptide-MHC binding affinity prediction
The [adaptive immune system](https://en.wikipedia.org/wiki/Adaptive_immune_system)
depends on the presentation of protein fragments by [MHC](https://en.wikipedia.org/wiki/Major_histocompatibility_complex)
molecules. Machine learning models of this interaction are used in studies of
infectious diseases, autoimmune diseases, vaccine development, and cancer
immunotherapy.
MHCflurry supports Class I peptide/MHC binding affinity prediction using
ensembles of allele-specific models. Pan-allelic prediction is supported in the
software but is not yet performing accurately and should not be use. Other 

MHCflurry ships with an  allele-specific (i.e. one model per allele)

MHCflurry supports allele-specific peptide / [MHC class I](https://en.wikipedia.org/wiki/MHC_class_I) affinity prediction using two approaches:
Tim O'Donnell's avatar
Tim O'Donnell committed
 * Ensembles of predictors trained on random halves of the training data (the default)
Tim O'Donnell's avatar
Tim O'Donnell committed
 * Single-model predictors for each allele trained on all data
Tim O'Donnell's avatar
Tim O'Donnell committed

For both kinds of predictors, you can fit models to your own data or download
trained models that we provide.

The downloadable models were trained on data from
Tim O'Donnell's avatar
Tim O'Donnell committed
[IEDB](http://www.iedb.org/home_v3.php) and [Kim 2014](http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-241).
The training dataset is available [here]()
Tim O'Donnell's avatar
Tim O'Donnell committed

In validation experiments using presented peptides identified by mass-spec,
the ensemble models perform best. We are working on a performance comparison of
these models with other predictors such as netMHCpan, which we hope to make
available soon.

We anticipate adding additional models, including pan-allele and class II predictors.
Tim O'Donnell's avatar
Tim O'Donnell committed
The MHCflurry predictors are implemented in Python using [keras](https://keras.io).
To configure keras you'll need to set an environment variable in your shell:
Tim O'Donnell's avatar
Tim O'Donnell committed

```
export KERAS_BACKEND=theano
```

If you're familiar with keras, you may also try using the tensorflow backend. MHCflurry is currently tested using theano, however.
 

Now install the package:
Tim O'Donnell's avatar
Tim O'Donnell committed
pip install mhcflurry
Tim O'Donnell's avatar
Tim O'Donnell committed
Then download our datasets and trained models:
mhcflurry-downloads fetch
Tim O'Donnell's avatar
Tim O'Donnell committed
From a checkout you can run the unit tests with:
Dan Vanderkam's avatar
Dan Vanderkam committed

Tim O'Donnell's avatar
Tim O'Donnell committed
## Making predictions from the command-line

```shell
$ mhcflurry-predict --alleles HLA-A0201 HLA-A0301 --peptides SIINFEKL SIINFEKD SIINFEKQ
Predicting for 2 alleles and 3 peptides = 6 predictions
allele,peptide,mhcflurry_prediction
HLA-A0201,SIINFEKL,10672.34765625
HLA-A0201,SIINFEKD,26042.716796875
HLA-A0201,SIINFEKQ,26375.794921875
HLA-A0301,SIINFEKL,25532.703125
HLA-A0301,SIINFEKD,24997.876953125
HLA-A0301,SIINFEKQ,28262.828125
```

You can also specify the input and output as CSV files. Run `mhcflurry-predict -h` for details.


## Making predictions from Python
Dan Vanderkam's avatar
Dan Vanderkam committed

```python
from mhcflurry import predict
predict(alleles=['A0201'], peptides=['SIINFEKL'])
Dan Vanderkam's avatar
Dan Vanderkam committed
```

```
  Allele   Peptide  Prediction
Tim O'Donnell's avatar
Tim O'Donnell committed
0  A0201  SIINFEKL  10672.347656
Dan Vanderkam's avatar
Dan Vanderkam committed
```
Alex Rubinsteyn's avatar
Alex Rubinsteyn committed

Tim O'Donnell's avatar
Tim O'Donnell committed
The predictions returned by `predict` are affinities (KD) in nM.

Tim O'Donnell's avatar
Tim O'Donnell committed
## Training your own models

Tim O'Donnell's avatar
Tim O'Donnell committed
See the [class1_allele_specific_models.ipynb](https://github.com/hammerlab/mhcflurry/blob/master/examples/class1_allele_specific_models.ipynb) notebook for an overview of the Python API, including predicting, fitting, and scoring single-model predictors. There is also a script called `mhcflurry-class1-allele-specific-cv-and-train` that will perform cross validation and model selection given a CSV file of training data. Try `mhcflurry-class1-allele-specific-cv-and-train --help` for details.
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
The ensemble predictors are trained similarly using the `mhcflurry-class1-allele-specific-ensemble-train` command.
Tim O'Donnell's avatar
Tim O'Donnell committed
## Details on the downloadable models
Tim O'Donnell's avatar
Tim O'Donnell committed
The scripts we use to train predictors, including hyperparameter selection
using cross validation, are
Tim O'Donnell's avatar
Tim O'Donnell committed
[here](downloads-generation/models_class1_allele_specific_ensemble)
for the ensemble predictors and [here](downloads-generation/models_class1_allele_specific_single)
Tim O'Donnell's avatar
Tim O'Donnell committed
for the single-model predictors.
Tim O'Donnell's avatar
Tim O'Donnell committed
For the ensemble predictors, we also generate a [report](http://htmlpreview.github.io/?https://github.com/hammerlab/mhcflurry/blob/master/downloads-generation/models_class1_allele_specific_ensemble/models-summary/report.html)
Tim O'Donnell's avatar
Tim O'Donnell committed
that describes the hyperparameters selected and the test performance of each
Tim O'Donnell's avatar
Tim O'Donnell committed
model.
Tim O'Donnell's avatar
Tim O'Donnell committed
Besides the model weights, the data downloaded when you run
`mhcflurry-downloads  fetch` also includes a CSV file giving the
hyperparameters used for each predictor. Run `mhcflurry-downloads path
models_class1_allele_specific_ensemble` or `mhcflurry-downloads path
Tim O'Donnell's avatar
Tim O'Donnell committed
models_class1_allele_specific_single` to get the directory where these files are stored.
Tim O'Donnell's avatar
Tim O'Donnell committed
## Problems and Solutions

###  undefined symbol
If you get an error like:

```
ImportError: _CVXcanon.cpython-35m-x86_64-linux-gnu.so: undefined symbol: _ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEED1Ev
```

Try installing cvxpy using conda instead of pip.


## Environment variables

The path where MHCflurry looks for model weights and data can be set with the `MHCFLURRY_DOWNLOADS_DIR` environment variable. This directory should contain subdirectories like "models_class1_allele_specific_single". Setting this variable overrides the other environment variables described below.

If you only want to change the version of the released data used, you can set `MHCFLURRY_DOWNLOADS_CURRENT_RELEASE`. If you want to change the base directory used for all releases, set `MHCFLURRY_DATA_DIR`.

Tim O'Donnell's avatar
Tim O'Donnell committed
By default, `MHCFLURRY_DOWNLOADS_DIR` is a platform specific application storage directory, `MHCFLURRY_DOWNLOADS_CURRENT_RELEASE` is the latest release, and `MHCFLURRY_DOWNLOADS_DIR` is set to `$MHCFLURRY_DATA_DIR/$MHCFLURRY_DOWNLOADS_CURRENT_RELEASE`.