Skip to content
Snippets Groups Projects
Commit cb88dac8 authored by Tim O'Donnell's avatar Tim O'Donnell Committed by GitHub
Browse files

Merge pull request #84 from hammerlab/add-class1-ensemble-rebased

Add class1 ensemble models, with model selection using imputation
parents e93299b6 741d873e
No related branches found
No related tags found
No related merge requests found
Showing
with 20036 additions and 237 deletions
...@@ -45,6 +45,8 @@ env: ...@@ -45,6 +45,8 @@ env:
script: script:
# download data and models, then run tests # download data and models, then run tests
- mhcflurry-downloads fetch - mhcflurry-downloads fetch
- mhcflurry-downloads fetch models_class1_allele_specific_ensemble
- mhcflurry-downloads fetch models_class1_allele_specific_single
- mhcflurry-downloads info # just to test this command works - mhcflurry-downloads info # just to test this command works
- nosetests test --with-coverage --cover-package=mhcflurry && ./lint.sh - nosetests test --with-coverage --cover-package=mhcflurry && ./lint.sh
after_success: after_success:
......
...@@ -42,14 +42,11 @@ WORKDIR /home/user ...@@ -42,14 +42,11 @@ WORKDIR /home/user
# Setup virtual envs and install convenience packages. Note: installing # Setup virtual envs and install convenience packages. Note: installing
# cherrypy as part of the mhcflurry installation weirdly fails on a unicode # cherrypy as part of the mhcflurry installation weirdly fails on a unicode
# issue in python2, but installing it separately seems to work. # issue in python2, but installing it separately seems to work.
# We also install bokeh so that dask distributed will have an admin web interface.
RUN virtualenv venv-py3 --python=python3 && \ RUN virtualenv venv-py3 --python=python3 && \
venv-py3/bin/pip install --upgrade pip && \ venv-py3/bin/pip install --upgrade pip && \
venv-py3/bin/pip install --upgrade \ venv-py3/bin/pip install --upgrade \
numpy \ numpy \
bokeh \
cherrypy \ cherrypy \
git+https://github.com/dask/distributed.git \
jupyter \ jupyter \
lxml \ lxml \
scipy \ scipy \
...@@ -59,9 +56,10 @@ RUN virtualenv venv-py3 --python=python3 && \ ...@@ -59,9 +56,10 @@ RUN virtualenv venv-py3 --python=python3 && \
ENV KERAS_BACKEND theano ENV KERAS_BACKEND theano
# RUN venv-py3/bin/pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.10.0-cp35-cp35m-linux_x86_64.whl # RUN venv-py3/bin/pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.10.0-cp35-cp35m-linux_x86_64.whl
# Install mhcflurry and download data and models. # Install mhcflurry and latest kubeface and download data and models.
COPY . ./mhcflurry COPY . ./mhcflurry
RUN venv-py3/bin/pip install ./mhcflurry && venv-py3/bin/mhcflurry-downloads fetch RUN venv-py3/bin/pip install --upgrade ./mhcflurry git+https://github.com/hammerlab/kubeface.git \
&& venv-py3/bin/mhcflurry-downloads fetch
EXPOSE 8888 EXPOSE 8888
CMD venv-py3/bin/jupyter notebook --no-browser CMD venv-py3/bin/jupyter notebook --no-browser
...@@ -5,15 +5,32 @@ Open source neural network models for peptide-MHC binding affinity prediction ...@@ -5,15 +5,32 @@ Open source neural network models for peptide-MHC binding affinity prediction
The [adaptive immune system](https://en.wikipedia.org/wiki/Adaptive_immune_system) depends on the presentation of protein fragments by [MHC](https://en.wikipedia.org/wiki/Major_histocompatibility_complex) molecules. Machine learning models of this interaction are used in studies of infectious diseases, autoimmune diseases, vaccine development, and cancer immunotherapy. The [adaptive immune system](https://en.wikipedia.org/wiki/Adaptive_immune_system) depends on the presentation of protein fragments by [MHC](https://en.wikipedia.org/wiki/Major_histocompatibility_complex) molecules. Machine learning models of this interaction are used in studies of infectious diseases, autoimmune diseases, vaccine development, and cancer immunotherapy.
MHCflurry currently supports peptide / [MHC class I](https://en.wikipedia.org/wiki/MHC_class_I) affinity prediction using one model per MHC allele. The predictors may be trained on data that has been augmented with data imputed based on other alleles (see [Rubinsteyn 2016](http://biorxiv.org/content/early/2016/06/07/054775)). We anticipate adding additional models, including pan-allele and class II predictors. MHCflurry currently supports allele-specific peptide / [MHC class I](https://en.wikipedia.org/wiki/MHC_class_I) affinity prediction using two approaches:
You can fit MHCflurry models to your own data or download trained models that we provide. Our models are trained on data from [IEDB](http://www.iedb.org/home_v3.php) and [Kim 2014](http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-241). See [here](https://github.com/hammerlab/mhcflurry/tree/master/downloads-generation/data_combined_iedb_kim2014) for details on the training data preparation. The steps we use to train predictors on this data, including hyperparameter selection using cross validation, are [here](https://github.com/hammerlab/mhcflurry/tree/master/downloads-generation/models_class1_allele_specific_single). * Ensembles of predictors trained on random halves of the training data (the default)
* Single-model predictors for each allele trained on all data
For both kinds of predictors, you can fit models to your own data or download
trained models that we provide.
The downloadable models were trained on data from
[IEDB](http://www.iedb.org/home_v3.php) and [Kim 2014](http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-241).
The ensemble predictors include models trained on data that has been
augmented with values imputed from other alleles (see
[Rubinsteyn 2016](http://biorxiv.org/content/early/2016/06/07/054775)).
In validation experiments using presented peptides identified by mass-spec,
the ensemble models perform best. We are working on a performance comparison of
these models with other predictors such as netMHCpan, which we hope to make
available soon.
We anticipate adding additional models, including pan-allele and class II predictors.
The MHCflurry predictors are implemented in Python using [keras](https://keras.io).
## Setup ## Setup
To configure keras, the neural network library used by MHCflurry, you'll need to set an environment variable in your shell: The MHCflurry predictors are implemented in Python using [keras](https://keras.io).
To configure keras you'll need to set an environment variable in your shell:
``` ```
export KERAS_BACKEND=theano export KERAS_BACKEND=theano
...@@ -73,25 +90,28 @@ The predictions returned by `predict` are affinities (KD) in nM. ...@@ -73,25 +90,28 @@ The predictions returned by `predict` are affinities (KD) in nM.
## Training your own models ## Training your own models
See the [class1_allele_specific_models.ipynb](https://github.com/hammerlab/mhcflurry/blob/master/examples/class1_allele_specific_models.ipynb) notebook for an overview of the Python API, including predicting, fitting, and scoring models. See the [class1_allele_specific_models.ipynb](https://github.com/hammerlab/mhcflurry/blob/master/examples/class1_allele_specific_models.ipynb) notebook for an overview of the Python API, including predicting, fitting, and scoring single-model predictors. There is also a script called `mhcflurry-class1-allele-specific-cv-and-train` that will perform cross validation and model selection given a CSV file of training data. Try `mhcflurry-class1-allele-specific-cv-and-train --help` for details.
There is also a script called `mhcflurry-class1-allele-specific-cv-and-train` that will perform cross validation and model selection given a CSV file of training data. Try `mhcflurry-class1-allele-specific-cv-and-train --help` for details. The ensemble predictors are trained similarly using the `mhcflurry-class1-allele-specific-ensemble-train` command.
## Details on the downloaded class I allele-specific models ## Details on the downloadable models
Besides the actual model weights, the data downloaded with `mhcflurry-downloads fetch` also includes a CSV file giving the hyperparameters used for each predictor. Another CSV gives the cross validation results used to select these hyperparameters. The scripts we use to train predictors, including hyperparameter selection
using cross validation, are
[here](downloads-generation/models_class1_allele_specific_ensemble)
for the ensemble predictors and [here](downloads-generation/models_class1_allele_specific_single)
for the single-model predictors.
To see the hyperparameters for the production models, run: For the ensemble predictors, we also generate a [report](http://htmlpreview.github.io/?https://github.com/hammerlab/mhcflurry/blob/master/downloads-generation/models_class1_allele_specific_ensemble/models-summary/report.html)
that describes the hyperparameters selected and the test performance of each
model.
``` Besides the model weights, the data downloaded when you run
open "$(mhcflurry-downloads path models_class1_allele_specific_single)/production.csv" `mhcflurry-downloads fetch` also includes a CSV file giving the
``` hyperparameters used for each predictor. Run `mhcflurry-downloads path
models_class1_allele_specific_ensemble` or `mhcflurry-downloads path
To see the cross validation results: models_class1_allele_specific_single` to get the directory where these files are stored.
```
open "$(mhcflurry-downloads path models_class1_allele_specific_single)/cv.csv"
```
## Problems and Solutions ## Problems and Solutions
### undefined symbol ### undefined symbol
......
...@@ -17,7 +17,7 @@ ...@@ -17,7 +17,7 @@
""" """
Combine 2013 Kim/Peters NetMHCpan dataset[*] with more recent IEDB entries Combine 2013 Kim/Peters NetMHCpan dataset[*] with more recent IEDB entries
* = "Dataset size and composition impact the reliability..." * = "AffinityMeasurementDataset size and composition impact the reliability..."
""" """
from __future__ import ( from __future__ import (
......
#!/bin/bash
if [[ $# -eq 0 ]] ; then
echo 'WARNING: This script is intended to be called with additional arguments to pass to mhcflurry-class1-allele-specific-cv-and-train'
echo 'See README.md'
fi
set -e
set -x
DOWNLOAD_NAME=models_class1_allele_specific_ensemble
SCRATCH_DIR=/tmp/mhcflurry-downloads-generation
SCRIPT_ABSOLUTE_PATH="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)/$(basename "${BASH_SOURCE[0]}")"
SCRIPT_DIR=$(dirname "$SCRIPT_ABSOLUTE_PATH")
export PYTHONUNBUFFERED=1
mkdir -p "$SCRATCH_DIR"
rm -rf "$SCRATCH_DIR/$DOWNLOAD_NAME"
mkdir "$SCRATCH_DIR/$DOWNLOAD_NAME"
# Send stdout and stderr to a logfile included with the archive.
exec > >(tee -ia "$SCRATCH_DIR/$DOWNLOAD_NAME/LOG.txt")
exec 2> >(tee -ia "$SCRATCH_DIR/$DOWNLOAD_NAME/LOG.txt" >&2)
# Log some environment info
date
pip freeze
git rev-parse HEAD
git status
cd $SCRATCH_DIR/$DOWNLOAD_NAME
mkdir models
cp $SCRIPT_DIR/models.py .
python models.py > models.json
time mhcflurry-class1-allele-specific-ensemble-train \
--ensemble-size 16 \
--model-architectures models.json \
--train-data "$(mhcflurry-downloads path data_combined_iedb_kim2014)/combined_human_class1_dataset.csv" \
--min-samples-per-allele 20 \
--out-manifest selected_models.csv \
--out-model-selection-manifest all_models.csv \
--out-models models \
--verbose \
"$@"
bzip2 all_models.csv
cp $SCRIPT_ABSOLUTE_PATH .
tar -cjf "../${DOWNLOAD_NAME}.tar.bz2" *
echo "Created archive: $SCRATCH_DIR/$DOWNLOAD_NAME.tar.bz2"
# Class I allele-specific models (ensemble)
This download contains trained MHC Class I allele-specific MHCflurry models. For each allele, an ensemble of predictors is trained on random halves of the training data. Model architectures are selected based on performance on the other half of the dataset, so in general each ensemble contains predictors of different architectures. At prediction time the geometric mean IC50 is taken over the trained models. The training data used is in the [data_combined_iedb_kim2014](../data_combined_iedb_kim2014) MHCflurry download.
The training script supports multi-node parallel execution using the [kubeface](https://github.com/hammerlab/kubeface) library.
To use kubeface, you should make a google storage bucket and pass it below with the --storage-prefix argument.
To generate this download we run:
```
./GENERATE.sh \
--parallel-backend kubeface \
--target-tasks 200 \
--kubeface-backend kubernetes \
--kubeface-storage gs://kubeface-tim \
--kubeface-worker-image hammerlab/mhcflurry-misc:latest \
--kubeface-kubernetes-task-resources-memory-mb 10000 \
--kubeface-worker-path-prefix venv-py3/bin \
--kubeface-max-simultaneous-tasks 200 \
--kubeface-speculation-max-reruns 3 \
```
To debug locally:
```
./GENERATE.sh \
--parallel-backend local-threads \
--target-tasks 1
```
# Class1 allele-specific ensemble models
To generate the report, run:
```
time jupyter-nbconvert report.ipynb \
--execute \
--ExecutePreprocessor.kernel_name=python \
--ExecutePreprocessor.timeout=60 \
--to html \
--stdout > report.html
```
source diff could not be displayed: it is too large. Options to address this: view the blob.
import sys
from mhcflurry.class1_allele_specific_ensemble import HYPERPARAMETER_DEFAULTS
import json
models = HYPERPARAMETER_DEFAULTS.models_grid(
impute=[False, True],
activation=["tanh"],
layer_sizes=[[12], [64], [128]],
embedding_output_dim=[8, 32, 64],
dropout_probability=[0, .1, .25],
fraction_negative=[0, .1, .2],
n_training_epochs=[250],
# Imputation arguments
impute_method=["mice"],
imputer_args=[
# Arguments specific to imputation method (mice)
{"n_burn_in": 5, "n_imputations": 50, "n_nearest_columns": 25}
],
impute_min_observations_per_peptide=[3],
impute_min_observations_per_allele=[3])
sys.stderr.write("Models: %d\n" % len(models))
print(json.dumps(models, indent=4))
This diff is collapsed.
...@@ -14,7 +14,11 @@ ...@@ -14,7 +14,11 @@
from .class1_allele_specific.class1_binding_predictor import ( from .class1_allele_specific.class1_binding_predictor import (
Class1BindingPredictor) Class1BindingPredictor)
from .predict import predict from .prediction import predict
from .affinity_measurement_dataset import AffinityMeasurementDataset
from .class1_allele_specific_ensemble import Class1EnsembleMultiAllelePredictor
from .class1_allele_specific import Class1SingleModelMultiAllelePredictor
from .measurement_collection import MeasurementCollection
from . import parallelism from . import parallelism
__version__ = "0.2.0" __version__ = "0.2.0"
...@@ -23,5 +27,9 @@ __all__ = [ ...@@ -23,5 +27,9 @@ __all__ = [
"Class1BindingPredictor", "Class1BindingPredictor",
"predict", "predict",
"parallelism", "parallelism",
"AffinityMeasurementDataset",
"Class1EnsembleMultiAllelePredictor",
"Class1SingleModelMultiAllelePredictor",
"MeasurementCollection",
"__version__", "__version__",
] ]
...@@ -37,12 +37,15 @@ from .imputation_helpers import ( ...@@ -37,12 +37,15 @@ from .imputation_helpers import (
) )
class Dataset(object): class AffinityMeasurementDataset(object):
""" """
Peptide-MHC binding dataset with helper methods for constructing Peptide-MHC binding dataset with helper methods for constructing
different representations (arrays, DataFrames, dictionaries, &c). different representations (arrays, DataFrames, dictionaries, &c).
This class is specific for affinity measurements (IC50s), whereas the
MeasurementCollection class supports both affinitites and other
measurement types like mass spec hits.
Design considerations: Design considerations:
- want to allow multiple measurements for each pMHC pair (which can - want to allow multiple measurements for each pMHC pair (which can
be dynamically combined) be dynamically combined)
...@@ -50,7 +53,7 @@ class Dataset(object): ...@@ -50,7 +53,7 @@ class Dataset(object):
""" """
def __init__(self, df): def __init__(self, df):
""" """
Constructs a DataSet from a pandas DataFrame with the following Constructs a AffinityMeasurementDataset from a pandas DataFrame with the following
columns: columns:
- allele - allele
- peptide - peptide
...@@ -89,7 +92,7 @@ class Dataset(object): ...@@ -89,7 +92,7 @@ class Dataset(object):
def to_dataframe(self): def to_dataframe(self):
""" """
Returns DataFrame representation of data contained in Dataset Returns DataFrame representation of data contained in AffinityMeasurementDataset
""" """
return self._df return self._df
...@@ -125,7 +128,7 @@ class Dataset(object): ...@@ -125,7 +128,7 @@ class Dataset(object):
return len(self.to_dataframe()) return len(self.to_dataframe())
def __str__(self): def __str__(self):
return "Dataset(n=%d, alleles=%s)" % ( return "AffinityMeasurementDataset(n=%d, alleles=%s)" % (
len(self), list(sorted(self.unique_alleles()))) len(self), list(sorted(self.unique_alleles())))
def __repr__(self): def __repr__(self):
...@@ -136,7 +139,7 @@ class Dataset(object): ...@@ -136,7 +139,7 @@ class Dataset(object):
Two datasets are equal if they contain the same number of samples Two datasets are equal if they contain the same number of samples
with the same properties and values. with the same properties and values.
""" """
if type(other) is not Dataset: if type(other) is not AffinityMeasurementDataset:
return False return False
elif len(self) != len(other): elif len(self) != len(other):
return False return False
...@@ -180,13 +183,13 @@ class Dataset(object): ...@@ -180,13 +183,13 @@ class Dataset(object):
def unique_alleles(self): def unique_alleles(self):
""" """
Returns the set of allele names contained in this Dataset. Returns the set of allele names contained in this AffinityMeasurementDataset.
""" """
return set(self.alleles) return set(self.alleles)
def unique_peptides(self): def unique_peptides(self):
""" """
Returns the set of peptide sequences contained in this Dataset. Returns the set of peptide sequences contained in this AffinityMeasurementDataset.
""" """
return set(self.peptides) return set(self.peptides)
...@@ -202,11 +205,11 @@ class Dataset(object): ...@@ -202,11 +205,11 @@ class Dataset(object):
entries just for that allele. entries just for that allele.
""" """
for (allele_name, group_df) in self.to_dataframe().groupby("allele"): for (allele_name, group_df) in self.to_dataframe().groupby("allele"):
yield (allele_name, Dataset(group_df)) yield (allele_name, AffinityMeasurementDataset(group_df))
def groupby_allele_dictionary(self): def groupby_allele_dictionary(self):
""" """
Returns dictionary mapping each allele name to a Dataset containing Returns dictionary mapping each allele name to a AffinityMeasurementDataset containing
only entries from that allele. only entries from that allele.
""" """
return dict(self.groupby_allele()) return dict(self.groupby_allele())
...@@ -314,7 +317,7 @@ class Dataset(object): ...@@ -314,7 +317,7 @@ class Dataset(object):
@classmethod @classmethod
def from_single_allele_dataframe(cls, allele_name, single_allele_df): def from_single_allele_dataframe(cls, allele_name, single_allele_df):
""" """
Construct a Dataset from a single MHC allele's DataFrame Construct a AffinityMeasurementDataset from a single MHC allele's DataFrame
""" """
df = single_allele_df.copy() df = single_allele_df.copy()
df["allele"] = allele_name df["allele"] = allele_name
...@@ -326,7 +329,7 @@ class Dataset(object): ...@@ -326,7 +329,7 @@ class Dataset(object):
allele_to_peptide_to_affinity_dict): allele_to_peptide_to_affinity_dict):
""" """
Given nested dictionaries mapping allele -> peptide -> affinity, Given nested dictionaries mapping allele -> peptide -> affinity,
construct a Dataset with uniform sample weights. construct a AffinityMeasurementDataset with uniform sample weights.
""" """
alleles = [] alleles = []
peptides = [] peptides = []
...@@ -344,7 +347,7 @@ class Dataset(object): ...@@ -344,7 +347,7 @@ class Dataset(object):
@classmethod @classmethod
def create_empty(cls): def create_empty(cls):
""" """
Returns an empty Dataset containing no pMHC entries. Returns an empty AffinityMeasurementDataset containing no pMHC entries.
""" """
return cls.from_nested_dictionary({}) return cls.from_nested_dictionary({})
...@@ -355,7 +358,7 @@ class Dataset(object): ...@@ -355,7 +358,7 @@ class Dataset(object):
peptide_to_affinity_dict): peptide_to_affinity_dict):
""" """
Given a peptide->affinity dictionary for a single allele, Given a peptide->affinity dictionary for a single allele,
create a Dataset. create a AffinityMeasurementDataset.
""" """
return cls.from_nested_dictionary({allele_name: peptide_to_affinity_dict}) return cls.from_nested_dictionary({allele_name: peptide_to_affinity_dict})
...@@ -382,7 +385,7 @@ class Dataset(object): ...@@ -382,7 +385,7 @@ class Dataset(object):
def get_allele(self, allele_name): def get_allele(self, allele_name):
""" """
Get Dataset for a single allele Get AffinityMeasurementDataset for a single allele
""" """
if allele_name not in self.unique_alleles(): if allele_name not in self.unique_alleles():
raise KeyError("Allele '%s' not found, available alleles: %s" % ( raise KeyError("Allele '%s' not found, available alleles: %s" % (
...@@ -393,7 +396,7 @@ class Dataset(object): ...@@ -393,7 +396,7 @@ class Dataset(object):
def get_alleles(self, allele_names): def get_alleles(self, allele_names):
""" """
Restrict Dataset to several allele names. Restrict AffinityMeasurementDataset to several allele names.
""" """
datasets = [] datasets = []
for allele_name in allele_names: for allele_name in allele_names:
...@@ -446,7 +449,7 @@ class Dataset(object): ...@@ -446,7 +449,7 @@ class Dataset(object):
weight = row["sample_weight"] weight = row["sample_weight"]
# we're either going to create a fresh original peptide column # we're either going to create a fresh original peptide column
# or extend the existing original peptide tuple that tracks # or extend the existing original peptide tuple that tracks
# the provenance of entries in the new Dataset # the provenance of entries in the new AffinityMeasurementDataset
original_peptide = row.get("original_peptide") original_peptide = row.get("original_peptide")
if original_peptide is None: if original_peptide is None:
original_peptide = () original_peptide = ()
...@@ -569,14 +572,14 @@ class Dataset(object): ...@@ -569,14 +572,14 @@ class Dataset(object):
def slice(self, indices): def slice(self, indices):
""" """
Create a new Dataset by slicing through all columns of this dataset Create a new AffinityMeasurementDataset by slicing through all columns of this dataset
with the given indices. with the given indices.
""" """
indices = np.asarray(indices) indices = np.asarray(indices)
max_index = indices.max() max_index = indices.max()
n_total = len(self) n_total = len(self)
if max_index >= len(self): if max_index >= len(self):
raise ValueError("Invalid index %d for Dataset of size %d" % ( raise ValueError("Invalid index %d for AffinityMeasurementDataset of size %d" % (
max_index, n_total)) max_index, n_total))
df = self.to_dataframe() df = self.to_dataframe()
...@@ -587,7 +590,7 @@ class Dataset(object): ...@@ -587,7 +590,7 @@ class Dataset(object):
def random_split(self, n=None, stratify_fn=None): def random_split(self, n=None, stratify_fn=None):
""" """
Randomly split the Dataset into smaller Dataset objects. Randomly split the AffinityMeasurementDataset into smaller AffinityMeasurementDataset objects.
Parameters Parameters
---------- ----------
...@@ -598,7 +601,7 @@ class Dataset(object): ...@@ -598,7 +601,7 @@ class Dataset(object):
Function that takes a row and returns bool, stratifying sampling Function that takes a row and returns bool, stratifying sampling
into two groups. into two groups.
Returns a pair of Dataset objects. Returns a pair of AffinityMeasurementDataset objects.
""" """
n_total = len(self) n_total = len(self)
if n is None: if n is None:
...@@ -646,7 +649,7 @@ class Dataset(object): ...@@ -646,7 +649,7 @@ class Dataset(object):
test_samples = self test_samples = self
test_sample_indices = np.arange(n_total) test_sample_indices = np.arange(n_total)
elif test_allele not in self.unique_alleles(): elif test_allele not in self.unique_alleles():
raise ValueError("Allele '%s' not in Dataset" % test_allele) raise ValueError("Allele '%s' not in AffinityMeasurementDataset" % test_allele)
else: else:
test_sample_indices = np.where(self.alleles == test_allele)[0] test_sample_indices = np.where(self.alleles == test_allele)[0]
...@@ -685,7 +688,7 @@ class Dataset(object): ...@@ -685,7 +688,7 @@ class Dataset(object):
""" """
Split an allele into training and test sets, and then impute values Split an allele into training and test sets, and then impute values
for peptides missing from the training set using data from other alleles for peptides missing from the training set using data from other alleles
in this Dataset. in this AffinityMeasurementDataset.
(apologies for the wordy name, this turns out to be a common operation) (apologies for the wordy name, this turns out to be a common operation)
...@@ -698,13 +701,13 @@ class Dataset(object): ...@@ -698,13 +701,13 @@ class Dataset(object):
Size of the training set to return for this allele. Size of the training set to return for this allele.
stratify_fn : function stratify_fn : function
Function mapping from rows of the Dataset to booleans for stratifying Function mapping from rows of the AffinityMeasurementDataset to booleans for stratifying
by two groups. by two groups.
**kwargs : dict **kwargs : dict
Extra keyword arguments passed to Dataset.impute_missing_values Extra keyword arguments passed to AffinityMeasurementDataset.impute_missing_values
Returns three Dataset objects: Returns three AffinityMeasurementDataset objects:
- training set with original pMHC affinities for given allele - training set with original pMHC affinities for given allele
- larger imputed training set for given allele - larger imputed training set for given allele
- test set - test set
...@@ -729,7 +732,7 @@ class Dataset(object): ...@@ -729,7 +732,7 @@ class Dataset(object):
The two arguments are assumed to be the same length. The two arguments are assumed to be the same length.
Returns Dataset of equal or smaller size. Returns AffinityMeasurementDataset of equal or smaller size.
""" """
if len(alleles) != len(peptides): if len(alleles) != len(peptides):
raise ValueError( raise ValueError(
...@@ -746,7 +749,7 @@ class Dataset(object): ...@@ -746,7 +749,7 @@ class Dataset(object):
allele_peptide_pairs : list of (str, str) tuples allele_peptide_pairs : list of (str, str) tuples
The two arguments are assumed to be the same length. The two arguments are assumed to be the same length.
Returns Dataset of equal or smaller size. Returns AffinityMeasurementDataset of equal or smaller size.
""" """
require_iterable_of(allele_peptide_pairs, tuple) require_iterable_of(allele_peptide_pairs, tuple)
keys_to_remove_set = set(allele_peptide_pairs) keys_to_remove_set = set(allele_peptide_pairs)
...@@ -763,7 +766,7 @@ class Dataset(object): ...@@ -763,7 +766,7 @@ class Dataset(object):
Parameters Parameters
---------- ----------
other_dataset : Dataset other_dataset : AffinityMeasurementDataset
Returns a new Dataset object of equal or lesser size. Returns a new Dataset object of equal or lesser size.
""" """
...@@ -792,7 +795,7 @@ class Dataset(object): ...@@ -792,7 +795,7 @@ class Dataset(object):
of floats and replaces some or all NaN values with synthetic of floats and replaces some or all NaN values with synthetic
affinities. affinities.
log_transform : function log_transform : bool
Transform affinities with to log10 values before imputation Transform affinities with to log10 values before imputation
(and then transform back afterward). (and then transform back afterward).
...@@ -802,7 +805,7 @@ class Dataset(object): ...@@ -802,7 +805,7 @@ class Dataset(object):
min_observations_per_allele : int min_observations_per_allele : int
Drop allele columns with fewer than this number of observed values. Drop allele columns with fewer than this number of observed values.
Returns Dataset with original pMHC affinities and additional Returns AffinityMeasurementDataset with original pMHC affinities and additional
synthetic samples. synthetic samples.
""" """
if isinstance(imputation_method, string_types): if isinstance(imputation_method, string_types):
......
from .presentation_model import PresentationModel from .presentation_model import PresentationModel, build_presentation_models
from .percent_rank_transform import PercentRankTransform from .percent_rank_transform import PercentRankTransform
from . import presentation_component_models, decoy_strategies from . import presentation_component_models, decoy_strategies
__all__ = [ __all__ = [
"PresentationModel", "PresentationModel",
"build_presentation_models",
"PercentRankTransform", "PercentRankTransform",
"presentation_component_models", "presentation_component_models",
"decoy_strategies", "decoy_strategies",
......
...@@ -25,7 +25,7 @@ class Expression(PresentationComponentModel): ...@@ -25,7 +25,7 @@ class Expression(PresentationComponentModel):
group in expression_values.columns group in expression_values.columns
for group in experiment_to_expression_group.values()) for group in experiment_to_expression_group.values())
assert_no_null(experiment_to_expression_group) assert_no_null(expression_values)
self.experiment_to_expression_group = experiment_to_expression_group self.experiment_to_expression_group = experiment_to_expression_group
self.expression_values = expression_values self.expression_values = expression_values
......
...@@ -16,6 +16,9 @@ class FixedAffinityPredictions(PresentationComponentModel): ...@@ -16,6 +16,9 @@ class FixedAffinityPredictions(PresentationComponentModel):
- "value", "percentile_rank" (IC50 and percent rank) - "value", "percentile_rank" (IC50 and percent rank)
- peptide (string) - peptide (string)
- allele (string) - allele (string)
name : string
Used to name output columns and in debug messages
""" """
def __init__( def __init__(
......
...@@ -47,7 +47,7 @@ class MHCBindingComponentModelBase(PresentationComponentModel): ...@@ -47,7 +47,7 @@ class MHCBindingComponentModelBase(PresentationComponentModel):
fallback_predictor : function: (allele, peptides) -> predictions fallback_predictor : function: (allele, peptides) -> predictions
Used when missing an allele. Used when missing an allele.
iedb_dataset : mhcflurry.Dataset iedb_dataset : mhcflurry.AffinityMeasurementDataset
IEDB data for this allele. If not specified no iedb data is used. IEDB data for this allele. If not specified no iedb data is used.
decoys_per_hit : int decoys_per_hit : int
......
...@@ -4,8 +4,9 @@ import numpy ...@@ -4,8 +4,9 @@ import numpy
import pandas import pandas
from ...common import normalize_allele_name from ...common import normalize_allele_name
from ...predict import predict from ...prediction import predict
from ..percent_rank_transform import PercentRankTransform from ..percent_rank_transform import PercentRankTransform
from ...peptide_encoding import encode_peptides
from .presentation_component_model import PresentationComponentModel from .presentation_component_model import PresentationComponentModel
...@@ -21,15 +22,22 @@ class MHCflurryReleased(PresentationComponentModel): ...@@ -21,15 +22,22 @@ class MHCflurryReleased(PresentationComponentModel):
random_peptides_for_percent_rank : list of string random_peptides_for_percent_rank : list of string
If specified, then percentile rank will be calibrated and emitted If specified, then percentile rank will be calibrated and emitted
using the given peptides. using the given peptides.
predictor : Class1EnsembleMultiAllelePredictor-like object
Predictor to use.
""" """
def __init__( def __init__(
self, self,
experiment_to_alleles, experiment_to_alleles,
random_peptides_for_percent_rank=None, random_peptides_for_percent_rank=None,
predictor=None,
predictor_name="mhcflurry_released",
**kwargs): **kwargs):
PresentationComponentModel.__init__(self, **kwargs) PresentationComponentModel.__init__(self, **kwargs)
self.experiment_to_alleles = experiment_to_alleles self.experiment_to_alleles = experiment_to_alleles
self.predictor = predictor
self.predictor_name = predictor_name
if random_peptides_for_percent_rank is None: if random_peptides_for_percent_rank is None:
self.percent_rank_transforms = None self.percent_rank_transforms = None
self.random_peptides_for_percent_rank = None self.random_peptides_for_percent_rank = None
...@@ -39,9 +47,9 @@ class MHCflurryReleased(PresentationComponentModel): ...@@ -39,9 +47,9 @@ class MHCflurryReleased(PresentationComponentModel):
random_peptides_for_percent_rank) random_peptides_for_percent_rank)
def column_names(self): def column_names(self):
columns = ['mhcflurry_released_affinity'] columns = [self.predictor_name + '_affinity']
if self.percent_rank_transforms is not None: if self.percent_rank_transforms is not None:
columns.append('mhcflurry_released_percentile_rank') columns.append(self.predictor_name + '_percentile_rank')
return columns return columns
def requires_fitting(self): def requires_fitting(self):
...@@ -59,15 +67,18 @@ class MHCflurryReleased(PresentationComponentModel): ...@@ -59,15 +67,18 @@ class MHCflurryReleased(PresentationComponentModel):
.Prediction.values) .Prediction.values)
def predict_min_across_alleles(self, alleles, peptides): def predict_min_across_alleles(self, alleles, peptides):
alleles = [ alleles = list(set([
normalize_allele_name(allele) normalize_allele_name(allele)
for allele in alleles for allele in alleles
] ]))
df = predict(alleles, numpy.unique(numpy.array(peptides))) df = predict(
alleles,
peptides,
predictor=self.predictor)
pivoted = df.pivot(index='Peptide', columns='Allele') pivoted = df.pivot(index='Peptide', columns='Allele')
pivoted.columns = pivoted.columns.droplevel() pivoted.columns = pivoted.columns.droplevel()
result = { result = {
'mhcflurry_released_affinity': ( self.predictor_name + '_affinity': (
pivoted.min(axis=1).ix[peptides].values) pivoted.min(axis=1).ix[peptides].values)
} }
if self.percent_rank_transforms is not None: if self.percent_rank_transforms is not None:
...@@ -77,8 +88,11 @@ class MHCflurryReleased(PresentationComponentModel): ...@@ -77,8 +88,11 @@ class MHCflurryReleased(PresentationComponentModel):
percentile_ranks[allele] = ( percentile_ranks[allele] = (
self.percent_rank_transforms[allele] self.percent_rank_transforms[allele]
.transform(pivoted[allele].values)) .transform(pivoted[allele].values))
result['mhcflurry_released_percentile_rank'] = ( result[self.predictor_name + '_percentile_rank'] = (
percentile_ranks.min(axis=1).ix[peptides].values) percentile_ranks.min(axis=1).ix[peptides].values)
for (key, value) in result.items():
assert len(value) == len(peptides), (len(peptides), result)
return result return result
def predict_for_experiment(self, experiment_name, peptides): def predict_for_experiment(self, experiment_name, peptides):
......
...@@ -3,7 +3,7 @@ from copy import copy ...@@ -3,7 +3,7 @@ from copy import copy
import pandas import pandas
from numpy import log, exp, nanmean from numpy import log, exp, nanmean
from ...dataset import Dataset from ...affinity_measurement_dataset import AffinityMeasurementDataset
from ...class1_allele_specific import Class1BindingPredictor from ...class1_allele_specific import Class1BindingPredictor
from ...common import normalize_allele_name from ...common import normalize_allele_name
...@@ -22,7 +22,7 @@ class MHCflurryTrainedOnHits(MHCBindingComponentModelBase): ...@@ -22,7 +22,7 @@ class MHCflurryTrainedOnHits(MHCBindingComponentModelBase):
Parameters Parameters
------------ ------------
iedb_dataset : mhcflurry.Dataset iedb_dataset : mhcflurry.AffinityMeasurementDataset
IEDB data for this allele. If not specified no iedb data is used. IEDB data for this allele. If not specified no iedb data is used.
mhcflurry_hyperparameters : dict mhcflurry_hyperparameters : dict
...@@ -96,7 +96,7 @@ class MHCflurryTrainedOnHits(MHCBindingComponentModelBase): ...@@ -96,7 +96,7 @@ class MHCflurryTrainedOnHits(MHCBindingComponentModelBase):
if self.iedb_dataset is not None: if self.iedb_dataset is not None:
df = df.append(iedb_dataset_df, ignore_index=True) df = df.append(iedb_dataset_df, ignore_index=True)
dataset = Dataset( dataset = AffinityMeasurementDataset(
df.sample(frac=1)) # shuffle dataframe df.sample(frac=1)) # shuffle dataframe
print("Train data: ", dataset) print("Train data: ", dataset)
model = Class1BindingPredictor( model = Class1BindingPredictor(
......
...@@ -187,7 +187,12 @@ class PresentationComponentModel(object): ...@@ -187,7 +187,12 @@ class PresentationComponentModel(object):
self.predict_for_experiment( self.predict_for_experiment(
str(peptides_df.iloc[0].experiment_name), str(peptides_df.iloc[0].experiment_name),
peptides_df.peptide.values)) peptides_df.peptide.values))
assert len(return_value) == len(peptides_df), str(self) assert len(return_value) == len(peptides_df), (
"%d != %d" % (len(return_value), len(peptides_df)),
str(self),
peptides_df.peptide.nunique(),
return_value,
peptides_df)
assert_no_null(return_value, str(self)) assert_no_null(return_value, str(self))
else: else:
peptides_df = ( peptides_df = (
...@@ -202,7 +207,9 @@ class PresentationComponentModel(object): ...@@ -202,7 +207,9 @@ class PresentationComponentModel(object):
result_df.loc[sub_df.index, "experiment_name"] == result_df.loc[sub_df.index, "experiment_name"] ==
experiment_name).all() experiment_name).all()
unique_peptides = numpy.unique(sub_df.peptide.values) unique_peptides = list(set(sub_df.peptide))
if len(unique_peptides) == 0:
continue
result_dict = self.predict_for_experiment( result_dict = self.predict_for_experiment(
experiment_name, unique_peptides) experiment_name, unique_peptides)
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment