Merge pull request #84 from hammerlab/add-class1-ensemble-rebased

Add class1 ensemble models, with model selection using imputation

Merge pull request #84 from hammerlab/add-class1-ensemble-rebased
Add class1 ensemble models, with model selection using imputation
cb88dac8 · Tim O'Donnell · GitHub · e93299b6 · 741d873e · cb88dac8
Commit cb88dac8 authored 8 years ago by Tim O'Donnell Committed by GitHub 8 years ago
--- a/.travis.yml
+++ b/.travis.yml
@@ -45,6 +45,8 @@ env:
 script:
  # download data and models, then run tests
  - mhcflurry-downloads fetch
+  - mhcflurry-downloads fetch models_class1_allele_specific_ensemble
+  - mhcflurry-downloads fetch models_class1_allele_specific_single
  - mhcflurry-downloads info  # just to test this command works
  - nosetests test --with-coverage --cover-package=mhcflurry  && ./lint.sh
 after_success:

--- a/Dockerfile
+++ b/Dockerfile
@@ -42,14 +42,11 @@ WORKDIR /home/user
 # Setup virtual envs and install convenience packages.  Note: installing
 # cherrypy as part of the mhcflurry installation weirdly fails on a unicode
 # issue in python2, but installing it separately seems to work.
-# We also install bokeh so that dask distributed will have an admin web interface.
 RUN virtualenv venv-py3 --python=python3 && \
    venv-py3/bin/pip install --upgrade pip && \
    venv-py3/bin/pip install --upgrade \
        numpy \
-        bokeh \
        cherrypy \
-        git+https://github.com/dask/distributed.git \
        jupyter \
        lxml \
        scipy \
@@ -59,9 +56,10 @@ RUN virtualenv venv-py3 --python=python3 && \
 ENV KERAS_BACKEND theano
 # RUN venv-py3/bin/pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.10.0-cp35-cp35m-linux_x86_64.whl

-# Install mhcflurry and download data and models.
+# Install mhcflurry and latest kubeface and download data and models.
 COPY . ./mhcflurry
-RUN venv-py3/bin/pip install ./mhcflurry && venv-py3/bin/mhcflurry-downloads fetch
+RUN venv-py3/bin/pip install --upgrade ./mhcflurry git+https://github.com/hammerlab/kubeface.git \
+    && venv-py3/bin/mhcflurry-downloads fetch
 
 EXPOSE 8888
 CMD venv-py3/bin/jupyter notebook --no-browser
--- a/README.md
+++ b/README.md
@@ -5,15 +5,32 @@ Open source neural network models for peptide-MHC binding affinity prediction

 The [adaptive immune system](https://en.wikipedia.org/wiki/Adaptive_immune_system) depends on the presentation of protein fragments by [MHC](https://en.wikipedia.org/wiki/Major_histocompatibility_complex) molecules. Machine learning models of this interaction are used in studies of infectious diseases, autoimmune diseases, vaccine development, and cancer immunotherapy.

-MHCflurry currently supports peptide / [MHC class I](https://en.wikipedia.org/wiki/MHC_class_I) affinity prediction using one model per MHC allele. The predictors may be trained on data that has been augmented with data imputed based on other alleles (see [Rubinsteyn 2016](http://biorxiv.org/content/early/2016/06/07/054775)). We anticipate adding additional models, including pan-allele and class II predictors.
+MHCflurry currently supports allele-specific peptide / [MHC class I](https://en.wikipedia.org/wiki/MHC_class_I) affinity prediction using two approaches:

-You can fit MHCflurry models to your own data or download trained models that we provide. Our models are trained on data from [IEDB](http://www.iedb.org/home_v3.php) and [Kim 2014](http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-241). See [here](https://github.com/hammerlab/mhcflurry/tree/master/downloads-generation/data_combined_iedb_kim2014) for details on the training data preparation. The steps we use to train predictors on this data, including hyperparameter selection using cross validation, are [here](https://github.com/hammerlab/mhcflurry/tree/master/downloads-generation/models_class1_allele_specific_single).
+ * Ensembles of predictors trained on random halves of the training data (the default)
+ * Single-model predictors for each allele trained on all data
+
+For both kinds of predictors, you can fit models to your own data or download
+trained models that we provide.
+
+The downloadable models were trained on data from
+[IEDB](http://www.iedb.org/home_v3.php) and [Kim 2014](http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-241).
+The ensemble predictors include models trained on data that has been
+augmented with values imputed from other alleles (see
+[Rubinsteyn 2016](http://biorxiv.org/content/early/2016/06/07/054775)).
+
+In validation experiments using presented peptides identified by mass-spec,
+the ensemble models perform best. We are working on a performance comparison of
+these models with other predictors such as netMHCpan, which we hope to make
+available soon.
+
+We anticipate adding additional models, including pan-allele and class II predictors.

-The MHCflurry predictors are implemented in Python using [keras](https://keras.io).

 ## Setup

-To configure keras, the neural network library used by MHCflurry, you'll need to set an environment variable in your shell:
+The MHCflurry predictors are implemented in Python using [keras](https://keras.io).
+To configure keras you'll need to set an environment variable in your shell:

 ```
 export KERAS_BACKEND=theano
@@ -73,25 +90,28 @@ The predictions returned by `predict` are affinities (KD) in nM.

 ## Training your own models

-See the [class1_allele_specific_models.ipynb](https://github.com/hammerlab/mhcflurry/blob/master/examples/class1_allele_specific_models.ipynb) notebook for an overview of the Python API, including predicting, fitting, and scoring models.
+See the [class1_allele_specific_models.ipynb](https://github.com/hammerlab/mhcflurry/blob/master/examples/class1_allele_specific_models.ipynb) notebook for an overview of the Python API, including predicting, fitting, and scoring single-model predictors. There is also a script called `mhcflurry-class1-allele-specific-cv-and-train` that will perform cross validation and model selection given a CSV file of training data. Try `mhcflurry-class1-allele-specific-cv-and-train --help` for details.

-There is also a script called `mhcflurry-class1-allele-specific-cv-and-train` that will perform cross validation and model selection given a CSV file of training data. Try `mhcflurry-class1-allele-specific-cv-and-train --help` for details.
+The ensemble predictors are trained similarly using the `mhcflurry-class1-allele-specific-ensemble-train` command.

-## Details on the downloaded class I allele-specific models
+## Details on the downloadable models

-Besides the actual model weights, the data downloaded with `mhcflurry-downloads fetch` also includes a CSV file giving the hyperparameters used for each predictor. Another CSV gives the cross validation results used to select these hyperparameters.
+The scripts we use to train predictors, including hyperparameter selection
+using cross validation, are
+[here](downloads-generation/models_class1_allele_specific_ensemble)
+for the ensemble predictors and [here](downloads-generation/models_class1_allele_specific_single)
+for the single-model predictors.

-To see the hyperparameters for the production models, run:
+For the ensemble predictors, we also generate a [report](http://htmlpreview.github.io/?https://github.com/hammerlab/mhcflurry/blob/master/downloads-generation/models_class1_allele_specific_ensemble/models-summary/report.html)
+that describes the hyperparameters selected and the test performance of each
+model.

-```
-open "$(mhcflurry-downloads path models_class1_allele_specific_single)/production.csv"
-```
-
-To see the cross validation results:
+Besides the model weights, the data downloaded when you run
+`mhcflurry-downloads  fetch` also includes a CSV file giving the
+hyperparameters used for each predictor. Run `mhcflurry-downloads path
+models_class1_allele_specific_ensemble` or `mhcflurry-downloads path
+models_class1_allele_specific_single` to get the directory where these files are stored.

-```
-open "$(mhcflurry-downloads path models_class1_allele_specific_single)/cv.csv"
-```
 ## Problems and Solutions

 ###  undefined symbol

--- a/downloads-generation/data_combined_iedb_kim2014/create-combined-class1-dataset.py
+++ b/downloads-generation/data_combined_iedb_kim2014/create-combined-class1-dataset.py
@@ -17,7 +17,7 @@
 """
 Combine 2013 Kim/Peters NetMHCpan dataset[*] with more recent IEDB entries

-* = "Dataset size and composition impact the reliability..."
+* = "AffinityMeasurementDataset size and composition impact the reliability..."
 """
 
 from __future__ import (

--- a/downloads-generation/models_class1_allele_specific_ensemble/GENERATE.sh
+++ b/downloads-generation/models_class1_allele_specific_ensemble/GENERATE.sh
+#!/bin/bash
+
+if [[ $# -eq 0 ]] ; then
+    echo 'WARNING: This script is intended to be called with additional arguments to pass to mhcflurry-class1-allele-specific-cv-and-train'
+    echo 'See README.md'
+fi
+
+set -e
+set -x
+
+DOWNLOAD_NAME=models_class1_allele_specific_ensemble
+SCRATCH_DIR=/tmp/mhcflurry-downloads-generation
+SCRIPT_ABSOLUTE_PATH="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)/$(basename "${BASH_SOURCE[0]}")"
+SCRIPT_DIR=$(dirname "$SCRIPT_ABSOLUTE_PATH")
+export PYTHONUNBUFFERED=1
+
+mkdir -p "$SCRATCH_DIR"
+rm -rf "$SCRATCH_DIR/$DOWNLOAD_NAME"
+mkdir "$SCRATCH_DIR/$DOWNLOAD_NAME"
+
+# Send stdout and stderr to a logfile included with the archive.
+exec >  >(tee -ia "$SCRATCH_DIR/$DOWNLOAD_NAME/LOG.txt")
+exec 2> >(tee -ia "$SCRATCH_DIR/$DOWNLOAD_NAME/LOG.txt" >&2)
+
+# Log some environment info
+date
+pip freeze
+git rev-parse HEAD
+git status
+
+cd $SCRATCH_DIR/$DOWNLOAD_NAME
+
+mkdir models
+
+cp $SCRIPT_DIR/models.py .
+python models.py > models.json
+
+time mhcflurry-class1-allele-specific-ensemble-train \
+    --ensemble-size 16 \
+    --model-architectures models.json \
+    --train-data "$(mhcflurry-downloads path data_combined_iedb_kim2014)/combined_human_class1_dataset.csv" \
+    --min-samples-per-allele 20 \
+    --out-manifest selected_models.csv \
+    --out-model-selection-manifest all_models.csv \
+    --out-models models \
+    --verbose \
+    "$@"
+
+bzip2 all_models.csv
+cp $SCRIPT_ABSOLUTE_PATH .
+tar -cjf "../${DOWNLOAD_NAME}.tar.bz2" *
+
+echo "Created archive: $SCRATCH_DIR/$DOWNLOAD_NAME.tar.bz2"
--- a/downloads-generation/models_class1_allele_specific_ensemble/README.md
+++ b/downloads-generation/models_class1_allele_specific_ensemble/README.md
+# Class I allele-specific models (ensemble)
+
+This download contains trained MHC Class I allele-specific MHCflurry models. For each allele, an ensemble of predictors is trained on random halves of the training data. Model architectures are selected based on performance on the other half of the dataset, so in general each ensemble contains predictors of different architectures. At prediction time the geometric mean IC50 is taken over the trained models. The training data used is in the [data_combined_iedb_kim2014](../data_combined_iedb_kim2014) MHCflurry download.
+
+The training script supports multi-node parallel execution using the [kubeface](https://github.com/hammerlab/kubeface) library.
+
+To use kubeface, you should make a google storage bucket and pass it below with the --storage-prefix argument. 
+
+To generate this download we run:
+
+```
+./GENERATE.sh \
+    --parallel-backend kubeface \
+    --target-tasks 200 \
+    --kubeface-backend kubernetes \
+    --kubeface-storage gs://kubeface-tim \
+    --kubeface-worker-image hammerlab/mhcflurry-misc:latest \
+    --kubeface-kubernetes-task-resources-memory-mb 10000 \
+    --kubeface-worker-path-prefix venv-py3/bin \
+    --kubeface-max-simultaneous-tasks 200 \
+    --kubeface-speculation-max-reruns 3 \
+```
+
+To debug locally:
+```
+./GENERATE.sh \
+    --parallel-backend local-threads \
+    --target-tasks 1
+```
--- a/downloads-generation/models_class1_allele_specific_ensemble/models-summary/README.md
+++ b/downloads-generation/models_class1_allele_specific_ensemble/models-summary/README.md
+# Class1 allele-specific ensemble models
+
+To generate the report, run:
+
+```
+time jupyter-nbconvert report.ipynb \
+    --execute \
+    --ExecutePreprocessor.kernel_name=python \
+    --ExecutePreprocessor.timeout=60 \
+    --to html \
+    --stdout > report.html
+```
--- a/downloads-generation/models_class1_allele_specific_ensemble/models-summary/report.html
+++ b/downloads-generation/models_class1_allele_specific_ensemble/models-summary/report.html
--- a/downloads-generation/models_class1_allele_specific_ensemble/models-summary/report.ipynb
+++ b/downloads-generation/models_class1_allele_specific_ensemble/models-summary/report.ipynb
--- a/downloads-generation/models_class1_allele_specific_ensemble/models.py
+++ b/downloads-generation/models_class1_allele_specific_ensemble/models.py
+import sys
+from mhcflurry.class1_allele_specific_ensemble import HYPERPARAMETER_DEFAULTS
+import json
+
+models = HYPERPARAMETER_DEFAULTS.models_grid(
+    impute=[False, True],
+    activation=["tanh"],
+    layer_sizes=[[12], [64], [128]],
+    embedding_output_dim=[8, 32, 64],
+    dropout_probability=[0, .1, .25],
+    fraction_negative=[0, .1, .2],
+    n_training_epochs=[250],
+
+    # Imputation arguments
+    impute_method=["mice"],
+    imputer_args=[
+        # Arguments specific to imputation method (mice)
+        {"n_burn_in": 5, "n_imputations": 50, "n_nearest_columns": 25}
+    ],
+    impute_min_observations_per_peptide=[3],
+    impute_min_observations_per_allele=[3])
+
+sys.stderr.write("Models: %d\n" % len(models))
+print(json.dumps(models, indent=4))
--- a/examples/class1_allele_specific_models.ipynb
+++ b/examples/class1_allele_specific_models.ipynb
--- a/mhcflurry/__init__.py
+++ b/mhcflurry/__init__.py
@@ -14,7 +14,11 @@

 from .class1_allele_specific.class1_binding_predictor import (
    Class1BindingPredictor)
-from .predict import predict
+from .prediction import predict
+from .affinity_measurement_dataset import AffinityMeasurementDataset
+from .class1_allele_specific_ensemble import Class1EnsembleMultiAllelePredictor
+from .class1_allele_specific import Class1SingleModelMultiAllelePredictor
+from .measurement_collection import MeasurementCollection
 from . import parallelism

 __version__ = "0.2.0"
@@ -23,5 +27,9 @@ __all__ = [
    "Class1BindingPredictor",
    "predict",
    "parallelism",
+    "AffinityMeasurementDataset",
+    "Class1EnsembleMultiAllelePredictor",
+    "Class1SingleModelMultiAllelePredictor",
+    "MeasurementCollection",
    "__version__",
 ]
--- a/mhcflurry/dataset.py
+++ b/mhcflurry/dataset.py
@@ -37,12 +37,15 @@ from .imputation_helpers import (
 )


-class Dataset(object):
-
+class AffinityMeasurementDataset(object):
    """
    Peptide-MHC binding dataset with helper methods for constructing
    different representations (arrays, DataFrames, dictionaries, &c).

+    This class is specific for affinity measurements (IC50s), whereas the
+    MeasurementCollection class supports both affinitites and other
+    measurement types like mass spec hits.
+
    Design considerations:
        - want to allow multiple measurements for each pMHC pair (which can
          be dynamically combined)
@@ -50,7 +53,7 @@ class Dataset(object):
    """
    def __init__(self, df):
        """
-        Constructs a DataSet from a pandas DataFrame with the following
+        Constructs a AffinityMeasurementDataset from a pandas DataFrame with the following
        columns:
            - allele
            - peptide
@@ -89,7 +92,7 @@ class Dataset(object):

    def to_dataframe(self):
        """
-        Returns DataFrame representation of data contained in Dataset
+        Returns DataFrame representation of data contained in AffinityMeasurementDataset
        """
        return self._df

@@ -125,7 +128,7 @@ class Dataset(object):
        return len(self.to_dataframe())

    def __str__(self):
-        return "Dataset(n=%d, alleles=%s)" % (
+        return "AffinityMeasurementDataset(n=%d, alleles=%s)" % (
            len(self), list(sorted(self.unique_alleles())))

    def __repr__(self):
@@ -136,7 +139,7 @@ class Dataset(object):
        Two datasets are equal if they contain the same number of samples
        with the same properties and values.
        """
-        if type(other) is not Dataset:
+        if type(other) is not AffinityMeasurementDataset:
            return False
        elif len(self) != len(other):
            return False
@@ -180,13 +183,13 @@ class Dataset(object):

    def unique_alleles(self):
        """
-        Returns the set of allele names contained in this Dataset.
+        Returns the set of allele names contained in this AffinityMeasurementDataset.
        """
        return set(self.alleles)

    def unique_peptides(self):
        """
-        Returns the set of peptide sequences contained in this Dataset.
+        Returns the set of peptide sequences contained in this AffinityMeasurementDataset.
        """
        return set(self.peptides)

@@ -202,11 +205,11 @@ class Dataset(object):
        entries just for that allele.
        """
        for (allele_name, group_df) in self.to_dataframe().groupby("allele"):
-            yield (allele_name, Dataset(group_df))
+            yield (allele_name, AffinityMeasurementDataset(group_df))

    def groupby_allele_dictionary(self):
        """
-        Returns dictionary mapping each allele name to a Dataset containing
+        Returns dictionary mapping each allele name to a AffinityMeasurementDataset containing
        only entries from that allele.
        """
        return dict(self.groupby_allele())
@@ -314,7 +317,7 @@ class Dataset(object):
    @classmethod
    def from_single_allele_dataframe(cls, allele_name, single_allele_df):
        """
-        Construct a Dataset from a single MHC allele's DataFrame
+        Construct a AffinityMeasurementDataset from a single MHC allele's DataFrame
        """
        df = single_allele_df.copy()
        df["allele"] = allele_name
@@ -326,7 +329,7 @@ class Dataset(object):
            allele_to_peptide_to_affinity_dict):
        """
        Given nested dictionaries mapping allele -> peptide -> affinity,
-        construct a Dataset with uniform sample weights.
+        construct a AffinityMeasurementDataset with uniform sample weights.
        """
        alleles = []
        peptides = []
@@ -344,7 +347,7 @@ class Dataset(object):
    @classmethod
    def create_empty(cls):
        """
-        Returns an empty Dataset containing no pMHC entries.
+        Returns an empty AffinityMeasurementDataset containing no pMHC entries.
        """
        return cls.from_nested_dictionary({})

@@ -355,7 +358,7 @@ class Dataset(object):
            peptide_to_affinity_dict):
        """
        Given a peptide->affinity dictionary for a single allele,
-        create a Dataset.
+        create a AffinityMeasurementDataset.
        """
        return cls.from_nested_dictionary({allele_name: peptide_to_affinity_dict})

@@ -382,7 +385,7 @@ class Dataset(object):

    def get_allele(self, allele_name):
        """
-        Get Dataset for a single allele
+        Get AffinityMeasurementDataset for a single allele
        """
        if allele_name not in self.unique_alleles():
            raise KeyError("Allele '%s' not found, available alleles: %s" % (
@@ -393,7 +396,7 @@ class Dataset(object):

    def get_alleles(self, allele_names):
        """
-        Restrict Dataset to several allele names.
+        Restrict AffinityMeasurementDataset to several allele names.
        """
        datasets = []
        for allele_name in allele_names:
@@ -446,7 +449,7 @@ class Dataset(object):
            weight = row["sample_weight"]
            # we're either going to create a fresh original peptide column
            # or extend the existing original peptide tuple that tracks
-            # the provenance of entries in the new Dataset
+            # the provenance of entries in the new AffinityMeasurementDataset
            original_peptide = row.get("original_peptide")
            if original_peptide is None:
                original_peptide = ()
@@ -569,14 +572,14 @@ class Dataset(object):

    def slice(self, indices):
        """
-        Create a new Dataset by slicing through all columns of this dataset
+        Create a new AffinityMeasurementDataset by slicing through all columns of this dataset
        with the given indices.
        """
        indices = np.asarray(indices)
        max_index = indices.max()
        n_total = len(self)
        if max_index >= len(self):
-            raise ValueError("Invalid index %d for Dataset of size %d" % (
+            raise ValueError("Invalid index %d for AffinityMeasurementDataset of size %d" % (
                max_index, n_total))

        df = self.to_dataframe()
@@ -587,7 +590,7 @@ class Dataset(object):

    def random_split(self, n=None, stratify_fn=None):
        """
-        Randomly split the Dataset into smaller Dataset objects.
+        Randomly split the AffinityMeasurementDataset into smaller AffinityMeasurementDataset objects.

        Parameters
        ----------
@@ -598,7 +601,7 @@ class Dataset(object):
            Function that takes a row and returns bool, stratifying sampling
            into two groups.

-        Returns a pair of Dataset objects.
+        Returns a pair of AffinityMeasurementDataset objects.
        """
        n_total = len(self)
        if n is None:
@@ -646,7 +649,7 @@ class Dataset(object):
            test_samples = self
            test_sample_indices = np.arange(n_total)
        elif test_allele not in self.unique_alleles():
-            raise ValueError("Allele '%s' not in Dataset" % test_allele)
+            raise ValueError("Allele '%s' not in AffinityMeasurementDataset" % test_allele)

        else:
            test_sample_indices = np.where(self.alleles == test_allele)[0]
@@ -685,7 +688,7 @@ class Dataset(object):
        """
        Split an allele into training and test sets, and then impute values
        for peptides missing from the training set using data from other alleles
-        in this Dataset.
+        in this AffinityMeasurementDataset.

        (apologies for the wordy name, this turns out to be a common operation)

@@ -698,13 +701,13 @@ class Dataset(object):
            Size of the training set to return for this allele.

        stratify_fn : function
-            Function mapping from rows of the Dataset to booleans for stratifying
+            Function mapping from rows of the AffinityMeasurementDataset to booleans for stratifying
            by two groups.

        **kwargs : dict
-            Extra keyword arguments passed to Dataset.impute_missing_values
+            Extra keyword arguments passed to AffinityMeasurementDataset.impute_missing_values

-        Returns three Dataset objects:
+        Returns three AffinityMeasurementDataset objects:
            - training set with original pMHC affinities for given allele
            - larger imputed training set for given allele
            - test set
@@ -729,7 +732,7 @@ class Dataset(object):

        The two arguments are assumed to be the same length.

-        Returns Dataset of equal or smaller size.
+        Returns AffinityMeasurementDataset of equal or smaller size.
        """
        if len(alleles) != len(peptides):
            raise ValueError(
@@ -746,7 +749,7 @@ class Dataset(object):
        allele_peptide_pairs : list of (str, str) tuples
        The two arguments are assumed to be the same length.

-        Returns Dataset of equal or smaller size.
+        Returns AffinityMeasurementDataset of equal or smaller size.
        """
        require_iterable_of(allele_peptide_pairs, tuple)
        keys_to_remove_set = set(allele_peptide_pairs)
@@ -763,7 +766,7 @@ class Dataset(object):

        Parameters
        ----------
-        other_dataset : Dataset
+        other_dataset : AffinityMeasurementDataset

        Returns a new Dataset object of equal or lesser size.
        """
@@ -792,7 +795,7 @@ class Dataset(object):
            of floats and replaces some or all NaN values with synthetic
            affinities.

-        log_transform : function
+        log_transform : bool
            Transform affinities with to log10 values before imputation
            (and then transform back afterward).

@@ -802,7 +805,7 @@ class Dataset(object):
        min_observations_per_allele : int
            Drop allele columns with fewer than this number of observed values.

-        Returns Dataset with original pMHC affinities and additional
+        Returns AffinityMeasurementDataset with original pMHC affinities and additional
        synthetic samples.
        """
        if isinstance(imputation_method, string_types):

--- a/mhcflurry/antigen_presentation/__init__.py
+++ b/mhcflurry/antigen_presentation/__init__.py
-from .presentation_model import PresentationModel
+from .presentation_model import PresentationModel, build_presentation_models
 from .percent_rank_transform import PercentRankTransform
 from . import presentation_component_models, decoy_strategies

 __all__ = [
    "PresentationModel",
+    "build_presentation_models",
    "PercentRankTransform",
    "presentation_component_models",
    "decoy_strategies",

--- a/mhcflurry/antigen_presentation/presentation_component_models/expression.py
+++ b/mhcflurry/antigen_presentation/presentation_component_models/expression.py
@@ -25,7 +25,7 @@ class Expression(PresentationComponentModel):
            group in expression_values.columns
            for group in experiment_to_expression_group.values())

-        assert_no_null(experiment_to_expression_group)
+        assert_no_null(expression_values)

        self.experiment_to_expression_group = experiment_to_expression_group
        self.expression_values = expression_values

--- a/mhcflurry/antigen_presentation/presentation_component_models/fixed_affinity_predictions.py
+++ b/mhcflurry/antigen_presentation/presentation_component_models/fixed_affinity_predictions.py
@@ -16,6 +16,9 @@ class FixedAffinityPredictions(PresentationComponentModel):
            - "value", "percentile_rank" (IC50 and percent rank)
            - peptide (string)
            - allele (string)
+
+    name : string
+        Used to name output columns and in debug messages
    """

    def __init__(

--- a/mhcflurry/antigen_presentation/presentation_component_models/mhc_binding_component_model_base.py
+++ b/mhcflurry/antigen_presentation/presentation_component_models/mhc_binding_component_model_base.py
@@ -47,7 +47,7 @@ class MHCBindingComponentModelBase(PresentationComponentModel):
    fallback_predictor : function: (allele, peptides) -> predictions
        Used when missing an allele.

-    iedb_dataset : mhcflurry.Dataset
+    iedb_dataset : mhcflurry.AffinityMeasurementDataset
        IEDB data for this allele. If not specified no iedb data is used.

    decoys_per_hit : int

--- a/mhcflurry/antigen_presentation/presentation_component_models/mhcflurry_released.py
+++ b/mhcflurry/antigen_presentation/presentation_component_models/mhcflurry_released.py
@@ -4,8 +4,9 @@ import numpy
 import pandas

 from ...common import normalize_allele_name
-from ...predict import predict
+from ...prediction import predict
 from ..percent_rank_transform import PercentRankTransform
+from ...peptide_encoding import encode_peptides
 from .presentation_component_model import PresentationComponentModel


@@ -21,15 +22,22 @@ class MHCflurryReleased(PresentationComponentModel):
    random_peptides_for_percent_rank : list of string
        If specified, then percentile rank will be calibrated and emitted
        using the given peptides.
+
+    predictor : Class1EnsembleMultiAllelePredictor-like object
+        Predictor to use.
    """

    def __init__(
            self,
            experiment_to_alleles,
            random_peptides_for_percent_rank=None,
+            predictor=None,
+            predictor_name="mhcflurry_released",
            **kwargs):
        PresentationComponentModel.__init__(self, **kwargs)
        self.experiment_to_alleles = experiment_to_alleles
+        self.predictor = predictor
+        self.predictor_name = predictor_name
        if random_peptides_for_percent_rank is None:
            self.percent_rank_transforms = None
            self.random_peptides_for_percent_rank = None
@@ -39,9 +47,9 @@ class MHCflurryReleased(PresentationComponentModel):
                random_peptides_for_percent_rank)

    def column_names(self):
-        columns = ['mhcflurry_released_affinity']
+        columns = [self.predictor_name + '_affinity']
        if self.percent_rank_transforms is not None:
-            columns.append('mhcflurry_released_percentile_rank')
+            columns.append(self.predictor_name + '_percentile_rank')
        return columns

    def requires_fitting(self):
@@ -59,15 +67,18 @@ class MHCflurryReleased(PresentationComponentModel):
                    .Prediction.values)

    def predict_min_across_alleles(self, alleles, peptides):
-        alleles = [
+        alleles = list(set([
            normalize_allele_name(allele)
            for allele in alleles
-        ]
-        df = predict(alleles, numpy.unique(numpy.array(peptides)))
+        ]))
+        df = predict(
+            alleles,
+            peptides,
+            predictor=self.predictor)
        pivoted = df.pivot(index='Peptide', columns='Allele')
        pivoted.columns = pivoted.columns.droplevel()
        result = {
-            'mhcflurry_released_affinity': (
+            self.predictor_name + '_affinity': (
                pivoted.min(axis=1).ix[peptides].values)
        }
        if self.percent_rank_transforms is not None:
@@ -77,8 +88,11 @@ class MHCflurryReleased(PresentationComponentModel):
                percentile_ranks[allele] = (
                    self.percent_rank_transforms[allele]
                    .transform(pivoted[allele].values))
-            result['mhcflurry_released_percentile_rank'] = (
+            result[self.predictor_name + '_percentile_rank'] = (
                percentile_ranks.min(axis=1).ix[peptides].values)
+
+        for (key, value) in result.items():
+            assert len(value) == len(peptides), (len(peptides), result)
        return result

    def predict_for_experiment(self, experiment_name, peptides):

--- a/mhcflurry/antigen_presentation/presentation_component_models/mhcflurry_trained_on_hits.py
+++ b/mhcflurry/antigen_presentation/presentation_component_models/mhcflurry_trained_on_hits.py
@@ -3,7 +3,7 @@ from copy import copy
 import pandas
 from numpy import log, exp, nanmean

-from ...dataset import Dataset
+from ...affinity_measurement_dataset import AffinityMeasurementDataset
 from ...class1_allele_specific import Class1BindingPredictor
 from ...common import normalize_allele_name

@@ -22,7 +22,7 @@ class MHCflurryTrainedOnHits(MHCBindingComponentModelBase):

    Parameters
    ------------
-    iedb_dataset : mhcflurry.Dataset
+    iedb_dataset : mhcflurry.AffinityMeasurementDataset
        IEDB data for this allele. If not specified no iedb data is used.

    mhcflurry_hyperparameters : dict
@@ -96,7 +96,7 @@ class MHCflurryTrainedOnHits(MHCBindingComponentModelBase):
        if self.iedb_dataset is not None:
            df = df.append(iedb_dataset_df, ignore_index=True)

-        dataset = Dataset(
+        dataset = AffinityMeasurementDataset(
            df.sample(frac=1))  # shuffle dataframe
        print("Train data: ", dataset)
        model = Class1BindingPredictor(

--- a/mhcflurry/antigen_presentation/presentation_component_models/presentation_component_model.py
+++ b/mhcflurry/antigen_presentation/presentation_component_models/presentation_component_model.py
@@ -187,7 +187,12 @@ class PresentationComponentModel(object):
                self.predict_for_experiment(
                    str(peptides_df.iloc[0].experiment_name),
                    peptides_df.peptide.values))
-            assert len(return_value) == len(peptides_df), str(self)
+            assert len(return_value) == len(peptides_df), (
+                "%d != %d" % (len(return_value), len(peptides_df)),
+                str(self),
+                peptides_df.peptide.nunique(),
+                return_value,
+                peptides_df)
            assert_no_null(return_value, str(self))
        else:
            peptides_df = (
@@ -202,7 +207,9 @@ class PresentationComponentModel(object):
                    result_df.loc[sub_df.index, "experiment_name"] ==
                    experiment_name).all()

-                unique_peptides = numpy.unique(sub_df.peptide.values)
+                unique_peptides = list(set(sub_df.peptide))
+                if len(unique_peptides) == 0:
+                    continue

                result_dict = self.predict_for_experiment(
                    experiment_name, unique_peptides)