python_tutorial.rst

Python library tutorial
=======================

Predicting
----------

The MHCflurry Python API exposes additional options and features beyond those
supported by the commandline tools. This tutorial gives a basic overview
of the most important functionality. See the :ref:`API-documentation` for further details.

The `~mhcflurry.Class1AffinityPredictor` class is the primary user-facing interface.
Use the `~mhcflurry.Class1AffinityPredictor.load` static method to load a
trained predictor from disk. With no arguments this method will load the predictor
released with MHCflurry (see :ref:`downloading`\ ). If you pass a path to a
models directory, then it will load that predictor instead.

.. runblock:: pycon

    >>> from mhcflurry import Class1AffinityPredictor
    >>> predictor = Class1AffinityPredictor.load()
    >>> predictor.supported_alleles[:10]

With a predictor loaded we can now generate some binding predictions:

.. runblock:: pycon

    >>> predictor.predict(allele="HLA-A0201", peptides=["SIINFEKL", "SIINFEQL"])

.. note::

    MHCflurry normalizes allele names using the `mhcnames <https://github.com/hammerlab/mhcnames>`__
    package. Names like ``HLA-A0201`` or ``A*02:01`` will be
    normalized to ``HLA-A*02:01``, so most naming conventions can be used
    with methods such as `~mhcflurry.Class1AffinityPredictor.predict`.

For more detailed results, we can use
`~mhcflurry.Class1AffinityPredictor.predict_to_dataframe`.

.. runblock:: pycon

    >>> predictor.predict_to_dataframe(allele="HLA-A0201", peptides=["SIINFEKL", "SIINFEQL"])

Instead of a single allele and multiple peptides, we may need predictions for
allele/peptide pairs. We can predict across pairs by specifying
the `alleles` argument instead of `allele`. The list of alleles
must be the same length as the list of peptides (i.e. it is predicting over pairs,
*not* taking the cross product).

.. runblock:: pycon

    >>> predictor.predict(alleles=["HLA-A0201", "HLA-B*57:01"], peptides=["SIINFEKL", "SIINFEQL"])

Training
--------

Let's fit our own MHCflurry predictor. First we need some training data. If you
haven't already, run this in a shell to download the MHCflurry training data:

.. code-block:: shell

    $ mhcflurry-downloads fetch data_curated

We can get the path to this data from Python using `mhcflurry.downloads.get_path`:

.. runblock:: pycon

    >>> from mhcflurry.downloads import get_path
    >>> data_path = get_path("data_curated", "curated_training_data.csv.bz2")
    >>> data_path

Now let's load it with pandas and filter to reasonably-sized peptides:

.. runblock:: pycon

    >>> import pandas
    >>> df = pandas.read_csv(data_path)
    >>> df = df.loc[(df.peptide.str.len() >= 8) & (df.peptide.str.len() <= 15)]
    >>> df.head(5)

We'll make an untrained `~mhcflurry.Class1AffinityPredictor` and then call
`~mhcflurry.Class1AffinityPredictor.fit_allele_specific_predictors` to fit
some models.

.. runblock:: pycon

    >>> new_predictor = Class1AffinityPredictor()
    >>> single_allele_train_data = df.loc[df.allele == "HLA-B*57:01"].sample(100)
    >>> new_predictor.fit_allele_specific_predictors(
    ...    n_models=1,
    ...    architecture_hyperparameters_list=[{
    ...         "layer_sizes": [16],
    ...         "max_epochs": 5,
    ...         "random_negative_constant": 5,
    ...    }],
    ...    peptides=single_allele_train_data.peptide.values,
    ...    affinities=single_allele_train_data.measurement_value.values,
    ...    allele="HLA-B*57:01")


The `~mhcflurry.Class1AffinityPredictor.fit_allele_specific_predictors` method
can be called any number of times on the same instance to build up ensembles
of models across alleles. The `architecture_hyperparameters` we specified are
for demonstration purposes; to fit real models you would usually train for
more epochs.

Now we can generate predictions:

.. runblock:: pycon

    >>> new_predictor.predict(["SYNPEPII"], allele="HLA-B*57:01")

We can save our predictor to the specified directory on disk by running:

.. runblock:: pycon

    >>> new_predictor.save("/tmp/new-predictor")

and restore it:

.. runblock:: pycon

    >>> new_predictor2 = Class1AffinityPredictor.load("/tmp/new-predictor")
    >>> new_predictor2.supported_alleles


Lower level interface
---------------------

The high-level `Class1AffinityPredictor` delegates to low-level
`~mhcflurry.Class1NeuralNetwork` objects, each of which represents
a single neural network. The purpose of `~mhcflurry.Class1AffinityPredictor`
is to implement several important features:

ensembles
    More than one neural network can be used to generate each prediction. The
    predictions returned to the user are the geometric mean of the individual
    model predictions. This gives higher accuracy in most situations

multiple alleles
    A `~mhcflurry.Class1NeuralNetwork` generates predictions for only a single
    allele. The `~mhcflurry.Class1AffinityPredictor` maps alleles to the
    relevant `~mhcflurry.Class1NeuralNetwork` instances

serialization
    Loading and saving predictors is implemented in `~mhcflurry.Class1AffinityPredictor`.

Sometimes it's easiest to work directly with `~mhcflurry.Class1NeuralNetwork`.
Here is a simple example of doing so:

.. runblock:: pycon

    >>> from mhcflurry import Class1NeuralNetwork
    >>> network = Class1NeuralNetwork()
    >>> network.fit(
    ...    single_allele_train_data.peptide.values,
    ...    single_allele_train_data.measurement_value.values,
    ...    verbose=0)
    >>> network.predict(["SIINFEKLL"])