Skip to content
Snippets Groups Projects
python_tutorial.rst 5.42 KiB
Newer Older
Tim O'Donnell's avatar
Tim O'Donnell committed
Python library tutorial
=======================
Tim O'Donnell's avatar
Tim O'Donnell committed
Predicting
----------

Tim O'Donnell's avatar
Tim O'Donnell committed
The MHCflurry Python API exposes additional options and features beyond those
supported by the commandline tools. This tutorial gives a basic overview
of the most important functionality. See the :ref:`API-documentation` for further details.

The `~mhcflurry.Class1AffinityPredictor` class is the primary user-facing interface.
Tim O'Donnell's avatar
Tim O'Donnell committed
Use the `~mhcflurry.Class1AffinityPredictor.load` static method to load a
trained predictor from disk. With no arguments this method will load the predictor
released with MHCflurry (see :ref:`downloading`\ ). If you pass a path to a
models directory, then it will load that predictor instead.
Tim O'Donnell's avatar
Tim O'Donnell committed
.. runblock:: pycon
Tim O'Donnell's avatar
Tim O'Donnell committed
    >>> from mhcflurry import Class1AffinityPredictor
    >>> predictor = Class1AffinityPredictor.load()
    >>> predictor.supported_alleles[:10]
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
With a predictor loaded we can now generate some binding predictions:
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
.. runblock:: pycon
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
    >>> predictor.predict(allele="HLA-A0201", peptides=["SIINFEKL", "SIINFEQL"])
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
.. note::
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
    MHCflurry normalizes allele names using the `mhcnames <https://github.com/hammerlab/mhcnames>`__
    package. Names like ``HLA-A0201`` or ``A*02:01`` will be
    normalized to ``HLA-A*02:01``, so most naming conventions can be used
    with methods such as `~mhcflurry.Class1AffinityPredictor.predict`.
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
For more detailed results, we can use
`~mhcflurry.Class1AffinityPredictor.predict_to_dataframe`.
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
.. runblock:: pycon
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
    >>> predictor.predict_to_dataframe(allele="HLA-A0201", peptides=["SIINFEKL", "SIINFEQL"])
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
Instead of a single allele and multiple peptides, we may need predictions for
allele/peptide pairs. We can predict across pairs by specifying
the `alleles` argument instead of `allele`. The list of alleles
must be the same length as the list of peptides (i.e. it is predicting over pairs,
*not* taking the cross product).
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
.. runblock:: pycon
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
    >>> predictor.predict(alleles=["HLA-A0201", "HLA-B*57:01"], peptides=["SIINFEKL", "SIINFEQL"])
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
Training
--------
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
Let's fit our own MHCflurry predictor. First we need some training data. If you
haven't already, run this in a shell to download the MHCflurry training data:
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
.. code-block:: shell
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
    $ mhcflurry-downloads fetch data_curated
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
We can get the path to this data from Python using `mhcflurry.downloads.get_path`:
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
.. runblock:: pycon
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
    >>> from mhcflurry.downloads import get_path
    >>> data_path = get_path("data_curated", "curated_training_data.csv.bz2")
    >>> data_path
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
Now let's load it with pandas and filter to reasonably-sized peptides:
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
.. runblock:: pycon
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
    >>> import pandas
    >>> df = pandas.read_csv(data_path)
    >>> df = df.loc[(df.peptide.str.len() >= 8) & (df.peptide.str.len() <= 15)]
    >>> df.head(5)
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
We'll make an untrained `~mhcflurry.Class1AffinityPredictor` and then call
`~mhcflurry.Class1AffinityPredictor.fit_allele_specific_predictors` to fit
some models.
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
.. runblock:: pycon
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
    >>> new_predictor = Class1AffinityPredictor()
    >>> single_allele_train_data = df.loc[df.allele == "HLA-B*57:01"].sample(100)
    >>> new_predictor.fit_allele_specific_predictors(
    ...    n_models=1,
    ...    architecture_hyperparameters_list=[{
Tim O'Donnell's avatar
Tim O'Donnell committed
    ...         "layer_sizes": [16],
    ...         "max_epochs": 5,
    ...         "random_negative_constant": 5,
Tim O'Donnell's avatar
Tim O'Donnell committed
    ...    peptides=single_allele_train_data.peptide.values,
    ...    affinities=single_allele_train_data.measurement_value.values,
    ...    allele="HLA-B*57:01")

Tim O'Donnell's avatar
Tim O'Donnell committed
The `~mhcflurry.Class1AffinityPredictor.fit_allele_specific_predictors` method
can be called any number of times on the same instance to build up ensembles
of models across alleles. The `architecture_hyperparameters` we specified are
for demonstration purposes; to fit real models you would usually train for
more epochs.

Now we can generate predictions:
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
.. runblock:: pycon
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
    >>> new_predictor.predict(["SYNPEPII"], allele="HLA-B*57:01")
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
We can save our predictor to the specified directory on disk by running:
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
.. runblock:: pycon
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
    >>> new_predictor.save("/tmp/new-predictor")
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
and restore it:
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
.. runblock:: pycon
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
    >>> new_predictor2 = Class1AffinityPredictor.load("/tmp/new-predictor")
    >>> new_predictor2.supported_alleles
Tim O'Donnell's avatar
Tim O'Donnell committed
Lower level interface
---------------------
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
The high-level `Class1AffinityPredictor` delegates to low-level
`~mhcflurry.Class1NeuralNetwork` objects, each of which represents
a single neural network. The purpose of `~mhcflurry.Class1AffinityPredictor`
is to implement several important features:
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
ensembles
    More than one neural network can be used to generate each prediction. The
    predictions returned to the user are the geometric mean of the individual
    model predictions. This gives higher accuracy in most situations
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
multiple alleles
    A `~mhcflurry.Class1NeuralNetwork` generates predictions for only a single
    allele. The `~mhcflurry.Class1AffinityPredictor` maps alleles to the
    relevant `~mhcflurry.Class1NeuralNetwork` instances
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
serialization
    Loading and saving predictors is implemented in `~mhcflurry.Class1AffinityPredictor`.
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
Sometimes it's easiest to work directly with `~mhcflurry.Class1NeuralNetwork`.
Here is a simple example of doing so:
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
.. runblock:: pycon
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
    >>> from mhcflurry import Class1NeuralNetwork
    >>> network = Class1NeuralNetwork()
    >>> network.fit(
    ...    single_allele_train_data.peptide.values,
    ...    single_allele_train_data.measurement_value.values,
    ...    verbose=0)
    >>> network.predict(["SIINFEKLL"])