Skip to content
Snippets Groups Projects
commandline_tutorial.rst 7.83 KiB
Newer Older
Tim O'Donnell's avatar
Tim O'Donnell committed
.. _commandline_tutorial:

Tim O'Donnell's avatar
Tim O'Donnell committed
Command-line tutorial
=====================
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
.. _downloading:

Tim O'Donnell's avatar
Tim O'Donnell committed
Downloading models
------------------

Most users will use pre-trained MHCflurry models that we release. These models
Tim O'Donnell's avatar
Tim O'Donnell committed
are distributed separately from the pip package and may be downloaded with the
:ref:`mhcflurry-downloads` tool:

.. code-block:: shell
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
    $ mhcflurry-downloads fetch models_class1_presentation
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
Files downloaded with :ref:`mhcflurry-downloads` are stored in a platform-specific
directory. To get the path to downloaded data, you can use:

Tim O'Donnell's avatar
Tim O'Donnell committed
.. command-output:: mhcflurry-downloads path models_class1_presentation
Tim O'Donnell's avatar
Tim O'Donnell committed
    :nostderr:

Tim O'Donnell's avatar
Tim O'Donnell committed
We also release a number of other "downloads," such as curated training data and some
experimental models. To see what's available and what you have downloaded, run
``mhcflurry-downloads info``.
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
Most users will only need ``models_class1_presentation``, however, as the
presentation predictor includes a peptide / MHC I binding affinity (BA) predictor
as well as an antigen processing (AP) predictor.

Tim O'Donnell's avatar
Tim O'Donnell committed
.. note::
Tim O'Donnell's avatar
Tim O'Donnell committed
    The code we use for *generating* the downloads is in the
Tim O'Donnell's avatar
Tim O'Donnell committed
    ``downloads_generation`` directory in the repository (https://github.com/openvax/mhcflurry/tree/master/downloads-generation)
Tim O'Donnell's avatar
Tim O'Donnell committed
Generating predictions
----------------------
Tim O'Donnell's avatar
Tim O'Donnell committed
The :ref:`mhcflurry-predict` command generates predictions for individual peptides
Tim O'Donnell's avatar
Tim O'Donnell committed
(see the next section for how to scan protein sequences for epitopes). By
default it will use the pre-trained models you downloaded above. Other
Tim O'Donnell's avatar
Tim O'Donnell committed
models can be used by specifying the ``--models`` argument.

Running:

.. command-output::
    mhcflurry-predict
        --alleles HLA-A0201 HLA-A0301
        --peptides SIINFEKL SIINFEKD SIINFEKQ
        --out /tmp/predictions.csv
Tim O'Donnell's avatar
Tim O'Donnell committed
results in a file like this:
Tim O'Donnell's avatar
Tim O'Donnell committed
.. command-output::
Tim O'Donnell's avatar
Tim O'Donnell committed
    cat /tmp/predictions.csv
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
The binding affinity predictions are given as affinities (KD) in nM in the
``mhcflurry_affinity`` column. Lower values indicate stronger binders. A commonly-used
threshold for peptides with a reasonable chance of being immunogenic is 500 nM.
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
The ``mhcflurry_affinity_percentile`` gives the quantile of the affinity
prediction among a large number of random peptides tested on that allele. Lower
is stronger. Two percent is a commonly-used threshold.

The last two columns give the antigen processing and presentation scores,
respectively. These range from 0 to 1 with higher values indicating more
favorable processing or presentation.

.. note::

    The processing predictor is experimental and under
    development. It models allele-independent effects that influence whether a
    peptide will be detected in a mass spec experiment. The presentation score is
    a simple logistic regression model that combines the (log) binding affinity
    prediction with the processing score to give a composite prediction. The resulting
    prediction is appropriate for prioritizing potential epitopes to test, but no
    thresholds have yet been established for what constitutes a "high enough"
    presentation score.
Tim O'Donnell's avatar
Tim O'Donnell committed
In most cases you'll want to specify the input as a CSV file instead of passing
Tim O'Donnell's avatar
Tim O'Donnell committed
peptides and alleles as commandline arguments. If you're relying on the
processing or presentation scores, you may also want to pass the upstream and
downstream sequences of the peptides from their source proteins for potentially more
accurate cleavage prediction. See the :ref:`mhcflurry-predict` docs.
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
Scanning protein sequences for predicted MHC I ligands
-------------------------------------------------

Tim O'Donnell's avatar
Tim O'Donnell committed
Starting in version 1.6.0, MHCflurry supports scanning proteins for MHC-binding
Tim O'Donnell's avatar
Tim O'Donnell committed
peptides using the ``mhcflurry-predict-scan`` command.

We'll generate predictions across ``example.fasta``, a FASTA file with two short
sequences:

.. literalinclude:: /example.fasta

Tim O'Donnell's avatar
Tim O'Donnell committed
Here's the ``mhcflurry-predict-scan`` invocation to scan the proteins for
Tim O'Donnell's avatar
Tim O'Donnell committed
binders to either of two MHC I genotypes (using a 100 nM threshold):
Tim O'Donnell's avatar
Tim O'Donnell committed

.. command-output::
Tim O'Donnell's avatar
Tim O'Donnell committed
    mhcflurry-predict-scan
        example.fasta
        --alleles
            HLA-A*02:01,HLA-A*03:01,HLA-B*57:01,HLA-B*45:01,HLA-C*02:02,HLA-C*07:02
            HLA-A*01:01,HLA-A*02:06,HLA-B*44:02,HLA-B*07:02,HLA-C*01:02,HLA-C*03:01
Tim O'Donnell's avatar
Tim O'Donnell committed
        --results-filtered affinity
        --threshold-affinity 100
Tim O'Donnell's avatar
Tim O'Donnell committed
    :nostderr:

See the :ref:`mhcflurry-predict-scan` docs for more options.


Tim O'Donnell's avatar
Tim O'Donnell committed
Fitting your own models
-----------------------

Tim O'Donnell's avatar
Tim O'Donnell committed
If you have your own data and want to fit your own MHCflurry models, you have
a few options. If you have data for only one or a few MHC I alleles, the best
approach is to use the
:ref:`mhcflurry-class1-train-allele-specific-models` command to fit an
"allele-specific" predictor, in which separate neural networks are used for
each allele. Here's an example:
Tim O'Donnell's avatar
Tim O'Donnell committed

.. code-block:: shell

    $ mhcflurry-class1-train-allele-specific-models \
        --data TRAINING_DATA.csv \
        --hyperparameters hyperparameters.yaml \
        --min-measurements-per-allele 75 \
        --out-models-dir models

Tim O'Donnell's avatar
Tim O'Donnell committed
.. note::
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
    MHCflurry predictors are serialized to disk as many files in a directory. The
    command above will write the models to the output directory specified by the
    ``--out-models-dir`` argument. This directory has files like:

    .. program-output::
        ls "$(mhcflurry-downloads path models_class1)/models"
        :shell:
        :nostderr:
        :ellipsis: 4,-4
Tim O'Donnell's avatar
Tim O'Donnell committed

Tim O'Donnell's avatar
Tim O'Donnell committed
    The ``manifest.csv`` file gives metadata for all the models used in the predictor.
    There will be a ``weights_...`` file for each model giving its weights
    (the parameters for the neural network). The ``percent_ranks.csv`` stores a
    histogram of model predictions for each allele over a large number of random
    peptides. It is used for generating the percent ranks at prediction time.
Tim O'Donnell's avatar
Tim O'Donnell committed

To call :ref:`mhcflurry-class1-train-allele-specific-models` you'll need some
training data. The data we use for our released predictors can be downloaded with
:ref:`mhcflurry-downloads`:

.. code-block:: shell

    $ mhcflurry-downloads fetch data_curated

It looks like this:

.. command-output::
Tim O'Donnell's avatar
Tim O'Donnell committed
    bzcat "$(mhcflurry-downloads path data_curated)/curated_training_data.csv.bz2" | head -n 3
Tim O'Donnell's avatar
Tim O'Donnell committed
    :shell:
    :nostderr:

Tim O'Donnell's avatar
Tim O'Donnell committed
To fit pan-allele models like the ones released with MHCflurry, you can use
a similar tool, ``mhcflurry-class1-train-pan-allele-models``. You'll probably
also want to take a look at the scripts used to generate the production models,
which are available in the *downloads-generation* directory in the MHCflurry
repository. The production MHCflurry models were fit using a cluster with several
dozen GPUs over a period of about two days.
Tim O'Donnell's avatar
Tim O'Donnell committed


Environment variables
-------------------------------------------------

MHCflurry behavior can be modified using these environment variables:

``MHCFLURRY_DEFAULT_CLASS1_MODELS``
    Path to models directory. If you call ``Class1AffinityPredictor.load()``
    with no arguments, the models specified in this environment variable will be
    used. If this environment variable is undefined, the downloaded models for
    the current MHCflurry release are used.

``MHCFLURRY_OPTIMIZATION_LEVEL``
    The pan-allele models can be somewhat slow. As an optimization, when this
    variable is greater than 0 (default is 1), we "stitch" the pan-allele models in
    the ensemble into one large tensorflow graph. In our experiments
    it gives about a 30% speed improvement. It has no effect on allele-specific
    models. Set this variable to 0 to disable this behavior. This may be helpful
    if you are running out of memory using the pan-allele models.


``MHCFLURRY_DEFAULT_PREDICT_BATCH_SIZE``
    For large prediction tasks, it can be helpful to increase the prediction batch
    size, which is set by this environment variable (default is 4096). This
    affects both allele-specific and pan-allele predictors. It can have large
    effects on performance. Alternatively, if you are running out of memory,
    you can try decreasing the batch size.