commandline_tutorial.rst

.. _commandline_tutorial:

Command-line tutorial
=====================

.. _downloading:

Downloading models
------------------

Most users will use pre-trained MHCflurry models that we release. These models
are distributed separately from the pip package and may be downloaded with the
:ref:`mhcflurry-downloads` tool:

.. code-block:: shell

    $ mhcflurry-downloads fetch models_class1_presentation

Files downloaded with :ref:`mhcflurry-downloads` are stored in a platform-specific
directory. To get the path to downloaded data, you can use:

.. command-output:: mhcflurry-downloads path models_class1_presentation
    :nostderr:

We also release a few other "downloads," such as curated training data and some
experimental models. To see what's available and what you have downloaded, run:

.. command-output:: mhcflurry-downloads info
    :nostderr:

Most users will only need ``models_class1_presentation``, however, as the
presentation predictor includes a peptide / MHC I binding affinity (BA) predictor
as well as an antigen processing (AP) predictor.

.. note::

    The code we use for *generating* the downloads is in the
    ``downloads_generation`` directory in the repository.


Generating predictions
----------------------

The :ref:`mhcflurry-predict` command generates predictions for individual peptides
(as opposed to scanning protein sequences for epitopes).
By default it will use the pre-trained models you downloaded above. Other
models can be used by specifying the ``--models`` argument.

Running:

.. command-output::
    mhcflurry-predict
        --alleles HLA-A0201 HLA-A0301
        --peptides SIINFEKL SIINFEKD SIINFEKQ
        --out /tmp/predictions.csv
    :nostderr:

results in a file like this:

.. command-output::
    cat /tmp/predictions.csv

The predictions are given as affinities (KD) in nM in the ``mhcflurry_prediction``
column. The other fields give the 5-95 percentile predictions across
the models in the ensemble and the quantile of the affinity prediction among
a large number of random peptides tested on that allele.

The predictions shown above were generated with MHCflurry |version|. Different versions of
MHCflurry can give considerably different results. Even
on the same version, exact predictions may vary (up to about 1 nM) depending
on the Keras backend and other details.

In most cases you'll want to specify the input as a CSV file instead of passing
peptides and alleles as commandline arguments. See :ref:`mhcflurry-predict` docs.

Scanning protein sequences for predicted MHC I ligands
-------------------------------------------------

Starting in version 1.6.0, MHCflurry supports scanning proteins for MHC I binding
peptides using the ``mhcflurry-predict-scan`` command.

We'll generate predictions across ``example.fasta``, a FASTA file with two short
sequences:

.. literalinclude:: /example.fasta

Here's the ``mhctools`` invocation.

.. command-output::
    mhctools
        --mhc-predictor mhcflurry
        --input-fasta-file example.fasta
        --mhc-alleles A02:01,A03:01
        --mhc-peptide-lengths 8,9,10,11
        --extract-subsequences
        --output-csv /tmp/subsequence_predictions.csv
    :ellipsis: 2,-2
    :nostderr:

This will write a file giving predictions for all subsequences of the specified lengths:

.. command-output::
    head -n 3 /tmp/subsequence_predictions.csv

See the :ref:`mhcflurry-predict-scan` docs for more options.


Fitting your own models
-----------------------

The :ref:`mhcflurry-class1-train-allele-specific-models` command is used to
fit models to training data. The models we release with MHCflurry are trained
with a command like:

.. code-block:: shell

    $ mhcflurry-class1-train-allele-specific-models \
        --data TRAINING_DATA.csv \
        --hyperparameters hyperparameters.yaml \
        --min-measurements-per-allele 75 \
        --out-models-dir models

MHCflurry predictors are serialized to disk as many files in a directory. The
command above will write the models to the output directory specified by the
``--out-models-dir`` argument. This directory has files like:

.. program-output::
    ls "$(mhcflurry-downloads path models_class1)/models"
    :shell:
    :nostderr:
    :ellipsis: 4,-4

The ``manifest.csv`` file gives metadata for all the models used in the predictor.
There will be a ``weights_...`` file for each model giving its weights
(the parameters for the neural network). The ``percent_ranks.csv`` stores a
histogram of model predictions for each allele over a large number of random
peptides. It is used for generating the percent ranks at prediction time.

To call :ref:`mhcflurry-class1-train-allele-specific-models` you'll need some
training data. The data we use for our released predictors can be downloaded with
:ref:`mhcflurry-downloads`:

.. code-block:: shell

    $ mhcflurry-downloads fetch data_curated

It looks like this:

.. command-output::
    bzcat "$(mhcflurry-downloads path data_curated)/curated_training_data.no_mass_spec.csv.bz2" | head -n 3
    :shell:
    :nostderr:


Environment variables
-------------------------------------------------

MHCflurry behavior can be modified using these environment variables:

``MHCFLURRY_DEFAULT_CLASS1_MODELS``
    Path to models directory. If you call ``Class1AffinityPredictor.load()``
    with no arguments, the models specified in this environment variable will be
    used. If this environment variable is undefined, the downloaded models for
    the current MHCflurry release are used.

``MHCFLURRY_OPTIMIZATION_LEVEL``
    The pan-allele models can be somewhat slow. As an optimization, when this
    variable is greater than 0 (default is 1), we "stitch" the pan-allele models in
    the ensemble into one large tensorflow graph. In our experiments
    it gives about a 30% speed improvement. It has no effect on allele-specific
    models. Set this variable to 0 to disable this behavior. This may be helpful
    if you are running out of memory using the pan-allele models.


``MHCFLURRY_DEFAULT_PREDICT_BATCH_SIZE``
    For large prediction tasks, it can be helpful to increase the prediction batch
    size, which is set by this environment variable (default is 4096). This
    affects both allele-specific and pan-allele predictors. It can have large
    effects on performance. Alternatively, if you are running out of memory,
    you can try decreasing the batch size.