commandline_tutorial.rst

.. _commandline_tutorial:

Command-line tutorial
=====================

.. _downloading:

Downloading models
------------------

Most users will use pre-trained MHCflurry models that we release. These models
are distributed separately from the pip package and may be downloaded with the
:ref:`mhcflurry-downloads` tool:

.. code-block:: shell

    $ mhcflurry-downloads fetch models_class1

Files downloaded with :ref:`mhcflurry-downloads` are stored in a platform-specific
directory. To get the path to downloaded data, you can use:

.. command-output:: mhcflurry-downloads path models_class1
    :nostderr:

We also release a few other "downloads," such as curated training data and some
experimental models. To see what's available and what you have downloaded, run:

.. command-output:: mhcflurry-downloads info
    :nostderr:

.. note::

    The code we use for *generating* the downloads is in the
    ``downloads_generation`` directory in the repository.


Generating predictions
----------------------

The :ref:`mhcflurry-predict` command generates predictions from the command-line.
By default it will use the pre-trained models you downloaded above; other
models can be used by specifying the ``--models`` argument.

Running:

.. command-output::
    mhcflurry-predict
        --alleles HLA-A0201 HLA-A0301
        --peptides SIINFEKL SIINFEKD SIINFEKQ
        --out /tmp/predictions.csv
    :nostderr:

results in a file like this:

.. command-output::
    cat /tmp/predictions.csv

The predictions are given as affinities (KD) in nM in the ``mhcflurry_prediction``
column. The other fields give the 5-95 percentile predictions across
the models in the ensemble and the quantile of the affinity prediction among
a large number of random peptides tested on that allele.

The predictions shown above were generated with MHCflurry |version|. Different versions of
MHCflurry can give considerably different results. Even
on the same version, exact predictions may vary (up to about 1 nM) depending
on the Keras backend and other details.

In most cases you'll want to specify the input as a CSV file instead of passing
peptides and alleles as commandline arguments. See :ref:`mhcflurry-predict` docs.

Fitting your own models
-----------------------

The :ref:`mhcflurry-class1-train-allele-specific-models` command is used to
fit models to training data. The models we release with MHCflurry are trained
with a command like:

.. code-block:: shell

    $ mhcflurry-class1-train-allele-specific-models \
        --data TRAINING_DATA.csv \
        --hyperparameters hyperparameters.yaml \
        --min-measurements-per-allele 75 \
        --out-models-dir models

MHCflurry predictors are serialized to disk as many files in a directory. The
command above will write the models to the output directory specified by the
``--out-models-dir`` argument. This directory has files like:

.. program-output::
    ls "$(mhcflurry-downloads path models_class1)/models"
    :shell:
    :nostderr:
    :ellipsis: 4,-4

The ``manifest.csv`` file gives metadata for all the models used in the predictor.
There will be a ``weights_...`` file for each model giving its weights
(the parameters for the neural network). The ``percent_ranks.csv`` stores a
histogram of model predictions for each allele over a large number of random
peptides. It is used for generating the percent ranks at prediction time.

To call :ref:`mhcflurry-class1-train-allele-specific-models` you'll need some
training data. The data we use for our released predictors can be downloaded with
:ref:`mhcflurry-downloads`:

.. code-block:: shell

    $ mhcflurry-downloads fetch data_curated

It looks like this:

.. command-output::
    bzcat "$(mhcflurry-downloads path data_curated)/curated_training_data.no_mass_spec.csv.bz2" | head -n 3
    :shell:
    :nostderr:


Scanning protein sequences for predicted epitopes
-------------------------------------------------

The `mhctools <https://github.com/hammerlab/mhctools>`__ package
provides support for scanning protein sequences to find predicted
epitopes. It supports MHCflurry as well as other binding predictors.
Here is an example.

First, install ``mhctools`` if it is not already installed:

.. code-block:: shell

    $ pip install mhctools

We'll generate predictions across ``example.fasta``, a FASTA file with two short
sequences:

.. literalinclude:: /example.fasta

Here's the ``mhctools`` invocation. See ``mhctools -h`` for more information.

.. command-output::
    mhctools
        --mhc-predictor mhcflurry
        --input-fasta-file example.fasta
        --mhc-alleles A02:01,A03:01
        --mhc-peptide-lengths 8,9,10,11
        --extract-subsequences
        --output-csv /tmp/subsequence_predictions.csv
    :ellipsis: 2,-2
    :nostderr:

This will write a file giving predictions for all subsequences of the specified lengths:

.. command-output::
    head -n 3 /tmp/subsequence_predictions.csv