Newer
Older
Downloading models
------------------
Most users will use pre-trained MHCflurry models that we release. These models
are distributed separately from the pip package and may be downloaded with the
:ref:`mhcflurry-downloads` tool:
.. code-block:: shell
Files downloaded with :ref:`mhcflurry-downloads` are stored in a platform-specific
directory. To get the path to downloaded data, you can use:
.. command-output:: mhcflurry-downloads path models_class1_presentation
We also release a few other "downloads," such as curated training data and some
experimental models. To see what's available and what you have downloaded, run:
.. command-output:: mhcflurry-downloads info
:nostderr:
Most users will only need ``models_class1_presentation``, however, as the
presentation predictor includes a peptide / MHC I binding affinity (BA) predictor
as well as an antigen processing (AP) predictor.
The code we use for *generating* the downloads is in the
``downloads_generation`` directory in the repository.
The :ref:`mhcflurry-predict` command generates predictions for individual peptides
(as opposed to scanning protein sequences for epitopes).
By default it will use the pre-trained models you downloaded above. Other
models can be used by specifying the ``--models`` argument.
Running:
.. command-output::
mhcflurry-predict
--alleles HLA-A0201 HLA-A0301
--peptides SIINFEKL SIINFEKD SIINFEKQ
--out /tmp/predictions.csv
The predictions are given as affinities (KD) in nM in the ``mhcflurry_prediction``
column. The other fields give the 5-95 percentile predictions across
the models in the ensemble and the quantile of the affinity prediction among
a large number of random peptides tested on that allele.
The predictions shown above were generated with MHCflurry |version|. Different versions of
MHCflurry can give considerably different results. Even
on the same version, exact predictions may vary (up to about 1 nM) depending
In most cases you'll want to specify the input as a CSV file instead of passing
peptides and alleles as commandline arguments. See :ref:`mhcflurry-predict` docs.
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
Scanning protein sequences for predicted MHC I ligands
-------------------------------------------------
Starting in version 1.6.0, MHCflurry supports scanning proteins for MHC I binding
peptides using the ``mhcflurry-predict-scan`` command.
We'll generate predictions across ``example.fasta``, a FASTA file with two short
sequences:
.. literalinclude:: /example.fasta
Here's the ``mhctools`` invocation.
.. command-output::
mhctools
--mhc-predictor mhcflurry
--input-fasta-file example.fasta
--mhc-alleles A02:01,A03:01
--mhc-peptide-lengths 8,9,10,11
--extract-subsequences
--output-csv /tmp/subsequence_predictions.csv
:ellipsis: 2,-2
:nostderr:
This will write a file giving predictions for all subsequences of the specified lengths:
.. command-output::
head -n 3 /tmp/subsequence_predictions.csv
See the :ref:`mhcflurry-predict-scan` docs for more options.
Fitting your own models
-----------------------
The :ref:`mhcflurry-class1-train-allele-specific-models` command is used to
fit models to training data. The models we release with MHCflurry are trained
with a command like:
.. code-block:: shell
$ mhcflurry-class1-train-allele-specific-models \
--data TRAINING_DATA.csv \
--hyperparameters hyperparameters.yaml \
--min-measurements-per-allele 75 \
--out-models-dir models
MHCflurry predictors are serialized to disk as many files in a directory. The
command above will write the models to the output directory specified by the
``--out-models-dir`` argument. This directory has files like:
.. program-output::
ls "$(mhcflurry-downloads path models_class1)/models"
:shell:
:nostderr:
The ``manifest.csv`` file gives metadata for all the models used in the predictor.
There will be a ``weights_...`` file for each model giving its weights
(the parameters for the neural network). The ``percent_ranks.csv`` stores a
histogram of model predictions for each allele over a large number of random
peptides. It is used for generating the percent ranks at prediction time.
To call :ref:`mhcflurry-class1-train-allele-specific-models` you'll need some
training data. The data we use for our released predictors can be downloaded with
:ref:`mhcflurry-downloads`:
.. code-block:: shell
$ mhcflurry-downloads fetch data_curated
It looks like this:
.. command-output::
bzcat "$(mhcflurry-downloads path data_curated)/curated_training_data.no_mass_spec.csv.bz2" | head -n 3
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
Environment variables
-------------------------------------------------
MHCflurry behavior can be modified using these environment variables:
``MHCFLURRY_DEFAULT_CLASS1_MODELS``
Path to models directory. If you call ``Class1AffinityPredictor.load()``
with no arguments, the models specified in this environment variable will be
used. If this environment variable is undefined, the downloaded models for
the current MHCflurry release are used.
``MHCFLURRY_OPTIMIZATION_LEVEL``
The pan-allele models can be somewhat slow. As an optimization, when this
variable is greater than 0 (default is 1), we "stitch" the pan-allele models in
the ensemble into one large tensorflow graph. In our experiments
it gives about a 30% speed improvement. It has no effect on allele-specific
models. Set this variable to 0 to disable this behavior. This may be helpful
if you are running out of memory using the pan-allele models.
``MHCFLURRY_DEFAULT_PREDICT_BATCH_SIZE``
For large prediction tasks, it can be helpful to increase the prediction batch
size, which is set by this environment variable (default is 4096). This
affects both allele-specific and pan-allele predictors. It can have large
effects on performance. Alternatively, if you are running out of memory,
you can try decreasing the batch size.