diff --git a/docs/commandline_tutorial.rst b/docs/commandline_tutorial.rst index a2a6d22e505c09c0bc39e569bfb4904af22c9c1c..e9118abde664fc2e71488adbccdd548e31f2d561 100644 --- a/docs/commandline_tutorial.rst +++ b/docs/commandline_tutorial.rst @@ -62,9 +62,9 @@ The binding affinity predictions are given as affinities (KD) in nM in the ``mhcflurry_affinity`` column. Lower values indicate stronger binders. A commonly-used threshold for peptides with a reasonable chance of being immunogenic is 500 nM. -The ``mhcflurry_affinity_percentile`` gives the quantile of the affinity -prediction among a large number of random peptides tested on that allele. Lower -is stronger. Two percent is a commonly-used threshold. +The ``mhcflurry_affinity_percentile`` gives the percentile of the affinity +prediction among a large number of random peptides tested on that allele (range +0 - 100). Lower is stronger. Two percent is a commonly-used threshold. The last two columns give the antigen processing and presentation scores, respectively. These range from 0 to 1 with higher values indicating more @@ -72,13 +72,13 @@ favorable processing or presentation. .. note:: - The processing predictor is experimental and under - development. It models allele-independent effects that influence whether a + The processing predictor is experimental. It models allele-independent + effects that influence whether a peptide will be detected in a mass spec experiment. The presentation score is a simple logistic regression model that combines the (log) binding affinity prediction with the processing score to give a composite prediction. The resulting - prediction is appropriate for prioritizing potential epitopes to test, but no - thresholds have yet been established for what constitutes a "high enough" + prediction may be useful for prioritizing potential epitopes, but no + thresholds have been established for what constitutes a "high enough" presentation score. In most cases you'll want to specify the input as a CSV file instead of passing @@ -122,20 +122,65 @@ a few options. If you have data for only one or a few MHC I alleles, the best approach is to use the :ref:`mhcflurry-class1-train-allele-specific-models` command to fit an "allele-specific" predictor, in which separate neural networks are used for -each allele. Here's an example: +each allele. + +To call :ref:`mhcflurry-class1-train-allele-specific-models` you'll need some +training data. The data we use for our released predictors can be downloaded with +:ref:`mhcflurry-downloads`: + +.. code-block:: shell + + $ mhcflurry-downloads fetch data_curated + +It looks like this: + +.. command-output:: + bzcat "$(mhcflurry-downloads path data_curated)/curated_training_data.csv.bz2" | head -n 3 + :shell: + :nostderr: + +Here's an example invocation to fit a predictor: .. code-block:: shell $ mhcflurry-class1-train-allele-specific-models \ - --data TRAINING_DATA.csv \ + --data curated_training_data.csv.bz2 \ --hyperparameters hyperparameters.yaml \ --min-measurements-per-allele 75 \ --out-models-dir models +The ``hyperparameters.yaml`` file gives the list of neural network architectures +to train models for. Here's an example specifying a single architecture: + +.. code-block:: yaml + + - activation: tanh + dense_layer_l1_regularization: 0.0 + dropout_probability: 0.0 + early_stopping: true + layer_sizes: [8] + locally_connected_layers: [] + loss: custom:mse_with_inequalities + max_epochs: 500 + minibatch_size: 128 + n_models: 4 + output_activation: sigmoid + patience: 20 + peptide_amino_acid_encoding: BLOSUM62 + random_negative_affinity_max: 50000.0 + random_negative_affinity_min: 20000.0 + random_negative_constant: 25 + random_negative_rate: 0.0 + validation_split: 0.1 + +The available hyperparameters for binding predictors are defined in +`~mhcflurry.Class1NeuralNetwork`. To see exactly how +these are used you will need to read the source code. + .. note:: MHCflurry predictors are serialized to disk as many files in a directory. The - command above will write the models to the output directory specified by the + model training command above will write the models to the output directory specified by the ``--out-models-dir`` argument. This directory has files like: .. program-output:: @@ -150,27 +195,19 @@ each allele. Here's an example: histogram of model predictions for each allele over a large number of random peptides. It is used for generating the percent ranks at prediction time. -To call :ref:`mhcflurry-class1-train-allele-specific-models` you'll need some -training data. The data we use for our released predictors can be downloaded with -:ref:`mhcflurry-downloads`: - -.. code-block:: shell - - $ mhcflurry-downloads fetch data_curated - -It looks like this: - -.. command-output:: - bzcat "$(mhcflurry-downloads path data_curated)/curated_training_data.csv.bz2" | head -n 3 - :shell: - :nostderr: - To fit pan-allele models like the ones released with MHCflurry, you can use -a similar tool, ``mhcflurry-class1-train-pan-allele-models``. You'll probably +a similar tool, :ref:`mhcflurry-class1-train-pan-allele-models`. You'll probably also want to take a look at the scripts used to generate the production models, which are available in the *downloads-generation* directory in the MHCflurry -repository. The production MHCflurry models were fit using a cluster with several -dozen GPUs over a period of about two days. +repository. See the scripts in the *models_class1_pan* subdirectory to see how the +fitting and model selection was done for models currently distributed with MHCflurry. + +.. note:: + + The production MHCflurry models were fit using a cluster with several + dozen GPUs over a period of about two days. If you model select over fewer + architectures, however, it should be possible to fit a predictor using less + resources. Environment variables diff --git a/docs/python_tutorial.rst b/docs/python_tutorial.rst index b906c29fe57c534ef3afdd63f8871e8ee0c948ff..9e74e77b0a4354d97199a9260dcc67d7cacff8e8 100644 --- a/docs/python_tutorial.rst +++ b/docs/python_tutorial.rst @@ -151,7 +151,7 @@ useful methods. Lower level interfaces ---------------------------------- -The `~mhcflurry.Class1PresentationPredictor` predictor delegates to a +The `~mhcflurry.Class1PresentationPredictor` delegates to a `~mhcflurry.Class1AffinityPredictor` instance for binding affinity predictions. If all you need are binding affinities, you can use this instance directly.