diff --git a/docs/Makefile b/docs/Makefile index 83ec2f208799c51ae7761baa32a418c988a353a8..e6118afe4c56ccaa0c30f8d14f84218605857e52 100644 --- a/docs/Makefile +++ b/docs/Makefile @@ -71,7 +71,11 @@ readme: text .PHONY: clean clean: + # Added by tim: preserve html/.git + rm -rf $(BUILDDIR)/html/* + mv $(BUILDDIR)/html /tmp/html-bk rm -rf $(BUILDDIR)/* + mv /tmp/html-bk $(BUILDDIR)/html .PHONY: html html: diff --git a/docs/commandline_tutorial.rst b/docs/commandline_tutorial.rst index 253f417226b938ddf3c205f90896c168e29e6373..6dc803c81b40657c2be83f8d5106a46654fb1c73 100644 --- a/docs/commandline_tutorial.rst +++ b/docs/commandline_tutorial.rst @@ -23,7 +23,7 @@ directory. To get the path to downloaded data, you can use: :nostderr: We also release a few other "downloads," such as curated training data and some -experimental models. To see what you have downloaded, run: +experimental models. To see what's available and what you have downloaded, run: .. command-output:: mhcflurry-downloads info :nostderr: @@ -38,7 +38,7 @@ Generating predictions ---------------------- The :ref:`mhcflurry-predict` command generates predictions from the command-line. -By default it will use the pre-trained models you downloaded above but other +By default it will use the pre-trained models you downloaded above; other models can be used by specifying the ``--models`` argument. Running: @@ -53,7 +53,7 @@ Running: results in a file like this: .. command-output:: - head -n 3 /tmp/predictions.csv + cat /tmp/predictions.csv The predictions are given as affinities (KD) in nM in the ``mhcflurry_prediction`` column. The other fields give the 5-95 percentile predictions across @@ -62,7 +62,7 @@ a large number of random peptides tested on that allele. The predictions shown above were generated with MHCflurry |version|. Different versions of MHCflurry can give considerably different results. Even -on the same version, your exact predictions may vary (up to about 1 nM) depending +on the same version, exact predictions may vary (up to about 1 nM) depending on the Keras backend and other details. In most cases you'll want to specify the input as a CSV file instead of passing diff --git a/docs/conf.py b/docs/conf.py index 84b8ec5e17e03d873ce05e8cb45e2db5c65e0da4..be2e10f48ca71e6332c6ae97636321d6fa7eca7a 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -20,8 +20,10 @@ import textwrap import logging # Hack added by tim for bug in autoprogram extension under Python 2. -from sphinx.util.pycompat import indent +from sphinx.util.pycompat import indent # pylint: disable=import-error textwrap.indent = indent + +# Disable logging (added by tim) logging.disable(logging.ERROR) # If extensions (or modules to document with autodoc) are in another directory, @@ -45,10 +47,10 @@ extensions = [ 'sphinx.ext.viewcode', 'sphinx.ext.githubpages', 'numpydoc', - #'sphinx_autorun', 'sphinxcontrib.autorun2', 'sphinxcontrib.programoutput', 'sphinxcontrib.autoprogram', + 'sphinx.ext.githubpages', ] # Add any paths that contain templates here, relative to this directory. diff --git a/docs/package_readme/readme.generated.txt b/docs/package_readme/readme.generated.txt index 6f16f412cd2e08362c03ad4d1db0374e9a327e1f..2ba05d584482f4c52885d60d4bd1a5f67b72741e 100644 --- a/docs/package_readme/readme.generated.txt +++ b/docs/package_readme/readme.generated.txt @@ -25,7 +25,7 @@ therefore support a fixed set of common class I alleles for which sufficient published training data is available (see Supported alleles and peptide lengths). -MHCflurry supports Python versions 2.7 and 3.4+. It uses the Keras +MHCflurry supports Python versions 2.7 and 3.4+. It uses the keras neural network library via either the Tensorflow or Theano backends. GPUs may optionally be used for a generally modest speed improvement. @@ -82,8 +82,15 @@ be downloaded with the mhcflurry-downloads tool: $ mhcflurry-downloads fetch models_class1 +Files downloaded with mhcflurry-downloads are stored in a platform- +specific directory. To get the path to downloaded data, you can use: + + $ mhcflurry-downloads path models_class1 + /Users/tim/Library/Application Support/mhcflurry/4/1.0.0/models_class1/ + We also release a few other “downloads,” such as curated training data -and some experimental models. To see what you have downloaded, run: +and some experimental models. To see what’s available and what you +have downloaded, run: $ mhcflurry-downloads info Environment variables @@ -103,13 +110,7 @@ and some experimental models. To see what you have downloaded, run: data_kim2014 NO NO http://github.com/hammerlab/mhcflurry/releases/download/0.9.1/data_kim2014.tar.bz2 data_curated YES YES https://github.com/hammerlab/mhcflurry/releases/download/pre-1.0/data_curated.tar.bz2 -Files downloaded with mhcflurry-downloads are stored in a platform- -specific directory. To get the path to downloaded data, you can use: - - $ mhcflurry-downloads path models_class1 - /Users/tim/Library/Application Support/mhcflurry/4/1.0.0/models_class1/ - -Note: The code we use for generating the downloads is in the +Note: The code we use for *generating* the downloads is in the "downloads_generation" directory in the repository. @@ -118,8 +119,7 @@ Generating predictions The mhcflurry-predict command generates predictions from the command- line. By default it will use the pre-trained models you downloaded -above but other models can be used by specifying the "--models" -argument. +above; other models can be used by specifying the "--models" argument. Running: @@ -131,10 +131,14 @@ Running: results in a file like this: - $ head -n 3 /tmp/predictions.csv + $ cat /tmp/predictions.csv allele,peptide,mhcflurry_prediction,mhcflurry_prediction_low,mhcflurry_prediction_high,mhcflurry_prediction_percentile - HLA-A0201,SIINFEKL,4899.047843425702,2767.7636539507857,7269.683642935029,6.509787499999997 - HLA-A0201,SIINFEKD,21050.420242970613,16834.65859138968,24129.046091695887,34.297175 + HLA-A0201,SIINFEKL,4899.04784343,2767.76365395,7269.68364294,6.5097875 + HLA-A0201,SIINFEKD,21050.420243,16834.6585914,24129.0460917,34.297175 + HLA-A0201,SIINFEKQ,21048.4726578,16736.5612549,24111.0131144,34.297175 + HLA-A0301,SIINFEKL,28227.2989092,24826.3079098,32714.285974,33.9512125 + HLA-A0301,SIINFEKD,30816.7212184,27685.5084708,36037.3259046,41.225775 + HLA-A0301,SIINFEKQ,24183.0210465,19346.154182,32263.7124753,24.8109625 The predictions are given as affinities (KD) in nM in the "mhcflurry_prediction" column. The other fields give the 5-95 @@ -144,8 +148,8 @@ peptides tested on that allele. The predictions shown above were generated with MHCflurry 1.0.0. Different versions of MHCflurry can give considerably different -results. Even on the same version, your exact predictions may vary (up -to about 1 nM) depending on the Keras backend and other details. +results. Even on the same version, exact predictions may vary (up to +about 1 nM) depending on the Keras backend and other details. In most cases you’ll want to specify the input as a CSV file instead of passing peptides and alleles as commandline arguments. See @@ -231,188 +235,216 @@ information. --mhc-peptide-lengths 8,9,10,11 --extract-subsequences --output-csv /tmp/subsequence_predictions.csv - 2017-12-21 16:29:58,003 - mhctools.cli.args - INFO - Building MHC binding prediction type for alleles ['HLA-A*02:01', 'HLA-A*03:01'] and epitope lengths [8, 9, 10, 11] - 2017-12-21 16:30:03,062 - mhctools.cli.script - INFO - + 2017-12-22 01:12:44,974 - mhctools.cli.args - INFO - Building MHC binding prediction type for alleles ['HLA-A*02:01', 'HLA-A*03:01'] and epitope lengths [8, 9, 10, 11] + 2017-12-22 01:12:48,868 - mhctools.mhcflurry - INFO - BindingPrediction(peptide='AARYSAFY', allele='HLA-A*03:01', affinity=5744.3443, percentile_rank=None, source_sequence_name=None, offset=0, prediction_method_name='mhcflurry') ... + [1192 rows x 8 columns] - Wrote: /tmp/subsequence_predictions.csv This will write a file giving predictions for all subsequences of the specified lengths: $ head -n 3 /tmp/subsequence_predictions.csv - source_sequence_name,offset,peptide,allele,affinity,percentile_rank,prediction_method_name,length - protein2,42,AARYSAFY,HLA-A*03:01,5744.344274398671,4.739962499999998,mhcflurry,8 - protein2,42,AARYSAFYN,HLA-A*03:01,10576.536440802967,8.399187499999996,mhcflurry,9 + ,source_sequence_name,offset,peptide,allele,affinity,percentile_rank,prediction_method_name,length + 0,protein2,42,AARYSAFY,HLA-A*03:01,5744.3442744,,mhcflurry,8 + 1,protein2,42,AARYSAFYN,HLA-A*03:01,10576.5364408,,mhcflurry,9 Python library tutorial ======================= + +Predicting +********** + The MHCflurry Python API exposes additional options and features beyond those supported by the commandline tools. This tutorial gives a basic overview of the most important functionality. See the API Documentation for further details. The "Class1AffinityPredictor" class is the primary user-facing -interface. - - - /Users/tim/miniconda3/envs/py3k/lib/python3.5/site-packages/matplotlib/__init__.py:913: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter. - warnings.warn(self.msg_depr % (key, alt_key)) - - # coding: utf-8 - - # In[22]: - - import pandas - import numpy - import seaborn - import logging - from matplotlib import pyplot - - import mhcflurry - - - - # # Download data and models - - # In[2]: - - get_ipython().system('mhcflurry-downloads fetch') - - - # # Making predictions with `Class1AffinityPredictor` - - # In[3]: - - help(mhcflurry.Class1AffinityPredictor) - - - # In[4]: - - downloaded_predictor = mhcflurry.Class1AffinityPredictor.load() - - - # In[5]: - - downloaded_predictor.predict(allele="HLA-A0201", peptides=["SIINFEKL", "SIINFEQL"]) - - - # In[6]: - - downloaded_predictor.predict_to_dataframe(allele="HLA-A0201", peptides=["SIINFEKL", "SIINFEQL"]) - - - # In[7]: - - downloaded_predictor.predict_to_dataframe(alleles=["HLA-A0201", "HLA-B*57:01"], peptides=["SIINFEKL", "SIINFEQL"]) - - - # In[8]: - - downloaded_predictor.predict_to_dataframe( - allele="HLA-A0201", - peptides=["SIINFEKL", "SIINFEQL"], - include_individual_model_predictions=True) - - - # In[9]: - - downloaded_predictor.predict_to_dataframe( - allele="HLA-A0201", - peptides=["SIINFEKL", "SIINFEQL", "TAAAALANGGGGGGGG"], - throw=False) # Without throw=False, you'll get a ValueError for invalid peptides or alleles - - - # # Instantiating a `Class1AffinityPredictor` from a saved model on disk - - # In[10]: - - models_dir = mhcflurry.downloads.get_path("models_class1", "models") - models_dir - - - # In[11]: - - # This will be the same predictor we instantiated above. We're just being explicit about what models to load. - downloaded_predictor = mhcflurry.Class1AffinityPredictor.load(models_dir) - downloaded_predictor.predict(["SIINFEKL", "SIQNPEKP", "SYNFPEPI"], allele="HLA-A0301") - - - # # Fit a model: first load some data - - # In[12]: - - # This is the data the downloaded models were trained on - data_path = mhcflurry.downloads.get_path("data_curated", "curated_training_data.csv.bz2") - data_path - - - # In[13]: - - data_df = pandas.read_csv(data_path) - data_df - - - # # Fit a model: Low level `Class1NeuralNetwork` interface - - # In[14]: - - # We'll use mostly the default hyperparameters here. Could also specify them as kwargs. - new_model = mhcflurry.Class1NeuralNetwork(layer_sizes=[16]) - new_model.hyperparameters - - - # In[16]: - - train_data = data_df.loc[ - (data_df.allele == "HLA-B*57:01") & - (data_df.peptide.str.len() >= 8) & - (data_df.peptide.str.len() <= 15) - ] - get_ipython().magic('time new_model.fit(train_data.peptide.values, train_data.measurement_value.values)') - - - # In[17]: - - new_model.predict(["SYNPEPII"]) +interface. Use the "load" static method to load a trained predictor +from disk. With no arguments this method will load the predictor +released with MHCflurry (see Downloading models). If you pass a path +to a models directory, then it will load that predictor instead. + >>> from mhcflurry import Class1AffinityPredictor + >>> predictor = Class1AffinityPredictor.load() + >>> predictor.supported_alleles[:10] + ['BoLA-6*13:01', 'Eqca-1*01:01', 'H-2-Db', 'H-2-Dd', 'H-2-Kb', 'H-2-Kd', 'H-2-Kk', 'H-2-Ld', 'HLA-A*01:01', 'HLA-A*02:01'] - # # Fit a model: high level `Class1AffinityPredictor` interface +With a predictor loaded we can now generate some binding predictions: - # In[18]: + >>> predictor.predict(allele="HLA-A0201", peptides=["SIINFEKL", "SIINFEQL"]) + /Users/tim/miniconda3/envs/py2k/lib/python2.7/site-packages/h5py/__init__.py:34: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 80 + from ._conv import register_converters as _register_converters + /Users/tim/miniconda3/envs/py2k/lib/python2.7/site-packages/h5py/__init__.py:43: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 80 + from . import h5a, h5d, h5ds, h5f, h5fd, h5g, h5r, h5s, h5t, h5p, h5z + /Users/tim/miniconda3/envs/py2k/lib/python2.7/site-packages/h5py/_hl/group.py:21: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 80 + from .. import h5g, h5i, h5o, h5r, h5t, h5l, h5p + Using TensorFlow backend. + array([ 4899.04784343, 5685.25682682]) - affinity_predictor = mhcflurry.Class1AffinityPredictor() +Note: MHCflurry normalizes allele names using the mhcnames package. + Names like "HLA-A0201" or "A*02:01" will be normalized to + "HLA-A*02:01", so most naming conventions can be used with methods + such as "predict". - # This can be called any number of times, for example on different alleles, to build up the ensembles. - affinity_predictor.fit_allele_specific_predictors( - n_models=1, - architecture_hyperparameters={"layer_sizes": [16], "max_epochs": 10}, - peptides=train_data.peptide.values, - affinities=train_data.measurement_value.values, - allele="HLA-B*57:01", - ) +For more detailed results, we can use "predict_to_dataframe". + >>> predictor.predict_to_dataframe(allele="HLA-A0201", peptides=["SIINFEKL", "SIINFEQL"]) + allele peptide prediction prediction_low prediction_high \ + 0 HLA-A0201 SIINFEKL 4899.047843 2767.763654 7269.683643 + 1 HLA-A0201 SIINFEQL 5685.256827 3815.923563 7476.714466 - # In[19]: + prediction_percentile + 0 6.509787 + 1 7.436687 - affinity_predictor.predict(["SYNPEPII"], allele="HLA-B*57:01") +Instead of a single allele and multiple peptides, we may need +predictions for allele/peptide pairs. We can predict across pairs by +specifying the "alleles" argument instead of "allele". The list of +alleles must be the same length as the list of peptides (i.e. it is +predicting over pairs, *not* taking the cross product). + >>> predictor.predict(alleles=["HLA-A0201", "HLA-B*57:01"], peptides=["SIINFEKL", "SIINFEQL"]) + array([ 4899.04794216, 26704.22011499]) - # # Save and restore the fit model - # In[20]: +Training +******** - get_ipython().system('mkdir /tmp/saved-affinity-predictor') - affinity_predictor.save("/tmp/saved-affinity-predictor") - get_ipython().system('ls /tmp/saved-affinity-predictor') +Let’s fit our own MHCflurry predictor. First we need some training +data. If you haven’t already, run this in a shell to download the +MHCflurry training data: + $ mhcflurry-downloads fetch data_curated - # In[21]: +We can get the path to this data from Python using +"mhcflurry.downloads.get_path": + + >>> from mhcflurry.downloads import get_path + >>> data_path = get_path("data_curated", "curated_training_data.csv.bz2") + >>> data_path + '/Users/tim/Library/Application Support/mhcflurry/4/1.0.0/data_curated/curated_training_data.csv.bz2' + +Now let’s load it with pandas and filter to reasonably-sized peptides: + + >>> import pandas + >>> df = pandas.read_csv(data_path) + >>> df = df.loc[(df.peptide.str.len() >= 8) & (df.peptide.str.len() <= 15)] + >>> df.head(5) + allele peptide measurement_value measurement_type \ + 0 BoLA-1*21:01 AENDTLVVSV 7817.0 quantitative + 1 BoLA-1*21:01 NQFNGGCLLV 1086.0 quantitative + 2 BoLA-2*08:01 AAHCIHAEW 21.0 quantitative + 3 BoLA-2*08:01 AAKHMSNTY 1299.0 quantitative + 4 BoLA-2*08:01 DSYAYMRNGW 2.0 quantitative + + measurement_source original_allele + 0 Barlow - purified MHC/competitive/fluorescence BoLA-1*02101 + 1 Barlow - purified MHC/direct/fluorescence BoLA-1*02101 + 2 Barlow - purified MHC/direct/fluorescence BoLA-2*00801 + 3 Barlow - purified MHC/direct/fluorescence BoLA-2*00801 + 4 Barlow - purified MHC/direct/fluorescence BoLA-2*00801 + +We’ll make an untrained "Class1AffinityPredictor" and then call +"fit_allele_specific_predictors" to fit some models. + + >>> new_predictor = Class1AffinityPredictor() + >>> single_allele_train_data = df.loc[df.allele == "HLA-B*57:01"].sample(100) + >>> new_predictor.fit_allele_specific_predictors( + ... n_models=1, + ... architecture_hyperparameters={ + ... "layer_sizes": [16], + ... "max_epochs": 5, + ... "random_negative_constant": 5, + ... }, + ... peptides=single_allele_train_data.peptide.values, + ... affinities=single_allele_train_data.measurement_value.values, + ... allele="HLA-B*57:01") + Train on 112 samples, validate on 28 samples + Epoch 1/1 + + 112/112 [==============================] - 0s 3ms/step - loss: 0.3730 - val_loss: 0.3472 + Epoch 0 / 5: loss=0.373015. Min val loss (None) at epoch None + Train on 112 samples, validate on 28 samples + Epoch 1/1 + + 112/112 [==============================] - 0s 38us/step - loss: 0.3508 - val_loss: 0.3345 + Train on 112 samples, validate on 28 samples + Epoch 1/1 + + 112/112 [==============================] - 0s 37us/step - loss: 0.3375 - val_loss: 0.3218 + Train on 112 samples, validate on 28 samples + Epoch 1/1 + + 112/112 [==============================] - 0s 36us/step - loss: 0.3227 - val_loss: 0.3092 + Train on 112 samples, validate on 28 samples + Epoch 1/1 + + 112/112 [==============================] - 0s 37us/step - loss: 0.3104 - val_loss: 0.2970 + [<mhcflurry.class1_neural_network.Class1NeuralNetwork object at 0x11e28ad10>] + +The "fit_allele_specific_predictors" method can be called any number +of times on the same instance to build up ensembles of models across +alleles. The "architecture_hyperparameters" we specified are for +demonstration purposes; to fit real models you would usually train for +more epochs. + +Now we can generate predictions: + + >>> new_predictor.predict(["SYNPEPII"], allele="HLA-B*57:01") + array([ 610.30706541]) + +We can save our predictor to the specified directory on disk by +running: + + >>> new_predictor.save("/tmp/new-predictor") + +and restore it: + + >>> new_predictor2 = Class1AffinityPredictor.load("/tmp/new-predictor") + >>> new_predictor2.supported_alleles + ['HLA-B*57:01'] + + +Lower level interface +********************* - affinity_predictor2 = mhcflurry.Class1AffinityPredictor.load("/tmp/saved-affinity-predictor") - affinity_predictor2.predict(["SYNPEPII"], allele="HLA-B*57:01") +The high-level "Class1AffinityPredictor" delegates to low-level +"Class1NeuralNetwork" objects, each of which represents a single +neural network. The purpose of "Class1AffinityPredictor" is to +implement several important features: + +ensembles + More than one neural network can be used to generate each + prediction. The predictions returned to the user are the geometric + mean of the individual model predictions. This gives higher + accuracy in most situations + +multiple alleles + A "Class1NeuralNetwork" generates predictions for only a single + allele. The "Class1AffinityPredictor" maps alleles to the relevant + "Class1NeuralNetwork" instances + +serialization + Loading and saving predictors is implemented in + "Class1AffinityPredictor". + +Sometimes it’s easiest to work directly with "Class1NeuralNetwork". +Here is a simple example of doing so: + + >>> from mhcflurry import Class1NeuralNetwork + >>> network = Class1NeuralNetwork() + >>> network.fit( + ... single_allele_train_data.peptide.values, + ... single_allele_train_data.measurement_value.values, + ... verbose=0) + Epoch 0 / 500: loss=0.533378. Min val loss (None) at epoch None + Early stopping at epoch 124 / 500: loss=0.0115427. Min val loss (0.0719302743673) at epoch 113 + >>> network.predict(["SIINFEKLL"]) + array([ 23004.58985458]) Supported alleles and peptide lengths diff --git a/mhcflurry/predict_command.py b/mhcflurry/predict_command.py index d7fdc0a2c5708a722493aa89f9b1a24f30f52114..3eb9f16fe2b6fb7d40e37001ee306bcfd3b5fa11 100644 --- a/mhcflurry/predict_command.py +++ b/mhcflurry/predict_command.py @@ -8,12 +8,12 @@ additional column giving MHCflurry binding affinity predictions: $ mhcflurry-predict INPUT.csv --out RESULT.csv -The input CSV file is expected to contain columns 'allele' and 'peptide'. -The predictions are written to a column called 'mhcflurry_prediction'. -These default column names may be changed with the --allele-column, ---peptide-column, and --prediction-column options. +The input CSV file is expected to contain columns ``allele`` and ``peptide``. +The predictions are written to a column called ``mhcflurry_prediction``. +These default column names may be changed with the `--allele-column`, +`--peptide-column`, and `--prediction-column` options. -If --out is not specified, results are writtent to standard out. +If `--out` is not specified, results are written to standard out. You can also run on alleles and peptides specified on the commandline, in which case predictions are written for all combinations of alleles and