Snippets Groups Projects

8 years ago

Big refactor to prepare for release · 942601b3

Tim O'Donnell authored 8 years ago

Lazily putting this all in one commit.

* infrastructure for downloading datasets and published trained models (the `mhcflurry-downloads` command)
* docs and scripts (in `downloads-generation`) to generate the pubilshed datsets and trained models
* parallelized cross validation and model training implementation, including support for imputation (based on the old mhcflurry-cloud repo, which is now gone)
* a single front-end script for class1 allele-specific cross validation and model training / testing (`mhcflurry-class1-allele-specific-cv-and-train`)
* refactor how we deal with hyper-parameters and how we instantiate Class1BindingPredictors
* make Class1BindingPredictor pickleable and remove old serialization code
* move code particular to class 1 allele-specific predictors into its own submodule
* remove unused code including arg parsing, plotting, and ensembles
* had to bump the binding prediction threshold for the Titin1 epitope from 500 to 700, as this test was sporadically failing for me (see test_known_class1_epitopes.py)
* Attempt to make tests involving randomness somewhat more reproducible by setting numpy random seed
* update README

942601b3

Big refactor to prepare for release

Tim O'Donnell authored 8 years ago

Lazily putting this all in one commit.

* infrastructure for downloading datasets and published trained models (the `mhcflurry-downloads` command)
* docs and scripts (in `downloads-generation`) to generate the pubilshed datsets and trained models
* parallelized cross validation and model training implementation, including support for imputation (based on the old mhcflurry-cloud repo, which is now gone)
* a single front-end script for class1 allele-specific cross validation and model training / testing (`mhcflurry-class1-allele-specific-cv-and-train`)
* refactor how we deal with hyper-parameters and how we instantiate Class1BindingPredictors
* make Class1BindingPredictor pickleable and remove old serialization code
* move code particular to class 1 allele-specific predictors into its own submodule
* remove unused code including arg parsing, plotting, and ensembles
* had to bump the binding prediction threshold for the Titin1 epitope from 500 to 700, as this test was sporadically failing for me (see test_known_class1_epitopes.py)
* Attempt to make tests involving randomness somewhat more reproducible by setting numpy random seed
* update README

README.md 1.54 KiB

Class I allele-specific models (single)

This download contains trained MHC Class I allele-specific MHCflurry models. The training data used is in the data_combined_iedb_kim2014 MHCflurry download. We first select network hyperparameters for each allele individually using cross validation over the models enumerated in models.py. The best hyperparameter settings are selected via average of AUC (at 500nm), F1, and Kendall's Tau over the training folds. We then train the production models over the full training set using the selected hyperparameters.

The training script supports multi-node parallel execution using the dask-distributed library. To enable this, pass the IP and port of the dask scheduler to the training script with the '--dask-scheduler' option. The GENERATE.sh script passes all arguments to the training script so you can just give it as an argument to GENERATE.sh.

We run dask distributed on Google Container Engine using Kubernetes as described here.

To generate this download we run:

# If you are running dask distributed using our kubernetes config, you can use the DASK_IP one liner below.
# Otherwise, just set it to the IP of the dask scheduler.
DASK_IP=$(kubectl get service | grep daskd-scheduler | tr -s ' ' | cut -d ' ' -f 3)
./GENERATE.sh \
    --joblib-num-jobs 100 \
    --joblib-pre-dispatch all \
    --cv-folds-per-task 10 \
    --dask-scheduler $DASK_IP:8786