Merge pull request #145 from openvax/v1.3

pan allele prediction (MHCflurry 1.3.0)

Merge pull request #145 from openvax/v1.3
pan allele prediction (MHCflurry 1.3.0)
18d0bd99 · Tim O'Donnell · GitHub · 74b751e6 · 3cbdfd69 · 18d0bd99
Unverified Commit 18d0bd99 authored 5 years ago by Tim O'Donnell Committed by GitHub 5 years ago
--- a/.travis.yml
+++ b/.travis.yml
@@ -4,9 +4,6 @@ python:
  - "2.7"
  - "3.6"
 before_install:
-  # Commands below copied from: http://conda.pydata.org/docs/travis.html
-  # We do this conditionally because it saves us some downloading if the
-  # version is the same.
  - if [[ "$TRAVIS_PYTHON_VERSION" == "2.7" ]]; then
      wget https://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh;
    else
@@ -20,6 +17,7 @@ before_install:
  - conda update -q conda
  # Useful for debugging any issues with conda
  - conda info -a
+  - free -m
 addons:
  apt:
    packages:
@@ -29,23 +27,21 @@ addons:
 install:
  - >
      conda create -q -n test-environment python=$TRAVIS_PYTHON_VERSION
-      numpy scipy nose pandas matplotlib mkl-service 
+      numpy scipy nose pandas matplotlib mkl-service tensorflow pypandoc
  - source activate test-environment
-  - pip install tensorflow pypandoc pylint 'theano>=1.0.4'
+  - pip install nose-timer
  - pip install -r requirements.txt
  - pip install .
  - pip freeze
 env:
  global:
    - PYTHONHASHSEED=0
-    - MKL_THREADING_LAYER=GNU  # for theano
-    - CUDA_VISIBLE_DEVICES=""  # for tensorflow
-  matrix:
-    - KERAS_BACKEND=theano
    - KERAS_BACKEND=tensorflow
+    - KMP_SETTINGS=TRUE
+    - OMP_NUM_THREADS=1
 script:
  # download data and models, then run tests
-  - mhcflurry-downloads fetch
+  - mhcflurry-downloads fetch data_curated models_class1 models_class1_pan allele_sequences
  - mhcflurry-downloads info  # just to test this command works
-  - nosetests test -sv
+  - nosetests --with-timer -sv test
-  - ./lint.sh
--- a/README.md
+++ b/README.md
@@ -5,10 +5,14 @@
 prediction package with competitive accuracy and a fast and 
 [documented](http://openvax.github.io/mhcflurry/) implementation.
-MHCflurry supports Class I peptide/MHC binding affinity prediction using
+MHCflurry implements class I peptide/MHC binding affinity prediction. By default
-ensembles of allele-specific models. It runs on Python 2.7 and 3.4+ using
+it supports 112 MHC alleles using ensembles of allele-specific models.
-the [keras](https://keras.io) neural network library. It exposes [command-line](http://openvax.github.io/mhcflurry/commandline_tutorial.html)
+Pan-allele predictors supporting virtually any MHC allele of known sequence
-and [Python library](http://openvax.github.io/mhcflurry/python_tutorial.html) interfaces.
+are available for testing (see below). MHCflurry runs on Python 2.7 and 3.4+ using the
+[keras](https://keras.io) neural network library.
+It exposes [command-line](http://openvax.github.io/mhcflurry/commandline_tutorial.html)
+and [Python library](http://openvax.github.io/mhcflurry/python_tutorial.html)
+interfaces.
 If you find MHCflurry useful in your research please cite:
@@ -43,12 +47,41 @@ Wrote: /tmp/predictions.csv
 See the [documentation](http://openvax.github.io/mhcflurry/) for more details.
-## MHCflurry model variants and mass spec 
+### Pan-allele models (experimental)
-The default MHCflurry models are trained
+We are testing new models that support prediction for any MHC I allele of known
-on affinity measurements. Mass spec datasets are incorporated only in
+sequence (as opposed to the 112 alleles supported by the allele-specific
-the model selection step. We also release experimental predictors whose training data directly
+predictors). These models are trained on both affinity measurements and mass spec.
-includes mass spec. To download these predictors, run:
+To try the pan-allele models, first download them:
+```
+$ mhcflurry-downloads fetch models_class1_pan
+```
+then set this environment variable to use them by default:
+```
+$ export MHCFLURRY_DEFAULT_CLASS1_MODELS="$(mhcflurry-downloads path models_class1_pan)/models.with_mass_spec"
+```
+You can now generate predictions for about 14,000 MHC I alleles. For example:
+```
+$ mhcflurry-predict --alleles HLA-A*02:04 --peptides SIINFEKL
+```
+If you use these models please let us know how it goes.
+## Other allele-specific models
+The default MHCflurry models are trained on affinity measurements, one allele
+per model (i.e. allele-specific). Mass spec datasets are incorporated in the
+model selection step.
+We also release experimental allele-specific predictors whose training data
+directly includes mass spec. To download these predictors, run:
 ```
 $ mhcflurry-downloads fetch models_class1_trained_with_mass_spec
@@ -66,4 +99,4 @@ these predictors, run:
 ```
 $ mhcflurry-downloads fetch models_class1_selected_no_mass_spec
 export MHCFLURRY_DEFAULT_CLASS1_MODELS="$(mhcflurry-downloads path models_class1_selected_no_mass_spec)/models"
 ```
\ No newline at end of file
--- a/docs/commandline_tutorial.rst
+++ b/docs/commandline_tutorial.rst
@@ -151,3 +151,33 @@ This will write a file giving predictions for all subsequences of the specified
 .. command-output::
    head -n 3 /tmp/subsequence_predictions.csv
+Environment variables
+-------------------------------------------------
+MHCflurry behavior can be modified using these environment variables:
+``MHCFLURRY_DEFAULT_CLASS1_MODELS``
+    Path to models directory. If you call ``Class1AffinityPredictor.load()``
+    with no arguments, the models specified in this environment variable will be
+    used. If this environment variable is undefined, the downloaded models for
+    the current MHCflurry release are used.
+``MHCFLURRY_OPTIMIZATION_LEVEL``
+    The pan-allele models can be somewhat slow. As an optimization, when this
+    variable is greater than 0 (default is 1), we "stitch" the pan-allele models in
+    the ensemble into one large tensorflow graph. In our experiments
+    it gives about a 30% speed improvement. It has no effect on allele-specific
+    models. Set this variable to 0 to disable this behavior. This may be helpful
+    if you are running out of memory using the pan-allele models.
+``MHCFLURRY_DEFAULT_PREDICT_BATCH_SIZE``
+    For large prediction tasks, it can be helpful to increase the prediction batch
+    size, which is set by this environment variable (default is 4096). This
+    affects both allele-specific and pan-allele predictors. It can have large
+    effects on performance. Alternatively, if you are running out of memory,
+    you can try decreasing the batch size.
--- a/downloads-generation/allele_sequences/GENERATE.sh
+++ b/downloads-generation/allele_sequences/GENERATE.sh
+#!/bin/bash
+#
+# Create allele sequences (sometimes referred to as pseudosequences) by
+# performing a global alignment across all MHC amino acid sequences we can get
+# our hands on.
+#
+# Requires: clustalo, wget
+#
+set -e
+set -x
+DOWNLOAD_NAME=allele_sequences
+SCRATCH_DIR=${TMPDIR-/tmp}/mhcflurry-downloads-generation
+SCRIPT_ABSOLUTE_PATH="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)/$(basename "${BASH_SOURCE[0]}")"
+SCRIPT_DIR=$(dirname "$SCRIPT_ABSOLUTE_PATH")
+export PYTHONUNBUFFERED=1
+mkdir -p "$SCRATCH_DIR"
+rm -rf "$SCRATCH_DIR/$DOWNLOAD_NAME"
+mkdir "$SCRATCH_DIR/$DOWNLOAD_NAME"
+# Send stdout and stderr to a logfile included with the archive.
+exec >  >(tee -ia "$SCRATCH_DIR/$DOWNLOAD_NAME/LOG.txt")
+exec 2> >(tee -ia "$SCRATCH_DIR/$DOWNLOAD_NAME/LOG.txt" >&2)
+# Log some environment info
+date
+pip freeze
+git status
+which clustalo
+clustalo --version
+cd $SCRATCH_DIR/$DOWNLOAD_NAME
+cp $SCRIPT_DIR/make_allele_sequences.py .
+cp $SCRIPT_DIR/filter_sequences.py .
+cp $SCRIPT_DIR/class1_pseudosequences.csv .
+cp $SCRIPT_ABSOLUTE_PATH .
+# Generate sequences
+# Training data is used to decide which additional positions to include in the
+# allele sequences to differentiate alleles that have identical traditional
+# pseudosequences but have associated training data
+TRAINING_DATA="$(mhcflurry-downloads path data_curated)/curated_training_data.with_mass_spec.csv.bz2"
+bzcat "$(mhcflurry-downloads path data_curated)/curated_training_data.with_mass_spec.csv.bz2" \
+    | cut -f 1 -d , | uniq | sort | uniq | grep -v allele > training_data.alleles.txt
+# Human
+wget -q ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/fasta/A_prot.fasta
+wget -q ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/fasta/B_prot.fasta
+wget -q ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/fasta/C_prot.fasta
+wget -q ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/fasta/E_prot.fasta
+wget -q ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/fasta/F_prot.fasta
+wget -q ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/fasta/G_prot.fasta
+# Mouse
+wget -q https://www.uniprot.org/uniprot/P01899.fasta  # H-2 Db
+wget -q https://www.uniprot.org/uniprot/P01900.fasta  # H-2 Dd
+wget -q https://www.uniprot.org/uniprot/P14427.fasta  # H-2 Dp
+wget -q https://www.uniprot.org/uniprot/P14426.fasta  # H-2 Dk
+wget -q https://www.uniprot.org/uniprot/Q31145.fasta  # H-2 Dq
+wget -q https://www.uniprot.org/uniprot/P01901.fasta  # H-2 Kb
+wget -q https://www.uniprot.org/uniprot/P01902.fasta  # H-2 Kd
+wget -q https://www.uniprot.org/uniprot/P04223.fasta  # H-2 Kk
+wget -q https://www.uniprot.org/uniprot/P14428.fasta  # H-2 Kq
+wget -q https://www.uniprot.org/uniprot/P01897.fasta  # H-2 Ld
+wget -q https://www.uniprot.org/uniprot/Q31151.fasta  # H-2 Lq
+# Various
+wget -q ftp://ftp.ebi.ac.uk/pub/databases/ipd/mhc/MHC_prot.fasta
+python filter_sequences.py *.fasta --out class1.fasta
+time clustalo -i class1.fasta -o class1.aligned.fasta
+time python make_allele_sequences.py \
+    class1.aligned.fasta \
+    --recapitulate-sequences class1_pseudosequences.csv \
+    --differentiate-alleles training_data.alleles.txt \
+    --out-csv allele_sequences.csv
+# Cleanup
+gzip -f class1.fasta
+gzip -f class1.aligned.fasta
+rm *.fasta
+cp $SCRIPT_ABSOLUTE_PATH .
+bzip2 LOG.txt
+tar -cjf "../${DOWNLOAD_NAME}.tar.bz2" *
+echo "Created archive: $SCRATCH_DIR/$DOWNLOAD_NAME.tar.bz2"
--- a/downloads-generation/allele_sequences/class1_pseudosequences.csv
+++ b/downloads-generation/allele_sequences/class1_pseudosequences.csv
--- a/downloads-generation/allele_sequences/filter_sequences.py
+++ b/downloads-generation/allele_sequences/filter_sequences.py
+"""
+Filter and combine class I sequence fastas.
+"""
+from __future__ import print_function
+import sys
+import argparse
+import mhcnames
+import Bio.SeqIO  # pylint: disable=import-error
+def normalize(s, disallowed=["MIC", "HFE"]):
+    if any(item in s for item in disallowed):
+        return None
+    try:
+        return mhcnames.normalize_allele_name(s)
+    except:
+        while s:
+            s = ":".join(s.split(":")[:-1])
+            try:
+                return mhcnames.normalize_allele_name(s)
+            except:
+                pass
+        return None
+parser = argparse.ArgumentParser(usage=__doc__)
+parser.add_argument(
+    "fastas",
+    nargs="+",
+    help="Unaligned fastas")
+parser.add_argument(
+    "--out",
+    required=True,
+    help="Fasta output")
+class_ii_names = {
+    "DRA",
+    "DRB",
+    "DPA",
+    "DPB",
+    "DQA",
+    "DQB",
+    "DMA",
+    "DMB",
+    "DOA",
+    "DOB",
+}
+def run():
+    args = parser.parse_args(sys.argv[1:])
+    print(args)
+    records = []
+    total = 0
+    seen = set()
+    for fasta in args.fastas:
+        reader = Bio.SeqIO.parse(fasta, "fasta")
+        for record in reader:
+            total += 1
+            name = record.description.split()[1]
+            normalized = normalize(name)
+            if not normalized and "," in record.description:
+                # Try parsing uniprot-style sequence description
+                if "MOUSE MHC class I L-q alpha-chain" in record.description:
+                    # Special case.
+                    name = "H2-Lq"
+                else: 
+                    name = (
+                        record.description.split()[1].replace("-", "") +
+                        "-" +
+                        record.description.split(",")[-1].split()[0].replace("-",""))
+                normalized = normalize(name)
+            if not normalized:
+                print("Couldn't parse: ", name)
+                continue
+            if normalized in seen:
+                continue
+            if any(n in name for n in class_ii_names):
+                print("Dropping", name)
+                continue
+            seen.add(normalized)
+            record.description = normalized + " " + record.description
+            records.append(record)
+    with open(args.out, "w") as fd:
+        Bio.SeqIO.write(records, fd, "fasta")
+    print("Wrote %d / %d sequences: %s" % (len(records), total, args.out))
+if __name__ == '__main__':
+    run()
--- a/downloads-generation/allele_sequences/make_allele_sequences.py
+++ b/downloads-generation/allele_sequences/make_allele_sequences.py
+"""
+Generate allele sequences for pan-class I models.
+Additional dependency: biopython
+"""
+from __future__ import print_function
+import sys
+import argparse
+import numpy
+import pandas
+import mhcnames
+import Bio.SeqIO  # pylint: disable=import-error
+def normalize_simple(s):
+    return mhcnames.normalize_allele_name(s)
+def normalize_complex(s, disallowed=["MIC", "HFE"]):
+    if any(item in s for item in disallowed):
+        return None
+    try:
+        return normalize_simple(s)
+    except:
+        while s:
+            s = ":".join(s.split(":")[:-1])
+            try:
+                return normalize_simple(s)
+            except:
+                pass
+        return None
+parser = argparse.ArgumentParser(usage=__doc__)
+parser.add_argument(
+    "aligned_fasta",
+    help="Aligned sequences")
+parser.add_argument(
+    "--recapitulate-sequences",
+    required=True,
+    help="CSV giving sequences to recapitulate")
+parser.add_argument(
+    "--differentiate-alleles",
+    help="File listing alleles to differentiate using additional positions")
+parser.add_argument(
+    "--out-csv",
+    help="Result file")
+def run():
+    args = parser.parse_args(sys.argv[1:])
+    print(args)
+    allele_to_sequence = {}
+    reader = Bio.SeqIO.parse(args.aligned_fasta, "fasta")
+    for record in reader:
+        name = record.description.split()[1]
+        print(record.name, record.description)
+        allele_to_sequence[name] = str(record.seq)
+    print("Read %d aligned sequences" % len(allele_to_sequence))
+    allele_sequences = pandas.Series(allele_to_sequence).to_frame()
+    allele_sequences.columns = ['aligned']
+    allele_sequences['aligned'] = allele_sequences['aligned'].str.replace(
+        "-", "X")
+    allele_sequences['normalized_allele'] = allele_sequences.index.map(normalize_complex)
+    allele_sequences = allele_sequences.set_index("normalized_allele", drop=True)
+    selected_positions = []
+    recapitulate_df = pandas.read_csv(args.recapitulate_sequences)
+    recapitulate_df["normalized_allele"] = recapitulate_df.allele.map(
+        normalize_complex)
+    recapitulate_df = (
+        recapitulate_df
+            .dropna()
+            .drop_duplicates("normalized_allele")
+            .set_index("normalized_allele", drop=True))
+    allele_sequences["recapitulate_target"] = recapitulate_df.iloc[:,-1]
+    print("Sequences in recapitulate CSV that are not in aligned fasta:")
+    print(recapitulate_df.index[
+        ~recapitulate_df.index.isin(allele_sequences.index)
+    ].tolist())
+    allele_sequences_with_target = allele_sequences.loc[
+        ~allele_sequences.recapitulate_target.isnull()
+    ]
+    position_identities = []
+    target_length = int(
+        allele_sequences_with_target.recapitulate_target.str.len().max())
+    for i in range(target_length):
+        series_i = allele_sequences_with_target.recapitulate_target.str.get(i)
+        row = []
+        full_length_sequence_length = int(
+            allele_sequences_with_target.aligned.str.len().max())
+        for k in range(full_length_sequence_length):
+            series_k = allele_sequences_with_target.aligned.str.get(k)
+            row.append((series_i == series_k).mean())
+        position_identities.append(row)
+    position_identities = pandas.DataFrame(numpy.array(position_identities))
+    selected_positions = position_identities.idxmax(1).tolist()
+    fractions = position_identities.max(1)
+    print("Selected positions: ", *selected_positions)
+    print("Lowest concordance fraction: %0.5f" % fractions.min())
+    assert fractions.min() > 0.99
+    allele_sequences["recapitulated"] = allele_sequences.aligned.map(
+        lambda s: "".join(s[p] for p in selected_positions))
+    allele_sequences_with_target = allele_sequences.loc[
+        ~allele_sequences.recapitulate_target.isnull()
+    ]
+    agreement = (
+        allele_sequences_with_target.recapitulated ==
+        allele_sequences_with_target.recapitulate_target).mean()
+    print("Overall agreement: %0.5f" % agreement)
+    assert agreement > 0.9
+    # Add additional positions
+    if args.differentiate_alleles:
+        differentiate_alleles = pandas.read_csv(
+            args.differentiate_alleles).iloc[:,0].values
+        print(
+            "Read %d alleles to differentiate:" % len(differentiate_alleles),
+            differentiate_alleles)
+        allele_sequences_to_differentiate = allele_sequences.loc[
+            allele_sequences.index.isin(differentiate_alleles)
+        ]
+        print(allele_sequences_to_differentiate.shape)
+        additional_positions = []
+        for (_, sub_df) in allele_sequences_to_differentiate.groupby("recapitulated"):
+            if sub_df.aligned.nunique() > 1:
+                differing = pandas.DataFrame(
+                    dict([(pos, chars) for (pos, chars) in
+                    enumerate(zip(*sub_df.aligned.values)) if
+                    any(c != chars[0] for c in chars) and "X" not in chars])).T
+                print(sub_df)
+                print(differing)
+                print()
+                additional_positions.extend(differing.index)
+    additional_positions = sorted(set(additional_positions))
+    print(
+        "Selected %d additional positions: " % len(additional_positions),
+        additional_positions)
+    extended_selected_positions = sorted(
+        set(selected_positions).union(set(additional_positions)))
+    print(
+        "Extended selected positions (%d)" % len(extended_selected_positions),
+        *extended_selected_positions)
+    allele_sequences["sequence"] = allele_sequences.aligned.map(
+        lambda s: "".join(s[p] for p in extended_selected_positions))
+    allele_sequences[["sequence"]].to_csv(args.out_csv, index=True)
+    print("Wrote: %s" % args.out_csv)
+if __name__ == '__main__':
+    run()
--- a/downloads-generation/data_curated/GENERATE.sh
+++ b/downloads-generation/data_curated/GENERATE.sh
@@ -46,8 +46,6 @@ time python curate.py \
        "$(mhcflurry-downloads path data_published)/bdata.20130222.mhci.public.1.txt" \
    --data-systemhc-atlas \
        "$(mhcflurry-downloads path data_systemhcatlas)/data.csv.bz2" \
-    --data-abelin-mass-spec \
-        "$(mhcflurry-downloads path data_published)/abelin2017.hits.csv.bz2" \
    --include-iedb-mass-spec \
    --out-csv curated_training_data.with_mass_spec.csv

--- a/downloads-generation/data_curated/curate.py
+++ b/downloads-generation/data_curated/curate.py
@@ -34,11 +34,6 @@ parser.add_argument(
    action="append",
    default=[],
    help="Path to systemhc-atlas-style mass-spec data")
-parser.add_argument(
-    "--data-abelin-mass-spec",
-    action="append",
-    default=[],
-    help="Path to Abelin Immunity 2017 mass-spec hits")
 parser.add_argument(
    "--include-iedb-mass-spec",
    action="store_true",
@@ -85,8 +80,8 @@ def load_data_kim2014(filename):
    df["peptide"] = df.sequence
    df["allele"] = df.mhc.map(normalize_allele_name)
    print("Dropping un-parseable alleles: %s" % ", ".join(
-        df.ix[df.allele == "UNKNOWN"]["mhc"].unique()))
+        df.loc[df.allele == "UNKNOWN"]["mhc"].unique()))
-    df = df.ix[df.allele != "UNKNOWN"]
+    df = df.loc[df.allele != "UNKNOWN"]
    print("Loaded kim2014 data: %s" % str(df.shape))
    return df
@@ -105,7 +100,7 @@ def load_data_systemhc_atlas(filename, min_probability=0.99):
    df["allele"] = df.top_allele.map(normalize_allele_name)
    print("Dropping un-parseable alleles: %s" % ", ".join(
-        str(x) for x in df.ix[df.allele == "UNKNOWN"]["top_allele"].unique()))
+        str(x) for x in df.loc[df.allele == "UNKNOWN"]["top_allele"].unique()))
    df = df.loc[df.allele != "UNKNOWN"]
    print("Systemhc atlas data now: %s" % str(df.shape))
@@ -120,65 +115,44 @@ def load_data_systemhc_atlas(filename, min_probability=0.99):
    return df
-def load_data_abelin_mass_spec(filename):
-    df = pandas.read_csv(filename)
-    print("Loaded Abelin mass-spec data: %s" % str(df.shape))
-    df["measurement_source"] = "abelin-mass-spec"
-    df["measurement_value"] = QUALITATIVE_TO_AFFINITY["Positive"]
-    df["measurement_inequality"] = "<"
-    df["measurement_type"] = "qualitative"
-    df["original_allele"] = df.allele
-    df["allele"] = df.original_allele.map(normalize_allele_name)
-    print("Dropping un-parseable alleles: %s" % ", ".join(
-        str(x) for x in df.ix[df.allele == "UNKNOWN"]["allele"].unique()))
-    df = df.loc[df.allele != "UNKNOWN"]
-    print("Abelin mass-spec data now: %s" % str(df.shape))
-    print("Removing duplicates")
-    df = df.drop_duplicates(["allele", "peptide"])
-    print("Abelin mass-spec data now: %s" % str(df.shape))
-    return df
 def load_data_iedb(iedb_csv, include_qualitative=True, include_mass_spec=False):
    iedb_df = pandas.read_csv(iedb_csv, skiprows=1, low_memory=False)
    print("Loaded iedb data: %s" % str(iedb_df.shape))
    print("Selecting only class I")
-    iedb_df = iedb_df.ix[
+    iedb_df = iedb_df.loc[
        iedb_df["MHC allele class"].str.strip().str.upper() == "I"
    ]
    print("New shape: %s" % str(iedb_df.shape))
    print("Dropping known unusuable alleles")
-    iedb_df = iedb_df.ix[
+    iedb_df = iedb_df.loc[
        ~iedb_df["Allele Name"].isin(EXCLUDE_IEDB_ALLELES)
    ]
-    iedb_df = iedb_df.ix[
+    iedb_df = iedb_df.loc[
        (~iedb_df["Allele Name"].str.contains("mutant")) &
        (~iedb_df["Allele Name"].str.contains("CD1"))
    ]
    iedb_df["allele"] = iedb_df["Allele Name"].map(normalize_allele_name)
    print("Dropping un-parseable alleles: %s" % ", ".join(
-        iedb_df.ix[iedb_df.allele == "UNKNOWN"]["Allele Name"].unique()))
+        iedb_df.loc[iedb_df.allele == "UNKNOWN"]["Allele Name"].unique()))
-    iedb_df = iedb_df.ix[iedb_df.allele != "UNKNOWN"]
+    iedb_df = iedb_df.loc[iedb_df.allele != "UNKNOWN"]
    print("IEDB measurements per allele:\n%s" % iedb_df.allele.value_counts())
-    quantitative = iedb_df.ix[iedb_df["Units"] == "nM"].copy()
+    quantitative = iedb_df.loc[iedb_df["Units"] == "nM"].copy()
    quantitative["measurement_type"] = "quantitative"
-    quantitative["measurement_inequality"] = "="
+    quantitative["measurement_inequality"] = quantitative[
+        "Measurement Inequality"
+    ].fillna("=").map(lambda s: {">=": ">", "<=": "<"}.get(s, s))
    print("Quantitative measurements: %d" % len(quantitative))
-    qualitative = iedb_df.ix[iedb_df["Units"] != "nM"].copy()
+    qualitative = iedb_df.loc[iedb_df["Units"].isnull()].copy()
    qualitative["measurement_type"] = "qualitative"
    print("Qualitative measurements: %d" % len(qualitative))
    if not include_mass_spec:
-        qualitative = qualitative.ix[
+        qualitative = qualitative.loc[
            (~qualitative["Method/Technique"].str.contains("mass spec"))
        ].copy()
@@ -200,7 +174,7 @@ def load_data_iedb(iedb_csv, include_qualitative=True, include_mass_spec=False):
    print("Subselecting to valid peptides. Starting with: %d" % len(iedb_df))
    iedb_df["Description"] = iedb_df.Description.str.strip()
-    iedb_df = iedb_df.ix[
+    iedb_df = iedb_df.loc[
        iedb_df.Description.str.match("^[ACDEFGHIKLMNPQRSTVWY]+$")
    ]
    print("Now: %d" % len(iedb_df))
@@ -248,7 +222,7 @@ def run():
            iedb_df = dfs[0]
            iedb_df["allele_peptide"] = iedb_df.allele + "_" + iedb_df.peptide
            print("Dropping kim2014 data present in IEDB.")
-            df = df.ix[
+            df = df.loc[
                ~df.allele_peptide.isin(iedb_df.allele_peptide)
            ]
            print("Kim2014 data now: %s" % str(df.shape))
@@ -256,9 +230,6 @@ def run():
    for filename in args.data_systemhc_atlas:
        df = load_data_systemhc_atlas(filename)
        dfs.append(df)
-    for filename in args.data_abelin_mass_spec:
-        df = load_data_abelin_mass_spec(filename)
-        dfs.append(df)
    df = pandas.concat(dfs, ignore_index=True)
    print("Combined df: %s" % (str(df.shape)))

--- a/downloads-generation/data_iedb/GENERATE.sh
+++ b/downloads-generation/data_iedb/GENERATE.sh
@@ -22,11 +22,17 @@ date
 cd $SCRATCH_DIR/$DOWNLOAD_NAME
-wget --quiet http://www.iedb.org/doc/mhc_ligand_full.zip
+wget -q http://www.iedb.org/doc/mhc_ligand_full.zip
+wget -q http://www.iedb.org/downloader.php?file_name=doc/tcell_full_v3.zip -O tcell_full_v3.zip
 unzip mhc_ligand_full.zip
 rm mhc_ligand_full.zip
 bzip2 mhc_ligand_full.csv
+unzip tcell_full_v3.zip
+rm tcell_full_v3.zip
+bzip2 tcell_full_v3.csv
 cp $SCRIPT_ABSOLUTE_PATH .
 bzip2 LOG.txt
 tar -cjf "../${DOWNLOAD_NAME}.tar.bz2" *

--- a/downloads-generation/data_published/GENERATE.sh
+++ b/downloads-generation/data_published/GENERATE.sh
 #!/bin/bash
 #
-# Download some published MHC I ligand data
+# Download published non-IEDB MHC I ligand data. Most data has made its way into
+# IEDB but not all. Here we gather up the rest.
 #
 #
 set -e
@@ -9,6 +10,7 @@ set -x
 DOWNLOAD_NAME=data_published
 SCRATCH_DIR=${TMPDIR-/tmp}/mhcflurry-downloads-generation
 SCRIPT_ABSOLUTE_PATH="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)/$(basename "${BASH_SOURCE[0]}")"
+SCRIPT_DIR=$(dirname "$SCRIPT_ABSOLUTE_PATH")
 mkdir -p "$SCRATCH_DIR"
 rm -rf "$SCRATCH_DIR/$DOWNLOAD_NAME"
@@ -26,13 +28,10 @@ git status
 cd $SCRATCH_DIR/$DOWNLOAD_NAME
-# Download kim2014 data
+# Kim et al 2014 [PMID 25017736]
-wget --quiet https://github.com/openvax/mhcflurry/releases/download/pre-1.1/bdata.2009.mhci.public.1.txt
+wget -q https://github.com/openvax/mhcflurry/releases/download/pre-1.1/bdata.2009.mhci.public.1.txt
-wget --quiet https://github.com/openvax/mhcflurry/releases/download/pre-1.1/bdata.20130222.mhci.public.1.txt
+wget -q https://github.com/openvax/mhcflurry/releases/download/pre-1.1/bdata.20130222.mhci.public.1.txt
-wget --quiet https://github.com/openvax/mhcflurry/releases/download/pre-1.1/bdata.2013.mhci.public.blind.1.txt
+wget -q https://github.com/openvax/mhcflurry/releases/download/pre-1.1/bdata.2013.mhci.public.blind.1.txt
-# Download abelin et al 2017 data
-wget --quiet https://github.com/openvax/mhcflurry/releases/download/pre-1.1/abelin2017.hits.csv.bz2
 cp $SCRIPT_ABSOLUTE_PATH .
 bzip2 LOG.txt

--- a/downloads-generation/data_systemhcatlas/GENERATE.sh
+++ b/downloads-generation/data_systemhcatlas/GENERATE.sh
@@ -26,9 +26,22 @@ git status
 cd $SCRATCH_DIR/$DOWNLOAD_NAME
-wget --quiet https://github.com/openvax/mhcflurry/releases/download/pre-1.1/systemhc.20171121.combined.csv.bz2
+wget -q https://systemhcatlas.org/Builds_for_download/180409_master_final.tgz
+mkdir extracted
-mv systemhc.20171121.combined.csv.bz2 data.csv.bz2
+tar -xvzf *.tgz -C extracted
+wc -l extracted/*/*.csv
+# Write header line
+cat extracted/*/*.csv | head -n 1 > data.csv
+# Write concatenated data
+grep -v SysteMHC_ID extracted/*/*.csv >> data.csv
+# Cleanup
+rm -rf extracted *.tgz
+ls -lh data.csv
+wc -l data.csv
+bzip2 data.csv
 cp $SCRIPT_ABSOLUTE_PATH .
 bzip2 LOG.txt

--- a/downloads-generation/data_systemhcatlas/README.md
+++ b/downloads-generation/data_systemhcatlas/README.md
 # SysteMHC database dump
-This is a data dump of the [SysteMHC Atlas](https://systemhcatlas.org/) provided
+This is a database export of the [SysteMHC Atlas](https://systemhcatlas.org/)
-by personal communication. It is distributed under the ODC Open Database License.
+downloaded from [here](https://systemhcatlas.org/Builds_for_download/). It is
+distributed under the ODC Open Database License.
 To generate this download run:
 ```
 ./GENERATE.sh
 ```
\ No newline at end of file
--- a/downloads-generation/models_class1_pan/GENERATE.WITH_HPC_CLUSTER.sh
+++ b/downloads-generation/models_class1_pan/GENERATE.WITH_HPC_CLUSTER.sh
+#!/bin/bash
+#
+# Model select pan-allele MHCflurry Class I models and calibrate percentile ranks.
+#
+# Uses an HPC cluster (Mount Sinai chimera cluster, which uses lsf job
+# scheduler). This would need to be modified for other sites.
+#
+set -e
+set -x
+DOWNLOAD_NAME=models_class1_pan
+SCRATCH_DIR=${TMPDIR-/tmp}/mhcflurry-downloads-generation
+SCRIPT_ABSOLUTE_PATH="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)/$(basename "${BASH_SOURCE[0]}")"
+SCRIPT_DIR=$(dirname "$SCRIPT_ABSOLUTE_PATH")
+mkdir -p "$SCRATCH_DIR"
+rm -rf "$SCRATCH_DIR/$DOWNLOAD_NAME"
+mkdir "$SCRATCH_DIR/$DOWNLOAD_NAME"
+# Send stdout and stderr to a logfile included with the archive.
+exec >  >(tee -ia "$SCRATCH_DIR/$DOWNLOAD_NAME/LOG.txt")
+exec 2> >(tee -ia "$SCRATCH_DIR/$DOWNLOAD_NAME/LOG.txt" >&2)
+# Log some environment info
+echo "Invocation: $0 $@"
+date
+pip freeze
+git status
+cd $SCRATCH_DIR/$DOWNLOAD_NAME
+export OMP_NUM_THREADS=1
+export PYTHONUNBUFFERED=1
+cp $SCRIPT_ABSOLUTE_PATH .
+GPUS=$(nvidia-smi -L 2> /dev/null | wc -l) || GPUS=0
+echo "Detected GPUS: $GPUS"
+PROCESSORS=$(getconf _NPROCESSORS_ONLN)
+echo "Detected processors: $PROCESSORS"
+if [ "$GPUS" -eq "0" ]; then
+   NUM_JOBS=${NUM_JOBS-1}
+else
+    NUM_JOBS=${NUM_JOBS-$GPUS}
+fi
+echo "Num local jobs for model selection: $NUM_JOBS"
+UNSELECTED_PATH="$(mhcflurry-downloads path models_class1_pan_unselected)"
+for kind in with_mass_spec no_mass_spec
+do
+    # Model selection is always done locally. It's fast enough that it
+    # doesn't make sense to put it on the cluster.
+    MODELS_DIR="$UNSELECTED_PATH/models.${kind}"
+    time mhcflurry-class1-select-pan-allele-models \
+        --data "$MODELS_DIR/train_data.csv.bz2" \
+        --models-dir "$MODELS_DIR" \
+        --out-models-dir models.${kind} \
+        --min-models 8 \
+        --max-models 32 \
+        --num-jobs $NUM_JOBS --max-tasks-per-worker 1 --gpus $GPUS --max-workers-per-gpu 1
+    # Percentile rank calibration is run on the cluster.
+    # For now we calibrate percentile ranks only for alleles for which there
+    # is training data. Calibrating all alleles would be too slow.
+    # This could be improved though.
+    time mhcflurry-calibrate-percentile-ranks \
+        --models-dir models.${kind} \
+        --match-amino-acid-distribution-data "$MODELS_DIR/train_data.csv.bz2" \
+        --motif-summary \
+        --num-peptides-per-length 1000000 \
+        --allele $(bzcat "$MODELS_DIR/train_data.csv.bz2" | cut -f 1 -d , | grep -v allele | uniq | sort | uniq) \
+        --verbosity 1 \
+        --worker-log-dir "$SCRATCH_DIR/$DOWNLOAD_NAME" \
+        --prediction-batch-size 524288 \
+        --cluster-parallelism \
+        --cluster-submit-command bsub \
+        --cluster-results-workdir ~/mhcflurry-scratch \
+        --cluster-script-prefix-path $SCRIPT_DIR/cluster_submit_script_header.mssm_hpc.lsf
+done
+bzip2 LOG.txt
+for i in $(ls LOG-worker.*.txt) ; do bzip2 $i ; done
+RESULT="$SCRATCH_DIR/${DOWNLOAD_NAME}.$(date +%Y%m%d).tar.bz2"
+tar -cjf "$RESULT" *
+echo "Created archive: $RESULT"
--- a/downloads-generation/models_class1_pan/GENERATE.sh
+++ b/downloads-generation/models_class1_pan/GENERATE.sh
+#!/bin/bash
+# Model select pan-allele MHCflurry Class I models and calibrate percentile ranks.
+#
+set -e
+set -x
+DOWNLOAD_NAME=models_class1_pan
+SCRATCH_DIR=${TMPDIR-/tmp}/mhcflurry-downloads-generation
+SCRIPT_ABSOLUTE_PATH="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)/$(basename "${BASH_SOURCE[0]}")"
+SCRIPT_DIR=$(dirname "$SCRIPT_ABSOLUTE_PATH")
+mkdir -p "$SCRATCH_DIR"
+rm -rf "$SCRATCH_DIR/$DOWNLOAD_NAME"
+mkdir "$SCRATCH_DIR/$DOWNLOAD_NAME"
+# Send stdout and stderr to a logfile included with the archive.
+exec >  >(tee -ia "$SCRATCH_DIR/$DOWNLOAD_NAME/LOG.txt")
+exec 2> >(tee -ia "$SCRATCH_DIR/$DOWNLOAD_NAME/LOG.txt" >&2)
+# Log some environment info
+date
+pip freeze
+git status
+cd $SCRATCH_DIR/$DOWNLOAD_NAME
+cp $SCRIPT_ABSOLUTE_PATH .
+GPUS=$(nvidia-smi -L 2> /dev/null | wc -l) || GPUS=0
+echo "Detected GPUS: $GPUS"
+PROCESSORS=$(getconf _NPROCESSORS_ONLN)
+echo "Detected processors: $PROCESSORS"
+if [ "$GPUS" -eq "0" ]; then
+   NUM_JOBS=${NUM_JOBS-1}
+else
+    NUM_JOBS=${NUM_JOBS-$GPUS}
+fi
+echo "Num jobs: $NUM_JOBS"
+export PYTHONUNBUFFERED=1
+UNSELECTED_PATH="$(mhcflurry-downloads path models_class1_pan_unselected)"
+for kind in with_mass_spec no_mass_spec
+do
+    MODELS_DIR="$UNSELECTED_PATH/models.${kind}"
+    time mhcflurry-class1-select-pan-allele-models \
+        --data "$MODELS_DIR/train_data.csv.bz2" \
+        --models-dir "$MODELS_DIR" \
+        --out-models-dir models.${kind} \
+        --min-models 8 \
+        --max-models 32 \
+        --num-jobs 0 \
+        --num-jobs $NUM_JOBS --max-tasks-per-worker 1 --gpus $GPUS --max-workers-per-gpu 1
+    # For now we calibrate percentile ranks only for alleles for which there
+    # is training data. Calibrating all alleles would be too slow.
+    # This could be improved though.
+    time mhcflurry-calibrate-percentile-ranks \
+        --models-dir models.${kind} \
+        --match-amino-acid-distribution-data "$MODELS_DIR/train_data.csv.bz2" \
+        --motif-summary \
+        --num-peptides-per-length 1000000 \
+        --allele $(bzcat "$MODELS_DIR/train_data.csv.bz2" | cut -f 1 -d , | grep -v allele | uniq | sort | uniq) \
+        --verbosity 1 \
+        --num-jobs $NUM_JOBS --max-tasks-per-worker 1 --gpus $GPUS --max-workers-per-gpu 1
+done
+bzip2 LOG.txt
+for i in $(ls LOG-worker.*.txt) ; do bzip2 $i ; done
+RESULT="$SCRATCH_DIR/${DOWNLOAD_NAME}.$(date +%Y%m%d).tar.bz2"
+tar -cjf "$RESULT" *
+echo "Created archive: $RESULT"
--- a/downloads-generation/models_class1_pan/cluster_submit_script_header.mssm_hpc.lsf
+++ b/downloads-generation/models_class1_pan/cluster_submit_script_header.mssm_hpc.lsf
+../models_class1_pan_unselected/cluster_submit_script_header.mssm_hpc.lsf
\ No newline at end of file
--- a/downloads-generation/models_class1_pan_unselected/GENERATE.WITH_HPC_CLUSTER.sh
+++ b/downloads-generation/models_class1_pan_unselected/GENERATE.WITH_HPC_CLUSTER.sh
+#!/bin/bash
+#
+# Train pan-allele MHCflurry Class I models. Supports re-starting a failed run.
+#
+# Uses an HPC cluster (Mount Sinai chimera cluster, which uses lsf job
+# scheduler). This would need to be modified for other sites.
+#
+set -e
+set -x
+DOWNLOAD_NAME=models_class1_pan_unselected
+SCRATCH_DIR=${TMPDIR-/tmp}/mhcflurry-downloads-generation
+SCRIPT_ABSOLUTE_PATH="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)/$(basename "${BASH_SOURCE[0]}")"
+SCRIPT_DIR=$(dirname "$SCRIPT_ABSOLUTE_PATH")
+mkdir -p "$SCRATCH_DIR"
+if [ "$1" != "continue-incomplete" ]
+then
+    echo "Fresh run"
+    rm -rf "$SCRATCH_DIR/$DOWNLOAD_NAME"
+    mkdir "$SCRATCH_DIR/$DOWNLOAD_NAME"
+else
+    echo "Continuing incomplete run"
+fi
+# Send stdout and stderr to a logfile included with the archive.
+LOG="$SCRATCH_DIR/$DOWNLOAD_NAME/LOG.$(date +%s).txt"
+exec >  >(tee -ia "$LOG")
+exec 2> >(tee -ia "$LOG" >&2)
+# Log some environment info
+echo "Invocation: $0 $@"
+date
+pip freeze
+git status
+cd $SCRATCH_DIR/$DOWNLOAD_NAME
+export OMP_NUM_THREADS=1
+export PYTHONUNBUFFERED=1
+if [ "$1" != "continue-incomplete" ]
+then
+    cp $SCRIPT_DIR/generate_hyperparameters.py .
+    python generate_hyperparameters.py > hyperparameters.yaml
+fi
+for kind in with_mass_spec no_mass_spec
+do
+    EXTRA_TRAIN_ARGS=""
+    if [ "$1" == "continue-incomplete" ] && [ -d "models.${kind}" ]
+    then
+        echo "Will continue existing run: $kind"
+        EXTRA_TRAIN_ARGS="--continue-incomplete"
+    fi
+    mhcflurry-class1-train-pan-allele-models \
+        --data "$(mhcflurry-downloads path data_curated)/curated_training_data.${kind}.csv.bz2" \
+        --allele-sequences "$(mhcflurry-downloads path allele_sequences)/allele_sequences.csv" \
+        --pretrain-data "$(mhcflurry-downloads path random_peptide_predictions)/predictions.csv.bz2" \
+        --held-out-measurements-per-allele-fraction-and-max 0.25 100 \
+        --ensemble-size 4 \
+        --hyperparameters hyperparameters.yaml \
+        --out-models-dir $(pwd)/models.${kind} \
+        --worker-log-dir "$SCRATCH_DIR/$DOWNLOAD_NAME" \
+        --verbosity 0 \
+        --cluster-parallelism \
+        --cluster-submit-command bsub \
+        --cluster-results-workdir ~/mhcflurry-scratch \
+        --cluster-script-prefix-path $SCRIPT_DIR/cluster_submit_script_header.mssm_hpc.lsf \
+        $EXTRA_TRAIN_ARGS
+done
+cp $SCRIPT_ABSOLUTE_PATH .
+bzip2 -f "$LOG"
+for i in $(ls LOG-worker.*.txt) ; do bzip2 -f $i ; done
+RESULT="$SCRATCH_DIR/${DOWNLOAD_NAME}.$(date +%Y%m%d).tar.bz2"
+tar -cjf "$RESULT" *
+echo "Created archive: $RESULT"
+# Split into <2GB chunks for GitHub
+PARTS="${RESULT}.part."
+# Check for pre-existing part files and rename them.
+for i in $(ls "${PARTS}"* )
+do
+    DEST="${i}.OLD.$(date +%s)"
+    echo "WARNING: already exists: $i . Moving to $DEST"
+    mv $i $DEST
+done
+split -b 2000M "$RESULT" "$PARTS"
+echo "Split into parts:"
+ls -lh "${PARTS}"*
--- a/downloads-generation/models_class1_pan_unselected/GENERATE.sh
+++ b/downloads-generation/models_class1_pan_unselected/GENERATE.sh
+#!/bin/bash
+#
+# Train pan-allele MHCflurry Class I models. Supports re-starting a failed run.
+#
+set -e
+set -x
+DOWNLOAD_NAME=models_class1_pan_unselected
+SCRATCH_DIR=${TMPDIR-/tmp}/mhcflurry-downloads-generation
+SCRIPT_ABSOLUTE_PATH="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)/$(basename "${BASH_SOURCE[0]}")"
+SCRIPT_DIR=$(dirname "$SCRIPT_ABSOLUTE_PATH")
+mkdir -p "$SCRATCH_DIR"
+if [ "$1" != "continue-incomplete" ]
+then
+    echo "Fresh run"
+    rm -rf "$SCRATCH_DIR/$DOWNLOAD_NAME"
+    mkdir "$SCRATCH_DIR/$DOWNLOAD_NAME"
+else
+    echo "Continuing incomplete run"
+fi
+# Send stdout and stderr to a logfile included with the archive.
+LOG="$SCRATCH_DIR/$DOWNLOAD_NAME/LOG.$(date +%s).txt"
+exec >  >(tee -ia "$LOG")
+exec 2> >(tee -ia "$LOG" >&2)
+# Log some environment info
+echo "Invocation: $0 $@"
+date
+pip freeze
+git status
+cd $SCRATCH_DIR/$DOWNLOAD_NAME
+cp $SCRIPT_DIR/generate_hyperparameters.py .
+python generate_hyperparameters.py > hyperparameters.yaml
+GPUS=$(nvidia-smi -L 2> /dev/null | wc -l) || GPUS=0
+echo "Detected GPUS: $GPUS"
+PROCESSORS=$(getconf _NPROCESSORS_ONLN)
+echo "Detected processors: $PROCESSORS"
+if [ "$GPUS" -eq "0" ]; then
+   NUM_JOBS=${NUM_JOBS-1}
+else
+    NUM_JOBS=${NUM_JOBS-$GPUS}
+fi
+echo "Num jobs: $NUM_JOBS"
+export PYTHONUNBUFFERED=1
+if [ "$1" != "continue-incomplete" ]
+then
+    cp $SCRIPT_DIR/generate_hyperparameters.py .
+    python generate_hyperparameters.py > hyperparameters.yaml
+fi
+for kind in with_mass_spec no_mass_spec
+do
+    EXTRA_TRAIN_ARGS=""
+    if [ "$1" == "continue-incomplete" ] && [ -d "models.${kind}" ]
+    then
+        echo "Will continue existing run: $kind"
+        EXTRA_TRAIN_ARGS="--continue-incomplete"
+    fi
+    mhcflurry-class1-train-pan-allele-models \
+        --data "$(mhcflurry-downloads path data_curated)/curated_training_data.${kind}.csv.bz2" \
+        --allele-sequences "$(mhcflurry-downloads path allele_sequences)/allele_sequences.csv" \
+        --pretrain-data "$(mhcflurry-downloads path random_peptide_predictions)/predictions.csv.bz2" \
+        --held-out-measurements-per-allele-fraction-and-max 0.25 100 \
+        --ensemble-size 4 \
+        --hyperparameters hyperparameters.yaml \
+        --out-models-dir models.${kind} \
+        --worker-log-dir "$SCRATCH_DIR/$DOWNLOAD_NAME" \
+        --verbosity 0 \
+        --num-jobs $NUM_JOBS --max-tasks-per-worker 1 --gpus $GPUS --max-workers-per-gpu 1 \
+        $EXTRA_TRAIN_ARGS
+done
+cp $SCRIPT_ABSOLUTE_PATH .
+bzip2 -f "$LOG"
+for i in $(ls LOG-worker.*.txt) ; do bzip2 -f $i ; done
+RESULT="$SCRATCH_DIR/${DOWNLOAD_NAME}.$(date +%Y%m%d).tar.bz2"
+tar -cjf "$RESULT" *
+echo "Created archive: $RESULT"
+# Split into <2GB chunks for GitHub
+PARTS="${RESULT}.part."
+# Check for pre-existing part files and rename them.
+for i in $(ls "${PARTS}"* )
+do
+    DEST="${i}.OLD.$(date +%s)"
+    echo "WARNING: already exists: $i . Moving to $DEST"
+    mv $i $DEST
+done
+split -b 2000M "$RESULT" "$PARTS"
+echo "Split into parts:"
+ls -lh "${PARTS}"*
--- a/downloads-generation/models_class1_pan_unselected/README.md
+++ b/downloads-generation/models_class1_pan_unselected/README.md
+# Class I pan-allele models (ensemble)
+This download contains trained MHC Class I MHCflurry models before model selection.
+To generate this download run:
+```
+./GENERATE.sh
+```
--- a/downloads-generation/models_class1_pan_unselected/cluster_submit_script_header.mssm_hpc.lsf
+++ b/downloads-generation/models_class1_pan_unselected/cluster_submit_script_header.mssm_hpc.lsf
+#!/bin/bash
+#BSUB -J MHCf-{work_item_num} # Job name
+#BSUB -P acc_nkcancer # allocation account or Unix group
+#BSUB -q gpu # queue
+#BSUB -R rusage[ngpus_excl_p=1]  # 1 exclusive GPU
+#BSUB -R span[hosts=1] # one node
+#BSUB -n 1 # number of compute cores
+#BSUB -W 46:00 # walltime in HH:MM
+#BSUB -R rusage[mem=30000] # mb memory requested
+#BSUB -o {work_dir}/%J.stdout # output log (%J : JobID)
+#BSUB -eo {work_dir}/%J.stderr # error log
+#BSUB -L /bin/bash # Initialize the execution environment
+#
+export TMPDIR=/local/JOBS/mhcflurry-{work_item_num}
+export PATH=$HOME/.conda/envs/py36b/bin/:$PATH
+export PYTHONUNBUFFERED=1
+export KMP_SETTINGS=1
+set -e
+set -x
+free -m
+module add cuda/10.0.130 cudnn/7.1.1
+module list
+python -c 'import tensorflow as tf ; print("GPU AVAILABLE" if tf.test.is_gpu_available() else "GPU NOT AVAILABLE")'
+env
+cd {work_dir}