Skip to content
Snippets Groups Projects
Unverified Commit 18d0bd99 authored by Tim O'Donnell's avatar Tim O'Donnell Committed by GitHub
Browse files

Merge pull request #145 from openvax/v1.3

pan allele prediction (MHCflurry 1.3.0)
parents 74b751e6 3cbdfd69
No related branches found
No related tags found
No related merge requests found
Showing
with 6228 additions and 84 deletions
...@@ -4,9 +4,6 @@ python: ...@@ -4,9 +4,6 @@ python:
- "2.7" - "2.7"
- "3.6" - "3.6"
before_install: before_install:
# Commands below copied from: http://conda.pydata.org/docs/travis.html
# We do this conditionally because it saves us some downloading if the
# version is the same.
- if [[ "$TRAVIS_PYTHON_VERSION" == "2.7" ]]; then - if [[ "$TRAVIS_PYTHON_VERSION" == "2.7" ]]; then
wget https://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh; wget https://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh;
else else
...@@ -20,6 +17,7 @@ before_install: ...@@ -20,6 +17,7 @@ before_install:
- conda update -q conda - conda update -q conda
# Useful for debugging any issues with conda # Useful for debugging any issues with conda
- conda info -a - conda info -a
- free -m
addons: addons:
apt: apt:
packages: packages:
...@@ -29,23 +27,21 @@ addons: ...@@ -29,23 +27,21 @@ addons:
install: install:
- > - >
conda create -q -n test-environment python=$TRAVIS_PYTHON_VERSION conda create -q -n test-environment python=$TRAVIS_PYTHON_VERSION
numpy scipy nose pandas matplotlib mkl-service numpy scipy nose pandas matplotlib mkl-service tensorflow pypandoc
- source activate test-environment - source activate test-environment
- pip install tensorflow pypandoc pylint 'theano>=1.0.4' - pip install nose-timer
- pip install -r requirements.txt - pip install -r requirements.txt
- pip install . - pip install .
- pip freeze - pip freeze
env: env:
global: global:
- PYTHONHASHSEED=0 - PYTHONHASHSEED=0
- MKL_THREADING_LAYER=GNU # for theano
- CUDA_VISIBLE_DEVICES="" # for tensorflow
matrix:
- KERAS_BACKEND=theano
- KERAS_BACKEND=tensorflow - KERAS_BACKEND=tensorflow
- KMP_SETTINGS=TRUE
- OMP_NUM_THREADS=1
script: script:
# download data and models, then run tests # download data and models, then run tests
- mhcflurry-downloads fetch - mhcflurry-downloads fetch data_curated models_class1 models_class1_pan allele_sequences
- mhcflurry-downloads info # just to test this command works - mhcflurry-downloads info # just to test this command works
- nosetests test -sv - nosetests --with-timer -sv test
- ./lint.sh
...@@ -5,10 +5,14 @@ ...@@ -5,10 +5,14 @@
prediction package with competitive accuracy and a fast and prediction package with competitive accuracy and a fast and
[documented](http://openvax.github.io/mhcflurry/) implementation. [documented](http://openvax.github.io/mhcflurry/) implementation.
MHCflurry supports Class I peptide/MHC binding affinity prediction using MHCflurry implements class I peptide/MHC binding affinity prediction. By default
ensembles of allele-specific models. It runs on Python 2.7 and 3.4+ using it supports 112 MHC alleles using ensembles of allele-specific models.
the [keras](https://keras.io) neural network library. It exposes [command-line](http://openvax.github.io/mhcflurry/commandline_tutorial.html) Pan-allele predictors supporting virtually any MHC allele of known sequence
and [Python library](http://openvax.github.io/mhcflurry/python_tutorial.html) interfaces. are available for testing (see below). MHCflurry runs on Python 2.7 and 3.4+ using the
[keras](https://keras.io) neural network library.
It exposes [command-line](http://openvax.github.io/mhcflurry/commandline_tutorial.html)
and [Python library](http://openvax.github.io/mhcflurry/python_tutorial.html)
interfaces.
If you find MHCflurry useful in your research please cite: If you find MHCflurry useful in your research please cite:
...@@ -43,12 +47,41 @@ Wrote: /tmp/predictions.csv ...@@ -43,12 +47,41 @@ Wrote: /tmp/predictions.csv
See the [documentation](http://openvax.github.io/mhcflurry/) for more details. See the [documentation](http://openvax.github.io/mhcflurry/) for more details.
## MHCflurry model variants and mass spec ### Pan-allele models (experimental)
The default MHCflurry models are trained We are testing new models that support prediction for any MHC I allele of known
on affinity measurements. Mass spec datasets are incorporated only in sequence (as opposed to the 112 alleles supported by the allele-specific
the model selection step. We also release experimental predictors whose training data directly predictors). These models are trained on both affinity measurements and mass spec.
includes mass spec. To download these predictors, run:
To try the pan-allele models, first download them:
```
$ mhcflurry-downloads fetch models_class1_pan
```
then set this environment variable to use them by default:
```
$ export MHCFLURRY_DEFAULT_CLASS1_MODELS="$(mhcflurry-downloads path models_class1_pan)/models.with_mass_spec"
```
You can now generate predictions for about 14,000 MHC I alleles. For example:
```
$ mhcflurry-predict --alleles HLA-A*02:04 --peptides SIINFEKL
```
If you use these models please let us know how it goes.
## Other allele-specific models
The default MHCflurry models are trained on affinity measurements, one allele
per model (i.e. allele-specific). Mass spec datasets are incorporated in the
model selection step.
We also release experimental allele-specific predictors whose training data
directly includes mass spec. To download these predictors, run:
``` ```
$ mhcflurry-downloads fetch models_class1_trained_with_mass_spec $ mhcflurry-downloads fetch models_class1_trained_with_mass_spec
...@@ -66,4 +99,4 @@ these predictors, run: ...@@ -66,4 +99,4 @@ these predictors, run:
``` ```
$ mhcflurry-downloads fetch models_class1_selected_no_mass_spec $ mhcflurry-downloads fetch models_class1_selected_no_mass_spec
export MHCFLURRY_DEFAULT_CLASS1_MODELS="$(mhcflurry-downloads path models_class1_selected_no_mass_spec)/models" export MHCFLURRY_DEFAULT_CLASS1_MODELS="$(mhcflurry-downloads path models_class1_selected_no_mass_spec)/models"
``` ```
\ No newline at end of file
...@@ -151,3 +151,33 @@ This will write a file giving predictions for all subsequences of the specified ...@@ -151,3 +151,33 @@ This will write a file giving predictions for all subsequences of the specified
.. command-output:: .. command-output::
head -n 3 /tmp/subsequence_predictions.csv head -n 3 /tmp/subsequence_predictions.csv
Environment variables
-------------------------------------------------
MHCflurry behavior can be modified using these environment variables:
``MHCFLURRY_DEFAULT_CLASS1_MODELS``
Path to models directory. If you call ``Class1AffinityPredictor.load()``
with no arguments, the models specified in this environment variable will be
used. If this environment variable is undefined, the downloaded models for
the current MHCflurry release are used.
``MHCFLURRY_OPTIMIZATION_LEVEL``
The pan-allele models can be somewhat slow. As an optimization, when this
variable is greater than 0 (default is 1), we "stitch" the pan-allele models in
the ensemble into one large tensorflow graph. In our experiments
it gives about a 30% speed improvement. It has no effect on allele-specific
models. Set this variable to 0 to disable this behavior. This may be helpful
if you are running out of memory using the pan-allele models.
``MHCFLURRY_DEFAULT_PREDICT_BATCH_SIZE``
For large prediction tasks, it can be helpful to increase the prediction batch
size, which is set by this environment variable (default is 4096). This
affects both allele-specific and pan-allele predictors. It can have large
effects on performance. Alternatively, if you are running out of memory,
you can try decreasing the batch size.
#!/bin/bash
#
# Create allele sequences (sometimes referred to as pseudosequences) by
# performing a global alignment across all MHC amino acid sequences we can get
# our hands on.
#
# Requires: clustalo, wget
#
set -e
set -x
DOWNLOAD_NAME=allele_sequences
SCRATCH_DIR=${TMPDIR-/tmp}/mhcflurry-downloads-generation
SCRIPT_ABSOLUTE_PATH="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)/$(basename "${BASH_SOURCE[0]}")"
SCRIPT_DIR=$(dirname "$SCRIPT_ABSOLUTE_PATH")
export PYTHONUNBUFFERED=1
mkdir -p "$SCRATCH_DIR"
rm -rf "$SCRATCH_DIR/$DOWNLOAD_NAME"
mkdir "$SCRATCH_DIR/$DOWNLOAD_NAME"
# Send stdout and stderr to a logfile included with the archive.
exec > >(tee -ia "$SCRATCH_DIR/$DOWNLOAD_NAME/LOG.txt")
exec 2> >(tee -ia "$SCRATCH_DIR/$DOWNLOAD_NAME/LOG.txt" >&2)
# Log some environment info
date
pip freeze
git status
which clustalo
clustalo --version
cd $SCRATCH_DIR/$DOWNLOAD_NAME
cp $SCRIPT_DIR/make_allele_sequences.py .
cp $SCRIPT_DIR/filter_sequences.py .
cp $SCRIPT_DIR/class1_pseudosequences.csv .
cp $SCRIPT_ABSOLUTE_PATH .
# Generate sequences
# Training data is used to decide which additional positions to include in the
# allele sequences to differentiate alleles that have identical traditional
# pseudosequences but have associated training data
TRAINING_DATA="$(mhcflurry-downloads path data_curated)/curated_training_data.with_mass_spec.csv.bz2"
bzcat "$(mhcflurry-downloads path data_curated)/curated_training_data.with_mass_spec.csv.bz2" \
| cut -f 1 -d , | uniq | sort | uniq | grep -v allele > training_data.alleles.txt
# Human
wget -q ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/fasta/A_prot.fasta
wget -q ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/fasta/B_prot.fasta
wget -q ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/fasta/C_prot.fasta
wget -q ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/fasta/E_prot.fasta
wget -q ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/fasta/F_prot.fasta
wget -q ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/fasta/G_prot.fasta
# Mouse
wget -q https://www.uniprot.org/uniprot/P01899.fasta # H-2 Db
wget -q https://www.uniprot.org/uniprot/P01900.fasta # H-2 Dd
wget -q https://www.uniprot.org/uniprot/P14427.fasta # H-2 Dp
wget -q https://www.uniprot.org/uniprot/P14426.fasta # H-2 Dk
wget -q https://www.uniprot.org/uniprot/Q31145.fasta # H-2 Dq
wget -q https://www.uniprot.org/uniprot/P01901.fasta # H-2 Kb
wget -q https://www.uniprot.org/uniprot/P01902.fasta # H-2 Kd
wget -q https://www.uniprot.org/uniprot/P04223.fasta # H-2 Kk
wget -q https://www.uniprot.org/uniprot/P14428.fasta # H-2 Kq
wget -q https://www.uniprot.org/uniprot/P01897.fasta # H-2 Ld
wget -q https://www.uniprot.org/uniprot/Q31151.fasta # H-2 Lq
# Various
wget -q ftp://ftp.ebi.ac.uk/pub/databases/ipd/mhc/MHC_prot.fasta
python filter_sequences.py *.fasta --out class1.fasta
time clustalo -i class1.fasta -o class1.aligned.fasta
time python make_allele_sequences.py \
class1.aligned.fasta \
--recapitulate-sequences class1_pseudosequences.csv \
--differentiate-alleles training_data.alleles.txt \
--out-csv allele_sequences.csv
# Cleanup
gzip -f class1.fasta
gzip -f class1.aligned.fasta
rm *.fasta
cp $SCRIPT_ABSOLUTE_PATH .
bzip2 LOG.txt
tar -cjf "../${DOWNLOAD_NAME}.tar.bz2" *
echo "Created archive: $SCRATCH_DIR/$DOWNLOAD_NAME.tar.bz2"
This diff is collapsed.
"""
Filter and combine class I sequence fastas.
"""
from __future__ import print_function
import sys
import argparse
import mhcnames
import Bio.SeqIO # pylint: disable=import-error
def normalize(s, disallowed=["MIC", "HFE"]):
if any(item in s for item in disallowed):
return None
try:
return mhcnames.normalize_allele_name(s)
except:
while s:
s = ":".join(s.split(":")[:-1])
try:
return mhcnames.normalize_allele_name(s)
except:
pass
return None
parser = argparse.ArgumentParser(usage=__doc__)
parser.add_argument(
"fastas",
nargs="+",
help="Unaligned fastas")
parser.add_argument(
"--out",
required=True,
help="Fasta output")
class_ii_names = {
"DRA",
"DRB",
"DPA",
"DPB",
"DQA",
"DQB",
"DMA",
"DMB",
"DOA",
"DOB",
}
def run():
args = parser.parse_args(sys.argv[1:])
print(args)
records = []
total = 0
seen = set()
for fasta in args.fastas:
reader = Bio.SeqIO.parse(fasta, "fasta")
for record in reader:
total += 1
name = record.description.split()[1]
normalized = normalize(name)
if not normalized and "," in record.description:
# Try parsing uniprot-style sequence description
if "MOUSE MHC class I L-q alpha-chain" in record.description:
# Special case.
name = "H2-Lq"
else:
name = (
record.description.split()[1].replace("-", "") +
"-" +
record.description.split(",")[-1].split()[0].replace("-",""))
normalized = normalize(name)
if not normalized:
print("Couldn't parse: ", name)
continue
if normalized in seen:
continue
if any(n in name for n in class_ii_names):
print("Dropping", name)
continue
seen.add(normalized)
record.description = normalized + " " + record.description
records.append(record)
with open(args.out, "w") as fd:
Bio.SeqIO.write(records, fd, "fasta")
print("Wrote %d / %d sequences: %s" % (len(records), total, args.out))
if __name__ == '__main__':
run()
"""
Generate allele sequences for pan-class I models.
Additional dependency: biopython
"""
from __future__ import print_function
import sys
import argparse
import numpy
import pandas
import mhcnames
import Bio.SeqIO # pylint: disable=import-error
def normalize_simple(s):
return mhcnames.normalize_allele_name(s)
def normalize_complex(s, disallowed=["MIC", "HFE"]):
if any(item in s for item in disallowed):
return None
try:
return normalize_simple(s)
except:
while s:
s = ":".join(s.split(":")[:-1])
try:
return normalize_simple(s)
except:
pass
return None
parser = argparse.ArgumentParser(usage=__doc__)
parser.add_argument(
"aligned_fasta",
help="Aligned sequences")
parser.add_argument(
"--recapitulate-sequences",
required=True,
help="CSV giving sequences to recapitulate")
parser.add_argument(
"--differentiate-alleles",
help="File listing alleles to differentiate using additional positions")
parser.add_argument(
"--out-csv",
help="Result file")
def run():
args = parser.parse_args(sys.argv[1:])
print(args)
allele_to_sequence = {}
reader = Bio.SeqIO.parse(args.aligned_fasta, "fasta")
for record in reader:
name = record.description.split()[1]
print(record.name, record.description)
allele_to_sequence[name] = str(record.seq)
print("Read %d aligned sequences" % len(allele_to_sequence))
allele_sequences = pandas.Series(allele_to_sequence).to_frame()
allele_sequences.columns = ['aligned']
allele_sequences['aligned'] = allele_sequences['aligned'].str.replace(
"-", "X")
allele_sequences['normalized_allele'] = allele_sequences.index.map(normalize_complex)
allele_sequences = allele_sequences.set_index("normalized_allele", drop=True)
selected_positions = []
recapitulate_df = pandas.read_csv(args.recapitulate_sequences)
recapitulate_df["normalized_allele"] = recapitulate_df.allele.map(
normalize_complex)
recapitulate_df = (
recapitulate_df
.dropna()
.drop_duplicates("normalized_allele")
.set_index("normalized_allele", drop=True))
allele_sequences["recapitulate_target"] = recapitulate_df.iloc[:,-1]
print("Sequences in recapitulate CSV that are not in aligned fasta:")
print(recapitulate_df.index[
~recapitulate_df.index.isin(allele_sequences.index)
].tolist())
allele_sequences_with_target = allele_sequences.loc[
~allele_sequences.recapitulate_target.isnull()
]
position_identities = []
target_length = int(
allele_sequences_with_target.recapitulate_target.str.len().max())
for i in range(target_length):
series_i = allele_sequences_with_target.recapitulate_target.str.get(i)
row = []
full_length_sequence_length = int(
allele_sequences_with_target.aligned.str.len().max())
for k in range(full_length_sequence_length):
series_k = allele_sequences_with_target.aligned.str.get(k)
row.append((series_i == series_k).mean())
position_identities.append(row)
position_identities = pandas.DataFrame(numpy.array(position_identities))
selected_positions = position_identities.idxmax(1).tolist()
fractions = position_identities.max(1)
print("Selected positions: ", *selected_positions)
print("Lowest concordance fraction: %0.5f" % fractions.min())
assert fractions.min() > 0.99
allele_sequences["recapitulated"] = allele_sequences.aligned.map(
lambda s: "".join(s[p] for p in selected_positions))
allele_sequences_with_target = allele_sequences.loc[
~allele_sequences.recapitulate_target.isnull()
]
agreement = (
allele_sequences_with_target.recapitulated ==
allele_sequences_with_target.recapitulate_target).mean()
print("Overall agreement: %0.5f" % agreement)
assert agreement > 0.9
# Add additional positions
if args.differentiate_alleles:
differentiate_alleles = pandas.read_csv(
args.differentiate_alleles).iloc[:,0].values
print(
"Read %d alleles to differentiate:" % len(differentiate_alleles),
differentiate_alleles)
allele_sequences_to_differentiate = allele_sequences.loc[
allele_sequences.index.isin(differentiate_alleles)
]
print(allele_sequences_to_differentiate.shape)
additional_positions = []
for (_, sub_df) in allele_sequences_to_differentiate.groupby("recapitulated"):
if sub_df.aligned.nunique() > 1:
differing = pandas.DataFrame(
dict([(pos, chars) for (pos, chars) in
enumerate(zip(*sub_df.aligned.values)) if
any(c != chars[0] for c in chars) and "X" not in chars])).T
print(sub_df)
print(differing)
print()
additional_positions.extend(differing.index)
additional_positions = sorted(set(additional_positions))
print(
"Selected %d additional positions: " % len(additional_positions),
additional_positions)
extended_selected_positions = sorted(
set(selected_positions).union(set(additional_positions)))
print(
"Extended selected positions (%d)" % len(extended_selected_positions),
*extended_selected_positions)
allele_sequences["sequence"] = allele_sequences.aligned.map(
lambda s: "".join(s[p] for p in extended_selected_positions))
allele_sequences[["sequence"]].to_csv(args.out_csv, index=True)
print("Wrote: %s" % args.out_csv)
if __name__ == '__main__':
run()
...@@ -46,8 +46,6 @@ time python curate.py \ ...@@ -46,8 +46,6 @@ time python curate.py \
"$(mhcflurry-downloads path data_published)/bdata.20130222.mhci.public.1.txt" \ "$(mhcflurry-downloads path data_published)/bdata.20130222.mhci.public.1.txt" \
--data-systemhc-atlas \ --data-systemhc-atlas \
"$(mhcflurry-downloads path data_systemhcatlas)/data.csv.bz2" \ "$(mhcflurry-downloads path data_systemhcatlas)/data.csv.bz2" \
--data-abelin-mass-spec \
"$(mhcflurry-downloads path data_published)/abelin2017.hits.csv.bz2" \
--include-iedb-mass-spec \ --include-iedb-mass-spec \
--out-csv curated_training_data.with_mass_spec.csv --out-csv curated_training_data.with_mass_spec.csv
......
...@@ -34,11 +34,6 @@ parser.add_argument( ...@@ -34,11 +34,6 @@ parser.add_argument(
action="append", action="append",
default=[], default=[],
help="Path to systemhc-atlas-style mass-spec data") help="Path to systemhc-atlas-style mass-spec data")
parser.add_argument(
"--data-abelin-mass-spec",
action="append",
default=[],
help="Path to Abelin Immunity 2017 mass-spec hits")
parser.add_argument( parser.add_argument(
"--include-iedb-mass-spec", "--include-iedb-mass-spec",
action="store_true", action="store_true",
...@@ -85,8 +80,8 @@ def load_data_kim2014(filename): ...@@ -85,8 +80,8 @@ def load_data_kim2014(filename):
df["peptide"] = df.sequence df["peptide"] = df.sequence
df["allele"] = df.mhc.map(normalize_allele_name) df["allele"] = df.mhc.map(normalize_allele_name)
print("Dropping un-parseable alleles: %s" % ", ".join( print("Dropping un-parseable alleles: %s" % ", ".join(
df.ix[df.allele == "UNKNOWN"]["mhc"].unique())) df.loc[df.allele == "UNKNOWN"]["mhc"].unique()))
df = df.ix[df.allele != "UNKNOWN"] df = df.loc[df.allele != "UNKNOWN"]
print("Loaded kim2014 data: %s" % str(df.shape)) print("Loaded kim2014 data: %s" % str(df.shape))
return df return df
...@@ -105,7 +100,7 @@ def load_data_systemhc_atlas(filename, min_probability=0.99): ...@@ -105,7 +100,7 @@ def load_data_systemhc_atlas(filename, min_probability=0.99):
df["allele"] = df.top_allele.map(normalize_allele_name) df["allele"] = df.top_allele.map(normalize_allele_name)
print("Dropping un-parseable alleles: %s" % ", ".join( print("Dropping un-parseable alleles: %s" % ", ".join(
str(x) for x in df.ix[df.allele == "UNKNOWN"]["top_allele"].unique())) str(x) for x in df.loc[df.allele == "UNKNOWN"]["top_allele"].unique()))
df = df.loc[df.allele != "UNKNOWN"] df = df.loc[df.allele != "UNKNOWN"]
print("Systemhc atlas data now: %s" % str(df.shape)) print("Systemhc atlas data now: %s" % str(df.shape))
...@@ -120,65 +115,44 @@ def load_data_systemhc_atlas(filename, min_probability=0.99): ...@@ -120,65 +115,44 @@ def load_data_systemhc_atlas(filename, min_probability=0.99):
return df return df
def load_data_abelin_mass_spec(filename):
df = pandas.read_csv(filename)
print("Loaded Abelin mass-spec data: %s" % str(df.shape))
df["measurement_source"] = "abelin-mass-spec"
df["measurement_value"] = QUALITATIVE_TO_AFFINITY["Positive"]
df["measurement_inequality"] = "<"
df["measurement_type"] = "qualitative"
df["original_allele"] = df.allele
df["allele"] = df.original_allele.map(normalize_allele_name)
print("Dropping un-parseable alleles: %s" % ", ".join(
str(x) for x in df.ix[df.allele == "UNKNOWN"]["allele"].unique()))
df = df.loc[df.allele != "UNKNOWN"]
print("Abelin mass-spec data now: %s" % str(df.shape))
print("Removing duplicates")
df = df.drop_duplicates(["allele", "peptide"])
print("Abelin mass-spec data now: %s" % str(df.shape))
return df
def load_data_iedb(iedb_csv, include_qualitative=True, include_mass_spec=False): def load_data_iedb(iedb_csv, include_qualitative=True, include_mass_spec=False):
iedb_df = pandas.read_csv(iedb_csv, skiprows=1, low_memory=False) iedb_df = pandas.read_csv(iedb_csv, skiprows=1, low_memory=False)
print("Loaded iedb data: %s" % str(iedb_df.shape)) print("Loaded iedb data: %s" % str(iedb_df.shape))
print("Selecting only class I") print("Selecting only class I")
iedb_df = iedb_df.ix[ iedb_df = iedb_df.loc[
iedb_df["MHC allele class"].str.strip().str.upper() == "I" iedb_df["MHC allele class"].str.strip().str.upper() == "I"
] ]
print("New shape: %s" % str(iedb_df.shape)) print("New shape: %s" % str(iedb_df.shape))
print("Dropping known unusuable alleles") print("Dropping known unusuable alleles")
iedb_df = iedb_df.ix[ iedb_df = iedb_df.loc[
~iedb_df["Allele Name"].isin(EXCLUDE_IEDB_ALLELES) ~iedb_df["Allele Name"].isin(EXCLUDE_IEDB_ALLELES)
] ]
iedb_df = iedb_df.ix[ iedb_df = iedb_df.loc[
(~iedb_df["Allele Name"].str.contains("mutant")) & (~iedb_df["Allele Name"].str.contains("mutant")) &
(~iedb_df["Allele Name"].str.contains("CD1")) (~iedb_df["Allele Name"].str.contains("CD1"))
] ]
iedb_df["allele"] = iedb_df["Allele Name"].map(normalize_allele_name) iedb_df["allele"] = iedb_df["Allele Name"].map(normalize_allele_name)
print("Dropping un-parseable alleles: %s" % ", ".join( print("Dropping un-parseable alleles: %s" % ", ".join(
iedb_df.ix[iedb_df.allele == "UNKNOWN"]["Allele Name"].unique())) iedb_df.loc[iedb_df.allele == "UNKNOWN"]["Allele Name"].unique()))
iedb_df = iedb_df.ix[iedb_df.allele != "UNKNOWN"] iedb_df = iedb_df.loc[iedb_df.allele != "UNKNOWN"]
print("IEDB measurements per allele:\n%s" % iedb_df.allele.value_counts()) print("IEDB measurements per allele:\n%s" % iedb_df.allele.value_counts())
quantitative = iedb_df.ix[iedb_df["Units"] == "nM"].copy() quantitative = iedb_df.loc[iedb_df["Units"] == "nM"].copy()
quantitative["measurement_type"] = "quantitative" quantitative["measurement_type"] = "quantitative"
quantitative["measurement_inequality"] = "=" quantitative["measurement_inequality"] = quantitative[
"Measurement Inequality"
].fillna("=").map(lambda s: {">=": ">", "<=": "<"}.get(s, s))
print("Quantitative measurements: %d" % len(quantitative)) print("Quantitative measurements: %d" % len(quantitative))
qualitative = iedb_df.ix[iedb_df["Units"] != "nM"].copy() qualitative = iedb_df.loc[iedb_df["Units"].isnull()].copy()
qualitative["measurement_type"] = "qualitative" qualitative["measurement_type"] = "qualitative"
print("Qualitative measurements: %d" % len(qualitative)) print("Qualitative measurements: %d" % len(qualitative))
if not include_mass_spec: if not include_mass_spec:
qualitative = qualitative.ix[ qualitative = qualitative.loc[
(~qualitative["Method/Technique"].str.contains("mass spec")) (~qualitative["Method/Technique"].str.contains("mass spec"))
].copy() ].copy()
...@@ -200,7 +174,7 @@ def load_data_iedb(iedb_csv, include_qualitative=True, include_mass_spec=False): ...@@ -200,7 +174,7 @@ def load_data_iedb(iedb_csv, include_qualitative=True, include_mass_spec=False):
print("Subselecting to valid peptides. Starting with: %d" % len(iedb_df)) print("Subselecting to valid peptides. Starting with: %d" % len(iedb_df))
iedb_df["Description"] = iedb_df.Description.str.strip() iedb_df["Description"] = iedb_df.Description.str.strip()
iedb_df = iedb_df.ix[ iedb_df = iedb_df.loc[
iedb_df.Description.str.match("^[ACDEFGHIKLMNPQRSTVWY]+$") iedb_df.Description.str.match("^[ACDEFGHIKLMNPQRSTVWY]+$")
] ]
print("Now: %d" % len(iedb_df)) print("Now: %d" % len(iedb_df))
...@@ -248,7 +222,7 @@ def run(): ...@@ -248,7 +222,7 @@ def run():
iedb_df = dfs[0] iedb_df = dfs[0]
iedb_df["allele_peptide"] = iedb_df.allele + "_" + iedb_df.peptide iedb_df["allele_peptide"] = iedb_df.allele + "_" + iedb_df.peptide
print("Dropping kim2014 data present in IEDB.") print("Dropping kim2014 data present in IEDB.")
df = df.ix[ df = df.loc[
~df.allele_peptide.isin(iedb_df.allele_peptide) ~df.allele_peptide.isin(iedb_df.allele_peptide)
] ]
print("Kim2014 data now: %s" % str(df.shape)) print("Kim2014 data now: %s" % str(df.shape))
...@@ -256,9 +230,6 @@ def run(): ...@@ -256,9 +230,6 @@ def run():
for filename in args.data_systemhc_atlas: for filename in args.data_systemhc_atlas:
df = load_data_systemhc_atlas(filename) df = load_data_systemhc_atlas(filename)
dfs.append(df) dfs.append(df)
for filename in args.data_abelin_mass_spec:
df = load_data_abelin_mass_spec(filename)
dfs.append(df)
df = pandas.concat(dfs, ignore_index=True) df = pandas.concat(dfs, ignore_index=True)
print("Combined df: %s" % (str(df.shape))) print("Combined df: %s" % (str(df.shape)))
......
...@@ -22,11 +22,17 @@ date ...@@ -22,11 +22,17 @@ date
cd $SCRATCH_DIR/$DOWNLOAD_NAME cd $SCRATCH_DIR/$DOWNLOAD_NAME
wget --quiet http://www.iedb.org/doc/mhc_ligand_full.zip wget -q http://www.iedb.org/doc/mhc_ligand_full.zip
wget -q http://www.iedb.org/downloader.php?file_name=doc/tcell_full_v3.zip -O tcell_full_v3.zip
unzip mhc_ligand_full.zip unzip mhc_ligand_full.zip
rm mhc_ligand_full.zip rm mhc_ligand_full.zip
bzip2 mhc_ligand_full.csv bzip2 mhc_ligand_full.csv
unzip tcell_full_v3.zip
rm tcell_full_v3.zip
bzip2 tcell_full_v3.csv
cp $SCRIPT_ABSOLUTE_PATH . cp $SCRIPT_ABSOLUTE_PATH .
bzip2 LOG.txt bzip2 LOG.txt
tar -cjf "../${DOWNLOAD_NAME}.tar.bz2" * tar -cjf "../${DOWNLOAD_NAME}.tar.bz2" *
......
#!/bin/bash #!/bin/bash
# #
# Download some published MHC I ligand data # Download published non-IEDB MHC I ligand data. Most data has made its way into
# IEDB but not all. Here we gather up the rest.
# #
# #
set -e set -e
...@@ -9,6 +10,7 @@ set -x ...@@ -9,6 +10,7 @@ set -x
DOWNLOAD_NAME=data_published DOWNLOAD_NAME=data_published
SCRATCH_DIR=${TMPDIR-/tmp}/mhcflurry-downloads-generation SCRATCH_DIR=${TMPDIR-/tmp}/mhcflurry-downloads-generation
SCRIPT_ABSOLUTE_PATH="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)/$(basename "${BASH_SOURCE[0]}")" SCRIPT_ABSOLUTE_PATH="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)/$(basename "${BASH_SOURCE[0]}")"
SCRIPT_DIR=$(dirname "$SCRIPT_ABSOLUTE_PATH")
mkdir -p "$SCRATCH_DIR" mkdir -p "$SCRATCH_DIR"
rm -rf "$SCRATCH_DIR/$DOWNLOAD_NAME" rm -rf "$SCRATCH_DIR/$DOWNLOAD_NAME"
...@@ -26,13 +28,10 @@ git status ...@@ -26,13 +28,10 @@ git status
cd $SCRATCH_DIR/$DOWNLOAD_NAME cd $SCRATCH_DIR/$DOWNLOAD_NAME
# Download kim2014 data # Kim et al 2014 [PMID 25017736]
wget --quiet https://github.com/openvax/mhcflurry/releases/download/pre-1.1/bdata.2009.mhci.public.1.txt wget -q https://github.com/openvax/mhcflurry/releases/download/pre-1.1/bdata.2009.mhci.public.1.txt
wget --quiet https://github.com/openvax/mhcflurry/releases/download/pre-1.1/bdata.20130222.mhci.public.1.txt wget -q https://github.com/openvax/mhcflurry/releases/download/pre-1.1/bdata.20130222.mhci.public.1.txt
wget --quiet https://github.com/openvax/mhcflurry/releases/download/pre-1.1/bdata.2013.mhci.public.blind.1.txt wget -q https://github.com/openvax/mhcflurry/releases/download/pre-1.1/bdata.2013.mhci.public.blind.1.txt
# Download abelin et al 2017 data
wget --quiet https://github.com/openvax/mhcflurry/releases/download/pre-1.1/abelin2017.hits.csv.bz2
cp $SCRIPT_ABSOLUTE_PATH . cp $SCRIPT_ABSOLUTE_PATH .
bzip2 LOG.txt bzip2 LOG.txt
......
...@@ -26,9 +26,22 @@ git status ...@@ -26,9 +26,22 @@ git status
cd $SCRATCH_DIR/$DOWNLOAD_NAME cd $SCRATCH_DIR/$DOWNLOAD_NAME
wget --quiet https://github.com/openvax/mhcflurry/releases/download/pre-1.1/systemhc.20171121.combined.csv.bz2 wget -q https://systemhcatlas.org/Builds_for_download/180409_master_final.tgz
mkdir extracted
mv systemhc.20171121.combined.csv.bz2 data.csv.bz2 tar -xvzf *.tgz -C extracted
wc -l extracted/*/*.csv
# Write header line
cat extracted/*/*.csv | head -n 1 > data.csv
# Write concatenated data
grep -v SysteMHC_ID extracted/*/*.csv >> data.csv
# Cleanup
rm -rf extracted *.tgz
ls -lh data.csv
wc -l data.csv
bzip2 data.csv
cp $SCRIPT_ABSOLUTE_PATH . cp $SCRIPT_ABSOLUTE_PATH .
bzip2 LOG.txt bzip2 LOG.txt
......
# SysteMHC database dump # SysteMHC database dump
This is a data dump of the [SysteMHC Atlas](https://systemhcatlas.org/) provided This is a database export of the [SysteMHC Atlas](https://systemhcatlas.org/)
by personal communication. It is distributed under the ODC Open Database License. downloaded from [here](https://systemhcatlas.org/Builds_for_download/). It is
distributed under the ODC Open Database License.
To generate this download run: To generate this download run:
``` ```
./GENERATE.sh ./GENERATE.sh
``` ```
\ No newline at end of file
#!/bin/bash
#
# Model select pan-allele MHCflurry Class I models and calibrate percentile ranks.
#
# Uses an HPC cluster (Mount Sinai chimera cluster, which uses lsf job
# scheduler). This would need to be modified for other sites.
#
set -e
set -x
DOWNLOAD_NAME=models_class1_pan
SCRATCH_DIR=${TMPDIR-/tmp}/mhcflurry-downloads-generation
SCRIPT_ABSOLUTE_PATH="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)/$(basename "${BASH_SOURCE[0]}")"
SCRIPT_DIR=$(dirname "$SCRIPT_ABSOLUTE_PATH")
mkdir -p "$SCRATCH_DIR"
rm -rf "$SCRATCH_DIR/$DOWNLOAD_NAME"
mkdir "$SCRATCH_DIR/$DOWNLOAD_NAME"
# Send stdout and stderr to a logfile included with the archive.
exec > >(tee -ia "$SCRATCH_DIR/$DOWNLOAD_NAME/LOG.txt")
exec 2> >(tee -ia "$SCRATCH_DIR/$DOWNLOAD_NAME/LOG.txt" >&2)
# Log some environment info
echo "Invocation: $0 $@"
date
pip freeze
git status
cd $SCRATCH_DIR/$DOWNLOAD_NAME
export OMP_NUM_THREADS=1
export PYTHONUNBUFFERED=1
cp $SCRIPT_ABSOLUTE_PATH .
GPUS=$(nvidia-smi -L 2> /dev/null | wc -l) || GPUS=0
echo "Detected GPUS: $GPUS"
PROCESSORS=$(getconf _NPROCESSORS_ONLN)
echo "Detected processors: $PROCESSORS"
if [ "$GPUS" -eq "0" ]; then
NUM_JOBS=${NUM_JOBS-1}
else
NUM_JOBS=${NUM_JOBS-$GPUS}
fi
echo "Num local jobs for model selection: $NUM_JOBS"
UNSELECTED_PATH="$(mhcflurry-downloads path models_class1_pan_unselected)"
for kind in with_mass_spec no_mass_spec
do
# Model selection is always done locally. It's fast enough that it
# doesn't make sense to put it on the cluster.
MODELS_DIR="$UNSELECTED_PATH/models.${kind}"
time mhcflurry-class1-select-pan-allele-models \
--data "$MODELS_DIR/train_data.csv.bz2" \
--models-dir "$MODELS_DIR" \
--out-models-dir models.${kind} \
--min-models 8 \
--max-models 32 \
--num-jobs $NUM_JOBS --max-tasks-per-worker 1 --gpus $GPUS --max-workers-per-gpu 1
# Percentile rank calibration is run on the cluster.
# For now we calibrate percentile ranks only for alleles for which there
# is training data. Calibrating all alleles would be too slow.
# This could be improved though.
time mhcflurry-calibrate-percentile-ranks \
--models-dir models.${kind} \
--match-amino-acid-distribution-data "$MODELS_DIR/train_data.csv.bz2" \
--motif-summary \
--num-peptides-per-length 1000000 \
--allele $(bzcat "$MODELS_DIR/train_data.csv.bz2" | cut -f 1 -d , | grep -v allele | uniq | sort | uniq) \
--verbosity 1 \
--worker-log-dir "$SCRATCH_DIR/$DOWNLOAD_NAME" \
--prediction-batch-size 524288 \
--cluster-parallelism \
--cluster-submit-command bsub \
--cluster-results-workdir ~/mhcflurry-scratch \
--cluster-script-prefix-path $SCRIPT_DIR/cluster_submit_script_header.mssm_hpc.lsf
done
bzip2 LOG.txt
for i in $(ls LOG-worker.*.txt) ; do bzip2 $i ; done
RESULT="$SCRATCH_DIR/${DOWNLOAD_NAME}.$(date +%Y%m%d).tar.bz2"
tar -cjf "$RESULT" *
echo "Created archive: $RESULT"
#!/bin/bash
# Model select pan-allele MHCflurry Class I models and calibrate percentile ranks.
#
set -e
set -x
DOWNLOAD_NAME=models_class1_pan
SCRATCH_DIR=${TMPDIR-/tmp}/mhcflurry-downloads-generation
SCRIPT_ABSOLUTE_PATH="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)/$(basename "${BASH_SOURCE[0]}")"
SCRIPT_DIR=$(dirname "$SCRIPT_ABSOLUTE_PATH")
mkdir -p "$SCRATCH_DIR"
rm -rf "$SCRATCH_DIR/$DOWNLOAD_NAME"
mkdir "$SCRATCH_DIR/$DOWNLOAD_NAME"
# Send stdout and stderr to a logfile included with the archive.
exec > >(tee -ia "$SCRATCH_DIR/$DOWNLOAD_NAME/LOG.txt")
exec 2> >(tee -ia "$SCRATCH_DIR/$DOWNLOAD_NAME/LOG.txt" >&2)
# Log some environment info
date
pip freeze
git status
cd $SCRATCH_DIR/$DOWNLOAD_NAME
cp $SCRIPT_ABSOLUTE_PATH .
GPUS=$(nvidia-smi -L 2> /dev/null | wc -l) || GPUS=0
echo "Detected GPUS: $GPUS"
PROCESSORS=$(getconf _NPROCESSORS_ONLN)
echo "Detected processors: $PROCESSORS"
if [ "$GPUS" -eq "0" ]; then
NUM_JOBS=${NUM_JOBS-1}
else
NUM_JOBS=${NUM_JOBS-$GPUS}
fi
echo "Num jobs: $NUM_JOBS"
export PYTHONUNBUFFERED=1
UNSELECTED_PATH="$(mhcflurry-downloads path models_class1_pan_unselected)"
for kind in with_mass_spec no_mass_spec
do
MODELS_DIR="$UNSELECTED_PATH/models.${kind}"
time mhcflurry-class1-select-pan-allele-models \
--data "$MODELS_DIR/train_data.csv.bz2" \
--models-dir "$MODELS_DIR" \
--out-models-dir models.${kind} \
--min-models 8 \
--max-models 32 \
--num-jobs 0 \
--num-jobs $NUM_JOBS --max-tasks-per-worker 1 --gpus $GPUS --max-workers-per-gpu 1
# For now we calibrate percentile ranks only for alleles for which there
# is training data. Calibrating all alleles would be too slow.
# This could be improved though.
time mhcflurry-calibrate-percentile-ranks \
--models-dir models.${kind} \
--match-amino-acid-distribution-data "$MODELS_DIR/train_data.csv.bz2" \
--motif-summary \
--num-peptides-per-length 1000000 \
--allele $(bzcat "$MODELS_DIR/train_data.csv.bz2" | cut -f 1 -d , | grep -v allele | uniq | sort | uniq) \
--verbosity 1 \
--num-jobs $NUM_JOBS --max-tasks-per-worker 1 --gpus $GPUS --max-workers-per-gpu 1
done
bzip2 LOG.txt
for i in $(ls LOG-worker.*.txt) ; do bzip2 $i ; done
RESULT="$SCRATCH_DIR/${DOWNLOAD_NAME}.$(date +%Y%m%d).tar.bz2"
tar -cjf "$RESULT" *
echo "Created archive: $RESULT"
../models_class1_pan_unselected/cluster_submit_script_header.mssm_hpc.lsf
\ No newline at end of file
#!/bin/bash
#
# Train pan-allele MHCflurry Class I models. Supports re-starting a failed run.
#
# Uses an HPC cluster (Mount Sinai chimera cluster, which uses lsf job
# scheduler). This would need to be modified for other sites.
#
set -e
set -x
DOWNLOAD_NAME=models_class1_pan_unselected
SCRATCH_DIR=${TMPDIR-/tmp}/mhcflurry-downloads-generation
SCRIPT_ABSOLUTE_PATH="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)/$(basename "${BASH_SOURCE[0]}")"
SCRIPT_DIR=$(dirname "$SCRIPT_ABSOLUTE_PATH")
mkdir -p "$SCRATCH_DIR"
if [ "$1" != "continue-incomplete" ]
then
echo "Fresh run"
rm -rf "$SCRATCH_DIR/$DOWNLOAD_NAME"
mkdir "$SCRATCH_DIR/$DOWNLOAD_NAME"
else
echo "Continuing incomplete run"
fi
# Send stdout and stderr to a logfile included with the archive.
LOG="$SCRATCH_DIR/$DOWNLOAD_NAME/LOG.$(date +%s).txt"
exec > >(tee -ia "$LOG")
exec 2> >(tee -ia "$LOG" >&2)
# Log some environment info
echo "Invocation: $0 $@"
date
pip freeze
git status
cd $SCRATCH_DIR/$DOWNLOAD_NAME
export OMP_NUM_THREADS=1
export PYTHONUNBUFFERED=1
if [ "$1" != "continue-incomplete" ]
then
cp $SCRIPT_DIR/generate_hyperparameters.py .
python generate_hyperparameters.py > hyperparameters.yaml
fi
for kind in with_mass_spec no_mass_spec
do
EXTRA_TRAIN_ARGS=""
if [ "$1" == "continue-incomplete" ] && [ -d "models.${kind}" ]
then
echo "Will continue existing run: $kind"
EXTRA_TRAIN_ARGS="--continue-incomplete"
fi
mhcflurry-class1-train-pan-allele-models \
--data "$(mhcflurry-downloads path data_curated)/curated_training_data.${kind}.csv.bz2" \
--allele-sequences "$(mhcflurry-downloads path allele_sequences)/allele_sequences.csv" \
--pretrain-data "$(mhcflurry-downloads path random_peptide_predictions)/predictions.csv.bz2" \
--held-out-measurements-per-allele-fraction-and-max 0.25 100 \
--ensemble-size 4 \
--hyperparameters hyperparameters.yaml \
--out-models-dir $(pwd)/models.${kind} \
--worker-log-dir "$SCRATCH_DIR/$DOWNLOAD_NAME" \
--verbosity 0 \
--cluster-parallelism \
--cluster-submit-command bsub \
--cluster-results-workdir ~/mhcflurry-scratch \
--cluster-script-prefix-path $SCRIPT_DIR/cluster_submit_script_header.mssm_hpc.lsf \
$EXTRA_TRAIN_ARGS
done
cp $SCRIPT_ABSOLUTE_PATH .
bzip2 -f "$LOG"
for i in $(ls LOG-worker.*.txt) ; do bzip2 -f $i ; done
RESULT="$SCRATCH_DIR/${DOWNLOAD_NAME}.$(date +%Y%m%d).tar.bz2"
tar -cjf "$RESULT" *
echo "Created archive: $RESULT"
# Split into <2GB chunks for GitHub
PARTS="${RESULT}.part."
# Check for pre-existing part files and rename them.
for i in $(ls "${PARTS}"* )
do
DEST="${i}.OLD.$(date +%s)"
echo "WARNING: already exists: $i . Moving to $DEST"
mv $i $DEST
done
split -b 2000M "$RESULT" "$PARTS"
echo "Split into parts:"
ls -lh "${PARTS}"*
#!/bin/bash
#
# Train pan-allele MHCflurry Class I models. Supports re-starting a failed run.
#
set -e
set -x
DOWNLOAD_NAME=models_class1_pan_unselected
SCRATCH_DIR=${TMPDIR-/tmp}/mhcflurry-downloads-generation
SCRIPT_ABSOLUTE_PATH="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)/$(basename "${BASH_SOURCE[0]}")"
SCRIPT_DIR=$(dirname "$SCRIPT_ABSOLUTE_PATH")
mkdir -p "$SCRATCH_DIR"
if [ "$1" != "continue-incomplete" ]
then
echo "Fresh run"
rm -rf "$SCRATCH_DIR/$DOWNLOAD_NAME"
mkdir "$SCRATCH_DIR/$DOWNLOAD_NAME"
else
echo "Continuing incomplete run"
fi
# Send stdout and stderr to a logfile included with the archive.
LOG="$SCRATCH_DIR/$DOWNLOAD_NAME/LOG.$(date +%s).txt"
exec > >(tee -ia "$LOG")
exec 2> >(tee -ia "$LOG" >&2)
# Log some environment info
echo "Invocation: $0 $@"
date
pip freeze
git status
cd $SCRATCH_DIR/$DOWNLOAD_NAME
cp $SCRIPT_DIR/generate_hyperparameters.py .
python generate_hyperparameters.py > hyperparameters.yaml
GPUS=$(nvidia-smi -L 2> /dev/null | wc -l) || GPUS=0
echo "Detected GPUS: $GPUS"
PROCESSORS=$(getconf _NPROCESSORS_ONLN)
echo "Detected processors: $PROCESSORS"
if [ "$GPUS" -eq "0" ]; then
NUM_JOBS=${NUM_JOBS-1}
else
NUM_JOBS=${NUM_JOBS-$GPUS}
fi
echo "Num jobs: $NUM_JOBS"
export PYTHONUNBUFFERED=1
if [ "$1" != "continue-incomplete" ]
then
cp $SCRIPT_DIR/generate_hyperparameters.py .
python generate_hyperparameters.py > hyperparameters.yaml
fi
for kind in with_mass_spec no_mass_spec
do
EXTRA_TRAIN_ARGS=""
if [ "$1" == "continue-incomplete" ] && [ -d "models.${kind}" ]
then
echo "Will continue existing run: $kind"
EXTRA_TRAIN_ARGS="--continue-incomplete"
fi
mhcflurry-class1-train-pan-allele-models \
--data "$(mhcflurry-downloads path data_curated)/curated_training_data.${kind}.csv.bz2" \
--allele-sequences "$(mhcflurry-downloads path allele_sequences)/allele_sequences.csv" \
--pretrain-data "$(mhcflurry-downloads path random_peptide_predictions)/predictions.csv.bz2" \
--held-out-measurements-per-allele-fraction-and-max 0.25 100 \
--ensemble-size 4 \
--hyperparameters hyperparameters.yaml \
--out-models-dir models.${kind} \
--worker-log-dir "$SCRATCH_DIR/$DOWNLOAD_NAME" \
--verbosity 0 \
--num-jobs $NUM_JOBS --max-tasks-per-worker 1 --gpus $GPUS --max-workers-per-gpu 1 \
$EXTRA_TRAIN_ARGS
done
cp $SCRIPT_ABSOLUTE_PATH .
bzip2 -f "$LOG"
for i in $(ls LOG-worker.*.txt) ; do bzip2 -f $i ; done
RESULT="$SCRATCH_DIR/${DOWNLOAD_NAME}.$(date +%Y%m%d).tar.bz2"
tar -cjf "$RESULT" *
echo "Created archive: $RESULT"
# Split into <2GB chunks for GitHub
PARTS="${RESULT}.part."
# Check for pre-existing part files and rename them.
for i in $(ls "${PARTS}"* )
do
DEST="${i}.OLD.$(date +%s)"
echo "WARNING: already exists: $i . Moving to $DEST"
mv $i $DEST
done
split -b 2000M "$RESULT" "$PARTS"
echo "Split into parts:"
ls -lh "${PARTS}"*
# Class I pan-allele models (ensemble)
This download contains trained MHC Class I MHCflurry models before model selection.
To generate this download run:
```
./GENERATE.sh
```
#!/bin/bash
#BSUB -J MHCf-{work_item_num} # Job name
#BSUB -P acc_nkcancer # allocation account or Unix group
#BSUB -q gpu # queue
#BSUB -R rusage[ngpus_excl_p=1] # 1 exclusive GPU
#BSUB -R span[hosts=1] # one node
#BSUB -n 1 # number of compute cores
#BSUB -W 46:00 # walltime in HH:MM
#BSUB -R rusage[mem=30000] # mb memory requested
#BSUB -o {work_dir}/%J.stdout # output log (%J : JobID)
#BSUB -eo {work_dir}/%J.stderr # error log
#BSUB -L /bin/bash # Initialize the execution environment
#
export TMPDIR=/local/JOBS/mhcflurry-{work_item_num}
export PATH=$HOME/.conda/envs/py36b/bin/:$PATH
export PYTHONUNBUFFERED=1
export KMP_SETTINGS=1
set -e
set -x
free -m
module add cuda/10.0.130 cudnn/7.1.1
module list
python -c 'import tensorflow as tf ; print("GPU AVAILABLE" if tf.test.is_gpu_available() else "GPU NOT AVAILABLE")'
env
cd {work_dir}
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment