Return to site

OpenVax Neoantigen Prediction Pipeline: A Tutorial

We've previously written about the OpenVax neoantigen prediction pipeline, which is the computational basis of 3 clinical trials at Mount Sinai. The purpose of this post is to provide a quick-start tutorial for using our pipeline on your own data to predict cancer neoantigens and select peptide vaccine contents, aiming to elicit a T cell response against those neoantigens.

In a nutshell, the OpenVax pipeline is a Dockerized end-to-end workflow using Snakemake that starts with raw tumor/normal sequencing data and does all the necessary processing to generate neoantigen predictions. It is easy to set up and run, contains all needed dependencies, and does not require a cluster. All you need is a Docker installation and a relatively beefy machine - we run it on a 24-core server, but 16 cores should be enough as well.


These are the steps performed by the OpenVax pipeline:

  • Tumor and normal whole exome sequencing FASTQ files are aligned to a particular reference genome using bwa mem. RNA-seq of the tumor sample is aligned by STAR to the same reference genome. (We generally use GRCh37+decoy sequences or GRCh38.)
  • Aligned tumor and normal exome reads pass through several steps using GATK 3.7: MarkDuplicates, IndelRealigner, BQSR.
  • Aligned RNA-seq reads are grouped into two sets: those spanning introns and those aligning entirely within an exon (determined based on the CIGAR string). The latter group is passed through IndelRealigner, and the two groups of reads are merged.
  • Somatic variant calling is performed by running Mutect 1.1.7, Mutect 2, and/or Strelka version 1. The pipeline configuration, which we go into below, specifies which of those 3 variant callers the pipeline should run.
  • Once both somatic variants and aligned RNA are ready, a custom tool called Vaxrank coordinates variant effect annotation, expression estimation, and MHC binding prediction. Vaxrank prioritizes somatic variants based on expression and predicted MHC class I binding affinity.
broken image

Getting Started

The OpenVax pipeline assumes you have the following datasets:

Sample data

You will need to provide tumor and normal whole exome sequencing data, as well as tumor RNA-seq data. These files need to be in gzip-compressed FASTQ format. The pipeline also expects as input a list of MHC class I alleles for the individual.


You will need to provide a reference genome and associated files:

  • Reference sequence file (FASTA)
  • Known transcripts file (GTF)
  • dbSNP known mutation file for your reference genome (VCF)

Optionally, you can also include:

  • COSMIC mutation file: will be used in somatic variant calling if present
  • Exome capture kit coverage file (BED): will be used to compute sequencing metrics if present
For a quick start, we provide a processed version of these reference files for the b37decoy, GRCh38, and mm10 genomes in Google Cloud. You can download them from this Google Cloud console link, or if you have installed gsutil, you can download the files faster. If you want to instead run the neoantigen prediction pipeline with your own reference genome, please contact us and we’ll help you out.


To install the latest version of the OpenVax pipeline from Dockerhub, run this one-liner:

docker pull openvax/neoantigen-vaccine-pipeline:latest

Verify everything installed correctly and see all available pipeline options (e.g. ability to execute a dry run listing all commands, specify memory/CPU resources for the pipeline, and others):
docker run openvax/neoantigen-vaccine-pipeline:latest -h


You run the pipeline by invoking a Docker entrypoint in the image, giving it three directories as mounted Docker volumes:

  • /inputs: FASTQ files and a configuration YAML (see example YAML here)
  • /outputs: directory to write results to
  • /reference-genome: reference genome data, may be shared across pipeline runs

Make sure that all 3 directories are world-writable - the Docker pipeline runs as an unprivileged user, and the pipeline will need to write data to one or more of these directories.

Let’s say you want to run the pipeline using the GRCh38 reference genome. If you’re using our provided processed files, download and uncompress them (and make the reference genome directory world-writable):

gsutil -m cp gs://reference-genomes/grch38.tar.gz /your/path/to/reference/genome/
cd /your/path/to/reference/genome/ && tar -zxvf grch38.tar.gz

chmod -R a+w grch38


broken image

An OpenVax pipeline YAML config file contains sample-specific settings and tool configurations that may be common across samples and shared in multiple pipeline runs. This config file needs to live in the directory /your/path/to/fastq/inputs, the same directory as your input FASTQ files. Some notes about this:

  • All paths in the config are expected to be relative to the inputs, outputs, and reference genome directories you're mounting as Docker volumes.
  • In most cases, you should only need to edit the sample-specific configuration (first half of the config YAML file).
  • You can list up to 6 MHC class I alleles for your sample. Make sure that each allele is supported by NetMHCpan; otherwise no vaccine peptide results will be returned.
  • The sample ID will determine the subdirectory where the results get written, so you should use a unique sample name for each pipeline run.
  • If you have data in paired-end FASTQ files, the two files must be specified as "r1" and "r2" entries instead of the singular entry in the config template. You’ll also need to change the "type" to "paired-end".
  • If you have data for a single sample that's spread across multiple sequencing runs or lanes, this can be accommodated also: the "tumor", "normal", and "rna" sections in the configuration file each contain a list of fragments. This allows for multiple list elements, as long as each entry has a distinct "fragment_id" value. For example, if the tumor RNA data comes from multiple sequencing runs, you can add another list element to the "rna" block with "fragment_id: L002".

Test run

For an example run, try starting with test data from our GitHub repo, consisting of a YAML config file and two small FASTQ files of reads overlapping a single somatic mutation, set up to run using the GRCh38 reference genome. First, download this reference genome as described in the Setup section above. For this simple test, we will re-use the tumor DNA sequencing as our RNA reads. Download the test data from these files to the directory you'll have mounted as your /inputs volume:

cd /your/path/to/fastq/inputs


After you create your /outputs directory, we may now run the pipeline:

docker run -it \
-v /your/path/to/fastq/inputs:/inputs \
-v /your/path/to/pipeline/outputs:/outputs \
-v /your/path/to/reference/genome:/reference-genome \
openvax/neoantigen-vaccine-pipeline:latest \

The first time you run this, it may take several minutes as necessary processing files are being downloaded and cached. The output will be a set of ranked variants and proposed vaccine peptide results in several file formats, including basic text (ASCII) and PDF. If everything works correctly, you should see a single IDH1 R132H variant in the final output.

What if I want to use OpenVax to just call somatic variants?

You can also run the OpenVax pipeline just to call somatic variants, if you don't have tumor expression data or MHC alleles for your sample. Simply use the same pipeline configuration, but omit the tumor RNA part of the sample as well as HLA alleles - this will call and write MuTect and Strelka variants into their own respective VCFs! See an example of a variant-calling-only pipeline config here.

There’s all sorts of other intermediate output in the pipeline as well in the pipeline - for a description of that, or for any other details about the pipeline, check out our README on GitHub. And if you have any other questions, please contact

Post by Julia Kodysh