ELASPIC is a metapredictor which combines sequential features (most important being PROVEAN) with structural features (most important being FoldX). It uses the Stochastic Gradient Boosting algorithm for machine learning.

  • ELASPIC is designed to work on the genome-wide scale by using homology models.
  • It predicts mutation \(\Delta \Delta G\) for protein folding and protein interactions.
  • It is open source and can be installed and ran locally.

Table of Contents

Introduction

_images/elaspic_flowchart.png

Flowchart describing the ELASPIC pipeline .

ELASPIC can be run using two different pipelines: the Local pipeline and the Database pipeline.

Database pipeline

The database pipeline allows mutations to be performed on a proteome-wide scale, without having to specify a structural template for each protein. This pipeline requires a local copy of ELASPIC domain definitions and templates, as well as a local copy of the BLAST and PDB databases.

The general overview of the database pipleine is presented in the figure to the right. A user runs the ELASPIC pipeline specifying the Uniprot ID of the protein being mutated, and one or more mutations affecting that protein. At each decision node, the pipeline queries the database to check whether or not the required information has been previously calculated. If the required data has not been calculated, the pipeline calculates it on the fly and stores the results in the database for later retrieval. The pipeline proceeds until homology models of all domains in the protein, and all domain-domain interactions involving the protein, have been calculated, and the \(\Delta \Delta G\) has been predicted for every specified mutation.

Local pipeline

The local pipeline works without downloading and installing a local copy of the ELASPIC and PDB databases, but requires a PDB structure or template to be provided for every protein. Pipeline output is saves as JSON files inside the working directory, rather than being uploaded to the database as in the case of the database pipeline. The general overview of the local pipleine is presented in the figure to the right.

The local pipeline still requires a local copy of the Blast nr database.

Installation Guide

In order to use the ELASPIC Local pipeline of your computer:
  1. Install Python and ELASPIC (Installing Python and ELASPIC).
  2. Download the BLAST database and preferrably also the PDB database to a local folder (Downloading external datasets).
In order to use the ELASPIC Database pipeline, in addition to the steps above:
  1. Create a local database and modify the configuration file to match your system and database setting (Updating the configuration file).
  2. Download Profs domain definitions for your organism of interest, and upload the data to a local database (Importing precalculated data).

Installing Python and ELASPIC

  1. Download and install the Anaconda Python Distribution (Python 3) for Linux.

  2. Add bioconda, salilab, and ostrokach channels to your ~/.condarc file:

    conda config --add channels ostrokach
    conda config --add channels salilab
    conda config --add channels bioconda
    
  3. Obtain a Modeller license, and export the license as KEY_MODELLER in your ~/.bashrc file:

    # ~/.bashrc
    export KEY_MODELLER=XXXXXXX
    
  4. Install ELASPIC and all its dependencies into a new conda environment:

    conda create -n elaspic elaspic
    
  5. Activate the new environment and use elaspic:

    source activate elaspic
    elaspic --help
    

Downloading external datasets

Blast

Download and extract the nr and pdbaa databases from ftp://ftp.ncbi.nlm.nih.gov/blast/db/, and change the blast_db_dir variable in your configuration file to point to the directory containing the uncompressed files.

PDB

Download the contents of the ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/ folder, and change the pdb_dir variable in your configuration file to point to the directory containing the downloaded data.

Updating the configuration file

Edit the ELASPIC configuration file ./config/config_file.ini to match your system:
  1. Settings in the [SEQUENCE] section should be modified to match the location of your local BLAST and PDB databases.
  2. Settings in the [DATABASE] section should be modified to match the local MySQL, PostgreSQL, or SQLite database.
  3. Settings in the [DEFAULT] and [MODEL] may be left unchanged, since the default values are good enough in most cases.

Configuration options

[DEFAULT]
global_temp_dir
Location for storing temporary files. It will be used only if the TMPDIR environmental variable is not set. Default = ‘/tmp/’.
temp_dir string
A folder in the global_temp_dir that will contain all the files that are relevant to ELASPIC. Inside this folder, every job will create its own unique subfolder. Default = ‘elaspic/’.
debug
Whether or not to show detailed debugging information. If True, the logging level will be set to logging.DEBUG. If False, the logging level will be set to logging.INFO. Default = True.
look_for_interactions
Whether or not to compute models of protein-protein interactions. Default = True.
remake_provean_supset
Whether or not to remake the Provean supporting set if one or more sequences cannot be found in the BLAST database. Default = False.
n_cores
Number of cores to use by programs that support multithreading. Default = 1.
web_server
Whether or not the ELASPIC pipeline is being run as part of a webserver. Default = False.
provean_temp_dir
Location to store provean temporary files if working on any note other than beagle or banting. For internal use only. Default = ‘’.
copy_data
Whether or not to copy calculated data back to the archive. Set to ‘False’ if you are planning to copy the data yourself (e.g. from inside a PBS or SGE script). Default = True.
[SEQUENCE]
blast_db_dir
Location of the blast nr and pdbaa databases.
blast_db_dir_fallback
Place to look for blast nr and pdbaa databases if blast_db_dir does not exist.
matrix_type
Substitution matrix for calculating the mutation conservation score. Default = ‘blosum80’.
gap_start
Penalty for starting a gap when calculating the mutation conservation score. Default = -16.
gap_extend
Penalty for extending a gap when calculating the mutation conservation score. Default = -4.
[MODEL]
modeller_runs
Number of models that MODELLER should make before choosing the best one. Not implemented! Default = 1.
foldx_water
  • -CRYSTAL: use water molecules in the crystal structure to bridge two protein atoms.
  • -PREDICT: predict water molecules that make 2 or more hydrogen bonds to the protein.
  • -COMPARE: compare predicted water bridges with bridges observed in the crystal structure.
  • -IGNORE: don’t predict water molecules. Default.

Source: http://foldx.crg.es/manual3.jsp.

foldx_num_of_runs
Number of times that FoldX should evaluate a given mutation. Default = 1.
[DATABASE]
db_type
The database that you are using. Supported databases are MySQL, PostgreSQL, and SQLite.
sqlite_db_dir
Location of the SQLite database. Required only if db_type is SQLite.
db_schema
The name of the schema that holds all elaspic data.
db_schema_uniprot
The name of the database schema that holds uniprot sequences. Defaults to db_schema.
db_database
The name of the database that contains db_schema and db_schema_uniprot. Required only if db_type is PostgreSQL. Defaults to db_schema.
db_username
The username for the database. Required only if db_type is MySQL or PostgreSQL.
db_password
The password for the database. Required only if db_type is MySQL or PostgreSQL.
db_url
The IP address of the database. Required only if db_type is MySQL or PostgreSQL.
db_port
The listening port of the database. Required only if db_type is MySQL or PostgreSQL.
db_socket
Path to the socket file, if it is not in the default location. Used only if db_url is localhost. For example: /usr/local/mysql5/mysqld.sock for MySQL and /var/lib/postgresql for PostgreSQL.
schema_version
Database schema to use for storing and retreiving data. Default = ‘elaspic’.
archive_type
  • extracted: all archive files are contained in an extracted directory tree.
  • 7zip: archive is made of three compressed 7zip files (provean/provean.7z, uniprot_domain/uniprot_domain.7z, uniprot_domain_pair/uniprot_domain_pair.7z), provided on the elaspic downloads page.
archive_dir
Location for storing and retrieving precalculated data.
pdb_dir
Location of all pdb structures, equivalent to the “data/data/structures/divided/pdb/” folder in the PDB ftp site. Optional.

Environmental variables

PATH

A colon-separated list of paths where ELASPIC should look for required programs, such as BLAST, T-coffee, Modeller, and cd-hit.

TMPDIR

Location to store all temporary files and folders.

Importing precalculated data

ELASPIC downloads page

The ELASPIC downloads page contains all precalculated data that is required to run the ELASPIC pipeline on a local machine.

The *.tsv.gz files correspond to different tables of the ELASPIC database:

  • The domain.tar.gz file in the root folder contains Profs domain definitions for files in the PDB, and corresponds to the domain table.
  • The domain_contact.tar.gz file in the root folder contains a list of interactions between those domains, and corresponds to the domain_contact table.
  • All other tables are split into separate folders according to the organism of origin. The files are named using the {table_name}.tsv.gz convention, where table_name is the name of the table in the database.

The *.7z files contain precalculated data:

  • The provean, uniprot_domain, and uniprot_domain_pair subfolders contain precalculated provean supporting sets, and homology models of protein domains and domain-domain interactions, respectively.

Precalculated mutations:

  • The Homo_sapiens folder contains an additional subfolder precalculated_mutations, which contains \(\Delta \Delta G\) scores for mutations in various datasets.

Note

The configure_test.sh and run_test.sh scripts in the ./scripts folder contain examples of how to download and set up a local copy of the database.

Downloading data

In order to run up ELASPIC on a local computer, you need to download precalculated data for your organism of interest. If your goal is to only test the pipeline, you can download a test dataset from the folder current_release/Homo_sapiens_test.

To download all precalculated data for a given organism, use the wget command:

# Download external files
wget -P "${TEST_DIR}/elaspic.kimlab.org" \
    http://elaspic.kimlab.org/static/download/current_release/domain.tsv.gz
wget -P "${TEST_DIR}/elaspic.kimlab.org" \
    http://elaspic.kimlab.org/static/download/current_release/domain_contact.tsv.gz
wget -P "${TEST_DIR}" \
    -r --no-parent --reject "index.html*" --cut-dirs=4  \
    http://elaspic.kimlab.org/static/download/current_release/Homo_sapiens_test/

You need to extract the provean supporting sets and domain homology models into a folder specified by the archive_dir variable in your configuration_file:

mkdir archive  # Set 'archive_dir' variable in the config file to this folder

7z x "${TEST_DIR}/elaspic.kimlab.org/provean/provean.7z" -o"archive"
7z x "${TEST_DIR}/elaspic.kimlab.org/uniprot_domain/uniprot_domain.7z" -o"archive"
7z x "${TEST_DIR}/elaspic.kimlab.org/uniprot_domain_pair/uniprot_domain_pair.7z" -o"archive"

Importing data into a database

You also need to create a local SQL database and fill it with precalculated data.

Modify the database variables in the ELASPIC configuration file to match your local MySQL, PostgreSQL, or SQLite database, and use the elaspic database CLI to create a new database and fill it with precalculated data.

First, you need to create an empty database:

elaspic database -c {your_configuration_file}.ini create

Next, you need to load all precalculated data for the organism in question to your database:

elaspic database -c {your_configuration_file}.ini load_data

To delete the database that you just created, run:

elaspic database -c {your_configuration_file}.ini delete

Command Line Interface

After following instructions in the Installation Guide, you should be able to run ELASPIC from the command line using the elaspic command:

$ elaspic --help
usage: elaspic [-h] {run,database,train} ...

optional arguments:
  -h, --help            show this help message and exit.

command:
  {run,database,train}
    run                 run ELASPIC
    database            perform database maintenance tasks
    train               train the ELASPIC classifiers

Type --help to see the options available for each subcommand:

  • elaspic run --help
  • elaspic database --help
  • elaspic database load_data --help
  • etc...

elaspic run

Run the ELASPIC pipeline.

If you wish to mutate an existing PDB, you should specify the name of the PDB file to be mutated, and the mutation(s):

elaspic run \
    --structure_file {structure_file} \
    --mutations {mutations}

If you wish to first create a homology model of a protein, you should provide a fasta file containing the sequence of the protein to be modelled, a PDB file containing the structural template, and the mutation(s):

elaspic run \
    --sequence_file {sequence_file} \
    --structure_file {structure_file} \
    --mutations {mutations}

If you wish to perform mutagenesis on a proteome-wide scale, you need to download protein domain definitions from the elaspic downloads page, and optionally a local copy of the PDB database. After saving your database information to a configuration file, you can run specify the uniprot id and mutation(s):

elaspic run \
  --config_file {config_file} \
  --uniprot_id {uniprot_id} \
  --mutations {mutations}

elaspic train

Train the machine learning predictor for the ELASPIC pipeline.

This is automatically done at install time, and you do not need to do this again unless you update your scikit-learn version.

elaspic database

Perform maintenance tasks on the ELASPIC database.

You must provide a configuration file containing the details of your database installation for any of these commands to work. For more information about configuration files, see Updating the configuration file.

elaspic database create

Create a new database schema.

elaspic database load_data

Load data to the database.

elaspic database delete

Delete the database schema.

Benchmarks

Rosetta benchmarks

Existing approaches

Sequence only

intogen

mCSM: predicting the effects of mutations in proteins using graph-based signatures.

  • http://www.ncbi.nlm.nih.gov/pubmed/24281696
  • “To understand the roles of mutations in disease, we have evaluated their impacts not only on protein stability but also on protein-protein and protein-nucleic acid interactions”.
  • cite{pires_mcsm_2014}

Sequence and structure

Predicting Binding Free Energy Change Caused by Point Mutations with Knowledge-Modified MM/PBSA Method

MAESTRO cite{laimer_maestro_2015}

  • https://biwww.che.sbg.ac.at/?page_id=477
  • MAESTRO implements a multi-agent machine learning system.
  • Structure based tools AUTO-MUTE [7], CUPSAT [8], Dmutant [9], FoldX [10], Eris [11], PoPMuSiC [12], SDM [13] or mCSM [14] usually perform better than the sequence based counterparts. Recently, SDM and mCSM have been integrated into a new method called DUET [15].

INPS: predicting the impact of non-synonymous variations on protein stability from sequence

FireProt: Energy- and Evolution-Based Computational Design of Thermostable Multiple-Point Mutants

  • http://journals.plos.org/ploscompbiol/article?id=10.1371%2Fjournal.pcbi.1004556
  • Predict the structural effect of multiple mutations.
  • “Stability effects of all possible single-point mutations were estimated using the <BuildModel> module of FoldX”.
  • We demonstrate that thermostability of the model enzymes haloalkane dehalogenase DhaA and γ-hexachlorocyclohexane dehydrochlorinase LinA can be substantially increased.
  • cite{bednar_fireprot_2015}

Statistics

Work in progres...

Database

Database schema

_images/elaspic_schema.png

Download: pdf png mwb

Database tables

domain

domain_contact

uniprot_sequence

provean

uniprot_domain

uniprot_domain_template

uniprot_domain_model

uniprot_domain_mutation

uniprot_domain_pair

uniprot_domain_pair_template

uniprot_domain_pair_model

uniprot_domain_pair_mutation

Modules

elaspic package

Submodules

elaspic.call_foldx module

elaspic.call_modeller module

elaspic.call_tcoffee module

elaspic.conf module

elaspic.database_pipeline module

elaspic.elaspic_database module

elaspic.elaspic_database_tables module

elaspic.elaspic_model module

elaspic.elaspic_predictor module

elaspic.elaspic_sequence module

elaspic.errors module

elaspic.helper module

elaspic.local_pipeline module

elaspic.machine_learning module

elaspic.pipeline module

elaspic.structure_analysis module

elaspic.structure_tools module

Module contents

Indices and tables