elaspic package

Submodules

elaspic.call_foldx module

elaspic.call_modeller module

elaspic.call_tcoffee module

Alignes sequences using t_coffee in expresso mode.

TCoffee.align(GAPOPEN=-0.0, GAPEXTEND=-0.0)[source]

Calls t_coffee (make sure BLAST is installed locally!).

Parameters:
  • alignment_fasta_file (string) – A file containing the fasta sequences to be aligned
  • alignment_template_file (string) – A file containing the structural templates for the fasta sequences described above
  • GAPOPEN (int or str) – See t_coffee manual
  • GAPEXTEND (int or str) – See t_coffee manual
  • Returns
  • --------
  • alignment_output_file (str) – Name of file which contains the alignment in fasta format.

elaspic.conf module

A singleton class that keeps track of ELASPIC configuration settings.

Configs.clear()[source]
Configs.copy()[source]
Configs.get(key, fallback=None)[source]
Configs.items()[source]
Configs.keys()[source]
Configs.update(**kwargs)[source]
Configs.values()[source]
Singleton.instance = None
elaspic.conf.get_temp_dir(global_temp_dir='/tmp', elaspic_foldername='')[source]

If a TMPDIR is given as an environment variable, the tmp directory is created relative to that. This is useful when running on banting (the cluster in the ccbr) and also on Scinet. Make sure that it points to ‘/dev/shm/’ on Scinet.

elaspic.conf.read_configuration_file(config_file, unique_temp_dir=None)[source]
elaspic.conf.read_database_configs(configParser)[source]

[DATABASE]

elaspic.conf.read_model_configs(configParser)[source]

[MODEL]

elaspic.conf.read_sequence_configs(configParser)[source]

[SEQUENCE]

elaspic.database_pipeline module

elaspic.elaspic_database module

MyDatabase.add_domain(d)[source]
MyDatabase.add_domain_errors(t, error_string)[source]
MyDatabase.add_uniprot_sequence(uniprot_sequence)[source]

Add new sequences to the database. :param uniprot_sequence: UniprotSequence object :rtype: None

MyDatabase.configure_session()[source]

Configure the Session class to use the current engine.

autocommit and autoflush are enabled for the sqlite database in order to improve performance.

MyDatabase.copy_table_to_db(table_name, table_folder)[source]

Copy data from a .tsv file to a table in the database.

MyDatabase.create_database_tables(clear_schema=False, keep_uniprot_sequence=True)[source]

Create a new database in the schema specified by the schema_version global variable. If clear_schema == True, remove all the tables in the schema first.

Warning

Using this function with an existing database can lead to loss of data. Make sure that you know what you are doing!

Parameters:
  • clear_schema (bool) – Whether or not to delete all tables in the database schema before creating new tables.
  • keep_uniprot_sequence (bool) – Whether or not to keep the uniprot_sequence table. Only relevant if clear_schema is True.
MyDatabase.delete_database_tables(drop_schema=False, keep_uniprot_sequence=True)[source]
Parameters:
  • drop_schema (bool) – Whether or not to drop the schema after dropping the tables.
  • keep_uniprot_sequence (bool) – Wheter or not to keep the table (and schema) containing uniprot sequences.
MyDatabase.get_alignment(model, path_to_data)[source]
MyDatabase.get_domain(pfam_names, subdomains=False)[source]

Returns pdbfam-based definitions of all pfam domains in the pdb.

MyDatabase.get_domain_contact(pfam_names_1, pfam_names_2, subdomains=False)[source]

Returns domain-domain interaction information from pdbfam. Note that the produced dataframe may not have the same order as the keys.

MyDatabase.get_engine(echo=False)[source]

Get an SQLAlchemy engine that can be used to connect to the database.

MyDatabase.get_rows_by_ids(row_object, row_object_identifiers, row_object_identifier_values)[source]

Get the rows from the table row_object identified by keys row_object_identifiers with values row_object_identifier_values

MyDatabase.get_uniprot_domain(uniprot_id, copy_data=False)[source]
MyDatabase.get_uniprot_domain_pair(uniprot_id, copy_data=False, uniprot_domain_pair_ids=[])[source]
MyDatabase.get_uniprot_mutation(d, mutation, uniprot_id=None, copy_data=False)[source]
MyDatabase.get_uniprot_sequence(uniprot_id, check_external=True)[source]
Parameters:
  • uniprot_id (str) – Uniprot ID of the protein
  • check_external (bool) – Whether or not to look online if the protein sequence is not found in the local database.
Returns:

Contains the sequence of the specified uniprot

Return type:

SeqRecord

MyDatabase.load_db_from_archive()[source]

TODO: In the future I should move back to using json...

MyDatabase.merge_model(d, files_dict={})[source]

Adds MODELLER models to the database.

MyDatabase.merge_mutation(mut, path_to_data=False)[source]
MyDatabase.merge_provean(provean, provean_supset_file, path_to_data)[source]

Adds provean score to the database.

MyDatabase.merge_row(row_instance)[source]

Adds a list of rows (row_instances) to the database.

MyDatabase.mysql_command_template = "load data local infile '{table_folder}/{table_name}.tsv' into table {table_db_schema}.{table_name} fields terminated by '\\t' escaped by '\\\\\\\\' lines terminated by '\\n'; "
MyDatabase.mysql_load_table_template = 'mysql --local-infile --host={db_url} --user={db_username} --password={db_password} {table_db_schema} -e "{sql_command}" '
MyDatabase.psql_command_template = "\\\\copy {table_db_schema}.{table_name} from '{table_folder}/{table_name}.tsv' with csv delimiter E'\\t' null '\\N' escape '\\\\'; "
MyDatabase.psql_load_table_template = 'PGPASSWORD={db_password} psql -h {db_url} -p {db_port} -U {db_username} -d {db_database} -c "{sql_command}" '
MyDatabase.remove_model(d)[source]

Remove a model from the database.

Do this if you realized that the model you built is incorrect or that some of the data is missing.

Raises:errors.ModelHasMutationsError – The model you are trying to delete has precalculated mutations, so it can’t be that bad. Delete those mutations and try again.
MyDatabase.session_scope()[source]

Provide a transactional scope around a series of operations. Enables the following construct: with self.session_scope() as session:.

MyDatabase.sqlite_table_filename = '{table_folder}/{table_name}.tsv'
elaspic.elaspic_database.check_exception(exc, valid_exc)[source]
elaspic.elaspic_database.decorate_all_methods(decorator)[source]

Decorate all methods of a class with decorator.

elaspic.elaspic_database.enable_sqlite_foreign_key_checks(engine)[source]
elaspic.elaspic_database.get_uniprot_base_path(d)[source]

The uniprot id is cut into several chunks to create folders that will hold a manageable number of pdbs.

elaspic.elaspic_database.get_uniprot_domain_path(d)[source]

Return the path to individual domains or domain pairs.

elaspic.elaspic_database.pickle_dump(obj, filename)[source]
elaspic.elaspic_database.retry_archive(fn)[source]

Decorator to keep probing the database untill you succeed.

elaspic.elaspic_database.retry_database(fn)[source]

Decorator to keep probing the database untill you succeed.

elaspic.elaspic_database.scinet_cleanup(folder, destination, name=None)[source]

zip and copy the results from the ramdisk to /scratch

elaspic.elaspic_database_tables module

Created on Thu Jun 11 16:52:31 2015

@author: ostrokach

Profs domain definitions for all proteins in the PDB.

Columns:
cath_id
Unique id identifying each domain in the PDB. Constructed by concatenating the pdb_id, pdb_chain, and an index specifying the order of the domain in the chain.
pdb_id
The PDB id in which the domain is found.
pdb_chain
The PDB chain in which the domain is found.
pdb_domain_def
Domain definitions of the domain, in PDB RESNUM coordinates.
pdb_pdbfam_name
The Profs name of the domain.
pdb_pdbfam_idx
An integer specifying the number of times a domain with domain name pdb_pdbfam_name has occurred in this chain up to this point. It is used to make every (pdb_id, pdb_chain, pdb_pdbfam_name, pdb_pdbfam_idx) tuple unique.
domain_errors
List of errors that occurred when annotating this domain, or when using this domain to make structural homology models.
Domain.cath_id
Domain.domain_errors
Domain.pdb_chain
Domain.pdb_domain_def
Domain.pdb_id
Domain.pdb_pdbfam_idx
Domain.pdb_pdbfam_name

Interactions between Profs domains in the PDB. Only interactions that were predicted to be biologically relevant by NOXclass are included in this table.

Columns:
domain_contact_id
A unique integer identifying each domain pair.
cath_id_1
Unique id identifying the first interacting domain in the domain table.
cath_id_2
Unique id identifying the second interacting domain in the domain table.
min_interchain_distance
The closest that any residue in domain one comes to any residue in domain two.
contact_volume
The volume covered by contacting residues.
contact_surface_area
The surface area of the contacting regions of the first and second domains.
atom_count_1
The number of atoms in the first domain.
atom_count_2
The number of atoms in the second domain.
number_of_contact_residues_1
The number of residues in the first domain that come within 5 Å of the second domain.
number_of_contact_residues_2
The number of residues in the second domain that come withing 5 Å of the first domain.
contact_residues_1
A list of all residues in the first domain that come within 5 Å of the second domain. The residue number corresponds to the position of the residue in the domain.
contact_residues_2
A list of all residues in the second domain that come within 5 Å of the first domain. The residue number corresponds to the position of the residue in the domain.
crystal_packing
The probability that the interaction is a crystallization artifacts, as defined by NOXclass.
domain_contact_errors
List of errors that occurred when annotating this domain pair, or when using this domain as a template for making structural homology models.
DomainContact.atom_count_1
DomainContact.atom_count_2
DomainContact.cath_id_1
DomainContact.cath_id_2
DomainContact.contact_residues_1
DomainContact.contact_residues_2
DomainContact.contact_surface_area
DomainContact.contact_volume
DomainContact.crystal_packing
DomainContact.domain_1
DomainContact.domain_2
DomainContact.domain_contact_errors
DomainContact.domain_contact_id
DomainContact.min_interchain_distance
DomainContact.number_of_contact_residues_1
DomainContact.number_of_contact_residues_2

Description of the Provean supporting set calculated for a protein sequence. The construction of a supporting set is the most lengthy step in running Provean. Therefore, the supporting set is precalculated and stored for every protein sequence.

Columns:
uniprot_id
The uniprot id of the protein.
provean_supset_filename
The filename of the Provean supporting set. The supporting set contains the ids and sequences of all proteins in the NCBI nr database that are used by Provean to construct a multiple sequence alignment for the given protein.
provean_supset_length
The number of sequences in Provean supporting set.
provean_errors
List of errors that occurred while the Provean supporting set was being calculated.
provean_date_modified
Date and time that this row was last modified.
Provean.provean_date_modified
Provean.provean_errors
Provean.provean_supset_filename
Provean.provean_supset_length
Provean.uniprot_id
Provean.uniprot_sequence

Pfam domain definitions for proteins in the uniprot_sequence table. This table was obtained by downloading Pfam domain definitions for all known proteins from the SIMAP website, and mapping the protein sequence to uniprot using the MD5 hash of each sequence.

Columns:
uniprot_domain_id
Unique id identifying each domain.
uniprot_id
The uniprot id of the protein containing the domain.
pdbfam_name
The Profs name of the domain. In most cases this will be equivalent to the Pfam name of the domain.
pdbfam_idx
The index of the Profs domain. pdbfam_idx ranges from 1 to the number of domains with the name pdbfam_name in the given protein. The (pdbfam_name, pdbfam_idx) tuple uniquely identifies each domain.
pfam_clan
The Pfam clan to which this Profs domain belongs.
alignment_def
Alignment domain definitions of the Profs domain. This field is obtained by removing gaps in the alignment_subdefs column.
pfam_names
Pfam names of all Pfam domains that were combined to create the given Profs domain.
alignment_subdefs
Comma-separated list of domain definitions for all Pfam domains that were merged to create the given Profs domain.
path_to_data
Location for storing homology models, mutation results, and all other data that are relevant to this domain. This path is prefixed by archive_dir.
UniprotDomain.IS_TRAINING_SCHEMA = False
UniprotDomain.alignment_def
UniprotDomain.alignment_subdefs
UniprotDomain.path_to_data
UniprotDomain.pdbfam_idx
UniprotDomain.pdbfam_name
UniprotDomain.pfam_clan
UniprotDomain.pfam_names
UniprotDomain.uniprot_domain_id
UniprotDomain.uniprot_id
UniprotDomain.uniprot_sequence

Homology models for templates in the uniprot_domain_template table.

Columns:
uniprot_domain_id
An integer which uniquely identifies each uniprot domain in the uniprot_domain table.
model_errors
List of errors that occurred when making the homology model.
alignment_filename
The name of the alignment that was given to Modeller when making the homology model.
model_filename
The name of the homology model that was produced by Modeller.
chain
The chain that contains the domain in question in the homology (this is now set to ‘A’ in all models).
norm_dope
Normalized DOPE score of the model (lower is better).
sasa_score
Comma-separated list of the percent solvent-accessible surface area for each residue.
m_date_modified
The date and time when this row was last modified.
model_domain_def

Domain definitions for the region of the domain that is covered by the structural template.

In most cases, this field is identical to the domain_def field in the uniprot_domain_template table. However, it sometimes happens that the best Profs structural template only covers a fraction of the Pfam domain. In that case, the alignment_def column in the uniprot_domain table, and the domain_def column in the uniprot_domain_template table, will contain the original Pfam domain definitions, and the model_domain_def column will contain domain definitions for only the region that is covered by the structural template.

UniprotDomainModel.alignment_filename
UniprotDomainModel.chain
UniprotDomainModel.m_date_modified
UniprotDomainModel.model_domain_def
UniprotDomainModel.model_errors
UniprotDomainModel.model_filename
UniprotDomainModel.norm_dope
UniprotDomainModel.sasa_score
UniprotDomainModel.template
UniprotDomainModel.uniprot_domain_id

Characterization of mutations introduced into structures in the uniprot_domain_model table.

Columns:
uniprot_id
Uniprot ID of the protein that was mutated.
uniprot_domain_id
Unique id which identifies the Profs domain that was mutated in the uniprot_domain table.
mutation
Mutation that was introduced into the protein, in Uniprot coordinates.
mutation_errors
List of errors that occured while evaluating the mutation.
model_filename_wt
The name of the file which contains the homology model of the domain after the model was relaxed with FoldX but before the mutation was introduced.
model_filename_mut
The name of the file which contains the homology model of the domain after the model was relaxed with FoldX and after the mutation was introduced.
chain_modeller
The chain which contains the domain that was mutated in the model_filename_wt and the model_filename_mut structures.
mutation_modeller
The mutation that was introduced into the protein, in PDB RESNUM coordinates. This identifies the mutated residue in the model_filename_wt and the model_filename_mut structures.
stability_energy_wt

Comma-separated list of scores returned by FoldX for the wildtype protein. The comma-separated list can be converted into a DataFrame with each column clearly labelled using the elaspic.predictor.format_mutation_features(). The FoldX energy terms are:

  • dg
  • backbone_hbond
  • sidechain_hbond
  • van_der_waals
  • electrostatics
  • solvation_polar
  • solvation_hydrophobic
  • van_der_waals_clashes
  • entropy_sidechain
  • entropy_mainchain
  • sloop_entropy
  • mloop_entropy
  • cis_bond
  • torsional_clash
  • backbone_clash
  • helix_dipole
  • water_bridge
  • disulfide
  • electrostatic_kon
  • partial_covalent_bonds
  • energy_ionisation
  • entropy_complex
  • number_of_residues
stability_energy_mut
Comma-separated list of scores returned by FoldX for the mutant protein. FoldX energy terms are the same as in stability_energy_wt, but for the mutated amino acid rather than the wildtype.
physchem_wt

Physicochemical properties describing the interaction of the wildtype residue with residues on the opposite chain. The terms are:

  • number of atoms in interacting residues that have the same charge.
  • number of atoms in interacting residues that have an opposite charge.
  • number of hydrogen bonds (very rough calculation).
  • number of carbons in interacting residues within 4 A of the mutated residue (rough measure of the van der Waals force).
physchem_wt_ownchain
Physicochemical properties describing the interaction of the wildtype residue with residues on the same chain. The terms are the same as in physchem_wt.
physchem_mut
Physicochemical properties describing the interaction of the mutant residue with residues on the opposite chain. The terms are the same as in physchem_wt.
physchem_mut_ownchain
Physicochemical properties describing the interaction of the mutant residue with residues on the same chain. The terms are the same as in physchem_wt.
matrix_score
Score assigned to the wt -> mut transition by the BLOSUM substitution matrix.
secondary_structure_wt
Secondary structure of the wildtype residue predicted by stride.
solvent_accessibility_wt
Percent solvent accessible surface area of the wildtype residue, predicted by msms.
secondary_structure_mut
Secondary structure of the mutated residue predicted by stride.
solvent_accessibility_mut
Percent solvent accessible surface area of the mutated residue, predicted by msms.
provean_score
Score produced by Provean for this mutation.
ddg
Change in the Gibbs free energy of folding that our classifier predicts for this mutation.
mut_date_modified
Date and time that this row was last modified.
UniprotDomainMutation.chain_modeller
UniprotDomainMutation.ddg
UniprotDomainMutation.matrix_score
UniprotDomainMutation.model
UniprotDomainMutation.model_filename_mut
UniprotDomainMutation.model_filename_wt
UniprotDomainMutation.mut_date_modified
UniprotDomainMutation.mutation
UniprotDomainMutation.mutation_errors
UniprotDomainMutation.mutation_modeller
UniprotDomainMutation.physchem_mut
UniprotDomainMutation.physchem_mut_ownchain
UniprotDomainMutation.physchem_wt
UniprotDomainMutation.physchem_wt_ownchain
UniprotDomainMutation.provean_score
UniprotDomainMutation.secondary_structure_mut
UniprotDomainMutation.secondary_structure_wt
UniprotDomainMutation.solvent_accessibility_mut
UniprotDomainMutation.solvent_accessibility_wt
UniprotDomainMutation.stability_energy_mut
UniprotDomainMutation.stability_energy_wt
UniprotDomainMutation.uniprot_domain_id
UniprotDomainMutation.uniprot_id

Potentially-interacting pairs of domains for proteins that are known to interact, according to Hippie, IRefIndex, and Rolland et al. 2014.

Columns:
uniprot_domain_pair_id
Unique id identifying each domain-domain interaction.
uniprot_domain_id_1
Unique id of the first domain.
uniprot_domain_id_2
Unique id of the second domain.
rigids
Phased out.
domain_contact_ids
List of unique ids identifying all domain-domain pairs in the PDB, where one domain belongs to the protein containing uniprot_domain_id_1 and the other domain belongs to the protein containing uniprot_domain_id_2. This was used as crystallographic evidence that the two proteins interact.
path_to_data
Location for storing homology models, mutation results, and all other data that is relevant to this domain pair. This path is prefixed by archive_dir.
UniprotDomainPair.domain_contact_ids
UniprotDomainPair.path_to_data
UniprotDomainPair.rigids
UniprotDomainPair.uniprot_domain_1
UniprotDomainPair.uniprot_domain_2
UniprotDomainPair.uniprot_domain_id_1
UniprotDomainPair.uniprot_domain_id_2
UniprotDomainPair.uniprot_domain_pair_id
UniprotDomainPair.uniprot_id_1
UniprotDomainPair.uniprot_id_2

Structural models of interactions between pairs of domains in the uniprot_domain_pair table.

Columns:
uniprot_domain_pair_id
Unique id identifying each domain-domain interaction.
model_errors
List of errors that occured while making the homology model.
alignment_filename_1
Name of the file containing the alignment of the first domain with its structural template.
alignment_filename_2
Name of the file containing the alignment of the second domain with its structural template.
model_filename
Name of the file containing the homology model of the domain-domain interaction created by Modeller.
chain_1
Chain containing the first domain in the model specified by model_filename.
chain_2
Chain containing the second domain in the model specified by model_filename.
norm_dope
The normalized DOPE score of the model.
interface_area_hydrophobic
Hydrophobic surface area of the interface, calculated using POPS.
interface_area_hydrophilic
Hydrophilic surface area of the interface, calculated using POPS.
interface_area_total
Total surface area of the interface, calculated using POPS.
interface_dg
Gibbs free energy of binding for this domain-domain interaction, predicted using FoldX. Not implemented yet!
interacting_aa_1
List of amino acid positions in the first domain that are within 5 Å of the second domain. Positions are specified using uniprot coordinates.
interacting_aa_2
List of amino acids in the second domain that are within 5 Å of the first domain. Position are specified using uniprot coordinates.
m_date_modified
Date and time that this row was last modified.
model_domain_def_1
Domain boundaries of the first domain that are covered by the Profs structural template.
model_domain_def_2
Domain boundaries of the second domain that are covered by the Profs structural template.
UniprotDomainPairModel.alignment_filename_1
UniprotDomainPairModel.alignment_filename_2
UniprotDomainPairModel.chain_1
UniprotDomainPairModel.chain_2
UniprotDomainPairModel.interacting_aa_1
UniprotDomainPairModel.interacting_aa_2
UniprotDomainPairModel.interface_area_hydrophilic
UniprotDomainPairModel.interface_area_hydrophobic
UniprotDomainPairModel.interface_area_total
UniprotDomainPairModel.interface_dg
UniprotDomainPairModel.m_date_modified
UniprotDomainPairModel.model_domain_def_1
UniprotDomainPairModel.model_domain_def_2
UniprotDomainPairModel.model_errors
UniprotDomainPairModel.model_filename
UniprotDomainPairModel.norm_dope
UniprotDomainPairModel.template
UniprotDomainPairModel.uniprot_domain_pair_id

Characterization of interface mutations introduced into structures in the uniprot_domain_pair_model table.

Columns:
uniprot_id
Uniprot ID of the protein that is being mutated.
uniprot_domain_pair_id
Unique id identifying each domain-domain interaction.
mutation
Mutation for which the \(\Delta \Delta G\) score is being predicted, specified in Uniprot coordinates.
mutation_errors
List of errors obtained when evaluating the impact of the mutation.
model_filename_wt
Filename of the homology model relaxed by FoldX but containing the wildtype residue.
model_filename_mut
Filename of the homology model relaxed by FoldX and containing the mutated residue.
chain_modeller
Chain containing the domain that was mutated, in homology models specified by model_filename_wt and model_filename_mut.
mutation_modeller
Mutation for which the \(\Delta \Delta G\) score is being predicted, specified in PDB RESNUM coordinates.
analyse_complex_energy_wt
Comma-separated list of FoldX scores describing the effect of the wildtype residue on the stability of the protein domain.
stability_energy_wt
Comma-separated list of FoldX scores describing the effect of the wildtype residue on protein-protein interaction interface.
analyse_complex_energy_mut
Comma-separated list of FoldX scores describing the effect of the mutated residue on the stability of the protein domain.
stability_energy_mut
Comma-separated list of FoldX scores describing the effect of the mutated residue on protein-protein interaction interface.
physchem_wt
Comma-separated list of physicochemical properties describing the interaction between the wildtype residue and other residues on the opposite chain.
physchem_wt_ownchain
Comma-separated list of physicochemical properties describing the interaction between the wildtype residue and other residues on the same chain.
physchem_mut
Comma-separated list of physicochemical properties describing the interaction between the mutated residue and other residues on the opposite chain.
physchem_mut_ownchain
Comma-separated list of physicochemical properties describing the interaction between the mutated residue and other residues on the same chain.
matrix_score
Score assigned to the wt -> mut transition by the BLOSUM substitution matrix.
secondary_structure_wt
Secondary structure of the wildtype residue, predicted by stride.
solvent_accessibility_wt
Percent solvent accessible surface area of the wildtype residue, predicted by msms.
secondary_structure_mut
Secondary structure of the mutated residue, predicted by stride.
solvent_accessibility_mut
Percent solvent accessible surface area of the mutated residue, predicted by msms.
contact_distance_wt
Shortest distance between the wildtype residue and a residue on the opposite chain.
contact_distance_mut
Shortest distance between the mutated reside and a residue on the opposite chain.
provean_score
Provean score for this mutation.
ddg
Predicted change in Gibbs free energy of binding caused by this mutation.
mut_date_modified
Date and time when this row was last modified.
UniprotDomainPairMutation.analyse_complex_energy_mut
UniprotDomainPairMutation.analyse_complex_energy_wt
UniprotDomainPairMutation.chain_modeller
UniprotDomainPairMutation.contact_distance_mut
UniprotDomainPairMutation.contact_distance_wt
UniprotDomainPairMutation.ddg
UniprotDomainPairMutation.matrix_score
UniprotDomainPairMutation.model
UniprotDomainPairMutation.model_filename_mut
UniprotDomainPairMutation.model_filename_wt
UniprotDomainPairMutation.mut_date_modified
UniprotDomainPairMutation.mutation
UniprotDomainPairMutation.mutation_errors
UniprotDomainPairMutation.mutation_modeller
UniprotDomainPairMutation.physchem_mut
UniprotDomainPairMutation.physchem_mut_ownchain
UniprotDomainPairMutation.physchem_wt
UniprotDomainPairMutation.physchem_wt_ownchain
UniprotDomainPairMutation.provean_score
UniprotDomainPairMutation.secondary_structure_mut
UniprotDomainPairMutation.secondary_structure_wt
UniprotDomainPairMutation.solvent_accessibility_mut
UniprotDomainPairMutation.solvent_accessibility_wt
UniprotDomainPairMutation.stability_energy_mut
UniprotDomainPairMutation.stability_energy_wt
UniprotDomainPairMutation.uniprot_domain_pair_id
UniprotDomainPairMutation.uniprot_id

Structural templates for pairs of domains in the uniprot_domain_pair table.

Columns:
uniprot_domain_pair_id
Unique id identifying each domain-domain interaction.
domain_contact_id
Unique id of the domain pair in the domain_contact table that was used as a template for the modelled domain pair.
cath_id_1
Unique id of the structural template for the first domain.
cath_id_2
Unique id of the structural template for the second domain.
identical_1
Fraction of residues in the Blast alignment of the first domain to its template that are identical.
conserved_1
Fraction of residues in the Blast alignment of the first domain to its template that are conserved.
coverage_1
Fraction of the first domain that is covered by the blast alignment.
score_1
Score obtained by multiplying identical_1 by coverage_1.
identical_if_1
Fraction of interface residues [1] that are identical in the Blast alignment of the first domain.
conserved_if_1
Fraction of interface residues [1] that are conserved in the Blast alignment of the first domain.
coverage_if_1
Fraction of interface residues [1] that are covered by the Blast alignment of the first domain.
score_if_1
Score obtained by combining identical_if_1 and coverage_if_1 using (1).
identical_2
Fraction of residues in the Blast alignment of the second domain to its template that are identical.
conserved_2
Fraction of residues in the Blast alignment of the second domain to its template that are conserved.
coverage_2
Fraction of the second domain that is covered by the blast alignment.
score_2
Score obtained by multiplying identical_2 by coverage_2.
identical_if_2
Fraction of interface residues [1] that are identical in the Blast alignment of the second domain.
conserved_if_2
Fraction of interface residues [1] that are conserved in the Blast alignment of the second domain.
coverage_if_2
Fraction of interface residues [1] that are covered by the Blast alignment of the second domain.
score_if_2
Score obtained by combining identical_if_2 and coverage_if_2 using (1).
score_total
The product of score_1 and score_2.
score_if_total
The product of score_if_1 and score_if_2.
score_overall
The product of score_total and score_if_total. This is the score that was used to select the best Profs domain pair to be used as a template.
t_date_modified
The date and time when this row was last updated.
template_errors
List of errors that occured while looking for the structural template.
[1](1, 2, 3, 4, 5, 6) Interface residues are defined as residues that are within 5 Å of the partner domain.
UniprotDomainPairTemplate.cath_id_1
UniprotDomainPairTemplate.cath_id_2
UniprotDomainPairTemplate.conserved_1
UniprotDomainPairTemplate.conserved_2
UniprotDomainPairTemplate.conserved_if_1
UniprotDomainPairTemplate.conserved_if_2
UniprotDomainPairTemplate.coverage_1
UniprotDomainPairTemplate.coverage_2
UniprotDomainPairTemplate.coverage_if_1
UniprotDomainPairTemplate.coverage_if_2
UniprotDomainPairTemplate.domain_1
UniprotDomainPairTemplate.domain_2
UniprotDomainPairTemplate.domain_contact
UniprotDomainPairTemplate.domain_contact_id
UniprotDomainPairTemplate.domain_pair
UniprotDomainPairTemplate.identical_1
UniprotDomainPairTemplate.identical_2
UniprotDomainPairTemplate.identical_if_1
UniprotDomainPairTemplate.identical_if_2
UniprotDomainPairTemplate.score_1
UniprotDomainPairTemplate.score_2
UniprotDomainPairTemplate.score_if_1
UniprotDomainPairTemplate.score_if_2
UniprotDomainPairTemplate.score_if_total
UniprotDomainPairTemplate.score_overall
UniprotDomainPairTemplate.score_total
UniprotDomainPairTemplate.t_date_modified
UniprotDomainPairTemplate.template_errors
UniprotDomainPairTemplate.uniprot_domain_pair_id

Structural templates for domains in the uniprot_domain table. Lists PDB crystal structures that will be used for making homology models.

Columns:
uniprot_domain_id
An integer which uniquely identifies each uniprot domain in the uniprot_domain table.
template_errors
List of errors that occurred during the process for finding the template.
cath_id
The unique id identifying the structural template of the domain.
domain_start
The Uniprot position of the first amino acid of the Profs domain.
domain_end
The Uniprot position of the last amino acid of the Profs domain.
domain_def
Profs domain definitions for domains with structural templates. Domain definitions in this column are different from domain definitions in the alignment_def column of the uniprot_domain table in that they have been expanded to match domain boundaries of the Profs structural template, identified by the cath_id.
alignment_identity
Percent identity of the domain to its structural template.
alignment_coverage
Percent coverage of the domain to its structural template.
alignment_score

A score obtained by combining alignment_identity (\(SeqId\)) and alignment_coverage (\(Cov\)) using the following equation, as described by Mosca et al.:

(1)\[Score = 0.95 \cdot \frac{SeqId}{100} \cdot \frac{Cov}{100} + 0.05 \cdot \frac{Cov}{100}\]
t_date_modified
The date and time when this row was last modified.
UniprotDomainTemplate.alignment_coverage
UniprotDomainTemplate.alignment_identity
UniprotDomainTemplate.alignment_score
UniprotDomainTemplate.cath_id
UniprotDomainTemplate.domain
UniprotDomainTemplate.domain_def
UniprotDomainTemplate.domain_end
UniprotDomainTemplate.domain_start
UniprotDomainTemplate.t_date_modified
UniprotDomainTemplate.template_errors
UniprotDomainTemplate.uniprot_domain
UniprotDomainTemplate.uniprot_domain_id

Protein sequences from the Uniprot KB, obtained by parsing uniprot_sprot_fasta.gz, uniprot_trembl_fasta.gz, and homo_sapiens_variation.txt files from the Uniprot ftp site.

Columns:
db
The database to which the protein sequence belongs. Possible values are sp for SwissProt and tr for TrEMBL.
uniprot_id
The uniprot id of the protein.
uniprot_name
The uniprot name of the protein.
protein_name
The protein name.
organism_name
Name of the organism in which this protein is found.
gene_name
Name of the gene that codes for this protein sequence.
protein_existence

Evidence for the existence of the protein:

  1. Experimental evidence at protein level
  2. Experimental evidence at transcript level
  3. Protein inferred from homology
  4. Protein predicted
  5. Protein uncertain
sequence_version
Version of the protein amino acid sequence.
uniprot_sequence
Amino acid sequence of the protein.
UniprotSequence.db
UniprotSequence.gene_name
UniprotSequence.organism_name
UniprotSequence.protein_existence
UniprotSequence.protein_name
UniprotSequence.sequence_version
UniprotSequence.uniprot_id
UniprotSequence.uniprot_name
UniprotSequence.uniprot_sequence
elaspic.elaspic_database_tables.get_db_specific_param(key)[source]
elaspic.elaspic_database_tables.get_table_args(table_name, index_columns=[], db_specific_params=[])[source]

Returns a tuple of additional table arguments.

elaspic.elaspic_model module

elaspic.elaspic_predictor module

Created on Wed Sep 30 16:54:21 2015

@author: strokach

Predictor.feature_name_conversion = {'normDOPE': 'norm_dope', 'seq_id_avg': 'alignment_identity'}
Predictor.score(df, core_or_interface)[source]
Parameters:df (DataFrame) – One or more rows with all data required to predict $Delta Delta G$ score. Like something that you would get when you join the appropriate rows in the database.
Returns:df – Same as the input dataframe, except with one additional column: ddg.
Return type:Dataframe
elaspic.elaspic_predictor.convert_features_to_differences(df, keep_mut=False)[source]

Creates a new set of features (ending in _change) that describe the difference between values of the wildtype (features ending in _wt) and mutant (features ending in _mut) features. If keep_mut is False, removes all mutant features (features ending in _mut).

elaspic.elaspic_predictor.format_mutation_features(feature_df, core_or_interface)[source]

Converts columns containing comma-separated lists of FoldX features and physicochemical features into a DataFrame where each feature has its own column.

Parameters:
Returns:

Contains the same data as feature_df, but with columns containing comma-separated lists of features converted to columns containing a single feature each.

Return type:

DataFrame

elaspic.elaspic_sequence module

Class for calculating sequence level features.

Sequence.mutate(mutation)[source]
Sequence.provean_supset_exists
Sequence.provean_supset_file
Sequence.result
Sequence.score_pairwise(seq1, seq2, matrix=None, gap_s=None, gap_e=None)[source]

Get the BLOSUM (or what ever matrix is given) score.

elaspic.elaspic_sequence.convert_basestring_to_seqrecord(sequence, sequence_id='id')[source]
elaspic.elaspic_sequence.download_uniport_sequence(uniprot_id, output_dir)[source]

elaspic.errors module

exception elaspic.errors.AlignmentNotFoundError(save_path, alignment_filename)[source]

Bases: Exception

exception elaspic.errors.Archive7zipError(result, error_message, return_code)[source]

Bases: Exception

exception elaspic.errors.Archive7zipFileNotFoundError(result, error_message, return_code)[source]

Bases: elaspic.errors.Archive7zipError

exception elaspic.errors.ChainsNotInteractingError[source]

Bases: Exception

exception elaspic.errors.DataError[source]

Bases: Exception

exception elaspic.errors.FoldXAAMismatchError[source]

Bases: Exception

exception elaspic.errors.FoldxError[source]

Bases: Exception

exception elaspic.errors.InterfaceMismatchError[source]

Bases: Exception

exception elaspic.errors.LowIdentity[source]

Bases: Exception

exception elaspic.errors.MSMSError[source]

Bases: Exception

exception elaspic.errors.ModelHasMutationsError[source]

Bases: Exception

Don’t delete a model that has precalculated mutations!

exception elaspic.errors.ModellerError[source]

Bases: Exception

exception elaspic.errors.MutationMismatchError[source]

Bases: Exception

exception elaspic.errors.MutationOutsideDomainError[source]

Bases: Exception

exception elaspic.errors.MutationOutsideInterfaceError[source]

Bases: Exception

exception elaspic.errors.NoModelFoundError[source]

Bases: Exception

exception elaspic.errors.NoSequenceFound[source]

Bases: Exception

exception elaspic.errors.NoTemplatesFoundError[source]

Bases: Exception

exception elaspic.errors.PDBChainError[source]

Bases: Exception

exception elaspic.errors.PDBDomainDefsError[source]

Bases: Exception

PDB domain definitions not found in the pdb file

exception elaspic.errors.PDBEmptySequenceError[source]

Bases: Exception

One of the sequences is missing from the alignment. The most likely cause is that the alignment domain definitions were incorrect.

exception elaspic.errors.PDBError[source]

Bases: Exception

exception elaspic.errors.PDBNotFoundError[source]

Bases: Exception

exception elaspic.errors.PopsError(message, pdb, chains)[source]

Bases: Exception

exception elaspic.errors.ProteinDefinitionError[source]

Bases: Exception

exception elaspic.errors.ProveanError[source]

Bases: Exception

exception elaspic.errors.ProveanResourceError(message, child_process_group_id)[source]

Bases: Exception

exception elaspic.errors.ResourceError[source]

Bases: Exception

exception elaspic.errors.TcoffeeBlastError(result, error, alignInFile, system_command)[source]

Bases: Exception

exception elaspic.errors.TcoffeeError(result, error, alignInFile, system_command)[source]

Bases: Exception

exception elaspic.errors.TcoffeePDBidError(result, error, alignInFile, system_command)[source]

Bases: Exception

exception elaspic.errors.TemplateCoreError[source]

Bases: Exception

exception elaspic.errors.TemplateInterfaceError[source]

Bases: Exception

exception elaspic.errors.WrongConfigKeyError[source]

Bases: Exception

elaspic.helper module

A class for collecting all the print statements from modeller in order to redirect them to the logger later on

WritableObject.write(string)[source]
color.BLUE = '\x1b[94m'
color.BOLD = '\x1b[1m'
color.CYAN = '\x1b[96m'
color.DARKCYAN = '\x1b[36m'
color.END = '\x1b[0m'
color.FAIL = '\x1b[91m'
color.GREEN = '\x1b[92m'
color.HEADER = '\x1b[95m'
color.OKBLUE = '\x1b[94m'
color.OKGREEN = '\x1b[92m'
color.PURPLE = '\x1b[95m'
color.RED = '\x1b[91m'
color.UNDERLINE = '\x1b[4m'
color.WARNING = '\x1b[93m'
color.YELLOW = '\x1b[93m'
elaspic.helper.decode_aa_list(interface_aa)[source]
elaspic.helper.decode_domain_def(domains, merge=True, return_string=False)[source]

Unlike split_domain(), this function returns a tuple of tuples of strings, preserving letter numbering (e.g. 10B)

elaspic.helper.decode_text_as_list(list_string)[source]

Uses the database convention to decode a string, describing domain boundaries of multiple domains, as a list of lists.

elaspic.helper.encode_domain(domains, merged=True)[source]
elaspic.helper.encode_list_as_text(list_of_lists)[source]

Uses the database convention to encode a list of lists, describing domain boundaries of multiple domains, as a string.

elaspic.helper.get_echo(system_constant)[source]
elaspic.helper.get_hostname()[source]
elaspic.helper.get_lock(name)[source]
elaspic.helper.get_logger(do_debug=True, logger_filename=None)[source]
elaspic.helper.get_path_to_current_file()[source]

Find the location of the file that is being executed

elaspic.helper.get_username()[source]
elaspic.helper.get_which(bin_name)[source]
elaspic.helper.kill_child_process(child_process)[source]
elaspic.helper.lock(fn)[source]

Allow only a single instance of function fn, and save results to a lock file.

elaspic.helper.log_print_statements(logger)[source]

Channel print statements to the debug logger

elaspic.helper.make_tarfile(output_filename, source_dir)[source]
elaspic.helper.open_exclusively(filename, mode='a')[source]
elaspic.helper.print_heartbeets()[source]
elaspic.helper.row2dict(row)[source]
elaspic.helper.run_subprocess(system_command, **popen_argvars)[source]
elaspic.helper.run_subprocess_locally(working_path, system_command, **popen_argvars)[source]
elaspic.helper.slugify(filename_string)[source]
elaspic.helper.subprocess_check_output(system_command, **popen_argvars)[source]
elaspic.helper.subprocess_check_output_locally(working_path, system_command, **popen_argvars)[source]
elaspic.helper.subprocess_communicate(child_process)[source]
elaspic.helper.switch_paths(working_path)[source]
elaspic.helper.underline(print_string)[source]

elaspic.local_pipeline module

elaspic.machine_learning module

elaspic.machine_learning.cross_validate_predictor(data, features, clf_options, output_filename=None)[source]
elaspic.machine_learning.get_final_predictor(data, features, options)[source]

Train a predictor using the entire dataset.

elaspic.machine_learning.write_row_to_file(results, output_filename)[source]

TODO: Add a datetime column to each written row.

elaspic.pipeline module

Foo.run()[source]
elaspic.pipeline.execute_and_remember(f)[source]

Some basic memoizer.

elaspic.pipeline.lock(fn)[source]

Allow only a single instance of function fn, and save results to a lock file.

elaspic.structure_analysis module

Runs the program pops to calculate the interface size of the complexes This is done by calculating the surface of the complex and the seperated parts. The interface is then given by the substracting.

AnalyzeStructure.get_dssp()[source]

Not used because crashes on server.

AnalyzeStructure.get_interchain_distances(pdb_chain=None, pdb_mutation=None, cutoff=None)[source]
AnalyzeStructure.get_interface_area(chain_ids)[source]
AnalyzeStructure.get_physi_chem(chain_id, mutation)[source]

Return the atomic contact vector, that is, counting how many interactions between charged, polar or “carbon” residues there are. The “carbon” interactions give you information about the Van der Waals packing of the residues. Comparing the wildtype vs. the mutant values is used in the machine learning algorithm.

‘mutation’ is of the form: ‘A16’ where A is the chain identifier and 16 the residue number (in pdb numbering) of the mutation chainIDs is a list of strings with the chain identifiers to be used if more than two chains are given, the chains not containing the mutation are considered as “opposing” chain

AnalyzeStructure.get_sasa(program_to_use='pops')[source]

Get Solvent Accessible Surface Area scores.

Note

deprecated

Use python:fn:get_seasa instead.

AnalyzeStructure.get_seasa()[source]
AnalyzeStructure.get_secondary_structure()[source]
AnalyzeStructure.get_stride()[source]
AnalyzeStructure.get_structure_file(chains, ext='.pdb')[source]
AnalyzeStructure.working_dir = None

Folder with all the binaries (i.e. ./analyze_structure)

elaspic.structure_tools module

MMCIFParserMod.get_structure(structure_id, gzip_fh)[source]

Altered get_structure method which accepts gzip file handles as input.

Only accept the specified chains when saving.

SelectChains.accept_residue(residue)[source]
elaspic.structure_tools.pdb_id

___

elaspic.structure_tools.domain_boundaries

list of lists of lists

Elements in the outer list correspond to domains in each chain of the pdb. Elements of the inner list contain the start and end of each fragment of each domain. For example, if there is only one chain with pdb domain boundaries 1-10:20-45, this would correspond to domain_boundaries [[[1,10],[20,45]]].

StructureParser.extract()[source]

Extract the wanted chains out of the PDB file. Removes water atoms and selects the domain regions (i.e. selects only those parts of the domain that are within the domain boundaries specified).

StructureParser.get_chain_seqres_sequence(chain_id, *args, **varargs)[source]

Call get_chain_seqres_sequence using chain with id chain_id

StructureParser.get_chain_sequence_and_numbering(chain_id, *args, **varargs)[source]

Call get_chain_sequence_and_numbering using chain with id chain_id

StructureParser.save_sequences(output_dir='')[source]
StructureParser.save_structure(output_dir='', remove_disordered=False)[source]
elaspic.structure_tools.calculate_distance(atom_1, atom_2, cutoff=None)[source]

Calculate the distance between two points in 3D space.

Parameters:cutoff (float, optional) – The maximum distance allowable between two points.
elaspic.structure_tools.chain_is_hetatm(chain)[source]

Return True if the chain is made up entirely of HETATMs.

elaspic.structure_tools.convert_aa(aa, quiet=False)[source]

Convert amino acids from three letter code to one letter code or vice versa

Note

Deprecated!

Use ''.join(AAA_DICT[aaa] for aaa in aa) and ''.join(A_DICT[a] for a in aa).

elaspic.structure_tools.convert_position_to_resid(chain, positions, domain_def_tuple=None)[source]

Convert mutation_domain to mutation_modeller. In mutation_modeller, the first amino acid in a chain may start with something other than 1.

elaspic.structure_tools.convert_resnum_alphanumeric_to_numeric(resnum)[source]

Convert residue numbering that has letters (i.e. 1A, 1B, 1C...) to residue numbering without letters (i.e. 1, 2, 3...).

Note

Deprecated!

Use get_chain_sequence_and_numbering().

elaspic.structure_tools.download_pdb_file(pdb_id, output_dir)[source]

Move PDB structure to the local working directory.

elaspic.structure_tools.euclidean_distance(a, b)[source]

Calculate the Euclidean distance between two lists or tuples of arbitrary length.

elaspic.structure_tools.get_aa_residues(chain)[source]
elaspic.structure_tools.get_chain_seqres_sequence(chain, aa_only=False)[source]

Get the amino acid sequence for the construct coding for the given chain.

Extracts a sequence from a PDB file. Usefull when interested in the sequence that was used for crystallization and not the ATOM sequence.

Parameters:aa_only (bool) – If aa_only is set to False, selenomethionines will be included in the sequence. See: http://biopython.org/DIST/docs/api/Bio.PDB.Polypeptide-module.html.
elaspic.structure_tools.get_chain_sequence_and_numbering(chain, domain_def_tuple=None, include_hetatms=False)[source]

Get the amino acid sequence and a list of residue ids for the given chain.

Parameters:chain (Bio.PDB.Chain.Chain) – The chain for which to get the amino acid sequence and numbering.
elaspic.structure_tools.get_interacting_residues(model, r_cutoff=5, skip_hetatm_chains=True)[source]

Returns all interactions between residues on different chains in model.

Returns:A dictionary of interactions between chains i (0..n-1) and j (i+1..n). Keys are (chain_idx, chain_id, residue_idx, residue_resnum, residue_amino_acid) tuples. (e.g. (0, ‘A’, 0, ‘0’, ‘M’), (0, 1, ‘2’, ‘K’), ...) Values are a list of tuples having the same format as the keys.
Return type:dict

You can reverse the order of keys and values like this:

complement = dict()
for key, values in get_interacting_chains(model):
    for value in values:
        complement.setdefault(value, set()).add(key)

You can get a list of all interacting chains using this command:

{(key[0], value[0])
 for (key, values) in get_interacting_chains(model).items()
 for value in values}
elaspic.structure_tools.get_interactions(model, chain_id, r_cutoff=6)[source]
elaspic.structure_tools.get_interactions_between_chains(model, chain_id_1, chain_id_2, r_cutoff=6)[source]

Calculate interactions between residues in pdb_chain_1 and pdb_chain_2. An interaction is defines as a pair of residues where at least one pair of atom is closer than r_cutoff. The default value for r_cutoff is 5 Angstroms.

Deprecated since version 1.0: Use python:fn:get_interacting_residues instead. It gives you both the residue index and the resnum.

Returns:Keys are (residue_number, residue_amino_acid) tuples (e.g. (‘0’, ‘M’), (‘1’, ‘Q’), ...). Values are lists of (residue_number, residue_amino_acid) tuples. (e.g. [(‘0’, ‘M’), (‘1’, ‘Q’), ...]).
Return type:OrderedDict
elaspic.structure_tools.get_interactions_between_chains_slow(model, pdb_chain_1, pdb_chain_2, r_cutoff=5)[source]

Calculate interactions between residues in pdb_chain_1 and pdb_chain_2. An interaction is defines as a pair of residues where at least one pair of atom is closer than r_cutoff. The default value for r_cutoff is 5 Angstroms.

Deprecated since version 1.0: Use get_interacting_residues() instead. It gives you both the residue index and the resnum.

elaspic.structure_tools.get_pdb[source]

Parse a pdb file with biopythons PDBParser() and return the structure.

Parameters:
  • pdb_code (str) – Four letter code of the PDB file
  • pdb_path (str) – Biopython pdb structure
  • temp_dir (str, optional, default=’/tmp/’) – Path to the folder for storing temporary files
  • pdb_type (‘ent’/’pdb’/’cif’, optional, default=’ent’) – The extension of the pdb to use
Raises:

PDBNotFoundError – If the pdb file could not be retrieved from the local (and remote) databases

elaspic.structure_tools.get_pdb_file(pdb_id, pdb_database_dir, pdb_type='ent')[source]

Get PDB file from a local mirror of the PDB database.

elaspic.structure_tools.get_pdb_id(pdb_file)[source]
elaspic.structure_tools.get_pdb_parser(pdb_type, temp_dir='/tmp')[source]

Get PDB parser that can work with structures of the specified type.

elaspic.structure_tools.get_pdb_structure(pdb_file, pdb_id=None)[source]

Set QUIET to False to output warnings like incomplete chains etc.

elaspic.structure_tools.get_structure_sequences(file_or_structure, seqres_sequence=False)[source]

Convenience function returining a dictionary of sequences for a given file or a Biopython Structure, Model or Chain.

elaspic.structure_tools.suppress_logger(fn)[source]

Module contents