The package provides interfaces to query records from the database (get() function) and put new or updated records into the database (update() function). Both of which will be explained breifly below.

Automated Evaluation: `eval.yaml`

The package provides a standardized way of running MD evaluations. Those evaluations are defined in eval.yaml file, located in the respective simulation directory. An example format of such a yaml file is as follows:

namespace: analyse.biwater

simulation_params:
    mixingratio: 0.5
    system: bulk
    ensemble: nvt

trajectory-open: open

evaluations:
    - subset:
        residue_name: W100
        atom_name: OW
        selection: W100
      functions:
        - isf:
            q: 22.7
        - msd
    - subset:
        residue_name: W100
        atom_name: OW
        selection: W100
      other:
        residue_name: W075
        atom_name: OW
        selection: W075
      functions:
        - rdf
    - subset:
        residue_name: W100
        coordinates-map: water_dipole
        selection: dipole
      functions:
        - oaf:
            order: 1

The first line defines a namespace, which is used to locate evaluation functions (see below). The key simulation_params defines parameters of the simulation, which are used when storing the results in the database. Parameters like the directory and the user are determined automatically from the file path and the current user, respectively. With the key trajectory-open a function can be specified that is used to open the trajectory. If this is omitted the function store.eval.open will be used.

The last key evaluations defines a list of evaluations which will be done. Each item of the list is a dictionary with the two keys subset and und functions. The parameters defined for the subset will be used to get the subset of the trajectory ,except for the special key selection which is used for the store.update function. The optional key other defines a second subset of atoms, which is passed to the function as the keyword other. The functions are again defined as a list. Each item can be either a string or a dictionary with on key value pair. In the latter case the key defines the function and the value should be another dictionary of keyword arguments for the function. These keyword arguments will also be stored in the database, as evaluation parameters.

The function is located by its name, first in the specified namespace and if not found there, in the store.analyse module, which defines some standard MD evaluation functions. The namespace may be used to calculate user defined functions.

The above example doese the following evaluations:

All results will be stored in the database with the parameters system=bulk, ensemble=nvt, mixingratio=0.5.
The trajectory will be opened with the function analyse.biwater.open and the path of the yaml file.
The first subset selects all OW atoms of the W100 residues, the selection parameter in the database will be W100.
The first function will either be analyse.biwater.isf (if existent) or store.analyse.isf, with the keyword argument q=22.7.
The second functions msd has no arguments.
A second subset selects all atoms of the W100 residue and runs the functions F1 and F2 with it.

Included analysis functions

The store package comes with some basic analysis functions, which can also serve as template for writing customized analyses. All analysis functions are defind in store.analyse, this is also the fall back namespace for eval.yaml evaluations, when no namespace is specified, or functions are not defined in the custom namespace.

Currently, the following functions are defined, some of them are documented, use help(store.analyse.function) to get more info. The follwoing list gives the names and required parameters (in parantheses) of the avaliable functions:

isf(q): Incoherent intermediate scatterig function
csf(q): Coherent intermediate scattering function
msd: Mean squared displacement
oaf(order): Orientational autocorrelation function, use with appropriate vector map (see below)
rdf: Radial pair distribution function
tetrahedral_order: Tetrahedral oder parameter

Additionally, some typical vector maps are defined:

vector: Generic vector map between two atom types
water_dipole
water_OH_bonds

Updating

The function update handles creation of new records as updating existing ones. It will look for a simulation in the database, according to the specified arguments and only if no matching record is found, a new simulation will be created.

import store
store.update(
    'isf', df, directory='/path/to/simulation', user='niels', T=120, selection='OW',
    simulation_params={'mixingratio': 0.5}, evaluation_params={'q': 22.7}
)

Here the variable df should be a dataframe with the evaluated data.

Querying

Users should use the function get to retrieve data from the database. For example:

import store

store.get(
    'isf', user='niels', T='100-150',
    simulation_params={'mixingratio': 0.5}
)

Note that the parameters defined as simulation_params (or evaluation_params) have to be defined when the data is put into the database.

Database organization

The database is organized in two main Tables:

Evaluation: Stores the evaluated data, linked to a simulation
Simulation: Stores the metadata of a simulation

Table schemas

Evaluation:
    observable: str
    selection: str
    parameters: list of Parameter
    simulation: Simulation
    data: object

Simulation:
    directory: str
    user: str
    temperature: number
    float_params: list of FloatAttribute
    string_params: list of StringAttribute
    evaluations: list of Evaluation

  Parameter:
      name: str
      value: float

  FloatAttribute:
      name: str
      value: float

  StringAttribute:
      name: str
      value: str

The tables Parameter, FloatAttribute and StringAttribute are simple key values pairs, allowing float or string values, respectively. They are used to store arbitrary attributes of the evaluation and simualtion records.

Notes for the future

Clean-Up SQL Database

To delete orphaned evaluations, on the postgresql shell (psql -h db.cluster -U store)

# Since parameters refernce their respective evaluation, we have to delete them first.
DELETE FROM parameters
WHERE parameters.evaluation_id IN 
(SELECT id FROM evaluations WHERE evaluations.simulation_id IS NULL);

DELETE FROM evaluations WHERE evaluations.simulation_id IS NULL;

Similarly, one can delete simulations, without any assigned evaluations.

Database usages

General size info

SELECT nspname || '.' || relname AS "relation",
    pg_size_pretty(pg_total_relation_size(C.oid)) AS "total_size"
FROM pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE nspname NOT IN ('pg_catalog', 'information_schema')
    AND C.relkind <> 'i'
    AND nspname !~ '^pg_toast'
ORDER BY pg_total_relation_size(C.oid) DESC
LIMIT 20;

Number of simulations per User

SELECT simulations.user, COUNT(DISTINCT simulations),
  pg_size_pretty(SUM(pg_column_size(data))) as "data-size",
  pg_size_pretty(SUM(pg_column_size(data)) / COUNT(DISTINCT simulations)) as "size / sim"
FROM evaluations
JOIN simulations 
ON (simulations.id = evaluations.simulation_id)
GROUP BY simulations.user
ORDER BY count DESC;

Average size of data per observable

SELECT observable, 
  pg_size_pretty(ROUND(AVG(pg_column_size(data)), 0)) as "size-avg",
  pg_size_pretty(ROUND(SUM(pg_column_size(data)), 0)) as "size-total",
  AVG(pg_column_size(data)) as "size_bytes"
FROM evaluations
GROUP BY observable
ORDER BY size_bytes DESC;

SSH tunnel connection

To get a secure connection to the postgrsql server an nas2, one can use SSH tunnels. This allows to use SSH certicifates for identification, so no passwords are reuqired. Example code fragments:

import sshtunnel
ssh_server = None

SSH_HOST = 'nas2'
SSH_USER = os.getlogin()
SSH_KEY = os.environ['HOME'] + '/.ssh/id_rsa'
SSH_BIND_PORT = 58222
DB_FILE = 'postgresql://localhost:{port}/test'.format(port=SSH_BIND_PORT)

def open_sshtunnel():
    global ssh_server
    ssh_server = sshtunnel.SSHTunnelForwarder(
        ssh_address_or_host=SSH_HOST,
        ssh_username=SSH_USER,
        ssh_pkey=SSH_KEY,
        remote_bind_address=('localhost', PG_PORT),
        local_bind_address=('localhost', SSH_BIND_PORT)
    )
    ssh_server.start()

README.md

Store.py: A database interface for storing MD evaluation data

Usage

Automated Evaluation: eval.yaml