Store.py: A database interface for storing MD evaluation data
A database interface to store evaluations of MD simulations.
Usage
The package provides interfaces to query records from the database (get()
function)
and put new or updated records into the database (update()
function).
Both of which will be explained breifly below.
Automated Evaluation: eval.yaml
The package provides a standardized way of running MD evaluations.
Those evaluations are defined in eval.yaml
file, located in the respective simulation directory.
An example format of such a yaml file is as follows:
namespace: analyse.biwater
simulation_params:
mixingratio: 0.5
system: bulk
ensemble: nvt
trajectory-open: open
evaluations:
- subset:
residue_name: W100
atom_name: OW
selection: W100
functions:
- isf:
q: 22.7
- msd
- subset:
residue_name: W100
atom_name: OW
selection: W100
other:
residue_name: W075
atom_name: OW
selection: W075
functions:
- rdf
- subset:
residue_name: W100
coordinates-map: water_dipole
selection: dipole
functions:
- oaf:
order: 1
The first line defines a namespace, which is used to locate evaluation functions (see below).
The key simulation_params
defines parameters of the simulation, which are used when storing the results in the database.
Parameters like the directory and the user are determined automatically from the file path and the current user, respectively.
With the key trajectory-open
a function can be specified that is used to open the trajectory.
If this is omitted the function store.eval.open will be used.
The last key evaluations
defines a list of evaluations which will be done.
Each item of the list is a dictionary with the two keys subset
and und functions
.
The parameters defined for the subset will be used to get the subset of the trajectory ,except for the special key selection
which is used for the store.update function.
The optional key other
defines a second subset of atoms, which is passed to the function as the keyword other.
The functions are again defined as a list. Each item can be either a string or a dictionary with on key value pair.
In the latter case the key defines the function and the value should be another dictionary of keyword arguments for the function.
These keyword arguments will also be stored in the database, as evaluation parameters.
The function is located by its name, first in the specified namespace and if not found there, in the store.analyse module, which defines some standard MD evaluation functions. The namespace may be used to calculate user defined functions.
The above example doese the following evaluations:
- All results will be stored in the database with the parameters system=bulk, ensemble=nvt, mixingratio=0.5.
- The trajectory will be opened with the function analyse.biwater.open and the path of the yaml file.
- The first subset selects all OW atoms of the W100 residues, the selection parameter in the database will be W100.
- The first function will either be analyse.biwater.isf (if existent) or store.analyse.isf, with the keyword argument q=22.7.
- The second functions msd has no arguments.
- A second subset selects all atoms of the W100 residue and runs the functions F1 and F2 with it.
Included analysis functions
The store package comes with some basic analysis functions, which can also serve as template for writing customized analyses.
All analysis functions are defind in store.analyse
, this is also the fall back namespace for eval.yaml evaluations,
when no namespace is specified, or functions are not defined in the custom namespace.
Currently, the following functions are defined, some of them are documented, use help(store.analyse.function) to get more info. The follwoing list gives the names and required parameters (in parantheses) of the avaliable functions:
- isf(q): Incoherent intermediate scatterig function
- csf(q): Coherent intermediate scattering function
- msd: Mean squared displacement
- oaf(order): Orientational autocorrelation function, use with appropriate vector map (see below)
- rdf: Radial pair distribution function
- tetrahedral_order: Tetrahedral oder parameter
Additionally, some typical vector maps are defined:
- vector: Generic vector map between two atom types
- water_dipole
- water_OH_bonds
Updating
The function update
handles creation of new records as updating existing ones.
It will look for a simulation in the database, according to the specified arguments
and only if no matching record is found, a new simulation will be created.
import store
store.update(
'isf', df, directory='/path/to/simulation', user='niels', T=120, selection='OW',
simulation_params={'mixingratio': 0.5}, evaluation_params={'q': 22.7}
)
Here the variable df
should be a dataframe with the evaluated data.
Querying
Users should use the function get
to retrieve data from the database. For example:
import store
store.get(
'isf', user='niels', T='100-150',
simulation_params={'mixingratio': 0.5}
)
Note that the parameters defined as simulation_params
(or evaluation_params
) have
to be defined when the data is put into the database.
Database organization
The database is organized in two main Tables:
- Evaluation: Stores the evaluated data, linked to a simulation
- Simulation: Stores the metadata of a simulation
Table schemas
Evaluation:
observable: str
selection: str
parameters: list of Parameter
simulation: Simulation
data: object
Simulation:
directory: str
user: str
temperature: number
float_params: list of FloatAttribute
string_params: list of StringAttribute
evaluations: list of Evaluation
Parameter:
name: str
value: float
FloatAttribute:
name: str
value: float
StringAttribute:
name: str
value: str
The tables Parameter, FloatAttribute and StringAttribute are simple key values pairs, allowing float or string values, respectively. They are used to store arbitrary attributes of the evaluation and simualtion records.
Notes for the future
Clean-Up SQL Database
To delete orphaned evaluations, on the postgresql shell (psql -h db.cluster -U store
)
# Since parameters refernce their respective evaluation, we have to delete them first.
DELETE FROM parameters
WHERE parameters.evaluation_id IN
(SELECT id FROM evaluations WHERE evaluations.simulation_id IS NULL);
DELETE FROM evaluations WHERE evaluations.simulation_id IS NULL;
Similarly, one can delete simulations, without any assigned evaluations.
Database usages
General size info
SELECT nspname || '.' || relname AS "relation",
pg_size_pretty(pg_total_relation_size(C.oid)) AS "total_size"
FROM pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE nspname NOT IN ('pg_catalog', 'information_schema')
AND C.relkind <> 'i'
AND nspname !~ '^pg_toast'
ORDER BY pg_total_relation_size(C.oid) DESC
LIMIT 20;
Number of simulations per User
SELECT simulations.user, COUNT(DISTINCT simulations),
pg_size_pretty(SUM(pg_column_size(data))) as "data-size",
pg_size_pretty(SUM(pg_column_size(data)) / COUNT(DISTINCT simulations)) as "size / sim"
FROM evaluations
JOIN simulations
ON (simulations.id = evaluations.simulation_id)
GROUP BY simulations.user
ORDER BY count DESC;
Average size of data per observable
SELECT observable,
pg_size_pretty(ROUND(AVG(pg_column_size(data)), 0)) as "size-avg",
pg_size_pretty(ROUND(SUM(pg_column_size(data)), 0)) as "size-total",
AVG(pg_column_size(data)) as "size_bytes"
FROM evaluations
GROUP BY observable
ORDER BY size_bytes DESC;
SSH tunnel connection
To get a secure connection to the postgrsql server an nas2, one can use SSH tunnels. This allows to use SSH certicifates for identification, so no passwords are reuqired. Example code fragments:
import sshtunnel
ssh_server = None
SSH_HOST = 'nas2'
SSH_USER = os.getlogin()
SSH_KEY = os.environ['HOME'] + '/.ssh/id_rsa'
SSH_BIND_PORT = 58222
DB_FILE = 'postgresql://localhost:{port}/test'.format(port=SSH_BIND_PORT)
def open_sshtunnel():
global ssh_server
ssh_server = sshtunnel.SSHTunnelForwarder(
ssh_address_or_host=SSH_HOST,
ssh_username=SSH_USER,
ssh_pkey=SSH_KEY,
remote_bind_address=('localhost', PG_PORT),
local_bind_address=('localhost', SSH_BIND_PORT)
)
ssh_server.start()