# Store.py: A database interface for storing MD evaluation data A database interface to store evaluations of MD simulations. ## Usage The package provides interfaces to query records from the database (`get()` function) and put new or updated records into the database (`update()` function). Both of which will be explained breifly below. ### Automated Evaluation: `eval.yaml` The package provides a standardized way of running MD evaluations. Those evaluations are defined in `eval.yaml` file, located in the respective simulation directory. An example format of such a yaml file is as follows: namespace: analyse.biwater simulation_params: mixingratio: 0.5 system: bulk ensemble: nvt trajectory-open: open evaluations: - subset: residue_name: W100 atom_name: OW selection: W100 functions: - isf: q: 22.7 - msd - subset: residue_name: W100 atom_name: OW selection: W100 other: residue_name: W075 atom_name: OW selection: W075 functions: - rdf - subset: residue_name: W100 coordinates-map: water_dipole selection: dipole functions: - oaf: order: 1 The first line defines a namespace, which is used to locate evaluation functions (see below). The key `simulation_params` defines parameters of the simulation, which are used when storing the results in the database. Parameters like the directory and the user are determined automatically from the file path and the current user, respectively. With the key `trajectory-open` a function can be specified that is used to open the trajectory. If this is omitted the function store.eval.open will be used. The last key `evaluations` defines a list of evaluations which will be done. Each item of the list is a dictionary with the two keys `subset` and und `functions`. The parameters defined for the subset will be used to get the subset of the trajectory ,except for the special key `selection` which is used for the store.update function. The optional key `other` defines a second subset of atoms, which is passed to the function as the keyword other. The functions are again defined as a list. Each item can be either a string or a dictionary with on key value pair. In the latter case the key defines the function and the value should be another dictionary of keyword arguments for the function. These keyword arguments will also be stored in the database, as evaluation parameters. The function is located by its name, first in the specified namespace and if not found there, in the store.analyse module, which defines some standard MD evaluation functions. The namespace may be used to calculate user defined functions. The above example doese the following evaluations: 1. All results will be stored in the database with the parameters system=bulk, ensemble=nvt, mixingratio=0.5. 2. The trajectory will be opened with the function analyse.biwater.open and the path of the yaml file. 3. The first subset selects all OW atoms of the W100 residues, the selection parameter in the database will be W100. 4. The first function will either be analyse.biwater.isf (if existent) or store.analyse.isf, with the keyword argument q=22.7. 5. The second functions msd has no arguments. 6. A second subset selects all atoms of the W100 residue and runs the functions F1 and F2 with it. ### Included analysis functions The store package comes with some basic analysis functions, which can also serve as template for writing customized analyses. All analysis functions are defind in `store.analyse`, this is also the fall back namespace for eval.yaml evaluations, when no namespace is specified, or functions are not defined in the custom namespace. Currently, the following functions are defined, some of them are documented, use help(store.analyse.function) to get more info. The follwoing list gives the names and required parameters (in parantheses) of the avaliable functions: - isf(q): Incoherent intermediate scatterig function - csf(q): Coherent intermediate scattering function - msd: Mean squared displacement - oaf(order): Orientational autocorrelation function, use with appropriate vector map (see below) - rdf: Radial pair distribution function - tetrahedral_order: Tetrahedral oder parameter Additionally, some typical vector maps are defined: - vector: Generic vector map between two atom types - water_dipole - water_OH_bonds ### Updating The function `update` handles creation of new records as updating existing ones. It will look for a simulation in the database, according to the specified arguments and only if no matching record is found, a new simulation will be created. import store store.update( 'isf', df, directory='/path/to/simulation', user='niels', T=120, selection='OW', simulation_params={'mixingratio': 0.5}, evaluation_params={'q': 22.7} ) Here the variable `df` should be a dataframe with the evaluated data. ### Querying Users should use the function `get` to retrieve data from the database. For example: import store store.get( 'isf', user='niels', T='100-150', simulation_params={'mixingratio': 0.5} ) Note that the parameters defined as `simulation_params` (or `evaluation_params`) have to be defined when the data is put into the database. ## Database organization The database is organized in two main Tables: 1. Evaluation: Stores the evaluated data, linked to a simulation 2. Simulation: Stores the metadata of a simulation ### Table schemas Evaluation: observable: str selection: str parameters: list of Parameter simulation: Simulation data: object Simulation: directory: str user: str temperature: number float_params: list of FloatAttribute string_params: list of StringAttribute evaluations: list of Evaluation Parameter: name: str value: float FloatAttribute: name: str value: float StringAttribute: name: str value: str The tables Parameter, FloatAttribute and StringAttribute are simple key values pairs, allowing float or string values, respectively. They are used to store arbitrary attributes of the evaluation and simualtion records. ## Notes for the future ### Clean-Up SQL Database To delete orphaned evaluations, on the postgresql shell (`psql -h db.cluster -U store`) # Since parameters refernce their respective evaluation, we have to delete them first. DELETE FROM parameters WHERE parameters.evaluation_id IN (SELECT id FROM evaluations WHERE evaluations.simulation_id IS NULL); DELETE FROM evaluations WHERE evaluations.simulation_id IS NULL; Similarly, one can delete simulations, without any assigned evaluations. ### Database usages General size info SELECT nspname || '.' || relname AS "relation", pg_size_pretty(pg_total_relation_size(C.oid)) AS "total_size" FROM pg_class C LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace) WHERE nspname NOT IN ('pg_catalog', 'information_schema') AND C.relkind <> 'i' AND nspname !~ '^pg_toast' ORDER BY pg_total_relation_size(C.oid) DESC LIMIT 20; Number of simulations per User SELECT simulations.user, COUNT(DISTINCT simulations), pg_size_pretty(SUM(pg_column_size(data))) as "data-size", pg_size_pretty(SUM(pg_column_size(data)) / COUNT(DISTINCT simulations)) as "size / sim" FROM evaluations JOIN simulations ON (simulations.id = evaluations.simulation_id) GROUP BY simulations.user ORDER BY count DESC; Average size of data per observable SELECT observable, pg_size_pretty(ROUND(AVG(pg_column_size(data)), 0)) as "size-avg", pg_size_pretty(ROUND(SUM(pg_column_size(data)), 0)) as "size-total", AVG(pg_column_size(data)) as "size_bytes" FROM evaluations GROUP BY observable ORDER BY size_bytes DESC; ### SSH tunnel connection To get a secure connection to the postgrsql server an nas2, one can use SSH tunnels. This allows to use SSH certicifates for identification, so no passwords are reuqired. Example code fragments: import sshtunnel ssh_server = None SSH_HOST = 'nas2' SSH_USER = os.getlogin() SSH_KEY = os.environ['HOME'] + '/.ssh/id_rsa' SSH_BIND_PORT = 58222 DB_FILE = 'postgresql://localhost:{port}/test'.format(port=SSH_BIND_PORT) def open_sshtunnel(): global ssh_server ssh_server = sshtunnel.SSHTunnelForwarder( ssh_address_or_host=SSH_HOST, ssh_username=SSH_USER, ssh_pkey=SSH_KEY, remote_bind_address=('localhost', PG_PORT), local_bind_address=('localhost', SSH_BIND_PORT) ) ssh_server.start()