245 lines
8.8 KiB
Markdown
245 lines
8.8 KiB
Markdown
# Store.py: A database interface for storing MD evaluation data
|
|
|
|
A database interface to store evaluations of MD simulations.
|
|
|
|
## Usage
|
|
|
|
The package provides interfaces to query records from the database (`get()` function)
|
|
and put new or updated records into the database (`update()` function).
|
|
Both of which will be explained breifly below.
|
|
|
|
### Automated Evaluation: `eval.yaml`
|
|
|
|
The package provides a standardized way of running MD evaluations.
|
|
Those evaluations are defined in `eval.yaml` file, located in the respective simulation directory.
|
|
An example format of such a yaml file is as follows:
|
|
|
|
namespace: analyse.biwater
|
|
|
|
simulation_params:
|
|
mixingratio: 0.5
|
|
system: bulk
|
|
ensemble: nvt
|
|
|
|
trajectory-open: open
|
|
|
|
evaluations:
|
|
- subset:
|
|
residue_name: W100
|
|
atom_name: OW
|
|
selection: W100
|
|
functions:
|
|
- isf:
|
|
q: 22.7
|
|
- msd
|
|
- subset:
|
|
residue_name: W100
|
|
atom_name: OW
|
|
selection: W100
|
|
other:
|
|
residue_name: W075
|
|
atom_name: OW
|
|
selection: W075
|
|
functions:
|
|
- rdf
|
|
- subset:
|
|
residue_name: W100
|
|
coordinates-map: water_dipole
|
|
selection: dipole
|
|
functions:
|
|
- oaf:
|
|
order: 1
|
|
|
|
The first line defines a namespace, which is used to locate evaluation functions (see below).
|
|
The key `simulation_params` defines parameters of the simulation, which are used when storing the results in the database.
|
|
Parameters like the directory and the user are determined automatically from the file path and the current user, respectively.
|
|
With the key `trajectory-open` a function can be specified that is used to open the trajectory.
|
|
If this is omitted the function store.eval.open will be used.
|
|
|
|
The last key `evaluations` defines a list of evaluations which will be done.
|
|
Each item of the list is a dictionary with the two keys `subset` and und `functions`.
|
|
The parameters defined for the subset will be used to get the subset of the trajectory ,except for the special key `selection` which is used for the store.update function.
|
|
The optional key `other` defines a second subset of atoms, which is passed to the function as the keyword other.
|
|
The functions are again defined as a list. Each item can be either a string or a dictionary with on key value pair.
|
|
In the latter case the key defines the function and the value should be another dictionary of keyword arguments for the function.
|
|
These keyword arguments will also be stored in the database, as evaluation parameters.
|
|
|
|
The function is located by its name, first in the specified namespace and if not found there, in the store.analyse module, which defines some standard MD evaluation functions.
|
|
The namespace may be used to calculate user defined functions.
|
|
|
|
The above example doese the following evaluations:
|
|
|
|
1. All results will be stored in the database with the parameters system=bulk, ensemble=nvt, mixingratio=0.5.
|
|
2. The trajectory will be opened with the function analyse.biwater.open and the path of the yaml file.
|
|
3. The first subset selects all OW atoms of the W100 residues, the selection parameter in the database will be W100.
|
|
4. The first function will either be analyse.biwater.isf (if existent) or store.analyse.isf, with the keyword argument q=22.7.
|
|
5. The second functions msd has no arguments.
|
|
6. A second subset selects all atoms of the W100 residue and runs the functions F1 and F2 with it.
|
|
|
|
### Included analysis functions
|
|
|
|
The store package comes with some basic analysis functions, which can also serve as template for writing customized analyses.
|
|
All analysis functions are defind in `store.analyse`, this is also the fall back namespace for eval.yaml evaluations,
|
|
when no namespace is specified, or functions are not defined in the custom namespace.
|
|
|
|
Currently, the following functions are defined, some of them are documented, use help(store.analyse.function) to get more info.
|
|
The follwoing list gives the names and required parameters (in parantheses) of the avaliable functions:
|
|
|
|
- isf(q): Incoherent intermediate scatterig function
|
|
- csf(q): Coherent intermediate scattering function
|
|
- msd: Mean squared displacement
|
|
- oaf(order): Orientational autocorrelation function, use with appropriate vector map (see below)
|
|
- rdf: Radial pair distribution function
|
|
- tetrahedral_order: Tetrahedral oder parameter
|
|
|
|
Additionally, some typical vector maps are defined:
|
|
|
|
- vector: Generic vector map between two atom types
|
|
- water_dipole
|
|
- water_OH_bonds
|
|
|
|
### Updating
|
|
|
|
The function `update` handles creation of new records as updating existing ones.
|
|
It will look for a simulation in the database, according to the specified arguments
|
|
and only if no matching record is found, a new simulation will be created.
|
|
|
|
import store
|
|
store.update(
|
|
'isf', df, directory='/path/to/simulation', user='niels', T=120, selection='OW',
|
|
simulation_params={'mixingratio': 0.5}, evaluation_params={'q': 22.7}
|
|
)
|
|
|
|
Here the variable `df` should be a dataframe with the evaluated data.
|
|
|
|
### Querying
|
|
|
|
Users should use the function `get` to retrieve data from the database. For example:
|
|
|
|
import store
|
|
|
|
store.get(
|
|
'isf', user='niels', T='100-150',
|
|
simulation_params={'mixingratio': 0.5}
|
|
)
|
|
|
|
Note that the parameters defined as `simulation_params` (or `evaluation_params`) have
|
|
to be defined when the data is put into the database.
|
|
|
|
## Database organization
|
|
|
|
The database is organized in two main Tables:
|
|
|
|
1. Evaluation: Stores the evaluated data, linked to a simulation
|
|
2. Simulation: Stores the metadata of a simulation
|
|
|
|
### Table schemas
|
|
|
|
Evaluation:
|
|
observable: str
|
|
selection: str
|
|
parameters: list of Parameter
|
|
simulation: Simulation
|
|
data: object
|
|
|
|
Simulation:
|
|
directory: str
|
|
user: str
|
|
temperature: number
|
|
float_params: list of FloatAttribute
|
|
string_params: list of StringAttribute
|
|
evaluations: list of Evaluation
|
|
|
|
Parameter:
|
|
name: str
|
|
value: float
|
|
|
|
FloatAttribute:
|
|
name: str
|
|
value: float
|
|
|
|
StringAttribute:
|
|
name: str
|
|
value: str
|
|
|
|
The tables Parameter, FloatAttribute and StringAttribute are simple key values pairs,
|
|
allowing float or string values, respectively. They are used to store arbitrary
|
|
attributes of the evaluation and simualtion records.
|
|
|
|
## Notes for the future
|
|
|
|
### Clean-Up SQL Database
|
|
|
|
To delete orphaned evaluations, on the postgresql shell (`psql -h db.cluster -U store`)
|
|
|
|
# Since parameters refernce their respective evaluation, we have to delete them first.
|
|
DELETE FROM parameters
|
|
WHERE parameters.evaluation_id IN
|
|
(SELECT id FROM evaluations WHERE evaluations.simulation_id IS NULL);
|
|
|
|
DELETE FROM evaluations WHERE evaluations.simulation_id IS NULL;
|
|
|
|
Similarly, one can delete simulations, without any assigned evaluations.
|
|
|
|
### Database usages
|
|
|
|
General size info
|
|
|
|
SELECT nspname || '.' || relname AS "relation",
|
|
pg_size_pretty(pg_total_relation_size(C.oid)) AS "total_size"
|
|
FROM pg_class C
|
|
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
|
|
WHERE nspname NOT IN ('pg_catalog', 'information_schema')
|
|
AND C.relkind <> 'i'
|
|
AND nspname !~ '^pg_toast'
|
|
ORDER BY pg_total_relation_size(C.oid) DESC
|
|
LIMIT 20;
|
|
|
|
Number of simulations per User
|
|
|
|
SELECT simulations.user, COUNT(DISTINCT simulations),
|
|
pg_size_pretty(SUM(pg_column_size(data))) as "data-size",
|
|
pg_size_pretty(SUM(pg_column_size(data)) / COUNT(DISTINCT simulations)) as "size / sim"
|
|
FROM evaluations
|
|
JOIN simulations
|
|
ON (simulations.id = evaluations.simulation_id)
|
|
GROUP BY simulations.user
|
|
ORDER BY count DESC;
|
|
|
|
Average size of data per observable
|
|
|
|
SELECT observable,
|
|
pg_size_pretty(ROUND(AVG(pg_column_size(data)), 0)) as "size-avg",
|
|
pg_size_pretty(ROUND(SUM(pg_column_size(data)), 0)) as "size-total",
|
|
AVG(pg_column_size(data)) as "size_bytes"
|
|
FROM evaluations
|
|
GROUP BY observable
|
|
ORDER BY size_bytes DESC;
|
|
|
|
|
|
### SSH tunnel connection
|
|
|
|
To get a secure connection to the postgrsql server an nas2, one can use SSH tunnels.
|
|
This allows to use SSH certicifates for identification, so no passwords are reuqired.
|
|
Example code fragments:
|
|
|
|
import sshtunnel
|
|
ssh_server = None
|
|
|
|
SSH_HOST = 'nas2'
|
|
SSH_USER = os.getlogin()
|
|
SSH_KEY = os.environ['HOME'] + '/.ssh/id_rsa'
|
|
SSH_BIND_PORT = 58222
|
|
DB_FILE = 'postgresql://localhost:{port}/test'.format(port=SSH_BIND_PORT)
|
|
|
|
def open_sshtunnel():
|
|
global ssh_server
|
|
ssh_server = sshtunnel.SSHTunnelForwarder(
|
|
ssh_address_or_host=SSH_HOST,
|
|
ssh_username=SSH_USER,
|
|
ssh_pkey=SSH_KEY,
|
|
remote_bind_address=('localhost', PG_PORT),
|
|
local_bind_address=('localhost', SSH_BIND_PORT)
|
|
)
|
|
ssh_server.start()
|