Initial project version
This commit is contained in:
244
README.md
Normal file
244
README.md
Normal file
@ -0,0 +1,244 @@
|
||||
# Store.py: A database interface for storing MD evaluation data
|
||||
|
||||
A database interface to store evaluations of MD simulations.
|
||||
|
||||
## Usage
|
||||
|
||||
The package provides interfaces to query records from the database (`get()` function)
|
||||
and put new or updated records into the database (`update()` function).
|
||||
Both of which will be explained breifly below.
|
||||
|
||||
### Automated Evaluation: `eval.yaml`
|
||||
|
||||
The package provides a standardized way of running MD evaluations.
|
||||
Those evaluations are defined in `eval.yaml` file, located in the respective simulation directory.
|
||||
An example format of such a yaml file is as follows:
|
||||
|
||||
namespace: analyse.biwater
|
||||
|
||||
simulation_params:
|
||||
mixingratio: 0.5
|
||||
system: bulk
|
||||
ensemble: nvt
|
||||
|
||||
trajectory-open: open
|
||||
|
||||
evaluations:
|
||||
- subset:
|
||||
residue_name: W100
|
||||
atom_name: OW
|
||||
selection: W100
|
||||
functions:
|
||||
- isf:
|
||||
q: 22.7
|
||||
- msd
|
||||
- subset:
|
||||
residue_name: W100
|
||||
atom_name: OW
|
||||
selection: W100
|
||||
other:
|
||||
residue_name: W075
|
||||
atom_name: OW
|
||||
selection: W075
|
||||
functions:
|
||||
- rdf
|
||||
- subset:
|
||||
residue_name: W100
|
||||
coordinates-map: water_dipole
|
||||
selection: dipole
|
||||
functions:
|
||||
- oaf:
|
||||
order: 1
|
||||
|
||||
The first line defines a namespace, which is used to locate evaluation functions (see below).
|
||||
The key `simulation_params` defines parameters of the simulation, which are used when storing the results in the database.
|
||||
Parameters like the directory and the user are determined automatically from the file path and the current user, respectively.
|
||||
With the key `trajectory-open` a function can be specified that is used to open the trajectory.
|
||||
If this is omitted the function store.eval.open will be used.
|
||||
|
||||
The last key `evaluations` defines a list of evaluations which will be done.
|
||||
Each item of the list is a dictionary with the two keys `subset` and und `functions`.
|
||||
The parameters defined for the subset will be used to get the subset of the trajectory ,except for the special key `selection` which is used for the store.update function.
|
||||
The optional key `other` defines a second subset of atoms, which is passed to the function as the keyword other.
|
||||
The functions are again defined as a list. Each item can be either a string or a dictionary with on key value pair.
|
||||
In the latter case the key defines the function and the value should be another dictionary of keyword arguments for the function.
|
||||
These keyword arguments will also be stored in the database, as evaluation parameters.
|
||||
|
||||
The function is located by its name, first in the specified namespace and if not found there, in the store.analyse module, which defines some standard MD evaluation functions.
|
||||
The namespace may be used to calculate user defined functions.
|
||||
|
||||
The above example doese the following evaluations:
|
||||
|
||||
1. All results will be stored in the database with the parameters system=bulk, ensemble=nvt, mixingratio=0.5.
|
||||
2. The trajectory will be opened with the function analyse.biwater.open and the path of the yaml file.
|
||||
3. The first subset selects all OW atoms of the W100 residues, the selection parameter in the database will be W100.
|
||||
4. The first function will either be analyse.biwater.isf (if existent) or store.analyse.isf, with the keyword argument q=22.7.
|
||||
5. The second functions msd has no arguments.
|
||||
6. A second subset selects all atoms of the W100 residue and runs the functions F1 and F2 with it.
|
||||
|
||||
### Included analysis functions
|
||||
|
||||
The store package comes with some basic analysis functions, which can also serve as template for writing customized analyses.
|
||||
All analysis functions are defind in `store.analyse`, this is also the fall back namespace for eval.yaml evaluations,
|
||||
when no namespace is specified, or functions are not defined in the custom namespace.
|
||||
|
||||
Currently, the following functions are defined, some of them are documented, use help(store.analyse.function) to get more info.
|
||||
The follwoing list gives the names and required parameters (in parantheses) of the avaliable functions:
|
||||
|
||||
- isf(q): Incoherent intermediate scatterig function
|
||||
- csf(q): Coherent intermediate scattering function
|
||||
- msd: Mean squared displacement
|
||||
- oaf(order): Orientational autocorrelation function, use with appropriate vector map (see below)
|
||||
- rdf: Radial pair distribution function
|
||||
- tetrahedral_order: Tetrahedral oder parameter
|
||||
|
||||
Additionally, some typical vector maps are defined:
|
||||
|
||||
- vector: Generic vector map between two atom types
|
||||
- water_dipole
|
||||
- water_OH_bonds
|
||||
|
||||
### Updating
|
||||
|
||||
The function `update` handles creation of new records as updating existing ones.
|
||||
It will look for a simulation in the database, according to the specified arguments
|
||||
and only if no matching record is found, a new simulation will be created.
|
||||
|
||||
import store
|
||||
store.update(
|
||||
'isf', df, directory='/path/to/simulation', user='niels', T=120, selection='OW',
|
||||
simulation_params={'mixingratio': 0.5}, evaluation_params={'q': 22.7}
|
||||
)
|
||||
|
||||
Here the variable `df` should be a dataframe with the evaluated data.
|
||||
|
||||
### Querying
|
||||
|
||||
Users should use the function `get` to retrieve data from the database. For example:
|
||||
|
||||
import store
|
||||
|
||||
store.get(
|
||||
'isf', user='niels', T='100-150',
|
||||
simulation_params={'mixingratio': 0.5}
|
||||
)
|
||||
|
||||
Note that the parameters defined as `simulation_params` (or `evaluation_params`) have
|
||||
to be defined when the data is put into the database.
|
||||
|
||||
## Database organization
|
||||
|
||||
The database is organized in two main Tables:
|
||||
|
||||
1. Evaluation: Stores the evaluated data, linked to a simulation
|
||||
2. Simulation: Stores the metadata of a simulation
|
||||
|
||||
### Table schemas
|
||||
|
||||
Evaluation:
|
||||
observable: str
|
||||
selection: str
|
||||
parameters: list of Parameter
|
||||
simulation: Simulation
|
||||
data: object
|
||||
|
||||
Simulation:
|
||||
directory: str
|
||||
user: str
|
||||
temperature: number
|
||||
float_params: list of FloatAttribute
|
||||
string_params: list of StringAttribute
|
||||
evaluations: list of Evaluation
|
||||
|
||||
Parameter:
|
||||
name: str
|
||||
value: float
|
||||
|
||||
FloatAttribute:
|
||||
name: str
|
||||
value: float
|
||||
|
||||
StringAttribute:
|
||||
name: str
|
||||
value: str
|
||||
|
||||
The tables Parameter, FloatAttribute and StringAttribute are simple key values pairs,
|
||||
allowing float or string values, respectively. They are used to store arbitrary
|
||||
attributes of the evaluation and simualtion records.
|
||||
|
||||
## Notes for the future
|
||||
|
||||
### Clean-Up SQL Database
|
||||
|
||||
To delete orphaned evaluations, on the postgresql shell (`psql -h db.cluster -U store`)
|
||||
|
||||
# Since parameters refernce their respective evaluation, we have to delete them first.
|
||||
DELETE FROM parameters
|
||||
WHERE parameters.evaluation_id IN
|
||||
(SELECT id FROM evaluations WHERE evaluations.simulation_id IS NULL);
|
||||
|
||||
DELETE FROM evaluations WHERE evaluations.simulation_id IS NULL;
|
||||
|
||||
Similarly, one can delete simulations, without any assigned evaluations.
|
||||
|
||||
### Database usages
|
||||
|
||||
General size info
|
||||
|
||||
SELECT nspname || '.' || relname AS "relation",
|
||||
pg_size_pretty(pg_total_relation_size(C.oid)) AS "total_size"
|
||||
FROM pg_class C
|
||||
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
|
||||
WHERE nspname NOT IN ('pg_catalog', 'information_schema')
|
||||
AND C.relkind <> 'i'
|
||||
AND nspname !~ '^pg_toast'
|
||||
ORDER BY pg_total_relation_size(C.oid) DESC
|
||||
LIMIT 20;
|
||||
|
||||
Number of simulations per User
|
||||
|
||||
SELECT simulations.user, COUNT(DISTINCT simulations),
|
||||
pg_size_pretty(SUM(pg_column_size(data))) as "data-size",
|
||||
pg_size_pretty(SUM(pg_column_size(data)) / COUNT(DISTINCT simulations)) as "size / sim"
|
||||
FROM evaluations
|
||||
JOIN simulations
|
||||
ON (simulations.id = evaluations.simulation_id)
|
||||
GROUP BY simulations.user
|
||||
ORDER BY count DESC;
|
||||
|
||||
Average size of data per observable
|
||||
|
||||
SELECT observable,
|
||||
pg_size_pretty(ROUND(AVG(pg_column_size(data)), 0)) as "size-avg",
|
||||
pg_size_pretty(ROUND(SUM(pg_column_size(data)), 0)) as "size-total",
|
||||
AVG(pg_column_size(data)) as "size_bytes"
|
||||
FROM evaluations
|
||||
GROUP BY observable
|
||||
ORDER BY size_bytes DESC;
|
||||
|
||||
|
||||
### SSH tunnel connection
|
||||
|
||||
To get a secure connection to the postgrsql server an nas2, one can use SSH tunnels.
|
||||
This allows to use SSH certicifates for identification, so no passwords are reuqired.
|
||||
Example code fragments:
|
||||
|
||||
import sshtunnel
|
||||
ssh_server = None
|
||||
|
||||
SSH_HOST = 'nas2'
|
||||
SSH_USER = os.getlogin()
|
||||
SSH_KEY = os.environ['HOME'] + '/.ssh/id_rsa'
|
||||
SSH_BIND_PORT = 58222
|
||||
DB_FILE = 'postgresql://localhost:{port}/test'.format(port=SSH_BIND_PORT)
|
||||
|
||||
def open_sshtunnel():
|
||||
global ssh_server
|
||||
ssh_server = sshtunnel.SSHTunnelForwarder(
|
||||
ssh_address_or_host=SSH_HOST,
|
||||
ssh_username=SSH_USER,
|
||||
ssh_pkey=SSH_KEY,
|
||||
remote_bind_address=('localhost', PG_PORT),
|
||||
local_bind_address=('localhost', SSH_BIND_PORT)
|
||||
)
|
||||
ssh_server.start()
|
Reference in New Issue
Block a user