Initial project version

2022-04-11 10:59:06 +02:00
commit 4fb80d9d44
58 changed files with 33879 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,244 @@
+# Store.py: A database interface for storing MD evaluation data
+
+A database interface to store evaluations of MD simulations.
+
+## Usage
+
+The package provides interfaces to query records from the database (`get()` function)
+and put new or updated records into the database (`update()` function).
+Both of which will be explained breifly below.
+
+### Automated Evaluation: `eval.yaml`
+
+The package provides a standardized way of running MD evaluations.
+Those evaluations are defined in `eval.yaml` file, located in the respective simulation directory.
+An example format of such a yaml file is as follows:
+
+    namespace: analyse.biwater
+
+    simulation_params:
+        mixingratio: 0.5
+        system: bulk
+        ensemble: nvt
+
+    trajectory-open: open
+
+    evaluations:
+        - subset:
+            residue_name: W100
+            atom_name: OW
+            selection: W100
+          functions:
+            - isf:
+                q: 22.7
+            - msd
+        - subset:
+            residue_name: W100
+            atom_name: OW
+            selection: W100
+          other:
+            residue_name: W075
+            atom_name: OW
+            selection: W075
+          functions:
+            - rdf
+        - subset:
+            residue_name: W100
+            coordinates-map: water_dipole
+            selection: dipole
+          functions:
+            - oaf:
+                order: 1
+
+The first line defines a namespace, which is used to locate evaluation functions (see below).
+The key `simulation_params` defines parameters of the simulation, which are used when storing the results in the database.
+Parameters like the directory and the user are determined automatically from the file path and the current user, respectively.
+With the key `trajectory-open` a function can be specified that is used to open the trajectory.
+If this is omitted the function store.eval.open will be used.
+
+The last key `evaluations` defines a list of evaluations which will be done.
+Each item of the list is a dictionary with the two keys `subset` and und `functions`.
+The parameters defined for the subset will be used to get the subset of the trajectory ,except for the special key `selection` which is used for the store.update function.
+The optional key `other` defines a second subset of atoms, which is passed to the function as the keyword other.
+The functions are again defined as a list. Each item can be either a string or a dictionary with on key value pair.
+In the latter case the key defines the function and the value should be another dictionary of keyword arguments for the function.
+These keyword arguments will also be stored in the database, as evaluation parameters.
+
+The function is located by its name, first in the specified namespace and if not found there, in the store.analyse module, which defines some standard MD evaluation functions.
+The namespace may be used to calculate user defined functions.
+
+The above example doese the following evaluations:
+
+1. All results will be stored in the database with the parameters system=bulk, ensemble=nvt, mixingratio=0.5.
+2. The trajectory will be opened with the function analyse.biwater.open and the path of the yaml file.
+3. The first subset selects all OW atoms of the W100 residues, the selection parameter in the database will be W100.
+4. The first function will either be analyse.biwater.isf (if existent) or store.analyse.isf, with the keyword argument q=22.7.
+5. The second functions msd has no arguments.
+6. A second subset selects all atoms of the W100 residue and runs the functions F1 and F2 with it.
+
+### Included analysis functions
+
+The store package comes with some basic analysis functions, which can also serve as template for writing customized analyses.
+All analysis functions are defind in `store.analyse`, this is also the fall back namespace for eval.yaml evaluations, 
+when no namespace is specified, or functions are not defined in the custom namespace.
+
+Currently, the following functions are defined, some of them are documented, use help(store.analyse.function) to get more info.
+The follwoing list gives the names and required parameters (in parantheses) of the avaliable functions:
+
+- isf(q): Incoherent intermediate scatterig function
+- csf(q): Coherent intermediate scattering function
+- msd: Mean squared displacement
+- oaf(order): Orientational autocorrelation function, use with appropriate vector map (see below) 
+- rdf: Radial pair distribution function
+- tetrahedral_order: Tetrahedral oder parameter
+
+Additionally, some typical vector maps are defined:
+
+- vector: Generic vector map between two atom types
+- water_dipole
+- water_OH_bonds
+
+### Updating
+
+The function `update` handles creation of new records as updating existing ones.
+It will look for a simulation in the database, according to the specified arguments
+and only if no matching record is found, a new simulation will be created.
+
+    import store
+    store.update(
+        'isf', df, directory='/path/to/simulation', user='niels', T=120, selection='OW',
+        simulation_params={'mixingratio': 0.5}, evaluation_params={'q': 22.7}
+    )
+
+Here the variable `df` should be a dataframe with the evaluated data.
+
+### Querying
+
+Users should use the function `get` to retrieve data from the database. For example:
+
+    import store
+
+    store.get(
+        'isf', user='niels', T='100-150',
+        simulation_params={'mixingratio': 0.5}
+    )
+
+Note that the parameters defined as `simulation_params` (or `evaluation_params`) have
+to be defined when the data is put into the database.
+
+## Database organization
+
+The database is organized in two main Tables:
+
+1. Evaluation: Stores the evaluated data, linked to a simulation
+2. Simulation: Stores the metadata of a simulation
+
+### Table schemas
+
+    Evaluation:
+        observable: str
+        selection: str
+        parameters: list of Parameter
+        simulation: Simulation
+        data: object
+
+    Simulation:
+        directory: str
+        user: str
+        temperature: number
+        float_params: list of FloatAttribute
+        string_params: list of StringAttribute
+        evaluations: list of Evaluation
+
+      Parameter:
+          name: str
+          value: float
+
+      FloatAttribute:
+          name: str
+          value: float
+
+      StringAttribute:
+          name: str
+          value: str
+
+The tables Parameter, FloatAttribute and StringAttribute are simple key values pairs,
+allowing float or string values, respectively. They are used to store arbitrary
+attributes of the evaluation and simualtion records.
+
+## Notes for the future
+
+### Clean-Up SQL Database
+
+To delete orphaned evaluations, on the postgresql shell (`psql -h db.cluster -U store`)
+    
+    # Since parameters refernce their respective evaluation, we have to delete them first.
+    DELETE FROM parameters
+    WHERE parameters.evaluation_id IN 
+    (SELECT id FROM evaluations WHERE evaluations.simulation_id IS NULL);
+
+    DELETE FROM evaluations WHERE evaluations.simulation_id IS NULL;
+
+Similarly, one can delete simulations, without any assigned evaluations.
+
+### Database usages
+
+General size info
+
+    SELECT nspname || '.' || relname AS "relation",
+        pg_size_pretty(pg_total_relation_size(C.oid)) AS "total_size"
+    FROM pg_class C
+    LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
+    WHERE nspname NOT IN ('pg_catalog', 'information_schema')
+        AND C.relkind <> 'i'
+        AND nspname !~ '^pg_toast'
+    ORDER BY pg_total_relation_size(C.oid) DESC
+    LIMIT 20;
+
+Number of simulations per User 
+
+    SELECT simulations.user, COUNT(DISTINCT simulations),
+      pg_size_pretty(SUM(pg_column_size(data))) as "data-size",
+      pg_size_pretty(SUM(pg_column_size(data)) / COUNT(DISTINCT simulations)) as "size / sim"
+    FROM evaluations
+    JOIN simulations 
+    ON (simulations.id = evaluations.simulation_id)
+    GROUP BY simulations.user
+    ORDER BY count DESC;
+
+Average size of data per observable
+
+    SELECT observable, 
+      pg_size_pretty(ROUND(AVG(pg_column_size(data)), 0)) as "size-avg",
+      pg_size_pretty(ROUND(SUM(pg_column_size(data)), 0)) as "size-total",
+      AVG(pg_column_size(data)) as "size_bytes"
+    FROM evaluations
+    GROUP BY observable
+    ORDER BY size_bytes DESC;
+
+    
+### SSH tunnel connection
+
+To get a secure connection to the postgrsql server an nas2, one can use SSH tunnels.
+This allows to use SSH certicifates for identification, so no passwords are reuqired.
+Example code fragments:
+
+    import sshtunnel
+    ssh_server = None
+
+    SSH_HOST = 'nas2'
+    SSH_USER = os.getlogin()
+    SSH_KEY = os.environ['HOME'] + '/.ssh/id_rsa'
+    SSH_BIND_PORT = 58222
+    DB_FILE = 'postgresql://localhost:{port}/test'.format(port=SSH_BIND_PORT)
+
+    def open_sshtunnel():
+        global ssh_server
+        ssh_server = sshtunnel.SSHTunnelForwarder(
+            ssh_address_or_host=SSH_HOST,
+            ssh_username=SSH_USER,
+            ssh_pkey=SSH_KEY,
+            remote_bind_address=('localhost', PG_PORT),
+            local_bind_address=('localhost', SSH_BIND_PORT)
+        )
+        ssh_server.start()