python-store/README.md

# Store.py: A database interface for storing MD evaluation data

A database interface to store evaluations of MD simulations.

## Usage

The package provides interfaces to query records from the database (`get()` function)
and put new or updated records into the database (`update()` function).
Both of which will be explained breifly below.

### Automated Evaluation: `eval.yaml`

The package provides a standardized way of running MD evaluations.
Those evaluations are defined in `eval.yaml` file, located in the respective simulation directory.
An example format of such a yaml file is as follows:

    namespace: analyse.biwater

    simulation_params:
        mixingratio: 0.5
        system: bulk
        ensemble: nvt

    trajectory-open: open

    evaluations:
        - subset:
            residue_name: W100
            atom_name: OW
            selection: W100
          functions:
            - isf:
                q: 22.7
            - msd
        - subset:
            residue_name: W100
            atom_name: OW
            selection: W100
          other:
            residue_name: W075
            atom_name: OW
            selection: W075
          functions:
            - rdf
        - subset:
            residue_name: W100
            coordinates-map: water_dipole
            selection: dipole
          functions:
            - oaf:
                order: 1

The first line defines a namespace, which is used to locate evaluation functions (see below).
The key `simulation_params` defines parameters of the simulation, which are used when storing the results in the database.
Parameters like the directory and the user are determined automatically from the file path and the current user, respectively.
With the key `trajectory-open` a function can be specified that is used to open the trajectory.
If this is omitted the function store.eval.open will be used.

The last key `evaluations` defines a list of evaluations which will be done.
Each item of the list is a dictionary with the two keys `subset` and und `functions`.
The parameters defined for the subset will be used to get the subset of the trajectory ,except for the special key `selection` which is used for the store.update function.
The optional key `other` defines a second subset of atoms, which is passed to the function as the keyword other.
The functions are again defined as a list. Each item can be either a string or a dictionary with on key value pair.
In the latter case the key defines the function and the value should be another dictionary of keyword arguments for the function.
These keyword arguments will also be stored in the database, as evaluation parameters.

The function is located by its name, first in the specified namespace and if not found there, in the store.analyse module, which defines some standard MD evaluation functions.
The namespace may be used to calculate user defined functions.

The above example doese the following evaluations:

1. All results will be stored in the database with the parameters system=bulk, ensemble=nvt, mixingratio=0.5.
2. The trajectory will be opened with the function analyse.biwater.open and the path of the yaml file.
3. The first subset selects all OW atoms of the W100 residues, the selection parameter in the database will be W100.
4. The first function will either be analyse.biwater.isf (if existent) or store.analyse.isf, with the keyword argument q=22.7.
5. The second functions msd has no arguments.
6. A second subset selects all atoms of the W100 residue and runs the functions F1 and F2 with it.

### Included analysis functions

The store package comes with some basic analysis functions, which can also serve as template for writing customized analyses.
All analysis functions are defind in `store.analyse`, this is also the fall back namespace for eval.yaml evaluations,
when no namespace is specified, or functions are not defined in the custom namespace.

Currently, the following functions are defined, some of them are documented, use help(store.analyse.function) to get more info.
The follwoing list gives the names and required parameters (in parantheses) of the avaliable functions:

- isf(q): Incoherent intermediate scatterig function
- csf(q): Coherent intermediate scattering function
- msd: Mean squared displacement
- oaf(order): Orientational autocorrelation function, use with appropriate vector map (see below)
- rdf: Radial pair distribution function
- tetrahedral_order: Tetrahedral oder parameter

Additionally, some typical vector maps are defined:

- vector: Generic vector map between two atom types
- water_dipole
- water_OH_bonds

### Updating

The function `update` handles creation of new records as updating existing ones.
It will look for a simulation in the database, according to the specified arguments
and only if no matching record is found, a new simulation will be created.

    import store
    store.update(
        'isf', df, directory='/path/to/simulation', user='niels', T=120, selection='OW',
        simulation_params={'mixingratio': 0.5}, evaluation_params={'q': 22.7}
    )

Here the variable `df` should be a dataframe with the evaluated data.

### Querying

Users should use the function `get` to retrieve data from the database. For example:

    import store

    store.get(
        'isf', user='niels', T='100-150',
        simulation_params={'mixingratio': 0.5}
    )

Note that the parameters defined as `simulation_params` (or `evaluation_params`) have
to be defined when the data is put into the database.

## Database organization

The database is organized in two main Tables:

1. Evaluation: Stores the evaluated data, linked to a simulation
2. Simulation: Stores the metadata of a simulation

### Table schemas

    Evaluation:
        observable: str
        selection: str
        parameters: list of Parameter
        simulation: Simulation
        data: object

    Simulation:
        directory: str
        user: str
        temperature: number
        float_params: list of FloatAttribute
        string_params: list of StringAttribute
        evaluations: list of Evaluation

      Parameter:
          name: str
          value: float

      FloatAttribute:
          name: str
          value: float

      StringAttribute:
          name: str
          value: str

The tables Parameter, FloatAttribute and StringAttribute are simple key values pairs,
allowing float or string values, respectively. They are used to store arbitrary
attributes of the evaluation and simualtion records.

## Notes for the future

### Clean-Up SQL Database

To delete orphaned evaluations, on the postgresql shell (`psql -h db.cluster -U store`)

    # Since parameters refernce their respective evaluation, we have to delete them first.
    DELETE FROM parameters
    WHERE parameters.evaluation_id IN
    (SELECT id FROM evaluations WHERE evaluations.simulation_id IS NULL);

    DELETE FROM evaluations WHERE evaluations.simulation_id IS NULL;

Similarly, one can delete simulations, without any assigned evaluations.

### Database usages

General size info

    SELECT nspname || '.' || relname AS "relation",
        pg_size_pretty(pg_total_relation_size(C.oid)) AS "total_size"
    FROM pg_class C
    LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
    WHERE nspname NOT IN ('pg_catalog', 'information_schema')
        AND C.relkind <> 'i'
        AND nspname !~ '^pg_toast'
    ORDER BY pg_total_relation_size(C.oid) DESC
    LIMIT 20;

Number of simulations per User

    SELECT simulations.user, COUNT(DISTINCT simulations),
      pg_size_pretty(SUM(pg_column_size(data))) as "data-size",
      pg_size_pretty(SUM(pg_column_size(data)) / COUNT(DISTINCT simulations)) as "size / sim"
    FROM evaluations
    JOIN simulations
    ON (simulations.id = evaluations.simulation_id)
    GROUP BY simulations.user
    ORDER BY count DESC;

Average size of data per observable

    SELECT observable,
      pg_size_pretty(ROUND(AVG(pg_column_size(data)), 0)) as "size-avg",
      pg_size_pretty(ROUND(SUM(pg_column_size(data)), 0)) as "size-total",
      AVG(pg_column_size(data)) as "size_bytes"
    FROM evaluations
    GROUP BY observable
    ORDER BY size_bytes DESC;


### SSH tunnel connection

To get a secure connection to the postgrsql server an nas2, one can use SSH tunnels.
This allows to use SSH certicifates for identification, so no passwords are reuqired.
Example code fragments:

    import sshtunnel
    ssh_server = None

    SSH_HOST = 'nas2'
    SSH_USER = os.getlogin()
    SSH_KEY = os.environ['HOME'] + '/.ssh/id_rsa'
    SSH_BIND_PORT = 58222
    DB_FILE = 'postgresql://localhost:{port}/test'.format(port=SSH_BIND_PORT)

    def open_sshtunnel():
        global ssh_server
        ssh_server = sshtunnel.SSHTunnelForwarder(
            ssh_address_or_host=SSH_HOST,
            ssh_username=SSH_USER,
            ssh_pkey=SSH_KEY,
            remote_bind_address=('localhost', PG_PORT),
            local_bind_address=('localhost', SSH_BIND_PORT)
        )
        ssh_server.start()