Skip to content

Latest commit

 

History

History
156 lines (129 loc) · 5.75 KB

data_persistence.md

File metadata and controls

156 lines (129 loc) · 5.75 KB

Data Persistence

Remember that containers have a volatile state: when they are removed, data modified inside the container's (virtual) filesystem will be deleted. Clearly, we need to workout a setup where data and settings remain safe -- persist -- across container shutdowns/upgrades.

Docker volumes is a central concept to this argument, volumes are essentially mounting points between the host and the container or among containers. Through volumes we can

  • share data between host and container,
  • share data between containers (with a storage-dedicated container).

The best strategy depends very much on each data provider's workflow and the amount of data. Anyways, we will here (together with the Workflow document) handle a couple of examples on the subject.

Generally, when we think about data to persist across Dachs instances, we think on the contents of:

  • /var/gavo/inputs/*: par default the services directories;
  • /var/gavo: almost completely defines the site;
  • /etc/gavo.rc: site's metadata.

Site's metadata, though, is something rather stable -- actually, static -- in every site. For that component, it may be reasonable to have it in a custom container after inheriting from chbrandt/dachs:server like shown in the README, 'FROM dachs:server' section.

Host-Container volumes

Let's consider we have a set of services under our host's /dachs/sets:

$ tree /dachs/sets
/dachs/sets/
├── arihip
│   ├── data
│   │   └── data.txt.gz
│   └── q.rd
└── datasetx
    ├── data.csv
    └── q.rd

We can run our Dachs container/site as follows:

# Run the 'postgres' container and then..
#
(host)$ docker run -dt --name dachs -p 80:80                         \
                   -v /dachs/sets/arihip:/var/gavo/inputs/arihip     \
                   -v /dachs/sets/datasetx:/var/gavo/inputs/datasetx \
                   chbrandt/dachs:server

And then, from another terminal window, manage (i.e., publish) the service:

(host)$ docker exec -it dachs bash
(cont)$ gavo imp arihip/q && gavo pub arihip/q
(cont)$ gavo imp datasetx/q && gavo pub datasetx/q
(cont)$ gavo serve reload

Obviously, you can handle this process as best it fits your workflow. For example, I have a "utils" directory with scripts I use on a daily basis for administrative tasks, among them the DaCHS data/services management. I usually bring my "utils" tools with me inside a container:

# Run the 'postgres' container and then..
#
(host)$ docker run -dt --name dachs -p 80:80                         \
                   -v /dachs/sets/arihip:/var/gavo/inputs/arihip     \
                   -v /dachs/sets/datasetx:/var/gavo/inputs/datasetx \
                   -v /dachs/utils:/usr/host/utils                   \
                   -v /dachs/etc/gavo.rc:/etc/gavo.rc:ro             \
                   chbrandt/dachs:server

Notice that in this example I also mounted the site's metadata (/etc/gavo.rc), with an extra parameter: ":ro" -- in read-only mode. By default, volumes are mounted in read-write mode, which means that files/directories can be modified (either from inside the container or from by the host). The "read-only" flag will block edition from inside the container.

Volume Containers

Another way to persist data in a docker setup is through volume containers. Basically, volume containers are containers dedicated to serve as a storage hub, exporting volumes to other, running containers.

There are two ways to have a volume container:

  1. (traditional) build a container and expose specific VOLUME path;
  2. (recommended) create a docker-volume to store different paths.

Traditional

This is the traditional way of creating a volume container, a Dockerfile is defined to export certain volumes. For example, the following Dockerfile could be used to pool everything from /var/gavo:

FROM debian
RUN mkdir -p /var/gavo
VOLUME /var/gavo

And if we built it with the following command line:

$ docker build -t mydachs:volume ./

We would then use it as:

(host)$ docker run -dt --name dachs_vargavo mydachs:volume
(host)$
(host)$ docker run -dt --name dachs -p 80:80      \
                   --volumes-from mydachs_volume  \
                   chbrandt/dachs:server

Everything you do inside /var/gavo (create, move, delete) will be saved in dachs_vargavo. Since you can commit changes done to a container in a new image, you could also versionize your dachs_vargavo container (image, mydachs:volume) each time a new service/resource comes in, for example. Again, it is up to the data publisher to decide if it is a reasonable workflow; You'll probably not do it if your services have a lot of data under them.

Recommended

Nowadays docker provides the volume interface, specific for non-running containers, dedicated to data persistence. First thing to know about docker volumes is that they are only deleted from your host's filesystem when explicitly removed -- which is a nice, very safe feature (though, notice, if you decide to play with volumes and forget to clean after it, data may start to accumulate under the hood.)

To create a docker volume is rather simple:

$ docker volume create dachs_store

And then, we can "mount" whichever path we want during the companion container's initialization:

$ docker run -dt --name dachs -p 80:80   \
             -v dachs_store:/var/gavo    \
             -v dachs_store:/etc/gavo.rc \
             chbrandt/dachs:server

If the volume-container is empty at a path (e.g., /var/gavo), it will copy the content from the companion container (e.g., dachs); otherwise, will just mount it at the corresponding location exposing its content.