Skip to content

Latest commit

 

History

History
188 lines (128 loc) · 8.34 KB

README.md

File metadata and controls

188 lines (128 loc) · 8.34 KB

CHUV License Codacy Badge Codacy Badge Dependencies Build Status CircleCI

Woken: Workflow for Analytics

An orchestration platform for Docker containers running data mining algorithms.

This project exposes a web interface to execute on demand data mining algorithms defined in Docker containers and implemented using any tool or language (R, Python, Java and more are supported).

It relies on a runtime environment containing Mesos and [Chronos])(https://mesos.github.io/chronos/) to control and execute the Docker containers over a cluster.

Usage

 docker run --rm --env [list of environment variables] --link woken hbpmip/woken:2.8.2

where the environment variables are:

  • CLUSTER_IP: Name of this server advertised in the Akka cluster
  • CLUSTER_PORT: Port of this server advertised in the Akka cluster
  • CLUSTER_NAME: Name of Woken cluster, default to 'woken'
  • WOKEN_PORT_8088_TCP_ADDR: Address of Woken master server
  • WOKEN_PORT_8088_TCP_PORT: Port of Woken master server, default to 8088
  • DOCKER_BRIDGE_NETWORK: Name of the Docker bridge network. Default to 'bridge'
  • NETWORK_INTERFACE: IP address for listening to incoming HTTP connections. Default to '0.0.0.0'
  • WEB_SERVICES_PORT: Port for the HTTP server in Docker container. Default to 8087
  • WEB_SERVICES_SECURE: If yes, HTTPS with a custom certificate will be used. Default to no.
  • WEB_SERVICES_USER: Name used to protected the web servers protected with HTTP basic authentication. Default to 'admin'
  • WEB_SERVICES_PASSWORD: Password used to protected the web servers protected with HTTP basic authentication.
  • LOG_LEVEL: Level for logs on standard output, default to WARNING
  • LOG_CONFIG: on/off - log configuration on start, default to off
  • VALIDATION_MIN_SERVERS: minimum number of servers with the 'validation' functionality in the cluster, default to 0
  • SCORING_MIN_SERVERS: minimum number of servers with the 'scoring' functionality in the cluster, default to 0
  • KAMON_ENABLED: enable monitoring with Kamon, default to no
  • ZIPKIN_ENABLED: enable reporting traces to Zipkin, default to no. Requires Kamon enabled.
  • ZIPKIN_IP: IP address to Zipkin server. Requires Kamon and Zipkin enabled.
  • ZIPKIN_PORT: Port to Zipkin server. Requires Kamon and Zipkin enabled.
  • PROMETHEUS_ENABLED: enable reporting metrics to Prometheus, default to no. Requires Kamon enabled.
  • PROMETHEUS_IP: IP address to Prometheus server. Requires Kamon and Prometheus enabled.
  • PROMETHEUS_PORT: Port to Prometheus server. Requires Kamon and Prometheus enabled.
  • SIGAR_SYSTEM_METRICS: Enable collection of metrics of the system using Sigar native library, default to no. Requires Kamon enabled.
  • JVM_SYSTEM_METRICS: Enable collection of metrics of the JVM using JMX, default to no. Requires Kamon enabled.
  • MINING_LIMIT: Maximum number of concurrent mining operations. Default to 100
  • EXPERIMENT_LIMIT: Maximum number of concurrent experiments. Default to 100

Getting started

Follow these steps to get started:

  1. Git-clone this repository.
  git clone https://github.com/LREN-CHUV/woken.git
  1. Change directory into your clone:
  cd woken
  1. Build the application

You need the following software installed:

  • Docker 17.06 or better with docker-compose
  • Captain 1.1.0 or better
  ./build.sh
  1. Run the application

You need the following software installed to execute some tests:

  cd tests
  ./run.sh

tests/run.sh uses docker-compose to start a full environment with Mesos, Zookeeper and Chronos, all of those are required for the proper execution of Woken.

  1. Create a DNS alias in /etc/hosts
  127.0.0.1       localhost frontend

  1. Browse to http://frontend:8087 or run one of the query* script located in folder 'tests'.

Available Docker containers

The Docker containers that can be executed on this platform require a few specific features.

TODO: define those features - parameters passed as environment variables, in and out directories, entrypoint with a 'compute command', ...

The project algorithm-repository contains the Docker images that can be used with woken.

Available commands

Mining query

Performs a data mining task.

Path: /mining/job Verb: POST

Takes a Json document in the body, returns a Json document.

Json input should be of the form:

  {
    "user": {"code": "user1"},
    "variables": [{"code": "var1"}],
    "covariables": [{"code": "var2"},{"code": "var3"}],
    "grouping": [{"code": "var4"}],
    "filters": [],
    "algorithm": "",
    "datasets": [{"code": "dataset1"},{"code": "dataset2"}]
  }

where:

  • variables is the list of variables
  • covariables is the list of covariables
  • grouping is the list of variables to group together
  • filters is the list of filters. The format used here is coming from JQuery QueryBuilder filters, for example {"condition":"AND","rules":[{"id":"FULLNAME", "field":"FULLNAME","type":"string","input":"text","operator":"equal","value":"Isaac Fulmer"}],"valid":true}
  • datasets is an optional list of datasets, it can be used in distributed mode to select the nodes to query and in all cases add a filter rule of type {"condition":"OR","rules":[{"field":"dataset","operator","equals","value":"dataset1"},{"field":"dataset","operator","equals","value":"dataset2"}]}
  • algorithm is the algorithm to use.

Currently, the following algorithms are supported:

  • data: returns the raw data matching the query
  • linearRegression: performs a linear regression
  • summaryStatistics: performs a summary statistics than can be used to draw box plots.
  • knn
  • naiveBayes

Experiment query

Performs an experiment comprised of several data mining tasks and an optional cross-validation step used to compute the fitness of each algorithm and select the best result.

TODO: document API

Release

You need the following software installed:

Execute the following commands to distribute Woken as a Docker container:

  ./publish.sh

Installation

For production, woken requires Mesos and Chronos. To install them, you can use either:

  • mip-microservices-infrastructure, a collection of Ansible scripts deploying a full Mesos stack on Ubuntu servers.
  • mantl.io, a microservice infrstructure by Cisco, based on Mesos.
  • Mesosphere DCOS DC/OS (the datacenter operating system) is an open-source, distributed operating system based on the Apache Mesos distributed systems kernel.

What's in a name?

Woken :

  • the Woken river in China - we were looking for rivers in China
  • passive form of awake - it launches Docker containers and computations
  • workflow - the previous name, not too different

Acknowledgements

This work has been funded by the European Union Seventh Framework Program (FP7/2007­2013) under grant agreement no. 604102 (HBP)

This work is part of SP8 of the Human Brain Project (SGA1).