Orchestrate a cluster of preemptible virtual machines on google compute engine.
- Node.js
- Installing node and npm
- Running as a command line tool
- Docker
- Setting up docker to run on local machine
- Setting up a docker host on a local subnet
- Google Cloud CLI
- Get started with Google Cloud
- Slave docker/vm image
npm i <tbc> -g
Linking will allow changes made to thee source code to be immediately reflected in the tool.
git clone https://github.com/conorturner/bach.git && \
cd bach && \
npm link
Applications are defined using a 'bachfile', this specifies the location of the binary file to be run in the computation. It also contains a definition of the hardware requirements for each slave node.
This use case supports a basic map and collect phase reading from any HTTP storage supporting the 'range' header. Documentation is available here.
Documentation is available here.
Good source of datasets: https://registry.opendata.aws/
US IRS filings https://registry.opendata.aws/irs990/ https://s3.amazonaws.com/irs-form-990/index_20xx.json
Massive web crawl database https://registry.opendata.aws/commoncrawl/
Nexrad weather satellite data https://docs.opendata.aws/noaa-nexrad/readme.html Data can be searched byprefix as shown below https://noaa-nexrad-level2.s3.amazonaws.com/?prefix=2019/01/19
Database of a subset of all 'events' that occur on this earth. Scraped from the internet I assume. https://www.gdeltproject.org/#intro Smaller 1.1gb version of the dataset http://data.gdeltproject.org/events/GDELT.MASTERREDUCEDV2.1979-2013.zip
Headers for 30gb taxi dataset http://www.debs2015.org/call-grand-challenge.html