This is part of the Slipo project for mining Locations of Interest. It provides distributed implementations in Apache Spark for the following operations:
-
Find hotspots from a collection of Points in 2D or (3D)(+ time => HotSpots_3D.scala) space using the Getis-Ord (Gi* statistic).
-
Find clusters implementing a distributed version of DBSCAN.
-
Performs LDA(latent Dirichlet allocation) in a collection of documents.
Input coordinates can be transformed on the fly from Source to Destination EPSG codes and back if specified in config.properties. For better accuracy you can specify these variables and assign cell-eps(Hotspots) and eps(DBSCAN) in meters instead of degrees by default: https://en.wikipedia.org/wiki/Decimal_degrees
- sbt (interacive build tool): https://www.scala-sbt.org/download.html
How to run Hotspots-Distributed:
-
Download or clone the project.
-
Open terminal inside root folder.
-
sbt package
-
Run spark-submit script as follows:
./spark-submit
--class runnables.(runnable) --master yarn --driver-memory 4g --executor-memory 4g path-to-generated-jar-file-from-Step-3.jar path-to-config.properties-File path-to-resources/EPSG_proj.csv
where --class refers to the main runable class e.g:(hotspots, dbscan or lda).
-
If you want to submit through curl:
-
Upload somewhere to HDFS the following files:
- Generated jar e.g: aoi-spark-2_2.11-0.1.jar
- config.properties
- resources/EPSG_proj.csv
- lib dir.
-
Fill config and app.json with appropriate variables. (app.json takes 2 arguments in args: path_to_config, path_to_EPSG_proj)
-
Run by submitting job to Spark cluster e.g: Yarn by: curl -d @app.json -H 'Content-Type: application/json' -X POST ..Cluster_Path/batches
-
Notice: aoi-spark is build under Spark Version 2.2.3. It is recommended for the Spark cluster to have the same version.
The contents of this project are licensed under the Apache License 2.0.