Trajectory is a software platform for automatically extracting topics from university course descriptions. It includes modules for data ingestion, learning, and visualization.
The basic requirements are Java JDK 7 or higher, Python 3.0 or higher, virtualenv
, and maven
. Support for the database layer requires system copies of MySQL, PostGres, SQLite, or similar software. Support for the visualization layer requires a proxy web server (eg. Apache, Nginx).
Note that this project contains an unholy combination of Bash scripts, Python tools, and Java code. Proceed with setup carefully.
Begin by exporting the $TRJ_HOME
path variable.
$ git clone http://github.com/jrouly/trajectory
$ cd trajectory
$ export TRJ_HOME=$(pwd)
Install Python dependencies by calling the bin/util/pysetup
script. Java code will be compiled on demand.
To specify or change the database URI and scheme, modify the config.py
file. Specifically, look for DATABASE_URI
. It defaults to a SQLite file named data.db
.
server {
listen 80;
location ^~ /static/ {
root /TRJ_HOME/src/main/resources/web/static;
}
location / {
proxy_pass http://localhost;
proxy_redirect off;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Host $server_name;
}
}
$ bin/scrape download [-h] {targets}
$ bin/scrape export [-h] [--data-directory <directory>]
[--departments <departments>] [--cs]
This exports data in a format that can be read in by the Learn
module. The data directory will default to data/
. You can selectively filter subjects exported using the --departments
flag.
$ bin/learn -in <path> -out <path> [-iterations <n>] [-debug]
[-threads <n>] [-topics <n>] [-words <n>]
[-alpha <alpha>] [-beta <beta>]
The -in
parameter must be an export location from the Scrape
module. Results will be stored within a timestamped subdirectory of the -out
directory. All other parameters are optional.
$ bin/scrape import-results [-h] --topic-file <file> --course-file <file>
[--alpha <alpha>] [--beta <beta>] [--iterations <iterations>]
Read the results of the Learn
module (inferred topics) back into the database and pair with existing course data. Multiple imports will simply add ResultSet
s to the existing database.
$ bin/web
Activate the visualization server. See gunicorn.py
for configuration settings. Notice that the PID and log files are stored in the TRJ_HOME
.
- Create common engine frameworks for catalog installs to be more DRY.
- Refactor configuration objects as a module.