Merge pull request #14 from dgraph-io/matthewmcneely/graph-loading

dgraph-io · Apr 26, 2023 · ba577e1 · ba577e1
2 parents 613d7f4 + dee4a0c
commit ba577e1
Show file tree

Hide file tree

Showing 26 changed files with 19,114 additions and 25 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,4 +1,6 @@
 .DS_Store
 .vscode/settings.json
 tools/data
-rdf-subset/*
+rdf/*
+notebook/spark-*
+out/*
diff --git a/docker-compose.yml b/docker-compose.yml
@@ -0,0 +1,31 @@
+version: "3.8"
+
+#
+# A simple compose file for running single zero and alpha
+#
+services:
+
+  # Dgraph Zero controls the cluster
+  zero:
+    image: dgraph/dgraph:latest
+    container_name: vlg_zero
+    volumes:
+      - ~/dgraph/vlg:/dgraph
+    ports:
+      - 5080:5080
+      - 6080:6080
+    command: dgraph zero --my=zero:5080 --logtostderr -v=2 --telemetry sentry=false
+  # Dgraph Alpha hosts the graph and indexes
+  alpha:
+    image: dgraph/dgraph:latest
+    container_name: vlg_alpha
+    volumes:
+      - ~/dgraph/vlg:/dgraph
+    ports:
+      - 8080:8080
+      - 9080:9080
+    command: >
+      dgraph alpha --my=alpha:7080 --zero=zero:5080
+        --security whitelist=0.0.0.0/0
+        --logtostderr -v=2
+        --telemetry sentry=false
diff --git a/notes/2. Schema Design.md b/notes/2. Schema Design.md
@@ -1,6 +1,6 @@
 # Schema Design
 
-Following the [raw data analysis](1. Raw Data Analysis.md), we can now start the process of designing a schema for the graph to store the Offshore Leaks Dataset. But a good task to accomplish prior to that is an analysis of the ways users might want to query the graph.
+Following the [raw data analysis](1.%20Raw%20Data%20Analysis.md), we can now start the process of designing a schema for the graph to store the Offshore Leaks Dataset. But a good task to accomplish prior to that is an analysis of the ways users might want to query the graph.
 
 ## Queries (currently available on the ICIJ website)
 

diff --git a/notes/3. Data Importing.md b/notes/3. Data Importing.md
@@ -0,0 +1,153 @@
+# Data Importing
+
+Following the [schema design](2.%20Schema%20Design.md), we turn our attention to data cleaning, converting data to a format compatible for importing into Dgraph, and finally importing the data into a Dgraph cluster.
+
+## Data Cleaning/Prepping
+
+As mentioned in the [earlier note](2.%20Schema%20Design.md), we processed the raw Offshore Leak CSV files and stored the data in Badger[^1], an open source key-value store maintained by Dgraph.
+
+## Conversion of Data to Dgraph-importable Format
+
+We expand the already created corresponding models in the [/tools/model](/tools/model) folder to export RDF representations of the data. Take for instance the RDF encoding function for the Entity type:
+
+```golang
+func (e *Entity) ToRDF(w io.Writer) {
+	e.Normalize()
+	id := e.RDFID()
+	fmt.Fprintf(w, "%s <dgraph.type> \"Entity\" .\n", id)
+	e.Record.ToRDF(w)
+	RDFEncodeTriples(w, tstore.TriplesFromStruct(id, e))
+}
+```
+
+This function first _normalizes_ (`e.Normalize()`) the record by looking for terms that are "boilerplate", such as 'None' for the entity name, or converting jurisdiction codes of XX (a data entry mistake) to XXX (XXX = "Unknown Jurisdiction"). Ultimately, the function RDF encodes all predicates of the record to the stream passed.
+
+Programs in [/tools/export](/tools/export) and [/tools/export-subset](/tools/export-subset/) load all (or a random subset of) Entities, Others, Officers, Intermediaries and Addresses records and run the `ToRDF()` function that write the results to a configurable output folder (/rdf or /rdf-subset). Finally, these programs create relationships between the records by iterating the [Relationship](/tools/model/relationship.go) type from the Badger store. Tests in the model package, such as the [entity test](/tools/model/entity_test.go), are a good example of what we expect in the conversion from CSV-parsed to Dgraph-compatible RDF-encoded data.
+
+Note, one aspect of converting Address records to RDF data involves consulting a pre-loaded CSV of geo-encoded data. See the section below describing Geo-encoding of Address Records.
+
+Exporting all Records to Dgraph-compatible RDF output:
+```bash
+cd tools
+go run export/main.go
+```
+
+Exporting a subset (100,000 random relationships):
+```bash
+cd tools
+go run export-subset/main.go
+```
+
+Both programs write Dgraph-compatible RDF files in the `/export `or `/export-subset` (respective) folders. We'll use these RDF files in a subsequent section to populate a Dgraph cluster.
+
+### Geo-encoding of Address Records
+
+One stated goal for the VLG is the ability to query Address records by geo coordinates. The Offshore Leaks dataset address entries are non-standardized, often with formatting errors that make it difficult to get normalized address information from the entries.
+
+We decided to attempt to geo-encode the US-based addresses from the dataset. Of the 24,366 US-based addresses, we were only able to successfully geo-encode 17,893 of those (~73%).
+
+In [/geo/main.go](/tools/geo/main.go), we use two libraries, libpostal[^2], a library for parsing/normalizing street addresses using statistical NLP and open geo data, and the US Census Bureau's Geodecoder API[^3] to convert addresses to geo-coordinates.
+
+First, we launch a convenient Docker container for libpostal
+
+```bash
+docker run -d -p 8899:8080 clicksend/libpostal-rest
+```
+
+The geo-encoding [program]((/tools/geo/main.go)) asks libpostal to normalize the raw address line to normalized number, street, city and state sections. If successful, it makes a request to the US Census Bureau Geodecoder API for the latitude and longitude of the address. If and only if that step is successful, it writes the result (lat/long) to a CSV file [/tools/addresses-geoencoded.csv](/tools/addresses-geoencoded.csv)—this file will be used during RDF conversion (see the Address.location field in the GraphQL SDL schema).
+
+Note that we could have written the results of the geo-encoding directly into the Badger store for each Address record that we successfully processed. However, the process described in this section takes several hours to resolve, so we chose to write it to a persistent CSV [file](/tools/addresses-geoencoded.csv) in the repo so those following along step-by-step would not be impeded by this lengthy process.
+
+## Importing into Dgraph
+
+At this point, we're ready to start a Dgraph cluster, apply the schema and load the RDF data that we just converted. For small datasets (such as the subset described above), the [Live Loader](https://dgraph.io/docs/howto/importdata/live-loader/) is recommended.
+
+For initial loading of a new graph, or for import sizes larger than a million edges, the [Bulk Loader](https://dgraph.io/docs/howto/importdata/bulk-loader/) is the appropriate mechanism.
+
+We'll describe both procedures below.
+
+### Bulk Loading
+
+The Dgraph Bulk loader is used to create a new cluster. It takes RDF or JSON-formatted files and sends them through a map/shuffle/reduce process that can achieve a write-rate close to 1 million edges per second[^4].
+
+The first step is to start with a clean directory where the Dgraph graph data files will be stored. For this example, let's use a folder named dgraph/vlg under our HOME directory.
+
+```bash
+rm -rf ~/dgraph/vlg
+```
+
+We'll use Docker Compose to run and control our cluster, see the [docker-compose.yml](/docker-compose.yml) file in top level of this repo.
+
+The second step is to start just the Zero Dgraph service. A [Zero](https://dgraph.io/docs/deploy/dgraph-zero/) service manages a Dgraph cluster. It's responsible for balancing storage, assigning UIDs, orchestrating the consensus protocol among other things. For bulk loading, it's the only service we need active.
+
+```bash
+docker-compose up zero
+```
+
+Although it can be built and run on pretty much any platform, Dgraph is only officially supported on Linux platforms. So it's easiest to run the Bulk Loader using the official Docker image. Note below that we mount the present working directory to the /home folder in the container in order to access the RDF and schema files.
+
+In a new terminal...
+```bash
+docker run -v $(PWD):/home dgraph/dgraph:latest dgraph bulk -s /dev/null -g /home/schema/schema.graphql -f /home/rdf --zero host.docker.internal:5080 --out /home/out
+```
+
+After this completes, the `out` folder will contain in the imported data. On my Mac M1 with 16GB of RAM, it takes less than two minutes to create over 28M edges.
+
+Stop the Zero server with `docker-compose stop zero` or Ctrl-C in its running terminal.
+
+Copy the `p` folder from ./out/0 folder to the ~/dgraph/vlg folder.
+
+```bash
+cp -r out/0/p ~/dgraph/vlg/
+```
+
+You're now ready to start the cluster normally.
+
+```bash
+docker-compose up
+```
+
+After a minute or so, your cluster should be healthy and ready to take requests.
+
+
+### Live Loading a Subset
+
+The Dgraph Live Loader can be used to load data into an existing graph, or if the import set is smallish (~1M edges), it can also be used to create a new graph.
+
+Let's create a graph of a subset of the relationships from the Offshore Leaks Dataset (100,000 randomly selected relationships).
+
+First, stop your existing cluster if one is running and remove the data directory:
+```bash
+docker-compose stop
+rm -rf ~/dgraph/vlg
+```
+
+Start up a fresh cluster:
+```bash
+docker-compose up
+```
+
+Apply the GraphQL schema:
+```bash
+curl -Ss --data-binary '@./schema/schema.graphql' http://localhost:8080/admin/schema
+```
+
+If you haven't yet, use the export-subset program in the `tools` folder to export a subset of 100,000 random relationships:
+```bash
+cd tools
+go run export-subset/main.go
+cd ..
+```
+
+Invoke the Live Loader to load the newly exported subset:
+```bash
+docker run -v $(PWD):/home dgraph/dgraph:latest dgraph live -f /home/rdf-subset --alpha host.docker.internal:9080 --zero host.docker.internal:5080
+```
+
+In less than a few minutes, the cluster will be fully loaded and ready to use.
+
+[^1]: [dgraph/badger](https://github.com/dgraph-io/badger)
+[^2]: [libpostal](https://github.com/openvenues/libpostal)
+[^3]: [US Census Geo-encoding REST API](https://geocoding.geo.census.gov/geocoder/)
+[^4]: [Dgraph Bulk Loader Article](https://dgraph.io/blog/post/bulkloader/)
+
diff --git a/rdf-subset/data.rdf.gz b/rdf-subset/data.rdf.gz
diff --git a/schema/schema.graphql b/schema/schema.graphql
@@ -1,8 +1,12 @@
+"""
+Source is the source of the data.
+"""
 enum Source {
     PandoraPapers
     ParadisePapers
     OffshoreLeaks
-    BahamasLeak
+    BahamasLeaks
+    PanamaPapers
 }
 
 """
@@ -16,7 +20,7 @@ interface Record {
     "The ICIJ internal ID"
     internalID: String @search(by: [exact])
     "The record source"
-    sourceID: String @search(by: [exact])
+    sourceID: Source! @search(by: [exact])
     "Notes about the record"
     notes: String @search(by: [fulltext])
 
@@ -96,7 +100,7 @@ enum OtherType {
 }
 
 """
-Other represents other companies, trusts and foundations.
+Other (an ICIJ term) represents other companies, trusts and foundations.
 """
 type Other implements Record {
      "The former name of the entity"