Skip to content

Commit

Permalink
Merge pull request #14 from dgraph-io/matthewmcneely/graph-loading
Browse files Browse the repository at this point in the history
  • Loading branch information
matthewmcneely authored Apr 26, 2023
2 parents 613d7f4 + dee4a0c commit ba577e1
Show file tree
Hide file tree
Showing 26 changed files with 19,114 additions and 25 deletions.
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
.DS_Store
.vscode/settings.json
tools/data
rdf-subset/*
rdf/*
notebook/spark-*
out/*
31 changes: 31 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
version: "3.8"

#
# A simple compose file for running single zero and alpha
#
services:

# Dgraph Zero controls the cluster
zero:
image: dgraph/dgraph:latest
container_name: vlg_zero
volumes:
- ~/dgraph/vlg:/dgraph
ports:
- 5080:5080
- 6080:6080
command: dgraph zero --my=zero:5080 --logtostderr -v=2 --telemetry sentry=false
# Dgraph Alpha hosts the graph and indexes
alpha:
image: dgraph/dgraph:latest
container_name: vlg_alpha
volumes:
- ~/dgraph/vlg:/dgraph
ports:
- 8080:8080
- 9080:9080
command: >
dgraph alpha --my=alpha:7080 --zero=zero:5080
--security whitelist=0.0.0.0/0
--logtostderr -v=2
--telemetry sentry=false
2 changes: 1 addition & 1 deletion notes/2. Schema Design.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Schema Design

Following the [raw data analysis](1. Raw Data Analysis.md), we can now start the process of designing a schema for the graph to store the Offshore Leaks Dataset. But a good task to accomplish prior to that is an analysis of the ways users might want to query the graph.
Following the [raw data analysis](1.%20Raw%20Data%20Analysis.md), we can now start the process of designing a schema for the graph to store the Offshore Leaks Dataset. But a good task to accomplish prior to that is an analysis of the ways users might want to query the graph.

## Queries (currently available on the ICIJ website)

Expand Down
153 changes: 153 additions & 0 deletions notes/3. Data Importing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# Data Importing

Following the [schema design](2.%20Schema%20Design.md), we turn our attention to data cleaning, converting data to a format compatible for importing into Dgraph, and finally importing the data into a Dgraph cluster.

## Data Cleaning/Prepping

As mentioned in the [earlier note](2.%20Schema%20Design.md), we processed the raw Offshore Leak CSV files and stored the data in Badger[^1], an open source key-value store maintained by Dgraph.

## Conversion of Data to Dgraph-importable Format

We expand the already created corresponding models in the [/tools/model](/tools/model) folder to export RDF representations of the data. Take for instance the RDF encoding function for the Entity type:

```golang
func (e *Entity) ToRDF(w io.Writer) {
e.Normalize()
id := e.RDFID()
fmt.Fprintf(w, "%s <dgraph.type> \"Entity\" .\n", id)
e.Record.ToRDF(w)
RDFEncodeTriples(w, tstore.TriplesFromStruct(id, e))
}
```

This function first _normalizes_ (`e.Normalize()`) the record by looking for terms that are "boilerplate", such as 'None' for the entity name, or converting jurisdiction codes of XX (a data entry mistake) to XXX (XXX = "Unknown Jurisdiction"). Ultimately, the function RDF encodes all predicates of the record to the stream passed.

Programs in [/tools/export](/tools/export) and [/tools/export-subset](/tools/export-subset/) load all (or a random subset of) Entities, Others, Officers, Intermediaries and Addresses records and run the `ToRDF()` function that write the results to a configurable output folder (/rdf or /rdf-subset). Finally, these programs create relationships between the records by iterating the [Relationship](/tools/model/relationship.go) type from the Badger store. Tests in the model package, such as the [entity test](/tools/model/entity_test.go), are a good example of what we expect in the conversion from CSV-parsed to Dgraph-compatible RDF-encoded data.

Note, one aspect of converting Address records to RDF data involves consulting a pre-loaded CSV of geo-encoded data. See the section below describing Geo-encoding of Address Records.

Exporting all Records to Dgraph-compatible RDF output:
```bash
cd tools
go run export/main.go
```

Exporting a subset (100,000 random relationships):
```bash
cd tools
go run export-subset/main.go
```

Both programs write Dgraph-compatible RDF files in the `/export `or `/export-subset` (respective) folders. We'll use these RDF files in a subsequent section to populate a Dgraph cluster.

### Geo-encoding of Address Records

One stated goal for the VLG is the ability to query Address records by geo coordinates. The Offshore Leaks dataset address entries are non-standardized, often with formatting errors that make it difficult to get normalized address information from the entries.

We decided to attempt to geo-encode the US-based addresses from the dataset. Of the 24,366 US-based addresses, we were only able to successfully geo-encode 17,893 of those (~73%).

In [/geo/main.go](/tools/geo/main.go), we use two libraries, libpostal[^2], a library for parsing/normalizing street addresses using statistical NLP and open geo data, and the US Census Bureau's Geodecoder API[^3] to convert addresses to geo-coordinates.

First, we launch a convenient Docker container for libpostal

```bash
docker run -d -p 8899:8080 clicksend/libpostal-rest
```

The geo-encoding [program]((/tools/geo/main.go)) asks libpostal to normalize the raw address line to normalized number, street, city and state sections. If successful, it makes a request to the US Census Bureau Geodecoder API for the latitude and longitude of the address. If and only if that step is successful, it writes the result (lat/long) to a CSV file [/tools/addresses-geoencoded.csv](/tools/addresses-geoencoded.csv)—this file will be used during RDF conversion (see the Address.location field in the GraphQL SDL schema).

Note that we could have written the results of the geo-encoding directly into the Badger store for each Address record that we successfully processed. However, the process described in this section takes several hours to resolve, so we chose to write it to a persistent CSV [file](/tools/addresses-geoencoded.csv) in the repo so those following along step-by-step would not be impeded by this lengthy process.

## Importing into Dgraph

At this point, we're ready to start a Dgraph cluster, apply the schema and load the RDF data that we just converted. For small datasets (such as the subset described above), the [Live Loader](https://dgraph.io/docs/howto/importdata/live-loader/) is recommended.

For initial loading of a new graph, or for import sizes larger than a million edges, the [Bulk Loader](https://dgraph.io/docs/howto/importdata/bulk-loader/) is the appropriate mechanism.

We'll describe both procedures below.

### Bulk Loading

The Dgraph Bulk loader is used to create a new cluster. It takes RDF or JSON-formatted files and sends them through a map/shuffle/reduce process that can achieve a write-rate close to 1 million edges per second[^4].

The first step is to start with a clean directory where the Dgraph graph data files will be stored. For this example, let's use a folder named dgraph/vlg under our HOME directory.

```bash
rm -rf ~/dgraph/vlg
```

We'll use Docker Compose to run and control our cluster, see the [docker-compose.yml](/docker-compose.yml) file in top level of this repo.

The second step is to start just the Zero Dgraph service. A [Zero](https://dgraph.io/docs/deploy/dgraph-zero/) service manages a Dgraph cluster. It's responsible for balancing storage, assigning UIDs, orchestrating the consensus protocol among other things. For bulk loading, it's the only service we need active.

```bash
docker-compose up zero
```

Although it can be built and run on pretty much any platform, Dgraph is only officially supported on Linux platforms. So it's easiest to run the Bulk Loader using the official Docker image. Note below that we mount the present working directory to the /home folder in the container in order to access the RDF and schema files.

In a new terminal...
```bash
docker run -v $(PWD):/home dgraph/dgraph:latest dgraph bulk -s /dev/null -g /home/schema/schema.graphql -f /home/rdf --zero host.docker.internal:5080 --out /home/out
```

After this completes, the `out` folder will contain in the imported data. On my Mac M1 with 16GB of RAM, it takes less than two minutes to create over 28M edges.

Stop the Zero server with `docker-compose stop zero` or Ctrl-C in its running terminal.

Copy the `p` folder from ./out/0 folder to the ~/dgraph/vlg folder.

```bash
cp -r out/0/p ~/dgraph/vlg/
```

You're now ready to start the cluster normally.

```bash
docker-compose up
```

After a minute or so, your cluster should be healthy and ready to take requests.


### Live Loading a Subset

The Dgraph Live Loader can be used to load data into an existing graph, or if the import set is smallish (~1M edges), it can also be used to create a new graph.

Let's create a graph of a subset of the relationships from the Offshore Leaks Dataset (100,000 randomly selected relationships).

First, stop your existing cluster if one is running and remove the data directory:
```bash
docker-compose stop
rm -rf ~/dgraph/vlg
```

Start up a fresh cluster:
```bash
docker-compose up
```

Apply the GraphQL schema:
```bash
curl -Ss --data-binary '@./schema/schema.graphql' http://localhost:8080/admin/schema
```

If you haven't yet, use the export-subset program in the `tools` folder to export a subset of 100,000 random relationships:
```bash
cd tools
go run export-subset/main.go
cd ..
```

Invoke the Live Loader to load the newly exported subset:
```bash
docker run -v $(PWD):/home dgraph/dgraph:latest dgraph live -f /home/rdf-subset --alpha host.docker.internal:9080 --zero host.docker.internal:5080
```

In less than a few minutes, the cluster will be fully loaded and ready to use.

[^1]: [dgraph/badger](https://github.com/dgraph-io/badger)
[^2]: [libpostal](https://github.com/openvenues/libpostal)
[^3]: [US Census Geo-encoding REST API](https://geocoding.geo.census.gov/geocoder/)
[^4]: [Dgraph Bulk Loader Article](https://dgraph.io/blog/post/bulkloader/)

Binary file added rdf-subset/data.rdf.gz
Binary file not shown.
10 changes: 7 additions & 3 deletions schema/schema.graphql
Original file line number Diff line number Diff line change
@@ -1,8 +1,12 @@
"""
Source is the source of the data.
"""
enum Source {
PandoraPapers
ParadisePapers
OffshoreLeaks
BahamasLeak
BahamasLeaks
PanamaPapers
}

"""
Expand All @@ -16,7 +20,7 @@ interface Record {
"The ICIJ internal ID"
internalID: String @search(by: [exact])
"The record source"
sourceID: String @search(by: [exact])
sourceID: Source! @search(by: [exact])
"Notes about the record"
notes: String @search(by: [fulltext])

Expand Down Expand Up @@ -96,7 +100,7 @@ enum OtherType {
}

"""
Other represents other companies, trusts and foundations.
Other (an ICIJ term) represents other companies, trusts and foundations.
"""
type Other implements Record {
"The former name of the entity"
Expand Down
Loading

0 comments on commit ba577e1

Please sign in to comment.