Skip to content

Keeping addressbase up to date

Will Roper edited this page Feb 21, 2024 · 14 revisions

Keeping Addressbase up to date.

Setting up the database to write import scripts

This is necessary for anyone writing import scripts, as opposed to working on the API or front end. There are various licencing requirements to use addressbase so we can only make it available to those working on importing this data.

These commands will delete the old tables, so if you need to update things when a new dataset comes out you can generally just run the command.

Download addressbase

Get a copy of addressbase that has been 'cleaned' and call it 'addressbase_cleaned'. You can use an s3 browser or: $ AWS_PROFILE=<your-dem-club-profile> aws s3 cp s3 s3://<bucket-name>/addressbase/<most-recently-cleaned>.csv path/on/your/machine/addressbase_cleaned.csv Append --dryrun to see what will happen.

Import addressbase

This assumes you've completed the set up described in the readme (tl;dr: setup a venv, install python dependencies, make a postgis db, import ONSPD and councils).

To import the addresses run:

$ python manage.py import_cleaned_addresses path/to/dir/containing/the/csv/

NB path should be to a directory which contains a csv called addressbase_cleaned.csv

Create a uprn to council lookup table

This is a two step process where we create a lookup based on a spatial join between addressbase and councils, dump it to a csv, and then import that csv as a table. We create a new lookup table because updating every row in addressbase with a gss code is really slow.

  • Create a csv: $ python manage.py create_uprn_council_lookup
    • You can add a path to where the csv will be created but you don't need to. It will create a file uprn-to-councils.csv by default.
  • Import the csv: $ python manage.py import_uprn_council_lookup uprn-to-councils.csv
    • If you wrote the csv somewhere else (eg /tmp) then amend the path.
  • Delete the uprn-to-councils.csv

Updating from a new addressbase release.

  • Download a new version of Addressbase Plus from the Ordnance Survey website.
  • Inside your UK Polling Stations virtual env run python manage.py clean_addressbase_plus folder/with/addressbase/csvs/. This will create a file called addressbase_cleaned.csv in the same folder.
  • You can now follow the steps Import addressbase and Create a uprn to council lookup table above to update your dev copy.

To make this available for building new images:

  • Rename this to something of the form ABFLGB_CSV_2020-01-22_cleaned.csv and upload it to s3://pollingstations-packer-assets/addressbase/.

Updating Addressbase in Production

Preprocess

  • Download new Addressbase Plus - Get latest version from OS Data Hub. This is only possible for DC staff.
  • Create cleaned_addressbase.csv and uprn-to-councils.csv
  • upload to s3://pollingstations.private.data - name space with addressbase release. eg: s3://pollingstations.private.data/addressbase/AddressBasePlus_FULL_2023-12-15_cleaned.csv

Set up worker instance

worker instance

  • From the AWS console launch an instance from the launch template in the account - there should only be one. Use the latest template version (i.e. 25 rather than 1)
  • Use something like an mx5.large. Make it different from the existing instance to make you're life easy when distinguishing them.
  • You don't need a key pair. Use instance connect/mssh.
  • This instance will be referred to as the 'worker instance'.

Switch off replication

application instance

  • The existing instance(s) (i.e. not the worker instance) will be reffered to as the 'application instance' - it's the one serving the website. Do this when there's only one running.
  • SSH into application instance.
  • Stop replication:
    sudo systemctl stop polling_stations_db_replication.service
  • Delete the replication file:
    sudo rm /var/www/polling_stations/home/db_replication_complete

Copy CSV to DB

worker instance

  • SSH into worker instance
  • Copy the CSVs to the working directory (probably /home/ubuntu/) NB. whatever the paths of the files is here, needs to match the paths in the \copy lines of the copy-db script below.
    aws s3 cp s3://pollingstations.private.data/addressbase/AddressBasePlus_FULL_2023-12-15_cleaned.csv addressbase_cleaned.csv
    aws s3 cp s3://pollingstations.private.data/addressbase/AddressBasePlus_FULL_2023-12-15_uprn-to-councils.csv uprn-to-councils.csv
  • create an executable file called copy-db:
    touch copy-db && chmod u+x copy-db
  • Copy the following script into the file:
#!/bin/bash

# Exit on first error
set -e

# Check if a connection string is provided
if [ "$#" -ne 1 ]; then
    echo "Usage: $0 <connection-string>"
    exit 1
fi

# AWS CLI profile name
CONNECTION="$1"

psql "$CONNECTION" -c "TRUNCATE TABLE addressbase_address CASCADE;"
psql "$CONNECTION" -c "ALTER TABLE addressbase_address DROP CONSTRAINT IF EXISTS addressbase_address_pkey CASCADE;"
psql "$CONNECTION" -c 'ALTER TABLE addressbase_uprntocouncil DROP CONSTRAINT IF EXISTS addressbase_uprntocouncil_pkey CASCADE;'

psql "$CONNECTION" -c "DROP INDEX IF EXISTS address_location_gist, address_postcode_idx, address_postcode_like_idx, address_uprn_like_idx;"
psql "$CONNECTION" -c "DROP INDEX  IF EXISTS uprntocouncil_adv_v_station_idx, uprntocouncil_uprn_like_idx, uprntocouncil_lad_idx;"

psql "$CONNECTION" -c "\copy addressbase_address (UPRN, address, postcode, location, addressbase_postal) FROM '/home/ubuntu/addressbase_cleaned.csv' WITH (FORMAT CSV, DELIMITER ',', QUOTE '\"');"
psql "$CONNECTION" -c "\copy addressbase_uprntocouncil (uprn,lad,polling_station_id,advance_voting_station_id) FROM '/home/ubuntu/uprn-to-councils.csv' WITH (FORMAT CSV, DELIMITER ',', null '\N');"

psql "$CONNECTION" -c "ALTER TABLE addressbase_address ADD PRIMARY KEY (uprn);"
psql "$CONNECTION" -c "ALTER TABLE addressbase_uprntocouncil ADD PRIMARY KEY (uprn);"
psql "$CONNECTION" -c "ALTER TABLE addressbase_uprntocouncil ADD CONSTRAINT uprntocouncil_uprn_fk_address_uprn FOREIGN KEY (uprn) REFERENCES addressbase_address (uprn) MATCH FULL;"

psql "$CONNECTION" -c "CREATE INDEX address_location_gist ON public.addressbase_address USING gist (location);"
psql "$CONNECTION" -c "CREATE INDEX address_postcode_idx ON public.addressbase_address USING btree (postcode);"
psql "$CONNECTION" -c "CREATE INDEX address_postcode_like_idx ON public.addressbase_address USING btree (postcode varchar_pattern_ops);"
psql "$CONNECTION" -c "CREATE INDEX address_uprn_like_idx ON public.addressbase_address USING btree (uprn varchar_pattern_ops);"
psql "$CONNECTION" -c "CREATE INDEX uprntocouncil_adv_v_station_idx ON public.addressbase_uprntocouncil USING btree (advance_voting_station_id);"
psql "$CONNECTION" -c "CREATE INDEX uprntocouncil_uprn_like_idx ON public.addressbase_uprntocouncil USING btree (uprn varchar_pattern_ops);"
psql "$CONNECTION" -c "CREATE INDEX uprntocouncil_lad_idx ON public.addressbase_uprntocouncil USING btree (lad);"

-The script takes ~50minutes to run. So best to do it inside tmux: shell tmux new -s db

  • Run the script passing the RDS connection string as the first parameter:
    ./copy-db postgresql://user:password@host/db-name
    Make sure the connection string is for the correct environment!

Create records missing from uprn-to-council.csv

application instance

  • SSH into application instance
  • Become polling_stations and start python app shell
    # should change user and cd to /var/www/polling_stations/code and activate venv
    sudo su polling_stations 
  • Enter console
    # Run shell plus. Optionally pip install ipython beforehand.
    python manage.py shell_plus
  • Run the following code:
    from addressbase.models import Address
    Address.objects.using('principal').count()
    UprnToCouncil.objects.using('principal').count()
    Address.objects.using('local').count() # This will be 'default' instead of local if you haven't deleted /var/www/polling_stations/home/db_replication_complete
    
    UprnToCouncil.objects.using('local').count()
    
    for address in Address.objects.using('principal').filter(uprntocouncil__isnull=True):
        council = address.get_council_from_others_in_postcode()
        if council:
            print(
                f"Creating UprnToCouncil record for {address.uprn} with gss {council.geography.gss}"
            )
            UprnToCouncil.objects.using('principal').create(
                uprn=address, lad=council.geography.gss, polling_station_id=""
            )
        else:
            print(f"Council ambiguous for {address.uprn}, deleting")
            address.delete()
    Address.objects.using('principal').count()
    UprnToCouncil.objects.using('principal').count()
    Address.objects.using('local').count()
    UprnToCouncil.objects.using('local').count()
    N.B. The counts are there to check you hit the right DB!

Redeploy instance

  • The easiest way to pick up the changes is just to do a redeploy. So merge a pr that has some code changes in it.

Terminate worker instance

  • In AWS console terminate worker instance.