-
Notifications
You must be signed in to change notification settings - Fork 30
Keeping addressbase up to date
This is necessary for anyone writing import scripts, as opposed to working on the API or front end. There are various licencing requirements to use addressbase so we can only make it available to those working on importing this data.
These commands will delete the old tables, so if you need to update things when a new dataset comes out you can generally just run the command.
Get a copy of addressbase that has been 'cleaned' and call it 'addressbase_cleaned'. You can use an s3 browser or:
$ AWS_PROFILE=<your-dem-club-profile> aws s3 cp s3 s3://<bucket-name>/addressbase/<most-recently-cleaned>.csv path/on/your/machine/addressbase_cleaned.csv
Append --dryrun
to see what will happen.
This assumes you've completed the set up described in the readme (tl;dr: setup a venv, install python dependencies, make a postgis db, import ONSPD and councils).
To import the addresses run:
$ python manage.py import_cleaned_addresses path/to/dir/containing/the/csv/
NB path should be to a directory which contains a csv called addressbase_cleaned.csv
This is a two step process where we create a lookup based on a spatial join between addressbase and councils, dump it to a csv, and then import that csv as a table. We create a new lookup table because updating every row in addressbase with a gss code is really slow.
- Create a csv:
$ python manage.py create_uprn_council_lookup
- You can add a path to where the csv will be created but you don't need to. It will create a file
uprn-to-councils.csv
by default.
- You can add a path to where the csv will be created but you don't need to. It will create a file
- Import the csv:
$ python manage.py import_uprn_council_lookup uprn-to-councils.csv
- If you wrote the csv somewhere else (eg
/tmp
) then amend the path.
- If you wrote the csv somewhere else (eg
- Delete the uprn-to-councils.csv
- Download a new version of Addressbase Plus from the Ordnance Survey website.
- Inside your UK Polling Stations virtual env run
python manage.py clean_addressbase_plus folder/with/addressbase/csvs/
. This will create a file calledaddressbase_cleaned.csv
in the same folder. - You can now follow the steps Import addressbase and Create a uprn to council lookup table above to update your dev copy.
To make this available for building new images:
- Rename this to something of the form
ABFLGB_CSV_2020-01-22_cleaned.csv
and upload it to s3://pollingstations-packer-assets/addressbase/.
- Download new Addressbase Plus - Get latest version from OS Data Hub. This is only possible for DC staff.
- Create
cleaned_addressbase.csv
anduprn-to-councils.csv
- upload to
s3://pollingstations.private.data
- name space with addressbase release. eg:s3://pollingstations.private.data/addressbase/AddressBasePlus_FULL_2023-12-15_cleaned.csv
worker instance
- From the AWS console launch an instance from the launch template in the account - there should only be one. Use the latest template version (i.e. 25 rather than 1)
- Use something like an mx5.large. Make it different from the existing instance to make you're life easy when distinguishing them.
- You don't need a key pair. Use instance connect/mssh.
- This instance will be referred to as the 'worker instance'.
application instance
- The existing instance(s) (i.e. not the worker instance) will be reffered to as the 'application instance' - it's the one serving the website. Do this when there's only one running.
- SSH into application instance.
- Stop replication:
sudo systemctl stop polling_stations_db_replication.service
- Delete the replication file:
sudo rm /var/www/polling_stations/home/db_replication_complete
worker instance
- SSH into worker instance
- Copy the CSVs to the working directory (probably
/home/ubuntu/
) NB. whatever the paths of the files is here, needs to match the paths in the\copy
lines of thecopy-db
script below.aws s3 cp s3://pollingstations.private.data/addressbase/AddressBasePlus_FULL_2023-12-15_cleaned.csv addressbase_cleaned.csv aws s3 cp s3://pollingstations.private.data/addressbase/AddressBasePlus_FULL_2023-12-15_uprn-to-councils.csv uprn-to-councils.csv
- create an executable file called copy-db:
touch copy-db && chmod u+x copy-db
- Copy the following script into the file:
#!/bin/bash
# Exit on first error
set -e
# Check if a connection string is provided
if [ "$#" -ne 1 ]; then
echo "Usage: $0 <connection-string>"
exit 1
fi
# AWS CLI profile name
CONNECTION="$1"
psql "$CONNECTION" -c "TRUNCATE TABLE addressbase_address CASCADE;"
psql "$CONNECTION" -c "ALTER TABLE addressbase_address DROP CONSTRAINT IF EXISTS addressbase_address_pkey CASCADE;"
psql "$CONNECTION" -c 'ALTER TABLE addressbase_uprntocouncil DROP CONSTRAINT IF EXISTS addressbase_uprntocouncil_pkey CASCADE;'
psql "$CONNECTION" -c "DROP INDEX IF EXISTS address_location_gist, address_postcode_idx, address_postcode_like_idx, address_uprn_like_idx;"
psql "$CONNECTION" -c "DROP INDEX IF EXISTS uprntocouncil_adv_v_station_idx, uprntocouncil_uprn_like_idx, uprntocouncil_lad_idx;"
psql "$CONNECTION" -c "\copy addressbase_address (UPRN, address, postcode, location, addressbase_postal) FROM '/home/ubuntu/addressbase_cleaned.csv' WITH (FORMAT CSV, DELIMITER ',', QUOTE '\"');"
psql "$CONNECTION" -c "\copy addressbase_uprntocouncil (uprn,lad,polling_station_id,advance_voting_station_id) FROM '/home/ubuntu/uprn-to-councils.csv' WITH (FORMAT CSV, DELIMITER ',', null '\N');"
psql "$CONNECTION" -c "ALTER TABLE addressbase_address ADD PRIMARY KEY (uprn);"
psql "$CONNECTION" -c "ALTER TABLE addressbase_uprntocouncil ADD PRIMARY KEY (uprn);"
psql "$CONNECTION" -c "ALTER TABLE addressbase_uprntocouncil ADD CONSTRAINT uprntocouncil_uprn_fk_address_uprn FOREIGN KEY (uprn) REFERENCES addressbase_address (uprn) MATCH FULL;"
psql "$CONNECTION" -c "CREATE INDEX address_location_gist ON public.addressbase_address USING gist (location);"
psql "$CONNECTION" -c "CREATE INDEX address_postcode_idx ON public.addressbase_address USING btree (postcode);"
psql "$CONNECTION" -c "CREATE INDEX address_postcode_like_idx ON public.addressbase_address USING btree (postcode varchar_pattern_ops);"
psql "$CONNECTION" -c "CREATE INDEX address_uprn_like_idx ON public.addressbase_address USING btree (uprn varchar_pattern_ops);"
psql "$CONNECTION" -c "CREATE INDEX uprntocouncil_adv_v_station_idx ON public.addressbase_uprntocouncil USING btree (advance_voting_station_id);"
psql "$CONNECTION" -c "CREATE INDEX uprntocouncil_uprn_like_idx ON public.addressbase_uprntocouncil USING btree (uprn varchar_pattern_ops);"
psql "$CONNECTION" -c "CREATE INDEX uprntocouncil_lad_idx ON public.addressbase_uprntocouncil USING btree (lad);"
-The script takes ~50minutes to run. So best to do it inside tmux:
shell tmux new -s db
- Run the script passing the RDS connection string as the first parameter:
Make sure the connection string is for the correct environment!
./copy-db postgresql://user:password@host/db-name
application instance
- SSH into application instance
- Become
polling_stations
and start python app shell# should change user and cd to /var/www/polling_stations/code and activate venv sudo su polling_stations
- Enter console
# Run shell plus. Optionally pip install ipython beforehand. python manage.py shell_plus
- Run the following code:
N.B. The counts are there to check you hit the right DB!
from addressbase.models import Address Address.objects.using('principal').count() UprnToCouncil.objects.using('principal').count() Address.objects.using('local').count() # This will be 'default' instead of local if you haven't deleted /var/www/polling_stations/home/db_replication_complete UprnToCouncil.objects.using('local').count() for address in Address.objects.using('principal').filter(uprntocouncil__isnull=True): council = address.get_council_from_others_in_postcode() if council: print( f"Creating UprnToCouncil record for {address.uprn} with gss {council.geography.gss}" ) UprnToCouncil.objects.using('principal').create( uprn=address, lad=council.geography.gss, polling_station_id="" ) else: print(f"Council ambiguous for {address.uprn}, deleting") address.delete() Address.objects.using('principal').count() UprnToCouncil.objects.using('principal').count() Address.objects.using('local').count() UprnToCouncil.objects.using('local').count()
- The easiest way to pick up the changes is just to do a redeploy. So merge a pr that has some code changes in it.
- In AWS console terminate worker instance.