After experimenting for a while, we ultimately decided not to use Intake catalogs to distribute our data, so this repository is no longer maintained.
See PUDL Data Access for more information on how to access the data we publish.
This package provides a (very) thin wrapper around the more general intake-sql driver, which can be used to generate Intake data catalogs from SQL databases.
The intake-sql
driver takes an SQL Alchemy database URL and uses it to
connect to and extract data from the database. This works with just fine with
SQLite databases, but only when the database
file is stored locally and can be referenced with a simple path.
For example this path:
/home/zane/code/catalyst/pudl-work/sqlite/pudl.sqlite
would correspond to this SQL Alchemy database URL:
sqlite:///home/zane/code/catalyst/pudl-work/sqlite/pudl.sqlite
But you can't access a remote SQLite DB this way.
- SQLite databases are great standalone, standardized containers for relational data, that can be accessed using a huge variety of tools on almost any computer platform. They are even accepted as an archival format by the US Library of Congress!
- Data evolves over time, and it's often useful to have easy access to several different versions of it, and to know exactly which version you're working with.
- Cloud object storage is extremely cheap and convenient, and makes it easy to publish and store historical file-based data releases.
- Managing your own bespoke local filesystem hierarchy filled with data -- and coordinating with colleagues so that everyone is using the same filesystem organizational scheme -- is a pain.
- Intake catalogs can provide easy access to metadata and let you manage data versions just like software versions. Installing a new version of the data catalog points you at the new version of the data.
- The overhead and cost associated with setting up and maintaining a database that uses a client-server model is relatively large compared to distributing a few files that change infrequently, are essentially read-only resources, and only take up a few gigabytes of space.
Rather than using an SQL Alchemy database URL to reference the SQLite DB, this intake driver takes a local path or a remote URL, like:
../pudl-work/sqlite/pudl.sqlite
https://global-power-plants.datasettes.com/global-power-plants.db
s3://cloudy-mc-cloudface-databucket/v1.2.3/mydata.db
For local paths, it resolves the path and prepends sqlite://
before handing it off
to intake-sql
to do all the hard work.
For remote URLs it uses fsspec
to cache a local copy
of the database, and then gives intake-sql
a database URL that points to the cached
copy.
import intake_sqlite
gpp_cat = intake_sqlite.SQLiteCatalog(
urlpath="https://global-power-plants.datasettes.com/global-power-plants.db",
storage_options={"simplecache": {"cache_storage": "/home/zane/.cache/intake"}},
)
list(gpp_cat)
# ['global-power-plants',
# 'global-power-plants_fts',
# 'global-power-plants_fts_config',
# 'global-power-plants_fts_data',
# 'global-power-plants_fts_docsize',
# 'global-power-plants_fts_idx']
Catalyst Cooperative is a small group of data wranglers and policy wonks organized as a worker-owned cooperative consultancy. Our goal is a more just, livable, and sustainable world. We integrate public data and perform custom analyses to inform public policy (Hire us!). Our focus is primarily on mitigating climate change and improving electric utility regulation in the United States.
- For general support, questions, or other conversations around the project that might be of interest to others, check out the GitHub Discussions
- If you'd like to get occasional updates about our projects sign up for our email list.
- Want to schedule a time to chat with us one-on-one? Join us for Office Hours
- Follow us on Twitter: @CatalystCoop
- More info on our website: https://catalyst.coop
- For private communication about the project or to hire us to provide customized data extraction and analysis, you can email the maintainers: [email protected]