Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FAIR dataset deletion #970

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

glombiewski
Copy link
Contributor

  • I added an entry to CHANGELOG.md if knowledge of this change could be valuable to users.

Here is a brief summary of what I did:

  • Associated user uploads with user datasets.
  • Allow new combination of meta data/data deletions.
  • Allow for expiration of datasets (deletion after a specified time).

@glombiewski
Copy link
Contributor Author

Löschen von Datasets

@coveralls
Copy link
Collaborator

coveralls commented Jul 3, 2024

Pull Request Test Coverage Report for Build 9778142714

Details

  • 1369 of 1424 (96.14%) changed or added relevant lines in 9 files are covered.
  • 22 unchanged lines in 12 files lost coverage.
  • Overall coverage increased (+0.05%) to 90.741%

Changes Missing Coverage Covered Lines Changed/Added Lines %
services/src/datasets/upload.rs 35 36 97.22%
services/src/pro/datasets/storage.rs 94 95 98.95%
services/src/pro/api/handlers/datasets.rs 140 160 87.5%
services/src/pro/datasets/postgres.rs 1058 1091 96.98%
Files with Coverage Reduction New Missed Lines %
services/src/api/handlers/layers.rs 1 88.87%
services/src/datasets/external/netcdfcf/mod.rs 1 94.74%
services/src/datasets/external/edr.rs 1 85.72%
services/src/datasets/external/netcdfcf/ebvportal_api.rs 1 88.35%
services/src/datasets/postgres.rs 1 88.8%
services/src/pro/datasets/postgres.rs 1 87.61%
services/src/api/handlers/datasets.rs 1 95.52%
services/src/datasets/create_from_workflow.rs 1 89.62%
services/src/api/handlers/workflows.rs 1 90.22%
services/src/datasets/external/pangaea/mod.rs 2 83.03%
Totals Coverage Status
Change from base Build 9548821675: 0.05%
Covered Lines: 134486
Relevant Lines: 148208

💛 - Coveralls

@coveralls
Copy link
Collaborator

coveralls commented Jul 3, 2024

Pull Request Test Coverage Report for Build 9778355898

Details

  • 1374 of 1429 (96.15%) changed or added relevant lines in 10 files are covered.
  • 28 unchanged lines in 16 files lost coverage.
  • Overall coverage increased (+0.05%) to 90.738%

Changes Missing Coverage Covered Lines Changed/Added Lines %
services/src/datasets/upload.rs 35 36 97.22%
services/src/pro/datasets/storage.rs 94 95 98.95%
services/src/pro/api/handlers/datasets.rs 140 160 87.5%
services/src/pro/datasets/postgres.rs 1058 1091 96.98%
Files with Coverage Reduction New Missed Lines %
services/src/datasets/external/netcdfcf/mod.rs 1 94.79%
services/src/datasets/external/netcdfcf/ebvportal_api.rs 1 88.35%
services/src/datasets/postgres.rs 1 88.8%
services/src/api/handlers/datasets.rs 1 95.52%
services/src/api/handlers/wcs.rs 1 88.71%
services/src/api/handlers/wfs.rs 1 96.29%
services/src/api/handlers/workflows.rs 1 90.22%
services/src/api/handlers/projects.rs 1 98.59%
services/src/api/handlers/layers.rs 2 88.8%
services/src/contexts/postgres.rs 2 97.76%
Totals Coverage Status
Change from base Build 9548821675: 0.05%
Covered Lines: 134479
Relevant Lines: 148206

💛 - Coveralls

#[derive(Deserialize, Serialize, Debug, Clone, ToSchema)]
pub struct Expiration {
pub deletion_timestamp: Option<DateTime>,
pub delete_record: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to delete the dataset record without deleting the data? Maybe make it an enum DeleteRecordAndData, DeleteRecord?
or throw an an error if delete_record ist true but delete_data is false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it does, but that's how GeoEngine works, so I originally kept the option for it. I changed it now to the suggested enum (7b4fa9c c56db30 3fc317a). An odd part now is delete_dataset from the DatasetStore, I added a new comment there, feel free to comment or resolve if this is not a problem: https://github.com/geo-engine/geoengine/pull/970/files#r1672610833

Available,
Expires,
Expired,
UpdateExpired,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this state mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Available: Dataset is available.
Expires: Dataset was set to expire at some point in the future or dataset was "deleted", which translates to expire now.
Expired: Dataset has reached the expiration timestamp, and the deletion action can be triggered now.
UpdateExpired: Dataset was previously Expired or Deleted, and the user has "upgraded" this expiration/deletion. The only way now, after incorporating the other review comments, is that the metadata should also be deleted. Previously this covered other cases. Will be marked as Expired once done.
Deleted: Action after Expired was taken.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add it as a comment to the code

deleted timestamp with time zone,
delete_data boolean NOT NULL,
delete_record boolean NOT NULL,
PRIMARY KEY (user_id, dataset_id, upload_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the user_id part of the primary key? every dataset should only ever point to one upload right?

Copy link
Contributor Author

@glombiewski glombiewski Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a table "user_uploads" with PRIMARY KEY (user_id, upload_id). So every upload has a user. This table additionally covers every upload has at most one dataset. So I guess we could limit the primary key to (dataset_id, upload_id), but user_id should also be unique, because every upload has exactly one user and at most one dataset? I'll do what you want.

Copy link
Contributor

@michaelmattig michaelmattig Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so leave it in.

Another point: should the expiration be a separate table? I'd figure it to be NULL for almost every dataset because deletion is rare? might also speed up garbage collection if we only need to look at deletion and not "all" uploads.

@coveralls
Copy link
Collaborator

coveralls commented Jul 9, 2024

Pull Request Test Coverage Report for Build 10182401222

Details

  • 1798 of 1863 (96.51%) changed or added relevant lines in 13 files are covered.
  • 16 unchanged lines in 9 files lost coverage.
  • Overall coverage increased (+0.07%) to 91.14%

Changes Missing Coverage Covered Lines Changed/Added Lines %
services/src/datasets/storage.rs 5 12 41.67%
services/src/pro/datasets/storage.rs 80 93 86.02%
services/src/pro/api/handlers/datasets.rs 203 223 91.03%
services/src/pro/datasets/postgres.rs 1369 1394 98.21%
Files with Coverage Reduction New Missed Lines %
services/src/datasets/external/netcdfcf/mod.rs 1 95.38%
services/src/datasets/external/netcdfcf/ebvportal_api.rs 1 95.96%
operators/src/util/stream_zip/vec_zip.rs 1 91.22%
services/src/datasets/external/netcdfcf/ebvportal_provider.rs 1 93.21%
services/src/api/handlers/wfs.rs 1 96.21%
services/src/datasets/external/pangaea/mod.rs 1 86.66%
services/src/api/handlers/workflows.rs 2 90.21%
services/src/api/handlers/datasets.rs 3 94.7%
services/src/pro/contexts/postgres.rs 5 96.75%
Totals Coverage Status
Change from base Build 10178074589: 0.07%
Covered Lines: 132740
Relevant Lines: 145644

💛 - Coveralls

Comment on lines 768 to 802
let _uploaded = self.uploaded_dataset_status_in_tx(&dataset_id, &tx).await;
if let Err(error) = _uploaded {
if matches!(error, UnknownDatasetId) {
self.ensure_permission_in_tx(dataset_id.into(), Permission::Owner, &tx)
.await
.boxed_context(crate::error::PermissionDb)?;

let stmt = tx
.prepare(
"
SELECT
TRUE
FROM
user_permitted_datasets p JOIN datasets d
ON (p.dataset_id = d.id)
WHERE
d.id = $1 AND p.user_id = $2 AND p.permission = 'Owner';",
)
.await?;

let rows = tx
.query(&stmt, &[&dataset_id, &self.session.user.id])
.await?;

if rows.is_empty() {
return Err(Error::OperationRequiresOwnerPermission);
}

let stmt = tx.prepare("DELETE FROM datasets WHERE id = $1;").await?;

tx.execute(&stmt, &[&dataset_id]).await?;

tx.commit().await?;

return Ok(());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit odd without the option to delete data, but keep metadata. If the dataset is not a user upload, the engine would do that, otherwise the default now would but a full deletion.

Copy link
Contributor

@michaelmattig michaelmattig Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what exactly is weird about it? Do you mean that there should be no generic deletion but always specify what should be deleted (metadata/data)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous way to delete data was to delete meta data, but keep data. That still exists for non-user uploads. But user uploads do not have that option.

use super::database_migration::{DatabaseVersion, Migration};

/// This migration adds new delete options for uploaded user datasets
pub struct Migration0011DeleteUploadedDatasets;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe be a better name would be fair_data_deletion? otherweise one might think the migration would delete datasets...


if rows.is_empty() {
return Err(Error::OperationRequiresOwnerPermission);
let _uploaded = self.uploaded_dataset_status_in_tx(&dataset_id, &tx).await;
Copy link
Contributor

@michaelmattig michaelmattig Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the underscore? you do not ignore the result.

return Err(Error::OperationRequiresOwnerPermission);
let _uploaded = self.uploaded_dataset_status_in_tx(&dataset_id, &tx).await;
if let Err(error) = _uploaded {
if matches!(error, UnknownDatasetId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a better way than matching on Error type? Why is there even anything to do if the dataset id is unknown?

tx.execute(&stmt, &[&dataset_id]).await?;

tx.commit().await?;
self.expire_uploaded_dataset(ChangeDatasetExpiration::delete_full(dataset_id))
Copy link
Contributor

@michaelmattig michaelmattig Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is every delete an expiration now? shouldn't it be the other way around and an expiration be a scheduled delete?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is a naming thing, I'm...open either way.

From a technical standpoint, the current implementation interprets delete as expire now, which means there is only one internal logic (i.e., set of queries) for this process. I could also change that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants