-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FAIR dataset deletion #970
base: main
Are you sure you want to change the base?
Conversation
Pull Request Test Coverage Report for Build 9778355898Details
💛 - Coveralls |
services/src/pro/datasets/storage.rs
Outdated
#[derive(Deserialize, Serialize, Debug, Clone, ToSchema)] | ||
pub struct Expiration { | ||
pub deletion_timestamp: Option<DateTime>, | ||
pub delete_record: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make sense to delete the dataset record without deleting the data? Maybe make it an enum DeleteRecordAndData, DeleteRecord?
or throw an an error if delete_record ist true but delete_data is false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it does, but that's how GeoEngine works, so I originally kept the option for it. I changed it now to the suggested enum (7b4fa9c c56db30 3fc317a). An odd part now is delete_dataset from the DatasetStore, I added a new comment there, feel free to comment or resolve if this is not a problem: https://github.com/geo-engine/geoengine/pull/970/files#r1672610833
Available, | ||
Expires, | ||
Expired, | ||
UpdateExpired, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this state mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Available: Dataset is available.
Expires: Dataset was set to expire at some point in the future or dataset was "deleted", which translates to expire now.
Expired: Dataset has reached the expiration timestamp, and the deletion action can be triggered now.
UpdateExpired: Dataset was previously Expired or Deleted, and the user has "upgraded" this expiration/deletion. The only way now, after incorporating the other review comments, is that the metadata should also be deleted. Previously this covered other cases. Will be marked as Expired once done.
Deleted: Action after Expired was taken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add it as a comment to the code
deleted timestamp with time zone, | ||
delete_data boolean NOT NULL, | ||
delete_record boolean NOT NULL, | ||
PRIMARY KEY (user_id, dataset_id, upload_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the user_id part of the primary key? every dataset should only ever point to one upload right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a table "user_uploads" with PRIMARY KEY (user_id, upload_id). So every upload has a user. This table additionally covers every upload has at most one dataset. So I guess we could limit the primary key to (dataset_id, upload_id), but user_id should also be unique, because every upload has exactly one user and at most one dataset? I'll do what you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok so leave it in.
Another point: should the expiration be a separate table? I'd figure it to be NULL for almost every dataset because deletion is rare? might also speed up garbage collection if we only need to look at deletion and not "all" uploads.
Pull Request Test Coverage Report for Build 10182401222Details
💛 - Coveralls |
let _uploaded = self.uploaded_dataset_status_in_tx(&dataset_id, &tx).await; | ||
if let Err(error) = _uploaded { | ||
if matches!(error, UnknownDatasetId) { | ||
self.ensure_permission_in_tx(dataset_id.into(), Permission::Owner, &tx) | ||
.await | ||
.boxed_context(crate::error::PermissionDb)?; | ||
|
||
let stmt = tx | ||
.prepare( | ||
" | ||
SELECT | ||
TRUE | ||
FROM | ||
user_permitted_datasets p JOIN datasets d | ||
ON (p.dataset_id = d.id) | ||
WHERE | ||
d.id = $1 AND p.user_id = $2 AND p.permission = 'Owner';", | ||
) | ||
.await?; | ||
|
||
let rows = tx | ||
.query(&stmt, &[&dataset_id, &self.session.user.id]) | ||
.await?; | ||
|
||
if rows.is_empty() { | ||
return Err(Error::OperationRequiresOwnerPermission); | ||
} | ||
|
||
let stmt = tx.prepare("DELETE FROM datasets WHERE id = $1;").await?; | ||
|
||
tx.execute(&stmt, &[&dataset_id]).await?; | ||
|
||
tx.commit().await?; | ||
|
||
return Ok(()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit odd without the option to delete data, but keep metadata. If the dataset is not a user upload, the engine would do that, otherwise the default now would but a full deletion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what exactly is weird about it? Do you mean that there should be no generic deletion but always specify what should be deleted (metadata/data)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous way to delete data was to delete meta data, but keep data. That still exists for non-user uploads. But user uploads do not have that option.
…hance for timing errors;
use super::database_migration::{DatabaseVersion, Migration}; | ||
|
||
/// This migration adds new delete options for uploaded user datasets | ||
pub struct Migration0011DeleteUploadedDatasets; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe be a better name would be fair_data_deletion? otherweise one might think the migration would delete datasets...
|
||
if rows.is_empty() { | ||
return Err(Error::OperationRequiresOwnerPermission); | ||
let _uploaded = self.uploaded_dataset_status_in_tx(&dataset_id, &tx).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the underscore? you do not ignore the result.
return Err(Error::OperationRequiresOwnerPermission); | ||
let _uploaded = self.uploaded_dataset_status_in_tx(&dataset_id, &tx).await; | ||
if let Err(error) = _uploaded { | ||
if matches!(error, UnknownDatasetId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a better way than matching on Error type? Why is there even anything to do if the dataset id is unknown?
tx.execute(&stmt, &[&dataset_id]).await?; | ||
|
||
tx.commit().await?; | ||
self.expire_uploaded_dataset(ChangeDatasetExpiration::delete_full(dataset_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is every delete an expiration now? shouldn't it be the other way around and an expiration be a scheduled delete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is a naming thing, I'm...open either way.
From a technical standpoint, the current implementation interprets delete as expire now, which means there is only one internal logic (i.e., set of queries) for this process. I could also change that.
CHANGELOG.md
if knowledge of this change could be valuable to users.Here is a brief summary of what I did: