-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: new job to backup pins to s3 #1844
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is getting together 🎉
I did a really quick review pass and left some high-level comments for you.
Also noticed some types still need updating across the board.
this.MAX_DAG_SIZE = 1024 * 1024 * 1024 * 32 // don't try to transfer a DAG that's bigger than 32GB | ||
this.log = debug('backup:pins') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens to this DAGs?
Is there a different approach we have in mind for those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't... but it's a pattern that is found in other areas of the system. So might be worth flagging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is not pointing to the right line anymore, should be line abovehttps://github.com/web3-storage/web3.storage/pull/1844/files#diff-4df8cd6f6a1194703527832b6e009d1515b1d8f708afe9a5af7c31f446f9aadeR25
@alanshaw you might know more about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking was along the lines of that being the max in our ToS and wanting to have a cap at some point on the size of data we're willing to have uploaded/pinned. If you uploaded more than what we've said is the max allowed then we shouldn't be obliged to store that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we don't have that limit for pins do we? I can't find the ToS for pins, but I might just be missing it?
Assuming there isn't one, there's a chance content bigger than that is stored there and users should expect it to be migrated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've commented out the size check part, because, unless we have strong reasons not to, we should move all the files to eipfs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add logging to keep track of enormous dags
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added 2 pieces of logging:
- every time for whatever reason dag export fails we log the number of bytes read to that point
- If a successful export is greater than 32TiB, we log it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
32TiB or 32 GiB? i think we want to know about pins >= 32 GiB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I actually added 32 GiB in the code 👍 (see here)
Ok, so before this is merged it needs a the migration running which adds a new column |
2d4715a
to
6017c94
Compare
@alanshaw, while there are still a few tweaks required (ie. some types are missing/need fixing), I wonder if you could review this PR to see if the approach is what you expected it to be. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should check with @olizilla that the DB change is similar to what he's expecting to do.
|
||
on: | ||
schedule: | ||
- cron: '*/30 * * * *' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just schedule for the max amount of time a job can run for 6h?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this correctly the job has 2 goals:
- move historical pins to EIPFS
- keep moving new psa requests to EIPFS until we move to pickup.
For 2 I guess it's ideal to keep moving stuff as promptly as possible (30 min make sense, even less than that?) while we know the first runs of the job will be super slow (since they will have to go through all the historical data).
Isn't a solution to satisfy both words to keep the schedule
as is and set concurrency
on the job?
packages/db/postgres/tables.sql
Outdated
@@ -307,11 +307,13 @@ CREATE TABLE IF NOT EXISTS psa_pin_request | |||
meta jsonb, | |||
deleted_at TIMESTAMP WITH TIME ZONE, | |||
inserted_at TIMESTAMP WITH TIME ZONE DEFAULT timezone('utc'::text, now()) NOT NULL, | |||
updated_at TIMESTAMP WITH TIME ZONE DEFAULT timezone('utc'::text, now()) NOT NULL | |||
updated_at TIMESTAMP WITH TIME ZONE DEFAULT timezone('utc'::text, now()) NOT NULL, | |||
backup_urls TEXT[] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be nice if this was NOT NULL DEFAULT []
so you don't have to distinguish between null and empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed :) FWIW, I used the same definition as backup_urls
in uploads
. Should I change that one too?
4edf2bf
to
b1fd1a9
Compare
e6d131b
to
df32027
Compare
df32027
to
52984d6
Compare
2fe2914
to
9dec63f
Compare
@olizilla can you please give this a thorough review 🙏 |
let reportInterval | ||
const libp2p = await getLibp2p() | ||
try { | ||
const dagula = await Dagula.fromNetwork(libp2p, { peer }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cache instances for same peer location.
It should be fine to keep a single lilp2p instance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
throw (err) | ||
} finally { | ||
if (bytesReceived > this.MAX_UPLOAD_DAG_SIZE) { | ||
this.log(`⚠️ CID: ${cid} dag is greater than ${this.fmt(this.MAX_UPLOAD_DAG_SIZE)}`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a per batch summary:
- failed cids (and their size)
- Successful (bigger than MAX_UPLOAD_DAG_SIZE)
Remove all unnecessary per cids logging, log just errors by default. (leave it with higher debug)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
Default logging (DEBUG=backupPins:log
) is quite succinct now, while the job can be run manually passing a more verbose DEBUG=backupPins:*
through workflow inputs.
Example of default logging:
❯ DEBUG=backupPins:log npm test --workspace=packages/cron
❯ DEBUG=backupPins:log npm test --workspace=packages/cron
* @returns {Promise<number | undefined>} | ||
*/ | ||
|
||
// Given for PIN requests we never limited files size we shouldn't check this. ie. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete stale code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
Bucket: bucketName, | ||
Key: key, | ||
Body: bak.content, | ||
Metadata: { structure: 'Complete' } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look into sending checksum for file, reject the upload if the bytes of car don't match the cid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is required.
Looking at the headers sent by the client
'content-type': 'application/xml',
'content-length': '11985',
Expect: '100-continue',
host: '127.0.0.1',
'x-amz-user-agent': 'aws-sdk-js/3.53.1',
'user-agent': 'aws-sdk-js/3.53.1 os/darwin/21.6.0 lang/js md/nodejs/16.14.0 md/crt-avail api/s3/3.53.1',
'amz-sdk-invocation-id': '25110079-acaf-425f-8933-3527fd8366c7',
'amz-sdk-request': 'attempt=1; max=3',
'x-amz-date': '20221031T122520Z',
'x-amz-content-sha256': 'ebd8a0f42b66a7756aaee73e6275d918143525f125137b584e6b079b364a6b5f',
authorization: 'AWS4-HMAC-SHA256 Credential=minioadmin/20221031/us-east-1/s3/aws4_request, SignedHeaders=amz-sdk-invocation-id;amz-sdk-request;content-length;content-type;host;x-amz-content-sha256;x-amz-date;x-amz-user-agent, Signature=90453c633c07234480f3319eb3c1b058d25b39a077eb3653063165c5bc137722'
}
you can see it sends x-amz-content-sha256
which is the sha256 of the payload and implies the payload is signed (see docs.
If every chunk is hashed and verified I don't think we need the overall one? Or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @olizilla
psa_pins_request
table to add the newbackup_urls
column.About this job
This new cron job which runs every 4 hours, grabs 10,000 pin requests, gets the car file and uploads it to s3.
It behaves in a similar way to nftstorage/backup.
It uses Dagula to grab the car file from the IPFS peer.