Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Parquet File/Dataset abstraction #501

Merged
merged 23 commits into from
Mar 20, 2024
Merged

Conversation

kylebarron
Copy link
Member

@kylebarron kylebarron commented Feb 7, 2024

Prep work for being able to fetch metadata from a bunch of Parquet files in a folder and load a spatial filter

Change list

  • New ParquetFile and ParquetDataset Rust structs to read from one or multiple Parquet files. This is generic over AsyncFileReader, which primarily works with object store.
  • Added initial metadata handling, to e.g. read the bounding box of the file.
  • Added Python bindings to each class. This uses ObjectStore.
  • Added initial JS bindings to each class. This uses a custom implementation of AsyncFileReader vendored from parquet-wasm. I couldn't get the ObjectStore integration working just yet. Thought object_store_wasm_s3 was updated and that might be an interesting thing to check out.

@kylebarron
Copy link
Member Author

@H-Plus-Time I'm starting to explore an object-store based async JS Parquet reader in this branch. I'm coming to think it's the best solution because I can reuse so much of the same Rust and Python code.

I'm thinking of trying to implement a minimal impl based on ehttp for the web, but very welcome to advice/thoughts.

@kylebarron
Copy link
Member Author

I failed at trying to implement a wasm ObjectStore. In particular I ended up hitting Send/Sync issues, and since ObjectStore is defined as #[async_trait] and not #[async_trait(?Send)], I'm not sure it's even possible to implement ObjectStore in wasm32. Instead, for now I vendored the AsyncFileReader impl into here.

@kylebarron kylebarron marked this pull request as ready for review March 20, 2024 16:59
@kylebarron kylebarron changed the title Parquet File/Dataset abstraction Initial Parquet File/Dataset abstraction Mar 20, 2024
@kylebarron kylebarron merged commit 3a2007b into main Mar 20, 2024
6 checks passed
@kylebarron kylebarron deleted the kyle/parquet-dataset branch March 20, 2024 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant