Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Spatialite and MySQL WKB dialects #153

Merged
merged 19 commits into from
Aug 12, 2023

Conversation

Oreilles
Copy link
Contributor

@Oreilles Oreilles commented Aug 9, 2023

This PR adds support for Spatialite and MySQL WKB dialects.

I'd happily discuss the implementation. Quirks of the spatialite dialect that required some refactoring:

  • A geometry blob ends with 0xFE. In order to know wether the polygon_end (or alike) is the last one, we keep track of the collection entity nesting level. If the nesting level is 0 when a geometry ends, it means it is the last one of the blob and the terminating byte 0xFE must be appended. Even though MULTI* geometries cannot be part of a GEOMETRYCOLLECTION in Spatialite at the moment, this was my easiest go at the problem - and it also makes it more future-proof in case that feature is added at some point.
  • In contrary to EWKB, the byte order and SRID info are not repeated in the nested entities headers, which means they have to be inherited from the parent entity info. In order to do that, I had to change the signature of the read_header parameter of process_wkb_geom_n so that the parent geometry info can be passed down to its children.
  • The "compressed" polygon and linestring geometries stores their X,Y,Z coordinates as f32 values, relative to the previous one (with the exception of the first and the last one), which required some refactoring to process_coord to avoid code duplication between the two implementations.

Side notes:

  • Shouldn't WkbWriter implement GeomProcessor::srid() ?
  • Only read support has been implemented for Compressed and TinyPoint geometries, write isn't implemented.
  • Spatialite header stores the envelope as [minx, miny, maxx, maxy] (like GeoJSON) whereas gpkg stores it as [minx, maxx, miny, maxy]

I also refactored the roundtrip tests so they can be shared by the 5 dialects.

@nyurik nyurik requested review from pka and nyurik August 9, 2023 23:49
Copy link
Member

@pka pka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great addition, thanks a lot @Oreilles !

Please do a cargo fmt and we're good to merge.

@Oreilles
Copy link
Contributor Author

Thanks for the review, the latest commit fixes the formatting issues!

@pka
Copy link
Member

pka commented Aug 10, 2023

Thanks! Now we just have to make clippy happy as well. To test locally, you just have to run cargo clippy or to be sure with all options as in the CI: cargo clippy --workspace --all-features --bins --examples --tests --benches.

@Oreilles
Copy link
Contributor Author

I think it's good now!

geozero/src/wkb/wkb_common.rs Outdated Show resolved Hide resolved
@@ -293,17 +424,59 @@ pub(crate) fn process_wkb_geom_n<R: Read, P: GeomProcessor>(
}
}

fn emit_coord<P: GeomProcessor>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a good approach here - you first create coordinate tuple on line 462, and then you discover you don't even need it if multi_dim is false. I would inline this function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed with latest commit

geozero/src/wkb/wkb_reader.rs Show resolved Hide resolved
@@ -198,3 +200,12 @@ pub(crate) enum WKBByteOrder {
Xdr = 0, // Big Endian
Ndr = 1, // Little Endian
}

impl From<scroll::Endian> for WKBByteOrder {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good addition, thanks! I wonder if we need a reverse as well?

} else {
scroll::LE
};
let endian = Endian::from(byte_order != 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sigh, Endian::from(bool) is not the most readable interface. I had to look up what true means :(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.S. this one is not exactly actionable - just a note in case you can think of a cleaner API usage here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I agree that Endian::from(bool) is not the very explicit at first glance... If we implement the reverse of From<scroll::Endian> for WKBByteOrder as you suggested, then maybe we could also implement From<u8> for WKBByteOrder and chain the two. That's a lot of code for basically a bool check though so...

Copy link
Member

@nyurik nyurik Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, probably not worth the complexity - maybe just add an extra line to make it more readable, like this?

let is_little_endian = byte_order != 0;
let endian = Endian::from(is_little_endian);

P.S. you can actually inline the second line down below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better, I'll do that 👍

geozero/src/wkb/wkb_reader.rs Outdated Show resolved Hide resolved
@nyurik
Copy link
Member

nyurik commented Aug 10, 2023

lol, seems like @pka and I were reviewing it at the same time...

nyurik
nyurik previously requested changes Aug 10, 2023
Copy link
Member

@nyurik nyurik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see inline

} else {
scroll::LE
};
let is_little_endian = byte_order != 0;
Copy link
Member

@nyurik nyurik Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turns out there is a much cleaner solution, but it requires a bit of research. https://docs.rs/scroll/latest/scroll/ has a lot of docs for that, but essentially instead of using raw.ioread::<u8>() you can use raw.ioread::<WKBByteOrder>. In order for that to work, you would have to implement impl IntoCtx<scroll::Endian> for WKBByteOrder { or something similar (I never used this crate, just saw some docs about cread vs gread vs pread - not sure which is which). At the end of the day, it should offer a very easy way to just do all the on-the-fly conversions cleanly

P.S. Note that you cannot do raw.ioread::<Encoding> because you can only implement a trait for the type in the same crate, and Encoding enum is not part of the geozero

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.P.S. note that we don't need to fix ALL such issues at the same time, can be done in another PR. For more docs on this - https://github.com/m4b/scroll#stdio-api

Copy link
Member

@nyurik nyurik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few minor nits, but looks almost ready to merge. Thanks for all the patience!! :)

geozero/src/wkb/wkb_reader.rs Outdated Show resolved Hide resolved
geozero/src/wkb/wkb_reader.rs Outdated Show resolved Hide resolved
geozero/src/wkb/wkb_writer.rs Outdated Show resolved Hide resolved
geozero/src/wkb/wkb_writer.rs Outdated Show resolved Hide resolved
geozero/src/wkb/wkb_writer.rs Outdated Show resolved Hide resolved
geozero/src/wkb/wkb_writer.rs Outdated Show resolved Hide resolved
geozero/src/wkb/wkb_writer.rs Outdated Show resolved Hide resolved
geozero/src/wkb/wkb_writer.rs Outdated Show resolved Hide resolved
geozero/src/wkb/wkb_writer.rs Show resolved Hide resolved
@Oreilles
Copy link
Contributor Author

Thanks for all the patience!! :)

Thanks for your time, it's very pleasing contributing to open source projects with responsive maintainers 👍

@nyurik nyurik self-requested a review August 11, 2023 19:50
@nyurik
Copy link
Member

nyurik commented Aug 11, 2023

please fix the conflict and lets merge

@Oreilles
Copy link
Contributor Author

Solved

@nyurik
Copy link
Member

nyurik commented Aug 11, 2023

At first i made a few minor comments here, but then just created a PR for your PR - please merge if agree -- Oreilles#1

@Oreilles
Copy link
Contributor Author

Oreilles commented Aug 12, 2023

I merge the two PR!
On an unrelated side, I have a question about the processor functions xy and coordinate: since all parameter of coordinate but x and y are of type Option, would it be possible to only ever use coordinate and let the writer decide what to do with them ?
My concern is that if you parsed for example a Wkb blob with XY coordinates, and process it with a WkbWriter for which you set dims = CoordDimensions::xyzm(), then the generated WKB would be invalid: its header would indicate the presence of XYZM values, but since only processor.xy would be called by the reader (since the input doesn't have ZM), the ZM coordinates bytes wouldn't be written as they should.
Same goes if your processed a Wkb blob that has XYZM, with processor with dims = CoordDimensions::xy(). The header would indicate only XY coordinates, but XYZM would actually end up included.
Having the readers always call processor.coordinate and let the processors decide what to do with the unnecessary (or absent) coordinate values would mitigate this issue.
I picked a Wkb to Wkb example for the sake of simplicity but that behavior seems to be shared by other readers and writers as well.

@pka
Copy link
Member

pka commented Aug 12, 2023

On an unrelated side, I have a question about the processor functions xy and coordinate: since all parameter of coordinate but x and y are of type Option, would it be possible to only ever use coordinate and let the writer decide what to do with them ?

The reason for having two methods was performance. This method is the innermost loop of any geometry operation (except for points) and passing additional optional params costs performance for the most common xy case.

My concern is that if you parsed for example a Wkb blob with XY coordinates, and process it with a WkbWriter for which you set dims = CoordDimensions::xyzm(), then the generated WKB would be invalid

Your concerns are valid, if the reader implementing the GezeroGeometry trait doesn't look at the requested dimensions.
Code example for dimension handling:

if multi_dim {
processor.coordinate(
point_type[0],
point_type[1],
point_type.get(2).copied(),
None,
None,
None,
idx,
)
} else {
processor.xy(point_type[0], point_type[1], idx)
}
}

WkbReader looks also fine, but there are maybe readers which don't respect GeomProcessor::dimensions resp. GeomProcessor::multi_dim, which would be a bug.

In your WKB xy to WKB xyzm example, I don't think that the result is invalid, you just loose the additional dimensions.

@Oreilles
Copy link
Contributor Author

Oreilles commented Aug 12, 2023

In your WKB xy to WKB xyzm example, I don't think that the result is invalid, you just loose the additional dimensions.

If we process a WKB that doesn't have ZM coordinates, it will only call processor.xy:

if multi_dim {
processor.coordinate(x, y, z, m, None, None, idx)
} else {
processor.xy(x, y, idx)
}

And then only X and Y will be written:

fn xy(&mut self, x: f64, y: f64, _idx: usize) -> Result<()> {
if self.geom_state == GeomState::MultiPointGeom {
self.write_header(WKBGeometryType::Point)?;
}
self.out.iowrite_with(x, self.endian)?;
self.out.iowrite_with(y, self.endian)?;
Ok(())

When a parser will try to read the WKB emitted, il will try to read four f64 values for each point (the header says so), but only two would have been written.

The reason for having two methods was performance. This method is the innermost loop of any geometry operation (except for points) and passing additional optional params costs performance for the most common xy case.

You might have tested it already (I haven't) but in this case, maybe the branching caused by having to handle four different outcomes outweight the performance gain of only calling the method with the least amout of parameters:

  • The reader emitted XY, and the writer have to write XY
  • The reader emitted XY, but the writer have to write XYZM
  • The reader emitted XYZM, but the writer have to write XY
  • The reader emitted XYZM, and the writer have to write XYZM

In any case, I'm more concerned about the bug opportunities here than performance 😅

@nyurik
Copy link
Member

nyurik commented Aug 12, 2023

@Oreilles i think this one is ready to go once conflict is fixed. One thing I forgot to ask about the other PR (which affect this PR as well) -- why do you need to call processor.srid(info.srid)?; everywhere? Shouldn't the processor be created with a given srid once?

@Oreilles
Copy link
Contributor Author

Oreilles commented Aug 12, 2023

@nyurik calling processor.srid(info.srid) allowed for inferring the EWKT geometry srid from the WKB blob without having to manually specify it. The trait ToWkt has a to_ewkt function that have an optional srid argument for when the source Wkb doesn't have one, but you still want one in the output. If the input Wkb has an srid defined, then this parameter can be omited and the one provided by the reader will be used (which is why I had to call processor.srid(info.srid). I wrote this so that the srid provided in the to_ewkt function parameter takes predominance over the one provided by the reader:

fn srid(&mut self, srid: Option<i32>) -> Result<()> {
self.srid = self.srid.or(srid);
Ok(())
}

@nyurik nyurik merged commit a92ef80 into georust:main Aug 12, 2023
1 check passed
@nyurik
Copy link
Member

nyurik commented Aug 12, 2023

Awesome work, thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants