-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Contiguous storage for path segments #369
Conversation
This is part of the larger multisampled path rendering work, under stroke rework (#303). It refactors the GPU pipeline so that the path segments available to fine rasterization are stored as a contiguous slice rather than a linked list as before. Numerous parts of the pipeline are refactored. In the old pipeline, path segment decoding generated cubic line segments and also estimated a bounding box (somewhat imprecise), and the combination of flattening those cubics and tiling was in a separate stage (path_coarse) quite a bit later in the pipeline. In the new pipeline, path decoding is fused with flattening, generating a `LineSoup` structure (line segments associated with paths, otherwise unordered) (with bbox as a side effect), and tiling is spread over multiple stages, later in the pipeline. The first tiling stage (path_count) counts the number of tiles that will be generated. Then coarse rasterization allocates contiguous slices based on those counts. The second stage does a scattered write of the resulting tiles. Both of these stages rely on indirect dispatch, as the number of lines and the number of segments (respectively) are not known at encode time. These changes only make sense for filled paths, thus they relied on stroke expansion being done earlier, currently on the CPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a number of minor comments and questions for clarification. The new pipeline structure and the changes make sense to me per all of our prior discussions and your posts around the topic.
I'm OK with merging the changes once you've addressed some of the comments.
let m = self.b.min(other.a); | ||
ClipBic::new(self.a + other.a - m, self.b + other.b - m) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind documenting the logic here? Perhaps simply reference section 3 in https://browse.arxiv.org/pdf/2205.11659.pdf and mention the term "bicyclic semigroup".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, left this out of the commit, but will get to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: this was fixed.
shader/coarse.wgsl
Outdated
// We overload the "segments" field to store both count (written by | ||
// path_count stage) and segment allocation (used by path_tiling and | ||
// fine). | ||
let n_segs = tile.segments; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is up to you but consider naming this tile.segment_count_or_ix
or something similar. It may seem verbose but it can help a reader who is new to the code base to know that this field has two interpretations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, renamed. Remind me to apply the rename on the CPU port as well.
shader/coarse.wgsl
Outdated
let n_segs = tile.segments; | ||
if n_segs != 0u { | ||
var seg_ix = atomicAdd(&bump.segments, n_segs); | ||
tiles[tile_ix].segments = ~seg_ix; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for negating seg_ix
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a comment in write_path
in coarse, but I added another in shared. I agree the renaming would be useful. This was a subtle error and originally both were stored as positive values, but I missed the fact that a tile could fail to be allocated because of clipping.
if n_segs != 0u { | ||
var seg_ix = atomicAdd(&bump.segments, n_segs); | ||
tiles[tile_ix].segments = ~seg_ix; | ||
alloc_cmd(4u); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion for a future improvement: IWBN to have constants like CMD_FILL_PAYLOAD_SIZE
declared in ptcl.wgsl, in place of the hard-coded 4u
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, but I'll defer that from this PR. Even better would be having a way to share those constants between GPU and CPU, but that's probably too much to ask in 2023.
shader/fine.wgsl
Outdated
let n_segs = fill.size_and_rule >> 1u; | ||
let even_odd = (fill.size_and_rule & 1u) != 0u; | ||
area = fill_path(fill.seg_data, n_segs, fill.backdrop, xy, even_odd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would have fill_path
take a CmdFill
and move the n_segs
and even_odd
computation to fill_path
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this is much nicer. There's a reason for it, originally the segments were inline in the ptcl, so the interpreter loop needed to know the size. But now they're in a separate buffer, it's possible to clean up.
shader/flatten.wgsl
Outdated
@@ -85,6 +173,11 @@ fn read_i16_point(ix: u32) -> vec2<f32> { | |||
return vec2(x, y); | |||
} | |||
|
|||
struct Transform { | |||
matrx: vec4<f32>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Maybe just spell out matrix
or if this must be shortened go with mat
? Just omitting the one letter doesn't seem that beneficial to me :)
BTW will it make sense to store a 3x3 matrix here in the future when we want to support real perspective? Or is it beneficial to keep the translation component decomposed in that world too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a keyword in WGSL. And I don't know yet the answer to that question, it depends on how the math works out. An affine could be stored as a mat3x2f, but I chose not to do that so that the transform could be written without having to multiply by 1.
I'll go with mat
as I do like conciseness.
|
||
let MAX_QUADS = 16u; | ||
|
||
fn flatten_cubic(cubic: Cubic) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the new organization of the curve eval and flattening code that used to be in path_coarse.
|
||
let dx = abs(s1.x - s0.x); | ||
let dy = s1.y - s0.y; | ||
if dx + dy == 0.0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious: should this check be against an epsilon? Would you mind pointing out in which earlier stage a zero length segment would have been culled (so I can better understand this exact comparison)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it can't be a check against epsilon, as it might still cross a tile boundary, in which case it would affect winding numbers at higher levels of the subdivision hieararchy. This stuff is subtle.
It could be eliminated in LineSoup
generation. That becomes an additional global invariant, while having it here is more local reasoning; letting it slip through would be pretty bad.
Also I took out the TODO on line 61, as that's been resolved.
let count = span(s0.x, s1.x) + span(s0.y, s1.y) - 1u; | ||
let dx = abs(s1.x - s0.x); | ||
let dy = s1.y - s0.y; | ||
let idxdy = 1.0 / (dx + dy); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe document that dx + dy
cannot be zero and the division is valid because a zero-length line should have been discarded by path_count
?
Mostly comments, but also a little cleanup of function signatures in fine.
Refer to Arxiv paper and mention bicyclic semigroups.
This is part of the larger multisampled path rendering work, under stroke rework (#303). It refactors the GPU pipeline so that the path segments available to fine rasterization are stored as a contiguous slice rather than a linked list as before.
Numerous parts of the pipeline are refactored. In the old pipeline, path segment decoding generated cubic line segments and also estimated a bounding box (somewhat imprecise), and the combination of flattening those cubics and tiling was in a separate stage (path_coarse) quite a bit later in the pipeline. In the new pipeline, path decoding is fused with flattening, generating a
LineSoup
structure (line segments associated with paths, otherwise unordered) (with bbox as a side effect), and tiling is spread over multiple stages, later in the pipeline.The first tiling stage (path_count) counts the number of tiles that will be generated. Then coarse rasterization allocates contiguous slices based on those counts. The second stage does a scattered write of the resulting tiles. Both of these stages rely on indirect dispatch, as the number of lines and the number of segments (respectively) are not known at encode time.
These changes only make sense for filled paths, thus they relied on stroke expansion being done earlier, currently on the CPU.