Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoexposure example restoration #137

Open
wants to merge 48 commits into
base: master
Choose a base branch
from

Conversation

devshgraphicsprogramming
Copy link
Member

No description provided.

nipunG314 and others added 28 commits July 19, 2024 22:07
Comment on lines 12 to 17
float meteringWindowScaleX, meteringWindowScaleY;
float meteringWindowOffsetX, meteringWindowOffsetY;
float lumaMin, lumaMax;
uint32_t sampleCountX, sampleCountY;
uint32_t viewportSizeX, viewportSizeY;
uint64_t lumaMeterBDA;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should pack these up intro structs in nbl::hlsl::luma_meter and friends

P.S. I see you already have, why not use things like nbl::hlsl::luma_meter::LumaMeteringWindow directly here?

params.shader.entryPoint = "main";
params.shader.entries = nullptr;
params.shader.requireFullSubgroups = true;
params.shader.requiredSubgroupSize = static_cast<IGPUShader::SSpecInfo::SUBGROUP_SIZE>(5);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be set from m_physicalDevice->getLimits().maxSubgroupSize, not just set to 32

Comment on lines 120 to 122
m_lumaPresentDS[0] = lumaPresentPool->createDescriptorSet(core::smart_refctd_ptr(lumaPresentDSLayout));
m_lumaPresentDS[1] = lumaPresentPool->createDescriptorSet(core::smart_refctd_ptr(lumaPresentDSLayout));
if (!m_lumaPresentDS[0] || !m_lumaPresentDS[1])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you double buffering descriptor sets ? the only reason old example did it was because of round robining the luma meter output SSBOs.

Now we have BDA in Push Cosntants

assert(allocation->memory.get() == buffer->getBoundMemory().memory);
};

build_buffer(m_device, &m_lumaGatherAllocation, buffer, m_physicalDevice->getLimits().maxSubgroupSize, "Luma Gather Buffer");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you seem to have forgotten that there's a sizeof(uint32_t)* needed on your maxsubgroup size.

Comment on lines +331 to +338
// Allocate and Leave 1/4 for image uploads, to test image copy with small memory remaining
{
uint32_t localOffset = video::StreamingTransientDataBufferMT<>::invalid_value;
uint32_t maxFreeBlock = m_utils->getDefaultUpStreamingBuffer()->max_size();
const uint32_t allocationAlignment = 64u;
const uint32_t allocationSize = (maxFreeBlock / 4) * 3;
m_utils->getDefaultUpStreamingBuffer()->multi_allocate(std::chrono::steady_clock::now() + std::chrono::microseconds(500u), 1u, &localOffset, &allocationSize, &allocationAlignment);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need this, this is test code from ex 24

Comment on lines 457 to 482
IGPUDescriptorSet::SDescriptorInfo info1 = {};
info1.info.image.imageLayout = IImage::LAYOUT::READ_ONLY_OPTIMAL;
info1.desc = m_gpuImgView;

IGPUDescriptorSet::SDescriptorInfo info2 = {};
info2.info.image.imageLayout = IImage::LAYOUT::READ_ONLY_OPTIMAL;
info2.desc = m_gpuImgView; // FIXME: temporarily pass in input image

IGPUDescriptorSet::SWriteDescriptorSet writeDescriptors[] = {
{
.dstSet = m_lumaPresentDS[0].get(),
.binding = 0,
.arrayElement = 0,
.count = 1,
.info = &info1
},
{
.dstSet = m_lumaPresentDS[1].get(),
.binding = 0,
.arrayElement = 0,
.count = 1,
.info = &info2
}
};

m_device->updateDescriptorSets(2, writeDescriptors, 0, nullptr);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// We do a very simple thing, display an image and wait `DisplayImageMs` to show it
inline void workLoopBody() override
{
const uint32_t SubgroupSize = m_physicalDevice->getLimits().maxSubgroupSize;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a member, its painful to see it being initialized multiple times

Comment on lines +498 to +499
auto cmdbuf = m_computeCmdBufs[0].get();
cmdbuf->reset(IGPUCommandBuffer::RESET_FLAGS::NONE);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd start the example same as the others with the frames in flight (ex24 doesn't do FIF because it displays the same image for a long time), so proper round robining of multiple commandbuffers, blocks and waits before resets

Comment on lines 504 to 507
.meteringWindowScaleX = MeteringWindowScale[0] * m_gpuImg->getCreationParameters().extent.width,
.meteringWindowScaleY = MeteringWindowScale[1] * m_gpuImg->getCreationParameters().extent.height,
.meteringWindowOffsetX = MeteringWindowOffset[0] * m_gpuImg->getCreationParameters().extent.width,
.meteringWindowOffsetY = MeteringWindowOffset[1] * m_gpuImg->getCreationParameters().extent.height,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you still calling it a scale if its a literal dimension!?

Also if you're using UV normalized coordinates via Sample() with float coords and explicit lod level, they should stay as they are and you shouldn't care about the input image extents!

.sampleCountY = SampleCount[1],
.viewportSizeX = m_gpuImg->getCreationParameters().extent.width,
.viewportSizeY = m_gpuImg->getCreationParameters().extent.height,
.lumaMeterBDA = m_lumaGatherBDA
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need two addresses to ping-pong between, one that you write current luma data to, and one you clear for a "future" dispatch

First frame's luma dispatch writes to A, clears B
Second frame's luma dispatch writes to B, clears A

Note that its not possible to use the tonemapping shader to clear the same buffer, as one workgroup may still be running while another quits and without an extra atomic (which would have to be cleared) you can't know which workgroup is the "last to quit"

cmdbuf->bindComputePipeline(m_lumaMeterPipeline.get());
cmdbuf->bindDescriptorSets(nbl::asset::EPBP_COMPUTE, m_lumaMeterPipeline->getLayout(), 0, 1, &ds); // also if you created DS Set with 3th index you need to respect it here - firstSet tells you the index of set and count tells you what range from this index it should update, useful if you had 2 DS with lets say set index 2,3, then you can bind both with single call setting firstSet to 2, count to 2 and last argument would be pointet to your DS pointers
cmdbuf->pushConstants(m_lumaMeterPipeline->getLayout(), IShader::E_SHADER_STAGE::ESS_COMPUTE, 0, sizeof(pc), &pc);
cmdbuf->dispatch(1 + (SampleCount[0] - 1) / SubgroupSize, 1 + (SampleCount[1] - 1) / SubgroupSize);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is backwards, you should get your sample count from workgroup dispatch count, not the other way round

Comment on lines 572 to 576
m_EV = 0.0f;
for (int index = 0; index < SubgroupSize; index++) {
m_EV += static_cast<float>(buffData[index]) / (log2(LumaMinMax[1]) - log2(LumaMinMax[0])) + log2(LumaMinMax[0]);
}
m_EV /= (SampleCount[0] * SampleCount[1]);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see the discord conversation, the per-workgroup-averages need to be in the [0,leftoverBits) range already

then your decode step is

for ()
   ev += static_cast<float>(buffData[index]) * (log2(LumaMinMax[1]) - log2(LumaMinMax[0])) / ((1<<fixedPointBitsLeft)-1) + log2(LumaMinMax[0]);
ev /= float(workgroupCount[0]*workgroupCount[1]);

Comment on lines 591 to 604
auto pc = AutoexposurePushData
{
.meteringWindowScaleX = MeteringWindowScale[0] * m_gpuImg->getCreationParameters().extent.width,
.meteringWindowScaleY = MeteringWindowScale[1] * m_gpuImg->getCreationParameters().extent.height,
.meteringWindowOffsetX = MeteringWindowOffset[0] * m_gpuImg->getCreationParameters().extent.width,
.meteringWindowOffsetY = MeteringWindowOffset[1] * m_gpuImg->getCreationParameters().extent.height,
.lumaMin = LumaMinMax[0],
.lumaMax = LumaMinMax[1],
.EV = m_EV,
.sampleCountX = SampleCount[0],
.sampleCountY = SampleCount[1],
.viewportSizeX = m_gpuImg->getCreationParameters().extent.width,
.viewportSizeY = m_gpuImg->getCreationParameters().extent.height,
.lumaMeterBDA = m_lumaGatherBDA
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you're overdeclaring push constants to share the layout between two pipelines, then only build them once

Comment on lines +586 to +588
auto queue = getGraphicsQueue();
auto cmdbuf = m_graphicsCmdBufs[0].get();
cmdbuf->reset(IGPUCommandBuffer::RESET_FLAGS::NONE);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +658 to +663
// we don't need to wait for the transfer semaphore, because we submit everything to the same queue
const IQueue::SSubmitInfo::SSemaphoreInfo acquired[1] = { {
.semaphore = acquire.semaphore,
.value = acquire.acquireCount,
.stageMask = PIPELINE_STAGE_FLAGS::NONE
} };
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to wait for the compute queue semaphore (from the luma metering) as well as the acquire

because eventually you'll compute the EV in the tonemapping pipeline and stop doing any host wait on the compute queue.

Comment on lines 677 to 687
// Wait for completion
{
const ISemaphore::SWaitInfo cmdbufDonePending[] = {
{
.semaphore = m_presentSemaphore.get(),
.value = m_submitIx
}
};
if (m_device->blockForSemaphores(cmdbufDonePending) != ISemaphore::WAIT_RESULT::SUCCESS)
return;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time to remove this and do Frames in Flight

constexpr float Key = 0.18;
auto params = ToneMapperClass::Params_t<TMO>(Exposure, Key, 0.85f);
{
params.setAdaptationFactorFromFrameDelta(0.f);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we used this basically
https://en.wikipedia.org/wiki/Exponential_smoothing

to blend the EV value over time

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 was a special value to "reset" the exposure (allow it to use the instantaneous value)

@devshgraphicsprogramming
Copy link
Member Author

Whats the status on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants