-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High bitrate, small devices #34
Comments
The FLAC format is based on subframes, and to decode the samples from two channels you really have to decode the entire first subframe, before the second subframe (and its audio samples) even becomes visible in the FLAC bitstream. That being said, it may be possible to achieve better latency if the buffer used for decoding and sending samples to output on the speakers is made smaller, because then it will decode a single frame, and use samples from that frame several times, before having to refill the now smaller buffer used for sending to the sound card. To do this, change the buffer size when invoking I've updated blip (see mewspring/blip@db40fe5) so you can experiment with different buffer sizes. The buffer size can be specified using the Hopefully you can experiment with this value and find a sweet spot for the RPi |
Can you not fill all even words of a unified outgoing buffer with data from the 1st subframe and then fill all odd words with data from the 2nd subframe? The data needs to be zipped together no matter what. But it is super inefficient to do it in a separate step using an extra buffer. I would like to eliminate this code:
|
I definitely see your point regarding performance. The update to unify samples from subframes into a single slices would be rather straight forward. However, not all applications and users of the API expect the data to be zipped. For instance, sound.Source of the zikichombo audio library expects samples from each channel to be placed directly after each other; e.g. n samples of the first channel in Consolidating the API to handle both cases is possible of course with a conditional check, and perhaps that is the direction to take going forward. As this change would update the API we'd have to do it in version 2.x, which is right now in planning stage. So any feedback and input is welcome :) Feel free to join the discussion towards of the 2.x roadmap issue: #33 P.S. I've added a bullet to track the issue of having to duplicate the audio sample slices. |
I made a couple of measurements on the RPi.
Cases 3-5 represent different strategies in my code to hand over 24bit data to portaudio. The different values (50-63%) demonstrate how the slicing and copying of data in my code does have a serious impact on the overall load. Case 3 represents the copy-loop shown two messages up (it is the most efficient of the three). Anyway, if this loop could be eliminated, I think cpu-load could drop from 50% to below 45%. If you compare cases 1 and 2 (both 16/44.1) you see that playmp3 appears to be more efficient. I know it is apples and oranges but the amount of decoded data handed over to the audio subsystem is the same in both cases. Do you think that mewkiz/flac (disregarding the copy loop in the client app) could be made more efficient still? Or (In my first message I mentioned "50-70%" cpu load. I was referring to the combined load of playflac + pulseaudio.) |
I created a mewkiz/flac "unified buffer" test implementation in which interleaved audio samples from all channels are being combined in one buffer. I allocated a new frame.BlockSamples[] in frame.Parse(). I then applied the following change to parseSubframe() where I zip the decoded data:
I can hand over the resulting frame.BlockSamples to the audio subsystem without having to slice or copy anything in my client app. I took a shortcut and outremarked correlate() because it expects the data to still be in subframe.Samples[]. Same is true for frame.Hash. Different functions making assumptions about subframe.Samples[] and the order of data within, appears to be the biggest work item when trying to fully implement this. One would also need to enable the client to somehow specify (or select) the desired output format. I also wonder how beneficial it may be, to use separate threads for decoding and outputting of data to the sound system. The client app could do this on it's own, but there would need to be a way to hand over altering frame buffers. Any thoughts? (Some Go channels have yet to be utilized.) |
Oh, most definitely! There is a benchmark you could run (and feel free to add and extend the benchmark with samples that from your own collection).
From https://github.com/mewkiz/flac/blob/master/frame/frame_test.go#L61: // The file 151185.flac is a 119.5 MB public domain FLAC file used to
// benchmark the flac library. Because of its size, it has not been included
// in the repository, but is available for download at
//
// http://freesound.org/people/jarfil/sounds/151185/ Profiling of the mewkiz/flac library shows that there are some long hanging fruits that we could target for optimizing performance.
Specifically, the bit reader. It takes 65% of the cumulated time.
|
Do you know how to make pprof show the # of function calls? I'd like to know this in relation to "time spent". Can you please share your thoughts regarding the dual-thread idea? This could be a big win, especially if interleaved samples idea cannot be easily implemented. The client should be able to hand over a fresh subframe buffer, say, to ParseNext(). It could then juggle two buffers, handing one to the decoder, while feeding the other to the audio sink. It wouldn't matter anymore if the client needs to spent some time reformatting data. Client could also use more than two frame buffers, should this ever be useful. |
This mpeg/mp3 package let's clients hand over the decode buffer: With this I can call the decoder multiple times, without having to copy any data. And when I do use the data, the decoder can continue crunching future data. I can provide any number of buffers, but in my console log below I am only using 4. This may be enough already to prevent all 24/96 underruns (even) on the RPi. Thoughts?
|
Hi. Playing hires audio on a RPi can easily drive 50-70% cpu load. You live by the "Output underflowed" message. It only takes another high priority task (sshd) to move a little finger and... bang. But CPU load could be brought down easily. All it would take is for the decoder to interleave the audio channels: LLLLRRRRLLLLRRRRLLLLRRRR... Audio subsystems prefer it this way. Other decoders do this too. Can we have frame.Samples[i] without subframes? Would it be possible to pick the outgoing format when opening a file? Thank you for considering
The text was updated successfully, but these errors were encountered: