Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream support #60

Closed
drewlio opened this issue Sep 12, 2017 · 13 comments
Closed

Stream support #60

drewlio opened this issue Sep 12, 2017 · 13 comments

Comments

@drewlio
Copy link

drewlio commented Sep 12, 2017

@bhilburn Great talk at GRCon17. That is what has brought me here.

I was very excited to hear about SigMF because it's very close to solving a problem I have. I also have experience with other standards (Vita49, Midas Blue, custom) that don't quite solve my problem. However, I was very sad to see it's limited to flat files and doesn't seem to have a provision for stream transport. My problem is that I would like to "plug in" a stream and be able to figure out how to process it via embedded metadata. I believe SigMF could do this with some very small allowances in the spec:

  1. Add an optional field length in the core namespace, representing length of the dataset.
  2. Allow a format (alternative to files) of concatenating the metadata and dataset files into a frame, to form a stream of the same data. And of course for sending frames back-to-back, like this:
    [METADATA][DATASET][METADATA][DATASET][METADATA][DATASET]

EDIT--removed my point number 3 because the spec already has all JSON in a top-level object. (I originally missed this)

There are a lot of nice things about this. It's agnostic of your transport, and the overhead is application dependent. Some people might want infinite dataset length with one metadata header at the front. Some might want to tune length to repeat the metadata periodically, maybe every 10ms, so when a stream is attached it can discover the metadata and start processing in a timeframe that is appropriate for the application.

@bhilburn
Copy link
Contributor

Hey @drewlio! Thanks so much for jumping in and getting involved! Sorry for the delayed response - as you might imagine, I really didn't get much done last week other than GRCon, and am catching up, now =)

Okay, this is a really interesting comment. So, one of the top questions we get about SigMF is "why didn't you just use VITA49?". I'm really interested to know why VITA49 doesn't work for you, especially since the format you suggested ([metadata][data][metadata][data]...) is an inline "packet-like" format, which is what VITA49 looks like.

Is there a way we could do this differently that would make it useful to you? I want to make sure I understand the difference =)

@drewlio
Copy link
Author

drewlio commented Sep 20, 2017

Quite honestly I don't have any powerful reasons why VITA49 couldn't be made to work in my situation. (Similarly, you have probably considered the question 'Can't I just stream VITA49 protocol to a file and call it SigMF?' Maybe it's possible, but that doesn't mean we want to do that.)

But here are the not-powerful reasons: I want a lightweight streaming/block protocol that I can freely and easily recreate the full-spec encoder/decoder in native languages (ECMAscript, Python, C, etc). I want a format that feels right when passed via file, pipe, or most importantly through a microservices architecture (I believe what Jonathan Corgan was calling "client/server"). For instance, what if a short recording needs to pass through a cache (like Redis or Memcached)? Combining the metadata and data into a single blob is good here. Another microservices example is using Nginx as a load balancer and sending snapshots via POST as application/octet-stream mime type. Also, websockets.

Currently, I would not use VITA49 for these examples because of design decisions. I would come up with my own custom protocol that would look very much like SigMF. In fact, I could use SigMF as-is and put length in my own namespace. Then the only non-compliance would be that I concatenate the [metadata][data] in whatever container it resides in (file, socket, cache, POST, etc).

So my two suggested changes above were to:

  1. Allow optional length in core
  2. Allow optional concatenation of the [metadata][data]

I think these two items (number 2 is most valuable) open up a lot of possible use cases. An interest area of mine is how to leverage the horizontal scaling tech that the DEVOPS crowd developed in the last ~5 years in response to cheap cloud VMs. I think this meshes well with what I heard at the conf about a possible microservices ("client/server") architecture and could be an enabling technology for horizontally scaling Gnu Radio based systems.

@mbr0wn
Copy link
Contributor

mbr0wn commented Oct 3, 2017

Why do we need length in core? We already have a start-sample index. The length is implied in the number of samples in the capture, right?

@drewlio
Copy link
Author

drewlio commented Oct 3, 2017

@mbr0wn I could be overlooking something, but in the streaming case (...[metadata][data][metadata][data]...) there is nothing to imply the number of samples in the capture. There isn't an EOF. One way is to continually scan for SigMF JSON objects in the stream and take all the data in between them as [data]. That's a hard way to go, though. So length gives the streaming case the same information that you'd get from the file size.

Currently in the SigMF spec there is offset. Once you parse the stream and find two adjacent SigMF JSON objects you could compute the difference in offset to determine the length. But then you've already done the hard work of finding SigMF objects in the stream and you could just take the data in between as [data] and not worry about offset.

This has different levels of importance with different implementations. For instance if you're using something raw like pipes or UDP, or some framer like E1/T1, or a one-way optical link, this is very helpful. If you have some wrapper like DDS, ZeroMQ messages, or FileMQ, the impact of not having length is reduced. But even for packing in ZeroMQ messages, the option to concatenate the [metadata][data] into a single file seems helpful vs tracking a pair of files.

@bhilburn
Copy link
Contributor

I think @drewlio is getting to a really fundamental point about SigMF. Do we want to support streaming formats, or not?

When we first started the SigMF effort, it was specifically to support the portability of datasets and metadata. It needed to be able to support applications (i.e., Readers and Writers) streaming data to disk, but that's quite different from supporting streaming data between applications. In short, SigMF was not originally designed to support streaming metadata (like VITA49 was), and that's apparent in its design, as @drewlio points out.

At GRCon, I heard from a lot of people that wanted this functionality. Indeed, here at DeepSig, we are talking about doing the same thing. One of the main reasons we haven't gone after streaming support in the past was because VITA49 exists and "already does that", and since the goal of SigMF is not to just be another standard for the sake of having one, and we hadn't identified streaming support as a primary goal, we weren't focused on enabling it.

I think it's time to circle back around to this question and re-debate it. Specifically, is a goal of SigMF to create a standard that can be used for streaming data? If we want the answer to this to be yes, then I think we need clear answers for why existing standards (e.g., VITA49) don't do what we need, and how we will do it better.

@drewlio provided some interesting insight as to why he wouldn't use VITA49. I'm copy/pasting it here, for reference:

But here are the not-powerful reasons: I want a lightweight streaming/block protocol that I can freely and easily recreate the full-spec encoder/decoder in native languages (ECMAscript, Python, C, etc). I want a format that feels right when passed via file, pipe, or most importantly through a microservices architecture (I believe what Jonathan Corgan was calling "client/server"). For instance, what if a short recording needs to pass through a cache (like Redis or Memcached)? Combining the metadata and data into a single blob is good here. Another microservices example is using Nginx as a load balancer and sending snapshots via POST as application/octet-stream mime type. Also, websockets.

Currently, I would not use VITA49 for these examples because of design decisions. I would come up with my own custom protocol that would look very much like SigMF. In fact, I could use SigMF as-is and put length in my own namespace. Then the only non-compliance would be that I concatenate the [metadata][data] in whatever container it resides in (file, socket, cache, POST, etc).

I've heard from a lot of people that don't like VITA49, especially at GRCon, but that doesn't necessarily mean that we should try to re-invent it.

I would really like to hear opinions on this topic. What are your thoughts? Suggestions? Is there something VITA49 can't do or does poorly that you would like to see in SigMF? Why should this be something we address?

@drewlio
Copy link
Author

drewlio commented Oct 26, 2017

Just some thoughts, not really for/against either one, but might be the pain points that are causing you to hear grumblings from GR users.

The people that said they want an alternative to Vita 49 probably don't want to replicate all functionality of Vita 49, but probably do want these:

  • Open. SigMF is an open standard. Here's where you can buy the ANSI Vita 49.2 spec for $100.
  • Extensible. The SigMF custom namespace is flexible. I actually don't know if Vita49 Context Packets can hold custom fields, maybe so. I don't know because Vita 49 is so...
  • [In]Convenient for high-level processing. JSON parsing is easy and JSON can hold all information necessary to convey information about the data. Vita 49 uses fixed-width bit fields with identifiers which may require a priori knowledge of the Vita 49 spec or the identifiers.
  • Portable. This is the big one. I used the word stream in the issue name but probably should have said blob because the important point is looking beyond files to be medium/transport agnostic. It can still be a file. And a concatenated [metadata][data] blob can also be passed around by all the high-level tools/platforms that we talked about earlier in a way that is more flexible than passing a pair of files and more extensible and convenient than passing Vita 49 blobs. And concatenated blobs can be a stream if someone wants that. It's a free world, man.

I would encourage you to look at SigMF as an easy and open standard for storing and passing a finite amount of signal data and metadata in a medium/transport agnostic way. It's good for files, blobs, streams--it's just payload in your medium/transport.

That leaves a clear distinction between SigMF and Vita 49. Vita 49 is better for low-level (ie FPGA) integration due to the fixed packet structure and natively handles the streaming protocol including acknowledgements as well as a fixed device control lexicon. The stable ANSI spec is better for long development timelines.

I see them as serving two distinct purposes, each with their place. In the extreme case, fully streaming SigMF could have a partial capability overlap with Vita 49. If the prospect of fragmenting users in that case is a concern enough to limit the use of SigMF to file pairs only, then that's an ok decision to make.

@djanderson
Copy link
Contributor

djanderson commented Oct 27, 2017

We have used something similar to the following in practice and it was effective:

[(uint64) metadata length in bytes] [JSON metadata] [data]

One thing that's nice about prepending the metadata with its length is that you don't have to read the stream char by char and try to watch for balanced object braces on the fly. In our case, we did have the length of the following data in the JSON, but since SigMF doesn't explicitly have that, we could use:

[(uint64) metadata length in bytes] [JSON metadata] [(uin64) data length in bytes] [data]

That way you don't have to parse the JSON on-the-fly, you can just stream both directly into a file. It would be an extremely minimalist extension to SigMF that would facilitate efficient streaming.

@drewlio
Copy link
Author

drewlio commented Oct 27, 2017

@djanderson Yeah, I get that. It seems like that would have it's advantages. It's one step toward having all fixed-width fields like Vita 49. There is a trade off for the designers between high-level structure for convenience and low-level structure for performance.

@djanderson
Copy link
Contributor

djanderson commented Oct 27, 2017

The assumption I'm making is that if someone is choosing streaming, they've already identified a performance or latency bottleneck that they're trying to address. Most of the use cases you specified fit into that category: caches, load balancers, sockets. In our case, we POSTed files via TCP if we could, but streamed the data over a UDP socket if low latency was more important than data integrity.

I don't know if forcing the metadata/data stream to be less than 18 quintillion bytes before another metadata message could be considered moving toward "fixed-width fields" :) but it does mean you can't just have a single metadata file followed by an infinite stream of data. I don't think we should strive to provide that.

I was just throwing it out there because it would allow us to provide for the streaming scenario (which I support) without actually baking that into the metadata format itself, because I agree with you: SigMF's simplicity is one of its biggest strengths.

@drewlio
Copy link
Author

drewlio commented Oct 27, 2017

for fixed field I meant the leading uint64 preceding the JSON. but that's a minor point.

You know, actually, when talking about all the caches and stuff, I was actually thinking more of API options than performance. Like if the data is in a single blob then it's a easy fit for stuff like Mongo, Redis, POST, ZeroMQ. These are technologies that work well with horizontal scaling, so it's closely tried to performance conversations.

@djanderson
Copy link
Contributor

@drewlio, for that use case, have you seen our "archive" format, which packages metadata and data together in a single blob-like archive? We're also working on compression support for them in #68.

That doesn't address streaming them back-to-back, but you can break an acquisition into arbitrarily many captures and archive them up as a single blob.

@drewlio
Copy link
Author

drewlio commented Oct 30, 2017

The uncompressed archive format could work. It covers the two suggestions, which were:

  1. length is encoded
  2. single blob

Tar is a bit bloated, but the data portion will always be so big that I think it is reasonable. (even for a small snapshot like 100ms at 1MSPS, the total tar overhead is <1%)

There is a usability consideration and one important caveat:

Usability
For use cases where you are processing stream-wise, the algorithm goes:

  1. Wait for offset 126 and read 12 bytes, which is length of first file in string encoded octal or base-256 (selected with internal flag)
  2. Read flag and convert length to int(length)
  3. Wait for offset 512 and then read length bytes, this is the metadata JSON
  4. Repeat steps 1,2,3 for the data portion.

IMO this loses some of the simplicity. In fact, I might rather just use what you suggested earlier:
[(uint64) metadata length in bytes] [JSON metadata] [data]

Caveat
Stream-wise processing works when the metadata is the first file in the archive. So whoever creates the archive blob would have to ensure that the metadata comes first, then data.

If you were to put this requirement (that metadata comes first) in the archive spec, that would be a pain for a lot of people who might be creating multi-signal archives on the command line like:
tar cf archive.sigmf *.sigmf-meta *.sigmf-data

It's even a pain for single signal archives because "sigmf-data" is alphabetically before "sigmf-meta". So this command would create the archive in the wrong order:
tar cf archive.sigmf mysignal.*

So in summary, for several reasons it's hard (but not impossible) to make the archive format work well for stream-wise processing case, but it is reasonable for the blob-wise processing case.

EDIT: And I think blob-wise processing is where big wins are to be had leveraging horizontal scaling technologies so I would be agreeable to adopting the uncompressed tar-based archive format as the suggested blob-wise solution.

@bhilburn
Copy link
Contributor

This discussion was really finalized by @drewlio back in 2017, and has since been addressed in several other Issues & PRs. Cleaning up this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants