Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support JSON lines files #120

Open
ivbeg opened this issue Apr 18, 2022 · 8 comments
Open

Support JSON lines files #120

ivbeg opened this issue Apr 18, 2022 · 8 comments
Assignees
Labels

Comments

@ivbeg
Copy link

ivbeg commented Apr 18, 2022

Please add support of JSON lines files https://jsonlines.org/
There are a lot of such files published and used. Sometimes they are huge and hard to convert to JSON

@evinism
Copy link
Owner

evinism commented Apr 19, 2022

Fantastic idea! No timeline yet on implementation, but definitely a very useful feature. I've run into this myself :)

@evinism
Copy link
Owner

evinism commented Apr 19, 2022

Actually @ivbeg, would you be able to describe your ideal interface for such a feature? Would the program run the query over each json line individually, or treat the whole file as a large array?

@ivbeg
Copy link
Author

ivbeg commented Apr 19, 2022

@evinism It would be great to support both ways to process JSON lines files, but streaming feature would be more important since there are huge JSON lines files, up to 100GB+ compressed. I could provide several examples from public datasets if needed. It's nearly impossible to process such files as a large array.

I've developed cmd tool undatum (https://github.com/datacoon/undatum) that support data processing and conversion of JSON lines and BSON files. BSON is a binary format used by MongoDB NoSQL database, very similar to JSON lines . So I would like to integrate query language into undatum to use it with data processing/conversion operations. I've already used dictquery (https://github.com/cyberlis/dictquery) but it's good for filtering only.

@evinism
Copy link
Owner

evinism commented Apr 19, 2022

streaming mode for processing jsonl sounds right to me too. Not sure when I'll get to this, but definitely something I want to tackle.

@ivbeg
Copy link
Author

ivbeg commented Apr 20, 2022

@evinism I've added experimental support of mistql to undatum, it's supported in main https://github.com/datacoon/undatum version 1.0.13
command "undatum query -q <yourquery> <filename>" filename could be csv, jsonl or bson.

I hope it could help.

@evinism evinism added the python label Jun 2, 2022
@evinism
Copy link
Owner

evinism commented Jun 2, 2022

Adding @ilan-pinto to this thread. For now, let's work on getting this up and running in Python.

@ilan-pinto
Copy link
Contributor

Hi
please assign it to me

@evinism
Copy link
Owner

evinism commented Jun 3, 2022

For reference, a possible interface for this feature could be as such:

tail file.log | python -m mistql.cli foo.bar --lines > processed.jsonl

Note that the query is performed in a streaming manner -- for each JSON line in file.log, the CLI spits out the query result for that line in processed.jsonl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants