Consider nonconsumptive for corpus ingest/management #76

bmschmidt · 2021-08-13T13:55:13Z

Hi guys, hope you don't mind if I spam you with a product placement.

I've been working on a complete rewrite of the tokenization and corpus management parts of bookworm into a standalone package called nonconsumptive. https://github.com/bmschmidt/nonconsumptive.

The most important part is a tokenization rewrite that probably doesn't matter for this project. But it also has a reasonably decent corpus ingest function. I thought of it for this project looking at the ingest files @pleonard212 put up for the Yale speeches. The goal there is to pull any of a variety of input formats (mallet-like, ndjson, keyed filenames) into a compressed, mem-mappable single file suitable for all sorts of stuff. I'm using it for scatterplots and bookworms, but I think it would also suit your use case here.

Can't remember how much I've waxed to @duhaime already about the parquet/feather ecosystem, but it's a hugely useful paradigm.

In the immediate term, the prime benefit for you would be that you'd be able to support ingest from a CSV file as well as the JSON format you currently have defined. Great, I guess. But on your side and mine, the real benefit would be that corpora built for this would be no-cost transportable to other systems, and that I could easily drop some intertext hooks into any previously built bookworms. Since the parquet/feather formats are quite portable, for the scale you're working it at it might be possible to bundle it into some kind of upload service.

Let me know if you want to talk more.

The text was updated successfully, but these errors were encountered:

pleonard212 · 2021-08-13T14:24:04Z

I think this would be a great idea, as I'm often running various different things (mallet, Intertext, philoline, etc) on the same set of texts. I've resorted to keeping metadata in sql tables (or nowadays, csvs -> panda dataframes) and writing export logic for slightly-varying contexts and use cases...

bmschmidt · 2021-08-13T14:46:32Z

Yeah, this is what I'm trying to define and what it would be great to do in concert. CSVs are too lossy on datatypes and don't support list items, both of which are indispensable. No one is doing this yet AFAICT, but parquet and feather have the infrastructure to support full LOD interop on metadata, too, which means it might even be possible to get some librarians on board.

And parquet/feather is nice because it's not platform/language dependent, reads into pandas or R way faster than CSVs, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider nonconsumptive for corpus ingest/management #76

Consider nonconsumptive for corpus ingest/management #76

bmschmidt commented Aug 13, 2021

pleonard212 commented Aug 13, 2021

bmschmidt commented Aug 13, 2021

Consider nonconsumptive for corpus ingest/management #76

Consider nonconsumptive for corpus ingest/management #76

Comments

bmschmidt commented Aug 13, 2021

pleonard212 commented Aug 13, 2021

bmschmidt commented Aug 13, 2021