Web labelling interface - LogPrecis 🏷️

This repo contains the visual interface to visualise/expand the labelled dataset used for LogPrecis.

How it works

Broadly, it accepts as input parquet, json or csv files. Those files must contain a column named ‘session‘.

The script:

Initialise a Token Classification Pipeline loading LogPrecis.
Load 1 session at the time and obtain LogPrecis's predictions for it
- If the model is not confident on the prediction, the background color will be transparent.
In the Flask Web-Interface, it shows the session under analysis + the model's predictions for it:
- The session's text is highlighted according to the model's predictions
- A summary of the currently labelled chunks of text is also present

The labeller can decide whether to:

Trust the model's prediction and move to another session.
Correct the model's prediction:
- Some ‘x‘ buttons are present for all labelled chunks in the summary
- Remember: since the order of the chunks is important, once a chunk is deleted all the chunks after it will also be deleted.
Skip the session, without correcting or trusting the predictions.
Exit the program.

ATTENTION:

you cannot press the continue button if you do not label all characters (even spaces) in the session.
Separators will not be relevant for the labelling process
The script will obtain 1 label per statement (e.g., entity separated by Shell separators such as &, |, ||, ;). If, at the end, the number of labels does not match the number of statements, the script returns an error.

Once the exit button is clicked, the script will save the trusted/corrected labels. Also, the next time the script will start from the first non-labelled session in the input dataframe (so, order is important: do not shuffle the input dataset).

The output of the process is a parquet file with the same columns as the original one + a ‘fingerprint‘ column, output of the process. N.b. The user can specify the output path and output filename in the configuration file. The output filename is also used by the script to determine whether there were previous interrupted labelling attempts.

How to run

Install the conda environment:

conda env create -f environment.yml

Set your parameters in the config.json.
Start the web interface running:

python web_labeller.py config.json

Warnings ⚠️

LogPrecis can (and probably will) make mistakes
- Most common mistake: on turning points between labels, it will output noise.
LogPrecis has a limited output length: it can process at most 512 tokens
- If the session is longer, it will stop predicting.
At the moment, no mechanism is still implemented to enforce the labeller to label based on statements. In other words, labelling:
- "cat proc/cpuinfo | grep Model" as ["cat proc/" - Discovery, "cpuinfo | grep Model" - Persistence] Is possible, but will trigger an error when clicking the continue button.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data_sample		data_sample
static		static
sub		sub
templates		templates
.gitignore		.gitignore
README.md		README.md
config.json		config.json
environment.yml		environment.yml
web_interface_sample.png		web_interface_sample.png
web_labeller.py		web_labeller.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web labelling interface - LogPrecis 🏷️

How it works

How to run

Warnings ⚠️

About

Releases

Packages

Languages

SmartData-Polito/LogPrecis_Web_Labeller

Folders and files

Latest commit

History

Repository files navigation

Web labelling interface - LogPrecis 🏷️

How it works

How to run

Warnings ⚠️

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages