This repo contains the visual interface to visualise/expand the labelled dataset used for LogPrecis.
Broadly, it accepts as input parquet, json or csv files. Those files must contain a column named ‘session‘.
The script:
- Initialise a Token Classification Pipeline loading LogPrecis.
- Load 1 session at the time and obtain LogPrecis's predictions for it
- If the model is not confident on the prediction, the background color will be transparent.
- In the Flask Web-Interface, it shows the session under analysis + the model's predictions for it:
- The session's text is highlighted according to the model's predictions
- A summary of the currently labelled chunks of text is also present
The labeller can decide whether to:
- Trust the model's prediction and move to another session.
- Correct the model's prediction:
- Some ‘x‘ buttons are present for all labelled chunks in the summary
- Remember: since the order of the chunks is important, once a chunk is deleted all the chunks after it will also be deleted.
- Skip the session, without correcting or trusting the predictions.
- Exit the program.
ATTENTION:
- you cannot press the
continue
button if you do not label all characters (even spaces) in the session. - Separators will not be relevant for the labelling process
- The script will obtain 1 label per statement (e.g., entity separated by Shell separators such as
&
,|
,||
,;
). If, at the end, the number of labels does not match the number of statements, the script returns an error.
Once the exit button is clicked, the script will save the trusted/corrected labels. Also, the next time the script will start from the first non-labelled session in the input dataframe (so, order is important: do not shuffle the input dataset).
The output of the process is a parquet file with the same columns as the original one + a ‘fingerprint‘ column, output of the process. N.b. The user can specify the output path and output filename in the configuration file. The output filename is also used by the script to determine whether there were previous interrupted labelling attempts.
- Install the conda environment:
conda env create -f environment.yml
- Set your parameters in the
config.json
. - Start the web interface running:
python web_labeller.py config.json
- LogPrecis can (and probably will) make mistakes
- Most common mistake: on turning points between labels, it will output noise.
- LogPrecis has a limited output length: it can process at most 512 tokens
- If the session is longer, it will stop predicting.
- At the moment, no mechanism is still implemented to enforce the labeller to label based on statements. In other words, labelling:
"cat proc/cpuinfo | grep Model"
as["cat proc/" - Discovery, "cpuinfo | grep Model" - Persistence]
Is possible, but will trigger an error when clicking thecontinue
button.