Skip to content

Latest commit

 

History

History
279 lines (211 loc) · 9.12 KB

README.md

File metadata and controls

279 lines (211 loc) · 9.12 KB

Python Pandas NumPy Matplotlib Keras scikit-learn Dask GitHub

DOI

Model Training and Evaluation for Higgs Dataset

Overview

This repository demonstrates training and evaluating a Keras model using the Higgs dataset available from the UCI ML Repository.

Higgs Dataset

The dataset has been studied in this publication:

Searching for Exotic Particles in High-energy Physics with Deep Learning.
Baldi, P., P. Sadowski, and D. Whiteson. Nature Communications 5, 4308 (2014)

The ML pipeline includes downloading the dataset, data preparation, model training, evaluation, feature importance analysis, and visualization of results. Dask is utilised for handling this large datasets for parallel processing.

Installation

  1. Create and source virtual environment:
python -m venv env
source env/bin/activate  # On Windows use `env\Scripts\activate`
  1. Install the dependencies:
pip install -r requirements.txt

Data

The Higgs dataset can be downloaded directly from the provided scripts in separate steps

  • download_data.py ~ 2.6 GB
  • data_extraction.py ~ 7 GB
  • data_preparation.py ~ test dataset: 240 MB, trained dataset: 5 GB

Alternatively, you can run directly the main script from the data/src/main.py:

python data/src/main.py

Downloading Data

Download a dataset file from the specified URL with a progress bar.

Script
python data/download_data.py
Example Usage
zipDataUrl = 'https://archive.ics.uci.edu/static/public/280/higgs.zip'      # Higgs dataset URL
zipPath = '../higgs/higgs.zip'
downloadDataset(zipDataUrl, zipPath)
cleanUp(zipPath)        # Clean up downloaded zip file (~ 2.6 GB)

Data Extraction

Extract the contents of a zip dataset and decompress the .gz dataset file to a specified output path.

Script
python data/data_extraction.py
Example Usage
zipDataUrl = 'https://archive.ics.uci.edu/static/public/280/higgs.zip'      # Higgs dataset URL
extractTo = '../higgs'
zipPath = os.path.join(extractTo, 'higgs.zip')
gzCsvPath = os.path.join(extractTo, 'higgs.csv.gz')
finalCsvPath = os.path.join(extractTo, 'higgs.csv')

extractZippedData(zipPath, extractTo)
decompressGzFile(gzCsvPath, finalCsvPath)
cleanUp(gzCsvPath)      # Clean up gzipped file (~ 2.6 GB)

Data Preparation

Set column names and separates the test set from the training data based on the dataset description (500,000 test sets).

Dataset Description: The first column is the class label (1 for signal, 0 for background), followed by the 28 features (21 low-level features then 7 high-level features). The first 21 features (columns 2-22) are kinematic properties measured by the particle detectors in the accelerator. The last seven features are functions of the first 21 features.

Script
python data/data_preparation.py
Example Usage
prepareFrom = '../higgs'
csvPath = os.path.join(prepareFrom, 'higgs.csv')
preparedCsvPath = os.path.join(prepareFrom, 'prepared-higgs.csv')
prepareData(csvPath, preparedCsvPath)
cleanUp(csvPath)         # Clean up gzipped file (~ 7.5 GB)

Loading Data

Using Pandas

Use the dataLoader/data_loader.py script to load the prepared dataset into a Pandas DataFrame.

Script
python data/src/data_loader.py
Example Usage
filepath = '../data/higgs/prepared-higgs_train.csv'   # prepared-higgs_test.csv
dataLoader = DataLoader(filepath)
dataFrame = dataLoader.loadData()
dataLoader.previewData(dataFrame)

Using Dask

Use the dataLoader/data_loader_dask.py script to load the prepared dataset into a Dask DataFrame, which is beneficial for this large dataset.

Script
python data/src/data_loader_dask.py
Example Usage:
filepath = '../data/higgs/prepared-higgs_train.csv'   # prepared-higgs_test.csv
dataLoader = DataLoaderDask(filepath)
dataFrame = dataLoader.loadData()
dataLoader.previewData(dataFrame)

Exploratory Data Analysis (EDA)

Provides various functions for performing EDA, including visualising correlations, checking missing values, and plotting feature distributions. The data analysis plots are saved under eda/plots.

Script
python exploration/eda.py
Example Usage:
filepath = '../data/higgs/prepared-higgs_train.csv'   # prepared-higgs_test.csv

    # using Dask data frame
dataLoaderDask = DataLoaderDask(filepath)
dataFrame = dataLoaderDask.loadData()

eda = EDA(dataFrame)
eda.describeData()
eda.checkMissingValues()
eda.visualiseFeatureCorrelation()

eda.visualizeTargetDistribution()
eda.visualizeFeatureDistribution('feature_1')
eda.visualizeAllFeatureDistributions()
eda.visualizeFeatureScatter('feature_1', 'feature_2')
eda.visualizeTargetDistribution()
eda.visualizeFeatureBoxplot('feature_2')

Usage

Training the Model

The model is defined using Keras with the following default architecture for binary classification:

  • Input layer with 128 neurons (dense)
  • Hidden layer with 64 neurons (dense)
  • Output layer with 1 neuron (activation function: sigmoid)

You can customise the model architecture by providing a different modelBuilder callable in the ModelTrainer class.

The trained models and training loss plots are saved under kerasModel/trainer/trainedModels.

Script
python kerasModel/trainer/model_trainer.py
Example Usage:
filePath = '../../data/higgs/prepared-higgs_train.csv'

def customModel(inputShape: int) -> Model:
    """Example of a custom model builder function for classification"""
    model = keras.Sequential([
        layers.Input(shape=(inputShape,)),
        layers.Dense(512, activation='relu'),
        layers.Dropout(0.3),
        layers.Dense(256, activation='relu'),
        layers.Dense(128, activation='relu'),
        layers.Dense(64, activation='relu'),
        layers.Dense(1, activation='sigmoid')  # Sigmoid for binary classification
    ])
    return model
        
dataLoaderDask = DataLoaderDask(filePath)
dataFrame = dataLoaderDask.loadData()

## Optional: Define model training/compiling/defining parameters as a dictionary and pass it to the class constructor
params = {
    "epochs": 10,
    "batchSize": 32,
    "minSampleSize": 100000,
    "learningRate": 0.001,
    "modelBuilder": customModel,     # callable
    "loss": 'binary_crossentropy',    
    "metrics": ['accuracy']
}
trainer = ModelTrainer(dataFrame, params)
trainer.trainKerasModel()           # optional: Train the Keras model with sampling, Set: trainKerasModel(sample = true, frac = 0.1).
trainer.plotTrainingHistory()

Evaluating the Model

The evaluation script computes metrics like:

  • Accuracy
  • Precision
  • Recall (Sensitivity
  • F1 Score
  • Classification Report

The evaluation includes visualizations such as

  • Confusion Matrix
  • ROC Curve

The evaluation results are logged and saved to a file under kerasModel/evaluator/evaluationPlots.

Script
python kerasModel/evaluator/model_evaluator.py
Example Usage:
modelPath = '../trainer/trainedModels/keras_model_trained_dataset.keras'
filePath = '../../data/higgs/prepared-higgs_train.csv'

dataLoaderDask = DataLoaderDask(filePath)
dataFrame = dataLoaderDask.loadData()

evaluator = ModelEvaluator(modelPath, dataFrame)
evaluator.evaluate()

Feature Importance Analysis

The feature importance is computed using permutation importance and visualised using a bar chart. It is implemented once using the Pandas approach (with SciKit) and another using Dask for parallel processing.

The chart and the result CSV file are saved under kerasModel/featureImportance/featureImportancePlots.

Script
python kerasModel/featureImportance/feature_importance.py
Example Usage:
modelPath = '../trainer/trainedModels/keras_model_test_dataset.keras'
filePath = '../../data/higgs/prepared-higgs_test.csv'

dataLoaderDask = DataLoaderDask(filePath)
dataFrame = dataLoaderDask.loadData()

evaluator = FeatureImportanceEvaluator(modelPath, dataFrame)
evaluator.evaluate()
        
        # Alternatively
evaluator = FeatureImportanceEvaluator(modelPath, dataFrame, sampleFraction = 0.1, nRepeats=32)  # with sampling
evaluator.evaluate(withDask = False)        # with pandas