This repository implements the 1st place solution for the single cell perturbations problem open-problems-single-cell-perturbations
- Input Features
- Use one hot encoding of cell_type/sm_names
- Add the mean, standard deviation, and (25%, 50%, 75%) percentiles of target values (differential expressions) per cell_type and sm_name.
- Model Architectures
- Use LSTM, GRU, and 1d-CNN architectures (see
models.py
).
- Loss Functions and Optimizer
- Use MSE, MAE, BCE, and LogCosh (see
helper_classes.py
) - Use Adam optimizer to train the models in a 5-fold cross validation setting.
- Hyperparameters
- 250 epochs, lr 0.001 for LSTM and 1d-CNN, and 0.0003 for GRU.
- Use gradient norm clip value of 1.0 during training
- Batch size 16
- Predictions
- Use weighted ensemble prediction; fold-wise, use the coefficients [0.25, 0.15, 0.2, 0.15, 0.25], and model-wise use [0.29, 0.33, 0.38]
Make sure Anaconda3 is installed and execute the following:
-
Clone this repository
git clone https://github.com/Jean-KOUAGOU/1st-place-solution-single-cell-pbs.git
-
First create and activate a conda environement
conda create -n single_cell_env python==3.9.0 --y && conda activate single_cell_env
-
Install all required packages in the environment
pip install -r requirements.txt
- python 3.9.0
- pandas 2.1.3
- pyarrow 14.0.1
- tqdm 4.66.1
- scikit-learn 1.3.2
- torch 2.1.1
- transformers 4.35.2
- matplotlib 3.8.2
- Ubuntu 20.04.6 LTS (Kaggle) AMD EPYC 7B12 CPU @ 2.25GHz (4 CPUs) 30GB RAM, 1xTesla GPU P100 16 GB (Kaggle), 73 GB disc
- Also tested on Debian GNU/Linux 11 AMD EPYC 7282 16-Core Processor @ 3.2GHz (32 CPUs), 1xNvidia GPU RTX 3090 24 GB, 252 GB RAM, 500 GB disc
-
Create a folder called
data/
in the main directory -
Add the training data in parquet format, e.g.,
de_train.parquet
as in the competition and check that its path is correct inSETTINGS.json
-
Also add the test data and a sample submission file (both should be csv files) in the same directory
data/
and checkSETTINGS.json
for path correctness -
Run
python prepare_data.py
to complete all required preprocessing steps
Make sure to locate at the top level of this Github repository
- Run
python train.py
to train models. This will automatically create a directory calltrained_models
and store the trained models. - Pretrained models can also be downloaded, see link on Kaggle to avoid training.
Check that there is a non-empty directory named trained_models
and that its path is specified in SETTINGS.json
under MODEL_DIR
- Run
python predict.py
to predict on the test data whose path is specified inSETTINGS.json
. This will automatically create an output directory sepcified inSETTINGS.json
and store predictions in a file namedsubmission.csv
- Create a directory
data
in this Github repository - If there is no directory named
trained_models
at the top level of this repository, make sure to create an empty directory with this name - Add de_train.parquet, id_map.csv, and sample_submission.csv into the directory
data
- If necessary, edit SETTINGS.json by specifying the correct paths
- Make sure your machine has at least 16GB RAM
- Execute
./build.sh
to build a docker image - If you would like to predict with pretrained models:
- Download the trained models from Kaggle at https://www.kaggle.com/datasets/jeannkouagou/best-models-single-cell/data, and place them under a folder named
trained_models
at the top level of this Github repository - Execute
./run.sh predict
to run the container and directly predict using the trained models. The output will be a csv file namedsubmission.csv
in the main directory.
- Execute
./run.sh train_and_predict
to train new models and predict.
- I recommend training on a GPU as it might take too long on CPU.
- Training on GPU can take between 6 hours (e.g. on Nvidia GPU RTX 3090) and 10 hours (e.g. on Tesla GPU P100) depending on the GPU used.
- If the objective is not to reproduce the results, you can also change configurations in
config
such as learning rate, epochs, etc, before building the container image.
Note: ./run.sh
should alway be run with an argument, and there are two possibilities ./run.sh predict
or ./run.sh train_and_predict
. If you encounter an error in 7. and 8., there is probably a conflicting container name, e.g., you have executed ./run.sh
several times. The error might look like The container name "single_cell_container" is already in use by container container_id
. In that case, delete container_id
by using sudo docker rm <container_id>
, and retry.