Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to solve the "ModuleNotFoundError: No module named 'experiments'"? #19

Open
Josh00-Lu opened this issue Nov 11, 2023 · 4 comments
Open

Comments

@Josh00-Lu
Copy link

I'm using 6GPUs on a single machine. This is my command:

python -m lamorel_launcher.launch --config-path Absolute/Path/To/Grounding_LLMs_with_online_RL/experiments/configs  --config-name local_gpu_config rl_script_args.path=Absolute/Path/To/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py lamorel_args.accelerate_args.machine_rank=1

and

python -m lamorel_launcher.launch --config-path Absolute/Path/To/Grounding_LLMs_with_online_RL/experiments/configs  --config-name local_gpu_config rl_script_args.path=Absolute/Path/To/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py lamorel_args.accelerate_args.machine_rank=0

It returns:
ModuleNotFoundError: No module named 'experiments'

@Josh00-Lu
Copy link
Author

Josh00-Lu commented Nov 11, 2023

I have solved the problem. But I encountered another :(
Now, I only use

python -m lamorel_launcher.launch --config-path Absolute/Path/To/Grounding_LLMs_with_online_RL/experiments/configs  --config-name local_gpu_config rl_script_args.path=Absolute/Path/To/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py lamorel_args.accelerate_args.machine_rank=0

As you mentioned in flowersteam/lamorel#23 (comment)
With the local_gpu_config.yaml as:

  lamorel_args:
  log_level: info
  allow_subgraph_use_whith_gradient: true
  distributed_setup_args:
    n_rl_processes: 1
    n_llm_processes: 1
  accelerate_args:
    config_file: accelerate/default_config.yaml
    machine_rank: 0
    num_machines: 1
  llm_args:
    model_type: seq2seq
    model_path: t5-small
    pretrained: true
    minibatch_size: 4
    pre_encode_inputs: true
    parallelism:
      use_gpu: true
      model_parallelism_size: 1
      synchronize_gpus_after_scoring: false
      empty_cuda_cache_after_scoring: false
rl_script_args:
  path: ???
  seed: 1
  number_envs: 2
  num_steps: 1000
  max_episode_steps: 3
  frames_per_proc: 40
  reward_shaping_beta: 0
  discount: 0.99
  lr: 1e-6
  beta1: 0.9
  beta2: 0.999
  gae_lambda: 0.99
  entropy_coef: 0.01
  value_loss_coef: 0.5
  max_grad_norm: 0.5
  adam_eps: 1e-5
  clip_eps: 0.2
  epochs: 4
  batch_size: 16
  action_space: ["turn_left","turn_right","go_forward","pick_up","drop","toggle"]
  saving_path_logs: Desktop/workspace2/Grounding_LLMs_with_online_RL/logs
  name_experiment: 'llm_mtrl'
  name_model: 'T5small'
  saving_path_model: Desktop/workspace2/Grounding_LLMs_with_online_RL/model
  name_environment: 'BabyAI-KeyCorridorS3R3-v0'
  number_episodes: 10
  language: 'english'
  load_embedding: true
  use_action_heads: false
  template_test: 1
  zero_shot: true
  modified_action_space: false
  new_action_space: #["rotate_left","rotate_right","move_ahead","take","release","switch"]
  spm_path: "YOUR_PATH_TO_PROJECT/experiments/agents/drrn/spm_models/unigram_8k.model"
  random_agent: true
  get_example_trajectories: false
  nbr_obs: 3
  im_learning: false
  im_path: ""
  bot: false

It returns that:

2023-11-11 22:26:56,396][lamorel_logger][INFO] - Init rl-llm group for process 1
[2023-11-11 22:26:56,396][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:3 with 2 nodes.
[2023-11-11 22:26:56,396][lamorel_logger][INFO] - Init rl-llm group for process 0
[2023-11-11 22:26:56,407][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:4 to store for rank: 1
[2023-11-11 22:26:56,407][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:4 to store for rank: 0
[2023-11-11 22:26:56,407][torch.distributed.distributed_c10d][INFO] - Rank 1: Completed store-based barrier for key:store_based_barrier_key:4 with 2 nodes.
[2023-11-11 22:26:56,407][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:4 with 2 nodes.
[2023-11-11 22:26:56,408][lamorel_logger][INFO] - 6 gpus available for current LLM but using only model_parallelism_size = 1
[2023-11-11 22:26:56,409][lamorel_logger][INFO] - Devices on process 1 (index 0): [0]
Parallelising HF LLM on 1 devices
Loading model t5-small
Error executing job with overrides: ['rl_script_args.path=~/Desktop/workspace2/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py', 'lamorel_args.accelerate_args.machine_rank=0']
Traceback (most recent call last):
  File "~/Desktop/workspace2/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py", line 393, in main
    lm_server = Caller(config_args.lamorel_args, custom_updater=PPOUpdater(),
  File "~/Desktop/workspace2/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/caller.py", line 53, in __init__
    Server(
  File "~/Desktop/workspace2/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/server.py", line 40, in __init__
    self._model = HF_LLM(config.llm_args, devices, use_cpu)
  File "~/Desktop/workspace2/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/llms/hf_llm.py", line 38, in __init__
    device_map = infer_auto_device_map(
  File "~/miniconda3/envs/dlp/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 923, in infer_auto_device_map
    max_memory = get_max_memory(max_memory)
  File "~/miniconda3/envs/dlp/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 674, in get_max_memory
    raise ValueError(
ValueError: Device 0 is not recognized, available devices are integers(for GPU/XPU), 'mps', 'cpu' and 'disk'

@ClementRomac
Copy link
Contributor

Hi,

What is your version of Accelerate? The passed device isn't recognized which is weird.

@ClementRomac
Copy link
Contributor

Please see flowersteam/lamorel#24 as it seems to be due to pythorch's version.

@Ziyu0118
Copy link

Hello, where is the directory of the file "lamorel_launcher.launch", I checked the folder "lamorel_launcher" in lamorel, yet I can't find it, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants