You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It is very difficult to reproduce the result shown in the paper by following the steps in the tutorial.
Issue 1: unable to run jobs on Azure using LaunchTrainJob.ipynb
Running LaunchTrainJob notebook result : TaskSchedulingConstraintFailed Reason: The user used to run the task is not found
We specified batch_job_user_name in the json, but that creates this error.
I need to change user identity to Task user (Admin). Then this problem goes away.
After fixing that issue, I end up with The specified command program is not found
CommandLine: call C:\\prereq\\mount.bat && C:\\ProgramData\\Anaconda3\\Scripts\\activate.bat py36 && python -u Z:\\scripts_downpour\\app\\distributed_agent.py data_dir=Z: role=agent max_epoch_runtime_sec=30 per_iter_epsilon_reduction=0.003000 min_epsilon=0.100000 batch_size=32 replay_memory_size=2000 experiment_name=distributed_rl_75726dee-3f90-41e4-8657-3f7ae8dc924d weights_path=Z:\data\pretrain_model_weights.h5 train_conv_layers=false
Message: The system cannot find the file specified.
Notice that weights_path=Z:\data\pretrain_model_weights.h5 (generated from the code) does not have extra escape character '\', I tried adding that too but still the same error.
I honestly don't think anyone who star this repo has actually ran the code themselves.
This issue 1 is the most critical part because I cannot run the training job.
Issue 2: SetupCluster.ipynb
This one is merely for bug reporting.
with open('Template\\pool.json.template', 'r') as f:
pool_config = f.read()
pool_config = pool_config\
.replace('{batch_pool_name}', NOTEBOOK_CONFIG['batch_pool_name'])\
.replace('{subscription_id}', NOTEBOOK_CONFIG['subscription_id'])\
.replace('{resource_group_name}', NOTEBOOK_CONFIG['resource_group_name'])\
.replace('{storage_account_name}', NOTEBOOK_CONFIG['storage_account_name'])\
.replace('{batch_job_user_name}', NOTEBOOK_CONFIG['batch_job_user_name'])\
.replace('{batch_job_user_password}', NOTEBOOK_CONFIG['batch_job_user_password'])\
.replace('{batch_pool_size}', str(NOTEBOOK_CONFIG['batch_pool_size']))
with open('pool.json', 'w') as f:
f.write(pool_config)
create_cmd = 'powershell.exe ".\ProvisionCluster.ps1 -subscriptionId {0} -resourceGroupName {1} -batchAccountName {2}"'\
.format(NOTEBOOK_CONFIG['subscription_id'], NOTEBOOK_CONFIG['resource_group_name'], NOTEBOOK_CONFIG['batch_account_name'])
print('Executing command. Check the terminal output for authentication instructions.')
os.system(create_cmd)
This code no longer works, this is because the json file it creates no longer contains sufficient information to create a pool on the latest Azure cloud.
I created a pool manually using Batch Explorer, I noticed that the pool should be created without adding any 'Start Task' and then set Start Task separately after creating the pool. Otherwise, you end up with the error:
InvalidPropertyValue
The value provided for one of the properties in the request body is invalid.
PropertyName: dataDisks
Reason: Only one of dataDisks and virtualMachineImageId can be specified
LaunchTrainingJob.ipynb
Syntax error in the code: batch_client = batch.BatchServiceClient(batch_credentials, base_url=NOTEBOOK_CONFIG['batch_account_url'])
Should be careful with choosing Azure server region. Not many regions have NV6. So trying to create a pool in those regions will cause an error. (I am currently using US East)
Make sure to upgrade your free-trial to pay-as-go and request for higher batch quota via support ticket. Free-trial subscription doesn't offer NV6.
The text was updated successfully, but these errors were encountered:
Thanks for the report. This worked a year ago when we initially wrote the tutorial; it looks like the API has changed a bit from under us. We'll look at updating it.
It is very difficult to reproduce the result shown in the paper by following the steps in the tutorial.
Issue 1: unable to run jobs on Azure using LaunchTrainJob.ipynb
Running LaunchTrainJob notebook result :
TaskSchedulingConstraintFailed Reason: The user used to run the task is not found
We specified batch_job_user_name in the json, but that creates this error.
I need to change user identity to Task user (Admin). Then this problem goes away.
After fixing that issue, I end up with The specified command program is not found
Notice that weights_path=Z:\data\pretrain_model_weights.h5 (generated from the code) does not have extra escape character '\', I tried adding that too but still the same error.
I honestly don't think anyone who star this repo has actually ran the code themselves.
This issue 1 is the most critical part because I cannot run the training job.
Issue 2: SetupCluster.ipynb
This one is merely for bug reporting.
This code no longer works, this is because the json file it creates no longer contains sufficient information to create a pool on the latest Azure cloud.
I created a pool manually using Batch Explorer, I noticed that the pool should be created without adding any 'Start Task' and then set Start Task separately after creating the pool. Otherwise, you end up with the error:
LaunchTrainingJob.ipynb
Syntax error in the code:
batch_client = batch.BatchServiceClient(batch_credentials, base_url=NOTEBOOK_CONFIG['batch_account_url'])
Should be :
batch_client = batch.BatchServiceClient(credentials=batch_credentials, **batch_url**=NOTEBOOK_CONFIG['batch_account_url'])
Similarily,
Should be :
Miscellaneous
The text was updated successfully, but these errors were encountered: