-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Install Failure on GCP Deep Learning VM #259
Comments
I'm not sure if this issue is specific to apex. I think you need to make sure your instance has python-dev: Also, I don't think this issue is related to cpp extension building in particular. I think if the suggested fix resolves your issue for the Python-only build, the cpp and cuda extension build is definitely worth another try. |
@mcarilli ah, thanks for the reply! I tried what you said, but they seem to be already installed. For completeness I included all the header information from the VM when it starts up below. These VMs come with PyTorch (and almost everything else) preinstalled. We use them in our GCP Quickstart Guide on our YOLOv3 repo: Version: m23
Based on: Debian GNU/Linux 9.8 (stretch) (GNU/Linux 4.9.0-8-amd64 x86_64\n)
Resources:
* Google Deep Learning Platform StackOverflow: https://stackoverflow.com/questi
ons/tagged/google-dl-platform
* Google Cloud Documentation: https://cloud.google.com/deep-learning-vm
* Google Group: https://groups.google.com/forum/#!forum/google-dl-platform
To reinstall Nvidia driver (if needed) run:
sudo /opt/deeplearning/install-driver.sh
This image uses python 3.7 from the Anaconda. Anaconda is installed to:
/opt/anaconda3/
Linux instance-2 4.9.0-8-amd64 #1 SMP Debian 4.9.130-2 (2018-10-27) x86_64
The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
ultralytics@instance-2:~$ sudo apt-get install python3-pip python3-dev
Reading package lists... Done
Building dependency tree
Reading state information... Done
python3-pip is already the newest version (9.0.1-2).
python3-dev is already the newest version (3.5.3-1).
0 upgraded, 0 newly installed, 0 to remove and 12 not upgraded. |
Following https://medium.com/giscle/setting-up-a-google-cloud-instance-for-deep-learning-d182256cb894, maybe the solution is as simple as using |
@mcarilli thanks, the change worked. The line I used to successfully install is: pip3 install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" . Unfortunately after install the apex module can be found, but not amp: ...
running install_egg_info
running egg_info
creating apex.egg-info
writing apex.egg-info/PKG-INFO
writing top-level names to apex.egg-info/top_level.txt
writing dependency_links to apex.egg-info/dependency_links.txt
writing manifest file 'apex.egg-info/SOURCES.txt'
reading manifest file 'apex.egg-info/SOURCES.txt'
writing manifest file 'apex.egg-info/SOURCES.txt'
Copying apex.egg-info to /home/ultralytics/.local/lib/python3.5/site-packages/apex-0.1-py3.5.egg-info
running install_scripts
writing list of installed files to '/tmp/pip-ln69wwvt-record/install-record.txt'
done
Removing source in /tmp/pip-5vfngf45-build
Successfully installed apex-0.1
Cleaning up...
ultralytics@instance-2:~/apex$ cd ..
ultralytics@instance-2:~$ python3 -c "import apex"
ultralytics@instance-2:~$ python3 -c "from apex import amp"
Traceback (most recent call last):
File "<string>", line 1, in <module>
ImportError: cannot import name 'amp' from 'apex' (unknown location)
ultralytics@instance-2:~$ python3 -c "import apex; a=apex.amp"
Traceback (most recent call last):
File "<string>", line 1, in <module>
AttributeError: module 'apex' has no attribute 'amp' |
This may be an artifact of where you tried to run Try this, starting in the apex repo directory:
should show where the files are being imported from, which should be some system install path, e.g. on my system
After installing, you can also try running the L0 tests:
They should all pass if you installed with cpp/cuda extensions. |
@mcarilli ah yes you are right! It was importing from the cloned repo. After I removed the I'm starting to think this is a conda install issue (the GCP Deep Learning VMs use Anaconda 3.7). From these directions on installing non-conda packages I activated the conda environment first before trying the install. Install was successful but then the package is missing from ultralytics@instance-2:~$ conda info --envs
WARNING: The conda.compat module is deprecated and will be removed in a future release.
# conda environments:
#
base * /opt/anaconda3
ultralytics@instance-2:~$ source activate base
(base) ultralytics@instance-2:~$ git clone https://github.com/NVIDIA/apex
(base) ultralytics@instance-2:~$ cd apex
(base) ultralytics@instance-2:~/apex$ pip3 install -v --no-cache-dir .
...
Successfully installed apex-0.1
Cleaning up...
(base) ultralytics@instance-2:~/apex$ cd .. && rm -rf apex
(base) ultralytics@instance-2:~$ python3
Python 3.7.1 (default, Dec 14 2018, 19:28:38)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from apex import amp
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'apex'
>>>
|
Hmm, if I try this on my local machine, it appears to install to the correct location. I'm not sure what's different/lacking about the conda environment on the GCP instance...
|
Did you ever figure out why conda on GCP was installing to the wrong directory? I'm not a conda expert so if you managed to resolve this issue it will be helpful for future users. |
No, no luck. I created a blank PyTorch deep learning VM and tried again from scratch, but it's installing to a different python 3.5 rather than anaconda. It seems to be an anaconda issue, and unfortunately I'm not the best conda expert either. I think pip installs to conda are generally not always problem free, I've seen other repos with conda-specific install instructions. In your above example, you see apex in your |
Yes:
When I've had issues using pip installs in conda environments in the past, I've sometimes resolved them by explicitly running |
Is it possible to use a Docker container on the gcp instance as a potential workaround? There are several options for Docker containers in which we test the Apex install regularly: https://github.com/NVIDIA/apex/tree/master/examples/docker Even if Docker containers succeed, this does not alleviate the importance of having the bare-metal Apex install also work. I'll consult some people who have more experience with conda. |
My guess that it's installing to a python 3.5 because it's using' OS's pip3 version 3.5, rather than conda's python 3.7, you can confirm by running |
@ngimel yes, you are correct! glenn@instance-1:~$ pip3 --version
pip 9.0.1 from /usr/lib/python3/dist-packages (python 3.5)
glenn@instance-1:~$ python --version
Python 3.7.3
glenn@instance-1:~$ pip --version
pip 19.0.3 from /opt/anaconda3/lib/python3.7/site-packages/pip (python 3.7) @mcarilli so I understand the situation now
Yes, if you could get someone to spin up a PyTorch 1.1 VM in GCP and work through the apex install that would help tremendously. Docker might be a fallback, but I think might also be a bridge too far for many users. |
I can't repro on the latest pytorch vm (
|
@ngimel I just checked on a new PyTorch 1.1 vm. This time I got a permission denied error: so I tried to use
|
Using
|
@see-- this works! I was able to successfully install on a GCP VM with the following commands: source activate base
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir . --user UPDATE 1: On running a mixed precision model with the above install I get the following warning: Installing instead with the following line removed the warning: source activate base
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" . --user |
Excellent, thanks guys. Sorry I haven't had time to do a deep dive myself, but i'm pinning this issue for others. |
For posterity, I was only able to get this to work (after trying many other things) with:
(note the sudo.) |
I had to use Conda forge to get this working within my conda environment
|
@see-- @glenn-jocher @sleepinyourhat My environment |
I created a simple GCP Deep Learning VM:
https://cloud.google.com/deep-learning-vm/
I followed the install directions, and the install failed with errors:
The Python-only option also failed:
It would seem like installation on a GCP Deep Learning VM would be one of the tested use cases here no?? If it doesn't work there of all places, where is it intended to work?
The text was updated successfully, but these errors were encountered: