-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes required to run SIMX on HPCAC #71
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Zach Tiffany <[email protected]>
vim \ | ||
iperf \ | ||
crash \ | ||
zstd \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the development convenience.
Maybe it'd be cool to allow user the ability to provide his own docker file that will incrementally append needed things to the already existing image,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was always in back of my mind, but didn't investigate how to do it without rebuilding all images.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just apply something on top of the existing image?
Like running one additional docker file over the existing image extending it and making the new one the “current “?
RUN /root/basic-setup.sh && /root/kvm-setup.sh | ||
RUN /root/basic-setup.sh | ||
|
||
RUN /root/kvm-setup.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be ignored
@@ -42,6 +42,12 @@ def make_simx(args): | |||
|
|||
subprocess.call(cmd + ['-j%d' %(args.num_jobs)]) | |||
|
|||
def make_rdmo_app(args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started throwing in some stuff to make MKT build my rdmo app. I abandoned that, though. Ignore references to rdmo-app and the packages added to support.Dockerfile.
I added packages to the VM image to build rdmo-app inside my VM instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Building inside VM looks simpler, but it misses the MKT concept. We wanted to separate build environment from run environment. It allows us to enjoy from specific optimizations and makes run fast.
# git_url: http://l-gerrit.mtl.labs.mlnx:8080/simx | ||
# git_commit: 41f602dc05b3c115b176ac3f7869e8bd390cbd92 | ||
# git_url: /global/home/users/ztiffany/test/simx | ||
# git_commit: 3f3c2c9338f3bbb73cf3bd298152e020e394086f |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be ignored
@@ -18,7 +18,7 @@ From simx.git | |||
%build | |||
./mlnx_infra/config.status.mlnx --target=x86 --prefix=/opt/simx | |||
make %{?_smp_mflags} | |||
make %{?_smp_mflags} -C mellanox/ | |||
make %{?_smp_mflags} -C mellanox/ SIMX_PROJECT=mlx5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tells SimX to only build the NIC part. I think it makes sense unless the switch part is planned to be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it makes sense for now. A long time ago, I pitched this project to switch team, they even tried it, but decided to stick with VMs because of differences in technical level expertise between development team and verification team.
# git_url: git://repo.or.cz/smatch.git | ||
# git_commit: 9bb66fa2d7c73b3338a27fd6b38d7d509b2a1c1b | ||
# git_url: /global/home/users/artemp/scratch/.cache/mellanox/mkt/smatch.git | ||
# git_commit: 72c21a144a812cadbe349801da1b24bc331af256 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason, the site where we were building this can't access the original URL.
This is specific to that site and shouldn't be considered, probably.
Especially given that "mkt images" is not a must.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC if you preload the normal cache directory it doesn't require network access so long as the commit_id is already present. So these weird disconnected cases are solved by transfering the cache directory from some network connected machine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is normal cache directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
➜ kernel git:(master) ls ~/.cache/mellanox/mkt
iproute2-next.git rdma-core.git simx.git smatch.git sparse.git tc-build.git
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That’s the issue, it fails to download there from got saying connection refused.
Again, I don’t think we should consider it in MKT. this is obviously the issue on that site.
we just haven’t cleaned the version we ended up with for the sake of time consumption.
@@ -1,7 +1,7 @@ | |||
#!/bin/bash | |||
# --- | |||
# git_url: git://git.kernel.org/pub/scm/devel/sparse/sparse.git | |||
# git_commit: 8af2432923486c753ab52cae70b94ee684121080 | |||
# git_url: /global/home/users/artemp/scratch/.cache/mellanox/mkt/sparse.git |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
@@ -79,9 +85,13 @@ def setup_from_pickle(args, pickle_params): | |||
subprocess.check_output(['make', 'headers_install', | |||
'INSTALL_HDR_PATH=/usr'], cwd=args.kernel) | |||
|
|||
if not os.path.isdir('/images/ztiffany/ccache'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is likely not needed based on my later experience
@@ -64,11 +64,17 @@ def remove_mounts(): | |||
|
|||
|
|||
def is_passable_mount(v): | |||
print ("Checking mount: {}".format(v)) | |||
if v[2] == "nfs" or v[2] == "nfs4": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qemu-system-x86_64: -fw_cfg etc/sercon-port,string=2: warning: externally provided fw_cfg item names should be prefixed with "opt/"
qemu-system-x86_64: -device virtio-9p-pci,fsdev=host_bind_fs0,mount_tag=bind0: cannot initialize fsdev 'host_bind_fs0': failed to open '<snip>': Permission denied
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Permission denied" - let's debug, it shouldn't
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am root on the node, I cannot LS my users home directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ls fails with Permission denied as well
if v[1].startswith("/images/"): | ||
print ("YES!!!") | ||
return True | ||
if v[1].startswith("/plugins"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HPCAC nodes are diskless. Here is how plugins are mounted:
Evaluating: /plugins
v is: ['tmpfs', '/plugins', 'tmpfs', 'ro,relatime,mode=555', '0', '0']
Passing: /plugins
Here is from a working system:
['/dev/sda5', '/plugins', 'ext3', 'ro,relatime', '0', '0']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, docker can't mount tmpfs, need to think about workaround
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does work if we add the above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think it should work as is
@@ -64,11 +64,17 @@ def remove_mounts(): | |||
|
|||
|
|||
def is_passable_mount(v): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On HPCAC, this was needed to get the rdma-core directory passed through:
mkt run --dir /images/ztiffany/src/rdma-core/
Is this expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, it means your config file is incomplete or another bug, we mount whole src directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be because /images is on tmpfs, but I don’t think we saw even an attempt to mount it
@@ -97,3 +107,5 @@ def setup_from_pickle(args, pickle_params): | |||
make_rdma_core(args) | |||
if args.project == "simx": | |||
make_simx(args) | |||
if args.project == "rdmo-app": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignore this.
@@ -67,6 +67,7 @@ def run_ci_cmd(self, supos): | |||
"rdma": "iproute2", | |||
"kernel": "kernel", | |||
"mlnx_infra": "simx", | |||
"rdmo-app": "rdmo-app", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignore.
@@ -27,7 +27,7 @@ def get_cache_fn(fn): | |||
an impact on the operation of mkt - at worst it will run slower.""" | |||
global cache_dir | |||
if cache_dir is None: | |||
cache_dir = os.path.expanduser("~/.cache/mellanox/mkt/") | |||
cache_dir = '/images/ztiffany/.cache/mellanox/mkt/' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the home dir is insufficient to hold these caches,
Is there a way to point it somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.cache is general mechanism, it is worth to make symlink
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense
@ztiffany please fix your git config to use your nvidia address, thanks |
@jgunthorpe this is not intended for merge. It’s an FYI to indicate what we had to hack to make it work on the particular system. we agreed with @rleon that we will open this one |
@@ -69,4 +69,5 @@ cat <<EOF > /etc/sysctl.d/hugepages.conf | |||
vm.nr_hugepages=2 | |||
EOF | |||
|
|||
rpm -U /opt/rpms/*.rpm | |||
#rpm -U /opt/rpms/*.rpm | |||
rpm -U --force /opt/rpms/*.rpm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in c2f86ca
Signed-off-by: Zach Tiffany [email protected]
This is a dirty set of changes that were made to set up MKT for run on HPCAI. Do not merge.