Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Bug in gpinitsystem when creating multiple segments per segment host #625

Open
1 of 2 tasks
lmugnano4537 opened this issue Sep 11, 2024 · 4 comments
Open
1 of 2 tasks
Assignees
Labels
type: Bug Something isn't working type: Enhancement New feature or request, ideas

Comments

@lmugnano4537
Copy link

Cloudberry Database version

1.6.0

What happened

See attached documents. We are creating a cluster on 12 segment hosts that is supposed to have 4 primaries per segment host spread across 4 mounted disks:

/data1
/data2
/data3
/data4

What resulted instead is we ended up with 2 primaries per segment host with it being fairly random as to what disks it created it on.

As a workaround I had to change the config file to double up the disk list. It just seems to be cutting it in half for some reason.

What you think should happen instead

it should be creating the number of primary segments as per the number of data drives in the DATA_DIRECTORY list

How to reproduce

See attached
gpinit_bug_1.txt
gpinit_bug_workaround.txt

Operating System

rocky8

Anything else

No response

Are you willing to submit PR?

  • Yes, I am willing to submit a PR!

Code of Conduct

@lmugnano4537 lmugnano4537 added the type: Bug Something isn't working label Sep 11, 2024
Copy link

Hey, @lmugnano4537 welcome!🎊 Thanks for taking the time to point this out.🙌

@tuhaihe
Copy link
Member

tuhaihe commented Sep 25, 2024

Hey @RyanWei could you help assign engineers to have a look? Thanks.

@liang8283
Copy link

Hi @lmugnano4537 , looking at the log gpinit_bug_1.txt, you specified multiple hostname sdw1 and sdw12 on the same host ip-10-9-1-139.us-west-2.compute.internal. So gpinitsystem spread the segments evenly across sdw1 and sdw12. Could you please correct it and try again?

20240911:09:08:11:237749 gpinitsystem:ip-10-9-1-175:gpadmin-[INFO]:-ip-10-9-1-139.us-west-2.compute.internal 	4000 	sdw12 	/data1/primary/gpseg0 	2
20240911:09:08:11:237749 gpinitsystem:ip-10-9-1-175:gpadmin-[INFO]:-ip-10-9-1-139.us-west-2.compute.internal 	4001 	sdw12 	/data2/primary/gpseg1 	3
20240911:09:08:11:237749 gpinitsystem:ip-10-9-1-175:gpadmin-[INFO]:-ip-10-9-1-139.us-west-2.compute.internal 	4002 	sdw1 	/data3/primary/gpseg2 	4
20240911:09:08:11:237749 gpinitsystem:ip-10-9-1-175:gpadmin-[INFO]:-ip-10-9-1-139.us-west-2.compute.internal 	4003 	sdw1 	/data4/primary/gpseg3 	5

@antoniopetrole
Copy link
Member

Hey @liang8283 . So some more context and I added some files to look at. It seems like the issue is the same thing that happens with gpssh. You pass these utilities a host file with a list of hosts for the cluster, then the utilities in the background seem to ssh into each host in that file, grab the hostname using something like hostname -f, and then it uses that newly generated list of hostnames to continue do perform it's operations. I don't fully understand why these utilities do this and don't just trust that the hostname we pass it is correct, I'd be interested in hearing if anyone knows more about that. I vaguely remember some mailing list discussion for GPDB years ago about this but I don't know if those conversations are available anymore

At the very least, I would hope that gpssh and gpinitsystem (and any other utilities that do this behavior) should at the very least do a duplicate hostname check when they grab the hostnames off the machines they ssh into and if there is a duplicate, throw some sort of exception and fail explicitly. I'd be happy to implement this as a bug fix if you think there is good reason to do so. The real challenge with it is it's not at all obvious that duplicate hostnames are the issue when you see the side effects of it. For instance gpinitsystem will not use all of the specified data directories and create a weird configuration (I've attached some files of an example of this with comments in the files. sdw1 and sdw2 I manually set the hostname to be the same at the OS level on both of them). This also caused all kinds of weird rsync error messages when I was doing gpcheckperf which lead me down a rabbit hole trying to find if there was a bug in rsync. I should be able to replicate this too if needed.

troubleshooting-gpinitsystem.zip

@my-ship-it my-ship-it added the type: Enhancement New feature or request, ideas label Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: Bug Something isn't working type: Enhancement New feature or request, ideas
Projects
None yet
Development

No branches or pull requests

5 participants