Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding the ability to create a non-distributed table (can be used for postgres extensions) #645

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

visill
Copy link

@visill visill commented Sep 26, 2024

Change logs

Add: create table table_name DISTRIBUTED BY LOCAL
Creates a table that is queried only on the coordinator

Why are the changes needed?

Some postgres extensions (sr_plan or yezzey) needs to create local table. Before we do this by C-functions. Now we can do using SQL syntax

Does this PR introduce any user-facing change?

No

How was this patch tested?

It has regress test

Contributor's Checklist

Here are some reminders and checklists before/when submitting your pull request, please check them:

  • Make sure your Pull Request has a clear title and commit message. You can take git-commit template as a reference.
  • Sign the Contributor License Agreement as prompted for your first-time contribution(One-time setup).
  • Learn the coding contribution guide, including our code conventions, workflow and more.
  • List your communication in the GitHub Issues or Discussions (if has or needed).
  • Document changes.
  • Add tests for the change
  • Pass make installcheck
  • Pass make -C src/test installcheck-cbdb-parallel
  • Feel free to request cloudberrydb/dev team for review and approval when your PR is ready🥳

@CLAassistant
Copy link

CLAassistant commented Sep 26, 2024

CLA assistant check
All committers have signed the CLA.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hiiii, @visill welcome!🎊 Thanks for taking the effort to make our project better! 🙌 Keep making such awesome contributions!

@wenchaozhang-123
Copy link
Contributor

wenchaozhang-123 commented Sep 26, 2024

Should add test such as explain and execution for join with other distribution table.

| DISTRIBUTED LOCAL
{
DistributedBy *distributedBy = makeNode(DistributedBy);
distributedBy->ptype = POLICYTYPE_LOCAL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we did choose POLICYTYPE_LOCAL Instead of POLICYTYPE_ENRTY, because getPolicyForDistributedBy explicitly forbits creating relation with POLICYTYPE_ENRTY policy. We are unaware of reason however.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible we keep distributeBy as NULL like pg_class? So that planner could work properly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will think about that

@reshke
Copy link
Contributor

reshke commented Sep 26, 2024

Should add test for join with other distribution table.

Maybe, but main reason for having LOCAL relation is that they act exactly like pg_catalog relations (so, used only on coordinator node).

@reshke
Copy link
Contributor

reshke commented Sep 26, 2024

Should add test for join with other distribution table.

Maybe, but main reason for having LOCAL relation is that they act exactly like pg_catalog relations (so, used only on coordinator node).

for example, save-restore plan (sr_plan) extension uses relation to store query plans. Designed originally for postgres, this extension does not work well in Greenplum (or its forks), because it tries to perform distributed query each time we access sr_plans table.

https://github.com/yezzey-gp/ygp/blob/YGP_6.27_STABLE/gpcontrib/sr_plan/sr_plan--1.1--1.2.sql#L5-L15

@wenchaozhang-123
Copy link
Contributor

Squash commit message?

@gfphoenix78
Copy link
Contributor

Hi, @visill , thank you for your interesting about cloudberry. Could you give more details about why we should change the current data model? Coordinator in cloudberry inherits from greenplum. It doesn't process data scan, and tries to move the workload to the segments, so the database has the ability to scale to a large data processing system. Local table seems violates our design, we need more discussion about why it's necessary.

Cloudberry has many extensions that were backported from PostgreSQL. Generally, our principle is to adapt the extensions to our distributed database, not the other way around.

@avamingli
Copy link
Contributor

Cloudberry has many extensions that were backported from PostgreSQL. Generally, our principle is to adapt the extensions to our distributed database, not the other way around.

Absolutely right.

This is wrong way, you need to make your extensions compatible with CBDB kernel codes.
Greenplum/Cloudberry store catalog on master, other tables should be on segments in MPP mode.
Could you explain more why need create table on master?

Why your extensions need to store data on master, if it must be , it's a catalog.
If it needs to work only on master, you need to change your extensions with Gp_role to do that.
A lot of extensions are modified to be compatible in MPP like that.

@@ -86,7 +86,8 @@ typedef enum GpPolicyType
{
POLICYTYPE_PARTITIONED, /* Tuples partitioned onto segment database. */
POLICYTYPE_ENTRY, /* Tuples stored on entry database. */
POLICYTYPE_REPLICATED /* Tuples stored a copy on all segment database. */
POLICYTYPE_REPLICATED, /* Tuples stored a copy on all segment database. */
POLICYTYPE_LOCAL /* Tuples stored on coordinator */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the new policy type, is it any different from POLICYTYPE_ENTRY?

@reshke
Copy link
Contributor

reshke commented Sep 30, 2024

Could you explain more why need create table on master?

Because we need to store and access data on master, during planning phase. At the first glance it is about sr_plan extension. But turns out, it is not only about that.

Why your extensions need to store data on master, if it must be , it's a catalog. If it needs to work only on master, you need to change your extensions with Gp_role to do that. A lot of extensions are modified to be compatible in MPP like that.

While this is generally true, it is very unhandy to do that. Today I bumped into problems with another extension - pg_hint_plan. This extension also uses table to store its planner 'hints'. This table should persist on master exclusively.

So, the ability to force Greenplum to create relation without any distribution policy would help with at least two extensions now.

And yes, we can play with Gp_role inside extension installation script and all functions/sql/other that access this relation. Whilst it is achievable with sr_plan, it is too much work in other cases, especially for https://github.com/ossc-db/pg_hint_plan

We also can execute heap_create_with_catalog from extension install script, to create all needed relation as we want (with NULL gp_policy).
But specifying one proper option in CREATE TABLE sql is just much more simple.

@reshke
Copy link
Contributor

reshke commented Oct 1, 2024

If one dislikes SQL approach, we can do it another way: using GUC to control transformDistributedBy behaviour when no distrib clause specified

@avamingli
Copy link
Contributor

We also can execute heap_create_with_catalog from extension install script, to create all needed relation as we want (with NULL gp_policy

That is forbidden in GPDB, all tables should have a distribution policy , you are passing an invalid param with that function.

@reshke
Copy link
Contributor

reshke commented Oct 14, 2024

We also can execute heap_create_with_catalog from extension install script, to create all needed relation as we want (with NULL gp_policy

That is forbidden in GPDB, all tables should have a distribution policy , you are passing an invalid param with that function.

It is impossible to forbid something for extension. extension install script is executed under superuser

@avamingli
Copy link
Contributor

avamingli commented Oct 15, 2024

We also can execute heap_create_with_catalog from extension install script, to create all needed relation as we want (with NULL gp_policy

That is forbidden in GPDB, all tables should have a distribution policy , you are passing an invalid param with that function.

It is impossible to forbid something for extension. extension install script is executed under superuser

I'm not talking about the permissions.
Superuser doesn't mean volatile our MPP design is right.
It's terrible that tables are created without distribution policy entry in MPP, and you intend to use that bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants