You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm fine with us supporting only the SQL standard (i.e. null values are unique) and dropping the need for my originally-proposed uniqueNulls property. However, I believe the documentation (and perhaps the specs) could be improved with regard to how null values are thus handled. The short fix would be to clarify that composite unique keys are excluded from the uniqueness check if they contain at least one null value. But maybe this is worth a longer explanation:
uniqueKeys
The documentation currently states:
All the field values that are on the logical level are considered to be null values MUST be excluded from the uniqueness check, as the uniqueKeys property is modeled on the concept of unique constraint in SQL.
I find this misleading, as the SQL standard considers null values unique (or maybe more precisely, distinct), meaning for example that these two rows are unique and thus considered valid for uniqueKey [a, b]:
a
b
1
null
1
null
Which isn't what a reader is likely to assume if we say that null values are "excluded". Or what a data packager may want, hence why I originally proposed the ability to specify uniqueNulls: false (e.g. the behavior used by Python and R).
foreignKeys
The documentation currently states:
They are directly modelled on the concept of foreign keys in SQL
If so, per the SQL standard, this would require (although nowhere stated) at least a uniqueKey (if not a primaryKey) on the reference field(s). Do we have any opinion on this?
Furthermore (and regardless of the above), per SQL standards, local field values that are not in the reference fields are permitted if at least one of the local fields is null. These two tables would be considered valid for foreignKey local [a, b] → reference [a, b]:
reference
a
b
2
1
3
1
local
a
b
1
null
Is this how we intend foreign keys to operate?
The text was updated successfully, but these errors were encountered:
Which isn't what a reader is likely to assume if we say that null values are "excluded". Or what a data packager may want, hence why I originally proposed the ability to specify uniqueNulls: false (e.g. the behavior used by Python and R).
As a heavy Python / R user and light SQL user, yes, that behavior is surprising for me with the word "excluded". It also seems nice to be able to switch between the behaviors. But for now given the time constraints I think we should probably just focus on matching the SQL behavior, so we can have frictionless schemas that accurately represent SQL DBs. (we can add more options like uniqueNulls later, right?)
Maybe in addition to changing the language we could provide some examples for folks like me who are inexperienced with these sorts of SQL details?
If so, per the SQL standard, this would require (although nowhere stated) at least a uniqueKey (if not a primaryKey) on the reference field(s). Do we have any opinion on this?
My impressions given my limited background:
I would expect foreign keys to only work with primary keys.
I would be surprised to find worked with uniqueKeys, but be understanding.
If they worked without any constraints on the key column I'd start getting a little concerned and really hope that it implicitly implied that a uniqueness constraint was implicitly added to the keys being referenced…
So my instinct would be to go the strict / simple route and require a primaryKey – but I defer to folks like you with more SQL / DB design experience.
Is this how we intend foreign keys to operate?
That behavior would surprise me! I would expect all non-null foreign keys to be capable of valid joins to their primary keys, so [1, null] would not be allowed. Is there an important reason this is the behavior in the SQL standard? Oh – is it because you might have multiple transactions to build the foreign key, so you need to allow incomplete foreign keys in the meantime?
In any case, if our goal here is to accurately describe an arbitrary SQL DB, then wouldn't we need to follow this behavior? Or if the behavior is only used in edge cases (multiple transactions to build the foreign key), should we take a more strict route and only allow complete foreign keys? Sorry I'm not more help on the design here, other than a reference for what might be expected by your average Python / R user.
One more thought – These issues potentially get further complicated by tagged missing values. So if we have:
a
b
1
SKIPPED
1
OMITTED
1
SKIPPED
where missingValues = ["SKIPPED", "OMITTED"], then I can imagine at least 3 policies for determining uniqueness of missing values:
All missing values are unique (each row is unique)
All missing values are equal (all three rows are identical)
Missing values with the same reason are equal (Rows 1 & 3 are equal, 2 is distinct).
roll
transferred this issue from frictionlessdata/datapackage-v2-draft
Jun 24, 2024
I wrote the original
uniqueKeys
pattern (https://specs.frictionlessdata.io/patterns/#table-schema-unique-constraints) and only noticed during the documentation review yesterday that a version ofuniqueKeys
made it into the specs.I'm fine with us supporting only the SQL standard (i.e. null values are unique) and dropping the need for my originally-proposed
uniqueNulls
property. However, I believe the documentation (and perhaps the specs) could be improved with regard to how null values are thus handled. The short fix would be to clarify that composite unique keys are excluded from the uniqueness check if they contain at least onenull
value. But maybe this is worth a longer explanation:uniqueKeys
The documentation currently states:
I find this misleading, as the SQL standard considers
null
values unique (or maybe more precisely, distinct), meaning for example that these two rows are unique and thus considered valid foruniqueKey
[a, b]:Which isn't what a reader is likely to assume if we say that
null
values are "excluded". Or what a data packager may want, hence why I originally proposed the ability to specifyuniqueNulls: false
(e.g. the behavior used by Python and R).foreignKeys
The documentation currently states:
If so, per the SQL standard, this would require (although nowhere stated) at least a
uniqueKey
(if not aprimaryKey
) on the reference field(s). Do we have any opinion on this?Furthermore (and regardless of the above), per SQL standards, local field values that are not in the reference fields are permitted if at least one of the local fields is
null
. These two tables would be considered valid forforeignKey
local [a, b] → reference [a, b]:reference
local
Is this how we intend foreign keys to operate?
The text was updated successfully, but these errors were encountered: