-
Notifications
You must be signed in to change notification settings - Fork 2
Home
Library of Congress call numbers (or classification numbers), especially as actually input into library catalogs, can be incredibly messy. In their raw form, they are not comparable as strings, which can be a pain, and have some other shortcomings as well from a machine-readable standpoint.
Any routine to normalize lc call numbers should fulfill two criteria:
- The normalized call numbers must be comparable with each other, for proper sorting via standard string comparison
- The normalized call number should be as short as possible, so left-anchored wildcard searches are meaningful. Searching on "A22
*
" should, after normalization, give you all the A22 call numbers but not the A2201 numbers. - The routine should be able to turn a call number into a range endpoint for the creation of call-number ranges, useful in various contexts.
That last point needs some explanation. The idea is that if someone gives a range of, say, A-AZ, what they really mean is A - AZ9999.99, not A - AZ0000, with, say, AZ2.C4 falling outside the range. There is no attempt to make end of range normalization correspond to anything in real life (e.g., an end of range call number may not itself be a valid call number, but rather will represent a string which sorts after all the call numbers in the desired range).
Original | Normalized | End of Range | Notes |
---|---|---|---|
A1 | A 0001 | A 000199 |
Padded alpha to three spaces and number to four digits; added decimal (99) and three right-normalized cutters (~999) to right-normalized version |
B22.3 | B 002230 | B 002239 |
Left-normalized decimal is .30; right is .39 |
C1.D11 | C 000100D110 | C 000100D119 |
A simple, single cutter |
D15.4 .D22 1990 | D 001540D220 000 000 1990 | D 001540D220 000 000 1990 | Left == Right because of the "extra" (1990) |
E8 C11 D22 | E 000800C110D220 | E 000800C110D229~999 | Flexible about finding cutters (no '.' before first cutter here) |
ZA4082G33M434.D54 1998 | ZA 408200G330M434D540 1998 | ZA 408200G330M434D540 1998 | More incorrect punctuation ignored |
My solution involves a ridiculous regular expression and a whole lot of guessing. I've tested it against some of the nastier stuff I can find in the University of Michigan catalog and it seems to do just fine with anything that is remotely a valid LC call number.
The basic algorithm is as follows:
- Try to match against a regexp that allows a few common prefixes and allows up to three Cutter numbers plus the "extra" (usually a year, but can include all sorts of crap).
- Put all the components (leading alpha, number, decimal, cutter1alpha, cutter1number, ..., extra) into an array,
OriginalArray
. - Create a left-normalized version of that array. Pad the alpha out to three spaces, left-pad the number with zeros, right-pad the decimal with zeros, create cutters with --space-- for a letter (since it sorts --before 'A'-- after Z, etc. in
LeftNormalizedArray
. - If there's a number (or number.decimal) in the original, automatically replace them in
OriginalArray
with a left-normalized version (e.g., 1.4 => 0001.40), since we always want that to be left-normalized. - Proceed backwards through the
OriginalArray
to find the rightmost component that was actually set in the original entry. - When we find the rightmost non-empty component at array index
N
, returnLeftNormalizedArray[0..N-1]
+ --OriginalArray[N]
--
If we want to take a call number and make it a range endpoint (because we want to be able to ask for call numbers less than the given string), we change the algorithm in these ways:
- Try to match against a regexp that allows a few common prefixes and allows up to three Cutter numbers plus the "extra" (usually a year, but can include all sorts of crap).
- Put all the components (leading alpha, number, decimal, cutter1alpha, cutter1number, ..., extra) into an array,
OriginalArray
. - Create a left-normalized version of that array. Pad the alpha out to three spaces, left-pad the number with zeros, right-pad the decimal with zeros, create cutters with space for a letter (since it sorts before 'A', etc. in
LeftNormalizedArray
. - Create a right-normalized version of
OriginalArray
. Pad the alpha out to three spaces, left-pad the number with zeros, right-pad the decimal with nines, create cutters with tilde for a letter (since it sorts after Z, etc. inRightNormalizedArray
. - If there's a number (or number.decimal) in the original, automatically replace them in
OriginalArray
with a left-normalized version (e.g., 1.4 => 0001.40), since we always want that to be left-normalized. - Proceed backwards through the
OriginalArray
to find the rightmost component that was actually set in the original entry. - When we find the rightmost non-empty component at array index
N
, returnLeftNormalizedArray[0..N-1]
+RightNormalizedArray[N..length(RightNormalizedArray)]
...looks like this. The prefixes are probably UMich-specific.
my $lcregex = qr/^
\s*
(?:VIDEO-D)? # for video stuff
(?:DVD-ROM)? # DVDs, obviously
(?:CD-ROM)? # CDs
(?:TAPE-C)? # Tapes
\s*
([A-Z]{1,3}) # alpha
\s*
(?: # optional numbers with optional decimal point
(\d+)
(?:\s*?\.\s*?(\d+))?
)?
\s*
(?: # optional cutter
\.? \s*
([A-Z]) # cutter letter
\s*
(\d+ | \Z)? # cutter numbers
)?
\s*
(?: # optional cutter
\.? \s*
([A-Z]) # cutter letter
\s*
(\d+ | \Z)? # cutter numbers
)?
\s*
(?: # optional cutter
\.? \s*
([A-Z]) # cutter letter
\s*
(\d+ | \Z)? # cutter numbers
)?
(\s+.+?)? # a mandatory space followed by everything else
\s*$
/x;