Skip to content
Galen Charlton edited this page Mar 13, 2015 · 1 revision

Introduction

Library of Congress call numbers (or classification numbers), especially as actually input into library catalogs, can be incredibly messy. In their raw form, they are not comparable as strings, which can be a pain, and have some other shortcomings as well from a machine-readable standpoint.

Criteria for a normalization routine

Any routine to normalize lc call numbers should fulfill two criteria:

  • The normalized call numbers must be comparable with each other, for proper sorting via standard string comparison
  • The normalized call number should be as short as possible, so left-anchored wildcard searches are meaningful. Searching on "A22*" should, after normalization, give you all the A22 call numbers but not the A2201 numbers.
  • The routine should be able to turn a call number into a range endpoint for the creation of call-number ranges, useful in various contexts.

That last point needs some explanation. The idea is that if someone gives a range of, say, A-AZ, what they really mean is A - AZ9999.99, not A - AZ0000, with, say, AZ2.C4 falling outside the range. There is no attempt to make end of range normalization correspond to anything in real life (e.g., an end of range call number may not itself be a valid call number, but rather will represent a string which sorts after all the call numbers in the desired range).

Some examples

Original Normalized End of Range Notes
A1 A 0001 A 000199999999~999 Padded alpha to three spaces and number to four digits; added decimal (99) and three right-normalized cutters (~999) to right-normalized version
B22.3 B 002230 B 002239999999~999 Left-normalized decimal is .30; right is .39
C1.D11 C 000100D110 C 000100D119999999 A simple, single cutter
D15.4 .D22 1990 D 001540D220 000 000 1990 D 001540D220 000 000 1990 Left == Right because of the "extra" (1990)
E8 C11 D22 E 000800C110D220 E 000800C110D229~999 Flexible about finding cutters (no '.' before first cutter here)
ZA4082G33M434.D54 1998 ZA 408200G330M434D540 1998 ZA 408200G330M434D540 1998 More incorrect punctuation ignored

The algorithm

My solution involves a ridiculous regular expression and a whole lot of guessing. I've tested it against some of the nastier stuff I can find in the University of Michigan catalog and it seems to do just fine with anything that is remotely a valid LC call number.

Creating a normalized LC Call Number

The basic algorithm is as follows:

  1. Try to match against a regexp that allows a few common prefixes and allows up to three Cutter numbers plus the "extra" (usually a year, but can include all sorts of crap).
  2. Put all the components (leading alpha, number, decimal, cutter1alpha, cutter1number, ..., extra) into an array, OriginalArray.
  3. Create a left-normalized version of that array. Pad the alpha out to three spaces, left-pad the number with zeros, right-pad the decimal with zeros, create cutters with --space-- for a letter (since it sorts --before 'A'-- after Z, etc. in LeftNormalizedArray.
  4. If there's a number (or number.decimal) in the original, automatically replace them in OriginalArray with a left-normalized version (e.g., 1.4 => 0001.40), since we always want that to be left-normalized.
  5. Proceed backwards through the OriginalArray to find the rightmost component that was actually set in the original entry.
  6. When we find the rightmost non-empty component at array index N, return LeftNormalizedArray[0..N-1] + --OriginalArray[N]--

Creating a "right-normalized" Call number (range endpoint)

If we want to take a call number and make it a range endpoint (because we want to be able to ask for call numbers less than the given string), we change the algorithm in these ways:

  1. Try to match against a regexp that allows a few common prefixes and allows up to three Cutter numbers plus the "extra" (usually a year, but can include all sorts of crap).
  2. Put all the components (leading alpha, number, decimal, cutter1alpha, cutter1number, ..., extra) into an array, OriginalArray.
  3. Create a left-normalized version of that array. Pad the alpha out to three spaces, left-pad the number with zeros, right-pad the decimal with zeros, create cutters with space for a letter (since it sorts before 'A', etc. in LeftNormalizedArray.
  4. Create a right-normalized version of OriginalArray. Pad the alpha out to three spaces, left-pad the number with zeros, right-pad the decimal with nines, create cutters with tilde for a letter (since it sorts after Z, etc. in RightNormalizedArray.
  5. If there's a number (or number.decimal) in the original, automatically replace them in OriginalArray with a left-normalized version (e.g., 1.4 => 0001.40), since we always want that to be left-normalized.
  6. Proceed backwards through the OriginalArray to find the rightmost component that was actually set in the original entry.
  7. When we find the rightmost non-empty component at array index N, return LeftNormalizedArray[0..N-1] + RightNormalizedArray[N..length(RightNormalizedArray)]

The Regular Expression to try to match

...looks like this. The prefixes are probably UMich-specific.

my $lcregex = qr/^
        \s*
        (?:VIDEO-D)? # for video stuff
        (?:DVD-ROM)? # DVDs, obviously
        (?:CD-ROM)?  # CDs
        (?:TAPE-C)?  # Tapes
        \s*
        ([A-Z]{1,3})  # alpha
        \s*
        (?:         # optional numbers with optional decimal point
          (\d+)
          (?:\s*?\.\s*?(\d+))?
        )?
        \s*
        (?:               # optional cutter
          \.? \s*
          ([A-Z])         # cutter letter
          \s*
          (\d+ | \Z)?     # cutter numbers
        )?
        \s*
        (?:               # optional cutter
          \.? \s*
          ([A-Z])         # cutter letter
          \s*
          (\d+ | \Z)?     # cutter numbers
        )?
        \s*
        (?:               # optional cutter
          \.? \s*
          ([A-Z])         # cutter letter
          \s*
          (\d+ | \Z)?     # cutter numbers
        )?
        (\s+.+?)?         # a mandatory space followed by everything else
        \s*$
  /x;