RegEx support of Datasets packages #81

JohnMrziglod · 2018-01-03T00:05:44Z

If I understand generate_filename correctly, the typhon.spareice.datasets approach assumes that the filename can be calculated using only the placeholders in the template. This is not the case for most real datasets. For example, many filenames contain orbit numbers or the string of a downlink station. That means it is necessary to include a regular expression. I'm not sure this is possible with the typhon.spareice.datasets approach but if it isn't, that would be a major limitation.

You are right. So far generate_filename only uses temporal placeholders. I thought about implementing user-defined placeholders but I have not had the time to do it. What do you need them for? Do you want to keep the information from the original filenames and create new filenames with it? A kind of filename conversion? Could you give me a more detailed example? How do you use typhon.Datasets for this?

The text was updated successfully, but these errors were encountered:

JohnMrziglod · 2018-01-10T01:15:48Z

As a regex example, an example of a HIRS filename is 'NSS.HIRX.NJ.D99127.S0632.E0820.B2241718.WI.gz'. I describe that with the regex r"(L?\d*\.)?NSS.HIR[XS].(?P<satcode>.{2})\.D(?P<year>\d{2})(?P<doy>\d{3})\.S(?P<hour>\d{2})(?P<minute>\d{2})\.E(?P<hour_end>\d{2})(?P<minute_end>\d{2})\.B(?P<B>\d{7})\.(?P<station>.{2})\.gz". Out of those, the parts B and station are present in the filename but not predictable from the starting time. In the case of FCDR_HIRS, I am either reading or writing data and I have both the re approach, and a template based approach:

stored_name = ("FIDUCEO_FCDR_L1C_HIRS{version:d}_{satname:s}_"
               "{year:04d}{month:02d}{day:02d}{hour:02d}{minute:02d}{second:02d}_"
               "{year_end:04d}{month_end:02d}{day_end:02d}{hour_end:02d}{minute_end:02d}{second_end:02d}_"
               "{fcdr_type:s}_v{data_version:s}_fv{format_version:s}.nc")
write_subdir = "{fcdr_type:s}/{satname:s}/{year:04d}/{month:02d}/{day:02d}"
stored_re = (r"FIDUCEO_FCDR_L1C_HIRS(?P<version>[2-4])_"
             r"(?P<satname>.{6})_"
             r"(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})"
             r"(?P<hour>\d{2})(?P<minute>\d{2})(?P<second>\d{2})_"
             r"(?P<year_end>\d{4})(?P<month_end>\d{2})(?P<day_end>\d{2})"
             r"(?P<hour_end>\d{2})(?P<minute_end>\d{2})(?P<second_end>\d{2})_"
             r"(?P<fcdr_type>[a-zA-Z]*)_"
             r"v(?P<data_version>.+)_"
             r"fv(?P<format_version>.+)\.nc")

My file-finder uses the regular expression, but the writing part uses the template. There is a duplication here, ideally one should only need one.

@gerritholl spareice.datasets supports this feature now partly. An user can define regular expressions and use them as placeholders in filenames (currently only in the basename, not in the directory name). Try this example (you need a file named NSS.HIRX.NJ.D99127.S0632.E0820.B2241718.WI.gz):

from typhon.spareice.datasets import Dataset
placeholder = {
    "satcode": "(.{2})",
    "B": "(\d{7})",
    "station": "(.{2})"
}
dataset = Dataset(
    "NSS.HIR[XS].{satcode}.D{year2}{doy}.S{hour}{minute}.E{end_hour}{end_minute}.B{B}.{station}.gz",
    placeholder=placeholder,
)
file_info = dataset.find_file("1999-05-08")
print(file_info)

This prints:

.../NSS.HIRX.NJ.D99127.S0632.E0820.B2241718.WI.gz
  Start: 1999-05-07 06:32:00
  End: 1999-05-07 08:20:00
  Attributes:
    satcode: NJ
    B: 2241718
    station: WI

file_info holds information about the file, you can access the parsed placeholders via file_info.attr. You can use it for generating filenames from other datasets:

other_dataset = Dataset("dummy_file_{year}{month}{day}_{satcode}_B{B}_{station}.dat")
other_dataset.generate_filename("1999-05-08", fill=file.attr)
#  '.../dummy_file_19990508_NJ_B2241718_WI.dat'

JohnMrziglod added the discussion Conversation about feature ideas label Jan 3, 2018

JohnMrziglod assigned JohnMrziglod and gerritholl Jan 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RegEx support of Datasets packages #81

RegEx support of Datasets packages #81

JohnMrziglod commented Jan 3, 2018

JohnMrziglod commented Jan 10, 2018 •

edited

Loading

RegEx support of Datasets packages #81

RegEx support of Datasets packages #81

Comments

JohnMrziglod commented Jan 3, 2018

JohnMrziglod commented Jan 10, 2018 • edited Loading

JohnMrziglod commented Jan 10, 2018 •

edited

Loading