You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Publishers will often send data that has been encoded with alternate character sets (e.g. latin-1, windows-125X). We want to normalize these data before we start processing. Some of our existing legacy code has issues with alternate encodings, and so we want to catch and replace these data with unicode equivalents whenever possible.
Describe the solution you'd like
We need a pre-parsing operation at some point between reading the file and parsing the contents that checks for the encoding, and if possible, automatically converts the data to unicode. One possible method of doing this is BeautifulSoup's bs4.UnicodeDammit module.
Additional context
We are encountering this issue when parsing reference data originating from ADSImportPipeline/ADSManualParser, and it is resulting in unmatched references solely because of publisher encoding problems.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
Publishers will often send data that has been encoded with alternate character sets (e.g. latin-1, windows-125X). We want to normalize these data before we start processing. Some of our existing legacy code has issues with alternate encodings, and so we want to catch and replace these data with unicode equivalents whenever possible.
Describe the solution you'd like
We need a pre-parsing operation at some point between reading the file and parsing the contents that checks for the encoding, and if possible, automatically converts the data to unicode. One possible method of doing this is BeautifulSoup's bs4.UnicodeDammit module.
Additional context
We are encountering this issue when parsing reference data originating from ADSImportPipeline/ADSManualParser, and it is resulting in unmatched references solely because of publisher encoding problems.
The text was updated successfully, but these errors were encountered: