Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] CSV Import - Change datetime format parsing #6539

Merged
merged 1 commit into from
Aug 25, 2023

Conversation

PrimozGodec
Copy link
Contributor

Issue

Addresses but doesn't fix #6499

The datetime format parsing in the Import CSV widget was implemented to work faster when the same times repeat, but it may cause problems since each time is parsed separately.

Description of changes

Since Pandas improved datetime parsing speeds, I suggest not parsing unique times separately but parsing them in one call of pd.to_datetime. Doing this way, Pandas try to guess the format of times in a column and then parse them with the same format.

It may solve issues with some formats that Pandas can recognize but will only solve some problems. E.g. when Pandas cannot recognize the format, they fall back to dateutil implementation, and in this case, dates are still parsed separately, which can cause different parsing between dates in the same column. It happens in case #6499, which means that this issue is not solved yet.

Includes
  • Code changes
  • Tests
  • Documentation

@codecov
Copy link

codecov bot commented Aug 18, 2023

Codecov Report

Merging #6539 (fbd553f) into master (981317c) will decrease coverage by 0.01%.
Report is 5 commits behind head on master.
The diff coverage is 100.00%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6539      +/-   ##
==========================================
- Coverage   87.68%   87.68%   -0.01%     
==========================================
  Files         321      321              
  Lines       69406    69400       -6     
==========================================
- Hits        60860    60853       -7     
- Misses       8546     8547       +1     

@PrimozGodec PrimozGodec changed the title CSV Import - Change datetime format parsing [FIX] CSV Import - Change datetime format parsing Aug 18, 2023
if len(unique_values) < 100 and len(unique_values) < len(col)**0.7:
return col.astype("category")
try:
return pd.to_datetime(col)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would specifying a format here and retrying on ParserErrors be a valid fix for #6499?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work. Maybe we can try with all formats that we currently support, and then we fall back to default if None works (since date utils support more formats that we do). We would need to test how time-consuming it is, but it is a solution. It would not solve the problem in #6499 since the d/m/y format currently doesn't exist in the list.

Even a better solution would be to allow users to specify the format.

@noahnovsak noahnovsak merged commit bb5e845 into biolab:master Aug 25, 2023
23 checks passed
@PrimozGodec PrimozGodec deleted the csv-datetimeguess branch September 20, 2023 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CSV File Import mangles dates - locale?
2 participants