Determining Duplicates by Entire Line Comparison #1531

osevill · 2024-03-17T04:00:21Z

osevill
Mar 17, 2024

I would like to remove duplicate rows from a csv file, where the duplicate rows do not necessarily appear contiguous to the original row. I was able to use this:
mlr --csv --quote-original uniq -a input.csv > output.csv
successfully, comparing all fields to determine uniqueness.

Is there a way, instead of comparing all fields, to compare the entire row in order to determine uniqueness?

I found this searching for an awk solution to remove duplicates:
awk '!seen[$0]++' your_file.csv > new_file.csv
where the entire row is the input of the associative array.

Is there a way to do something similar in the miller DSL?

Part of my problem is that miller is removing the (unnecessary) double quotes on comment fields. Since there are no commas used inside the comments field, miller is likely correct in removing them, but it's not a change I wanted; I only wanted to remove duplicate rows. I have tried the --quote-original flag but with or without this flag, the double quotes are removed when the comments field doesn't have internal commas.

The other issue I have is that what's nice about the awk script is that my file with duplicates could be any text file; field separators aren't an issue when the entire row is being compared for duplicates.

I see the value in many circumstances of being able to make key, value comparisons, but here I was hoping that treating uniqueness as a row comparison would eliminate the two issues mentioned above.

In general, sticking with miller as my one utility for tasks like this is my preference.

Thanks!

aborruso · 2024-03-17T08:10:41Z

aborruso
Mar 17, 2024

Can you attach some sample lines of input, and the output you would like?

0 replies

osevill · 2024-03-17T14:03:21Z

osevill
Mar 17, 2024
Author

sample_dupe_file.csv

Given the attached sample (this is just sample data I found at www.datablist.com and modified) of 5 rows followed by the same 5 duped rows, here is the output from the awk script awk '!seen[$0]++' sample_dupe_file.csv , which kept the (unnecessary) double quotes around Murillo-Perry in the third row of data.

Customer Id,First Name,Last Name,Company
DD37Cf93aecA6Dc,Sheryl,Baxter,Rasmussen Group
1Ef7b82A4CAAD10,Preston,Lozano,Vega-Gentry
6F94879bDAfE5a6,Roy,Berry,"Murillo-Perry"
5Cef8BFA16c5e3c,Linda,Olsen,"Dominguez, Mcmillan and Donovan"
053d585Ab6b3159,Joanna,Bender,"Martin, Lang and Andrade"

Here is the output from the miller script mlr --csv --quote-original uniq -a sample_dupe_file.csv , which removed the double quotes on Murillo-Perry.

Customer Id,First Name,Last Name,Company
DD37Cf93aecA6Dc,Sheryl,Baxter,Rasmussen Group
1Ef7b82A4CAAD10,Preston,Lozano,Vega-Gentry
6F94879bDAfE5a6,Roy,Berry,Murillo-Perry
5Cef8BFA16c5e3c,Linda,Olsen,"Dominguez, Mcmillan and Donovan"
053d585Ab6b3159,Joanna,Bender,"Martin, Lang and Andrade"

Both accomplish the task of removing the duplicate rows, and for my actual csv file, miller does it faster, but the awk script is preferable in my opinion because it doesn't sanitize the data formatting and would presumably work with almost any filetype. The awk script just does the one thing I wanted, remove duplicates. My original question about slurping in the entire row is asking if there's a way for miller to act more like awk in this circumstance.

Thanks.

0 replies

aborruso · 2024-03-17T21:27:01Z

aborruso
Mar 17, 2024

You could use NIDX format.

If you run

mlr --nidx --ifs '|' uniq -a sample_dupe_file.csv

you get

Customer Id,First Name,Last Name,Company
DD37Cf93aecA6Dc,Sheryl,Baxter,Rasmussen Group
1Ef7b82A4CAAD10,Preston,Lozano,Vega-Gentry
6F94879bDAfE5a6,Roy,Berry,"Murillo-Perry"
5Cef8BFA16c5e3c,Linda,Olsen,"Dominguez, Mcmillan and Donovan"
053d585Ab6b3159,Joanna,Bender,"Martin, Lang and Andrade"

0 replies

osevill · 2024-03-18T02:07:31Z

osevill
Mar 18, 2024
Author

Thank you. This works nicely on the sample file, and will be what I'll use in general, going forward. (Is the '|' input field separator any character, so long as it is not in the csv file, or does it need to be the '|' ?)

The issue I now have (with my original file) is that the numbered index file format, nidx, seems to abort if the row is too long. I tried your nidx miller script on my original file and noticed it stopped writing to the output file immediately before a very long row (67,542 characters long). I then added this row to sample_dupe_file.csv as the second data row, and when I ran the nidx miller script on it, the script also aborted after writing the header and first row of data. So the long row seems to be the culprit.

@johnkerl -- Is there a size limit on how long a numbered index row can be? Previously when choosing --csv format instead of --nidx, I didn't notice this issue.

Thanks all.

2 replies

osevill Mar 18, 2024
Author

I noticed miller v6.12 is available, so I downloaded and tested the above script, mlr --nidx --ifs '|' uniq -a sample_dupe_file.csv on my original file. Although it takes a little longer to run than the awk script awk '!seen[$0]++' sample_dupe_file.csv, when I diff the two outputs, they are identical. 👍

Guessing this bugfix in 6.12 may be related to my nidx issue: Miller produces no output on TSV with > 64K characters per line #1505

Thanks @aborruso for the nidx suggestion and @johnkerl for the bugfix.

aborruso Mar 18, 2024

Is the '|' input field separator any character, so long as it is not in the csv file, or does it need to be the '|' ?

My idea is to use any character, so long as it is not in the csv file, to have only one field. But it would still work even without setting it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determining Duplicates by Entire Line Comparison #1531

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Determining Duplicates by Entire Line Comparison #1531

osevill Mar 17, 2024

Replies: 4 comments · 2 replies

aborruso Mar 17, 2024

osevill Mar 17, 2024 Author

aborruso Mar 17, 2024

osevill Mar 18, 2024 Author

osevill Mar 18, 2024 Author

aborruso Mar 18, 2024

osevill
Mar 17, 2024

Replies: 4 comments 2 replies

aborruso
Mar 17, 2024

osevill
Mar 17, 2024
Author

aborruso
Mar 17, 2024

osevill
Mar 18, 2024
Author

osevill Mar 18, 2024
Author