Replies: 4 comments 2 replies
-
Can you attach some sample lines of input, and the output you would like? |
Beta Was this translation helpful? Give feedback.
-
Given the attached sample (this is just sample data I found at www.datablist.com and modified) of 5 rows followed by the same 5 duped rows, here is the output from the awk script Customer Id,First Name,Last Name,Company Here is the output from the miller script Customer Id,First Name,Last Name,Company Both accomplish the task of removing the duplicate rows, and for my actual csv file, miller does it faster, but the awk script is preferable in my opinion because it doesn't sanitize the data formatting and would presumably work with almost any filetype. The awk script just does the one thing I wanted, remove duplicates. My original question about slurping in the entire row is asking if there's a way for miller to act more like awk in this circumstance. Thanks. |
Beta Was this translation helpful? Give feedback.
-
You could use NIDX format. If you run
you get
|
Beta Was this translation helpful? Give feedback.
-
Thank you. This works nicely on the sample file, and will be what I'll use in general, going forward. (Is the '|' input field separator any character, so long as it is not in the csv file, or does it need to be the '|' ?) The issue I now have (with my original file) is that the numbered index file format, nidx, seems to abort if the row is too long. I tried your nidx miller script on my original file and noticed it stopped writing to the output file immediately before a very long row (67,542 characters long). I then added this row to sample_dupe_file.csv as the second data row, and when I ran the nidx miller script on it, the script also aborted after writing the header and first row of data. So the long row seems to be the culprit. @johnkerl -- Is there a size limit on how long a numbered index row can be? Previously when choosing --csv format instead of --nidx, I didn't notice this issue. Thanks all. |
Beta Was this translation helpful? Give feedback.
-
I would like to remove duplicate rows from a csv file, where the duplicate rows do not necessarily appear contiguous to the original row. I was able to use this:
mlr --csv --quote-original uniq -a input.csv > output.csv
successfully, comparing all fields to determine uniqueness.
Is there a way, instead of comparing all fields, to compare the entire row in order to determine uniqueness?
I found this searching for an awk solution to remove duplicates:
awk '!seen[$0]++' your_file.csv > new_file.csv
where the entire row is the input of the associative array.
Is there a way to do something similar in the miller DSL?
Part of my problem is that miller is removing the (unnecessary) double quotes on comment fields. Since there are no commas used inside the comments field, miller is likely correct in removing them, but it's not a change I wanted; I only wanted to remove duplicate rows. I have tried the
--quote-original
flag but with or without this flag, the double quotes are removed when the comments field doesn't have internal commas.The other issue I have is that what's nice about the awk script is that my file with duplicates could be any text file; field separators aren't an issue when the entire row is being compared for duplicates.
I see the value in many circumstances of being able to make key, value comparisons, but here I was hoping that treating uniqueness as a row comparison would eliminate the two issues mentioned above.
In general, sticking with miller as my one utility for tasks like this is my preference.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions