Replies: 2 comments
-
Ok, so seems like I've resolved it mostly by the following field parsers. If this is helpful to anyone else. ...
let quotedField = Parse {
"\"".utf8
Prefix { $0 != .init(ascii: "\"") }
"\"".utf8
}
let complexQuotedField = Parse {
"\"".utf8
Prefix { $0 != .init(ascii: "\"") }
"\"".utf8
",".utf8
}
let field = OneOf {
OneOf {
quotedField
complexQuotedField
}
plainField
} |
Beta Was this translation helpful? Give feedback.
-
So, I'm engaged in parsing a bunch of CSV data and I used this as a starting point. However, I needed to account for fields which could contain quotes, which would of course be escaped, so then needing to handle the escape character as well. The above code will process a quote delimited field which does not contain quote characters, but it will barf on something like What I've come up with is a parser that specifically handles quote delimited fields which use the backslash character as the escape character: public struct QuotedFieldParser<Input: Collection>: Parser
where
Input.SubSequence == Input,
Input.Element == UTF8.CodeUnit
{
enum FieldError: Error {
case unimplemented
case logicError
case malformed
}
enum ParsingState {
case notStarted
case midField
case escapeSequence
}
public init() {}
public func parse(_ input: inout Input) throws -> Substring.UTF8View {
var result = ""
var state = ParsingState.notStarted
let quote: UInt8 = .init(ascii: "\"")
let backslash: UInt8 = .init(ascii: "\\")
while !input.isEmpty {
let character = input.removeFirst()
switch state {
case .notStarted:
guard character == quote else { throw FieldError.malformed }
state = .midField
case .midField:
guard character != quote else {
// reached the end
return result[...].utf8
}
if character == backslash {
state = .escapeSequence
} else {
result.append(String(decoding: [character], as: UTF8.self))
}
case .escapeSequence:
result.append(String(decoding: [character], as: UTF8.self))
state = .midField
}
}
throw FieldError.malformed
}
}
extension QuotedFieldParser: ParserPrinter where Input: PrependableCollection {
public func print(_ output: Substring.UTF8View, into input: inout Input) throws {
var state = ParsingState.notStarted
var escaped = ""
let quote: UInt8 = .init(ascii: "\"")
let backslash: UInt8 = .init(ascii: "\\")
for element in output {
switch state {
case .notStarted:
escaped.append(String(decoding: [quote, element], as: UTF8.self))
state = .midField
case .midField:
if element == quote {
// write an escaped quote
escaped.append(String(decoding: [backslash, quote], as: UTF8.self))
} else if element == backslash {
// write an escaped backslash
escaped.append(String(decoding: [backslash, backslash], as: UTF8.self))
} else {
// write the character
escaped.append(String(decoding: [element], as: UTF8.self))
}
case .escapeSequence:
throw FieldError.logicError
}
}
// field ends with a quote
escaped.append("\"")
input.prepend(contentsOf: escaped.utf8)
}
} This allows me to parse a single CSV field as: struct FieldParser: ParserPrinter {
var body: some ParserPrinter<Substring.UTF8View, String> {
let quotedField: QuotedFieldParser<Substring.UTF8View> = QuotedFieldParser()
return OneOf {
quotedField
Prefix { $0 != UInt8(ascii: ",") && $0 != UInt8(ascii: "\n") }
}
.map(.string)
}
} ...and then parse a line of CSV fields as: struct LineParser: ParserPrinter {
var body: some ParserPrinter<Substring.UTF8View, [String]> {
Many {
FieldParser()
} separator: {
",".utf8
}
}
} Finally, a full CSV data dump can be parsed as a bunch of lines: struct CSVParser: ParserPrinter {
var body: some ParserPrinter<Substring.UTF8View, [[String]]> {
Many {
LineParser()
} separator: {
"\n".utf8
} terminator: {
End()
}
}
} I feel like my |
Beta Was this translation helpful? Give feedback.
-
Forgive me as I'm not caught up on all the Parsing videos and don't completely understand how to use this library yet. However I'm trying to parse CSV data, using parsers similar to what are in the Benchmarking example.
My CSV has some fields that are always quoted, but can have nested quotes, commas, and a lot of other data... So I'm trying to determine how to parse all the data up-to the ending. All the fields should end with a double quote and a comma
",
.Here's an example of one of the complex lines of the CSV data.
Anyways any suggestions / help is welcome. TIA.
Beta Was this translation helpful? Give feedback.
All reactions