CSV is harder than you think: quotes, BOMs, newlines, locales

CSV looks trivial. Comma-separated values, one row per line, what could go wrong? A lot, actually. CSV is barely a format — it's a folklore of conventions that mostly agree, mostly. Every export pipeline meets at least three of the problems below. Here's how to recognize them.

"CSV" is not really standardized

There is RFC 4180 from 2005, which describes the most common CSV dialect. It is not a standard in the IETF sense; it explicitly says "the CSV format has many implementations, and they vary." The variations are exactly where data corruption lives.

Open Excel and save as CSV: you'll get one dialect. Open the same workbook in LibreOffice, save as CSV: a different dialect (different separator depending on locale, different escaping). Python's csv module has a dozen options because each one represents a real CSV variant in the wild.

If you produce CSV that other tools will read, pick a dialect and document it. If you parse CSV from arbitrary sources, write code that tolerates the most common variations and falls back gracefully.

The separator isn't always a comma

In English-speaking locales, the separator is ,. In most of Europe and parts of Latin America, the system separator is ;, because , is the decimal point. Open a CSV file in Excel on a German Windows installation and it'll be interpreted with semicolons. Open the same file on a US installation and it'll be commas.

The fix is to either:

Always export with , and force a UTF-8 BOM (more on BOMs below), so Excel reads it predictably.
Export with the locale's preferred separator and let the user re-open in their tool of choice.
Use a TSV (tab-separated) variant — tabs are rarely present in actual data and skip the comma vs semicolon mess entirely.

The CSV to JSON converter on this site supports comma, semicolon, tab, and pipe. Auto-detect handles the common cases.

Quoting is where the rules diverge

The basic rule everyone agrees on:

If a value contains the separator, the value goes in double quotes: Hello, world → "Hello, world".
If a value contains a quote, the quote is escaped by doubling: She said "hi" → "She said ""hi""".
If a value contains a newline, the value goes in double quotes; the newline is preserved as-is inside the quotes.

Where dialects diverge:

Some dialects use backslash escapes (\") instead of double-doubling.
Some allow only quotes around fields that need them; others quote everything.
Excel's CSV uses =-prefix as a formula indicator, so =2+2 in a cell becomes a formula on import. Smuggling instructions into a CSV cell is a real vulnerability called CSV injection — when exporting user data, prefix any value starting with =, +, -, @, \t, \r with a single quote.

The CSV injection attack: an attacker types =CMD\|"/c calc"!A1 into a form. You export the form responses as CSV. Another user opens the CSV in Excel. Excel, by default, runs the formula. With CMD link, it can execute arbitrary commands. OWASP has documented this since 2014; the fix is to prefix risky values.

Newlines inside values

CSV allows newlines inside quoted fields. A multi-line address looks like:

"123 Main St
Apt 4B
Springfield",IL,62701

That's one record across three lines of text. Naïve parsers that split on \n produce three broken records. Even some "real" parsers (sed/awk scripts, old PHP libraries) break here. Test with multi-line data before you trust your pipeline.

The reverse problem: line endings. Windows uses \r\n, Unix uses \n, classic Mac uses \r. RFC 4180 says CSV should use \r\n. Most parsers are lenient and accept any of the three; some aren't. If you produce CSV and a downstream tool complains, normalize line endings to \r\n before writing.

The BOM problem

UTF-8 BOM is a three-byte sequence (EF BB BF) at the start of a file that signals "this file is UTF-8." It is optional in UTF-8 and mostly redundant — UTF-8 is the dominant encoding and detectable in other ways — but Excel needs it to read non-ASCII characters from CSV correctly.

The two failure modes:

No BOM, accented characters in the file. Excel on Windows reads the file as Windows-1252 (Latin-1), so é becomes Ã© and Ω becomes Î©. Hilarious until it's payroll data.
BOM present, code reading the file naïvely. Your script reads the file, takes the first row as headers, and the first header is now ID instead of ID. The "phantom column" bug. Strip the BOM in your reader or use a parser that does.

When in doubt: write a BOM if Excel users will open the file. Strip a BOM if a script will parse it.

Type inference: necessary, dangerous

CSV has no types. Everything is a string. Tools infer types on read.

1234567890 is probably an integer. Unless it's a phone number, in which case leading zeros are now gone.
0123 could be an integer with a leading zero or a zip code. Excel strips the zero.
2024-06-01 is a date in ISO 8601. Excel might or might not recognize it, depending on locale.
2/3 could be a fraction, a date, or a ratio. Excel will convert it to February 3 and reformat the cell — silently.

The Excel-date-eats-gene-names problem is real: when biologists named genes "SEPT1" or "MARCH1", Excel CSV imports turned them into dates. In 2020 the standards body renamed dozens of genes specifically to avoid this. Real corruption from a real format.

If you produce CSV, document the types. If you consume CSV from an unknown source, treat everything as a string until you can validate it. The CSV to JSON converter lets you preview type inference before committing to a parse.

Empty values, null, and the empty string

CSV has no null literal. An empty cell could mean:

The value is not provided.
The value is the empty string.
The value is null (in some encoding the producer made up).

Common producer conventions: "", NULL, \N (Postgres COPY default), or just two commas with nothing between them (a,,b). None is standard. If you control both ends, pick one and document it. If you don't, your parser needs an nullValue option (the CSV converter has one).

A short checklist before exporting CSV

When generating a CSV file someone will open in a spreadsheet:

Write a UTF-8 BOM if non-ASCII characters might appear.
Quote any field containing the separator, a quote, or a newline.
Escape internal quotes by doubling, not by backslash.
Use \r\n line endings between records.
Prefix any cell starting with =, +, -, @, \t, \r with a single quote (CSV injection defence).
Document the separator. If users in non-comma locales will open the file, consider tab-separated instead.

A short checklist before parsing CSV

When reading a CSV file you didn't produce:

Strip a leading BOM if present.
Accept \r\n, \n, \r as record separators.
Support quoted fields with embedded newlines and doubled internal quotes.
Don't infer types eagerly; let the caller decide.
Treat empty cells, the literal NULL, and \N as configurable (your default is "empty string").
Be ready for inconsistent column counts — some producers omit trailing empty cells.

CSV is one of those formats where the bugs aren't in the spec — they're in the gap between the spec and what tools actually emit. The defensive parser earns its keep.