encoding/csv: for Postgres, unquote empty strings, quote \.

In theory both of these lines encode the same three fields: a,,c a,"",c However, Postgres defines that when importing CSV, the unquoted version is treated as NULL (missing), while the quoted version is treated as a string value (empty string). If the middle field is supposed to be an integer value, the first line can be imported (NULL is okay), but the second line cannot (empty string is not). Postgres's import command (COPY FROM) has an option to force the unquoted empty to be interpreted as a string but it does not have an option to force the quoted empty to be interpreted as a NULL. From http://www.postgresql.org/docs/9.0/static/sql-copy.html: The CSV format has no standard way to distinguish a NULL value from an empty string. PostgreSQL's COPY handles this by quoting. A NULL is output as the NULL parameter string and is not quoted, while a non-NULL value matching the NULL parameter string is quoted. For example, with the default settings, a NULL is written as an unquoted empty string, while an empty string data value is written with double quotes (""). Reading values follows similar rules. You can use FORCE_NOT_NULL to prevent NULL input comparisons for specific columns. Therefore printing the unquoted empty is more flexible for imports into Postgres than printing the quoted empty. In addition to making the output more useful with Postgres, not quoting empty strings makes the output smaller and easier to read. It also matches the behavior of Microsoft Excel and Google Drive. Since we are here and making concessions for Postgres, handle this case too (again quoting the Postgres docs): Because backslash is not a special character in the CSV format, \., the end-of-data marker, could also appear as a data value. To avoid any misinterpretation, a \. data value appearing as a lone entry on a line is automatically quoted on output, and on input, if quoted, is not interpreted as the end-of-data marker. If you are loading a file created by another application that has a single unquoted column and might have a value of \., you might need to quote that value in the input file. Fixes #7586. LGTM=bradfitz R=bradfitz CC=golang-codereviews https://golang.org/cl/164760043

encoding/csv: for Postgres, unquote empty strings, quote \.
In theory both of these lines encode the same three fields: a,,c a,"",c However, Postgres defines that when importing CSV, the unquoted version is treated as NULL (missing), while the quoted version is treated as a string value (empty string). If the middle field is supposed to be an integer value, the first line can be imported (NULL is okay), but the second line cannot (empty string is not). Postgres's import command (COPY FROM) has an option to force the unquoted empty to be interpreted as a string but it does not have an option to force the quoted empty to be interpreted as a NULL. From http://www.postgresql.org/docs/9.0/static/sql-copy.html: The CSV format has no standard way to distinguish a NULL value from an empty string. PostgreSQL's COPY handles this by quoting. A NULL is output as the NULL parameter string and is not quoted, while a non-NULL value matching the NULL parameter string is quoted. For example, with the default settings, a NULL is written as an unquoted empty string, while an empty string data value is written with double quotes (""). Reading values follows similar rules. You can use FORCE_NOT_NULL to prevent NULL input comparisons for specific columns. Therefore printing the unquoted empty is more flexible for imports into Postgres than printing the quoted empty. In addition to making the output more useful with Postgres, not quoting empty strings makes the output smaller and easier to read. It also matches the behavior of Microsoft Excel and Google Drive. Since we are here and making concessions for Postgres, handle this case too (again quoting the Postgres docs): Because backslash is not a special character in the CSV format, \., the end-of-data marker, could also appear as a data value. To avoid any misinterpretation, a \. data value appearing as a lone entry on a line is automatically quoted on output, and on input, if quoted, is not interpreted as the end-of-data marker. If you are loading a file created by another application that has a single unquoted column and might have a value of \., you might need to quote that value in the input file. Fixes #7586. LGTM=bradfitz R=bradfitz CC=golang-codereviews https://golang.org/cl/164760043
6ad2749d · Russ Cox · 5361b747 · 6ad2749d · 6ad2749d
Commit 6ad2749d authored Oct 23, 2014 by Russ Cox
Hide whitespace changes
Inline Side-by-side

Showing with 25 additions and 2 deletions

src/encoding/csv/writer.go src/encoding/csv/writer.go +14 -2

src/encoding/csv/writer_test.go src/encoding/csv/writer_test.go +11 -0

No files found.
--- a/src/encoding/csv/writer.go
+++ b/src/encoding/csv/writer.go
@@ -115,10 +115,22 @@ func (w *Writer) WriteAll(records [][]string) (err error) {
 }
 // fieldNeedsQuotes returns true if our field must be enclosed in quotes.
-// Empty fields, files with a Comma, fields with a quote or newline, and
+// Fields with a Comma, fields with a quote or newline, and
 // fields which start with a space must be enclosed in quotes.
+// We used to quote empty strings, but we do not anymore (as of Go 1.4).
+// The two representations should be equivalent, but Postgres distinguishes
+// quoted vs non-quoted empty string during database imports, and it has
+// an option to force the quoted behavior for non-quoted CSV but it has
+// no option to force the non-quoted behavior for quoted CSV, making
+// CSV with quoted empty strings strictly less useful.
+// Not quoting the empty string also makes this package match the behavior
+// of Microsoft Excel and Google Drive.
+// For Postgres, quote the data termating string `\.`.
 func (w *Writer) fieldNeedsQuotes(field string) bool {
-	if len(field) == 0 || strings.IndexRune(field, w.Comma) >= 0 || strings.IndexAny(field, "\"\r\n") >= 0 {
+	if field == "" {
+		return false
+	}
+	if field == `\.` || strings.IndexRune(field, w.Comma) >= 0 || strings.IndexAny(field, "\"\r\n") >= 0 {
 		return true
 	}

--- a/src/encoding/csv/writer_test.go
+++ b/src/encoding/csv/writer_test.go
@@ -28,6 +28,17 @@ var writeTests = []struct {
 	{Input: [][]string{{"abc\ndef"}}, Output: "\"abc\r\ndef\"\r\n", UseCRLF: true},
 	{Input: [][]string{{"abc\rdef"}}, Output: "\"abcdef\"\r\n", UseCRLF: true},
 	{Input: [][]string{{"abc\rdef"}}, Output: "\"abc\rdef\"\n", UseCRLF: false},
+	{Input: [][]string{{""}}, Output: "\n"},
+	{Input: [][]string{{"", ""}}, Output: ",\n"},
+	{Input: [][]string{{"", "", ""}}, Output: ",,\n"},
+	{Input: [][]string{{"", "", "a"}}, Output: ",,a\n"},
+	{Input: [][]string{{"", "a", ""}}, Output: ",a,\n"},
+	{Input: [][]string{{"", "a", "a"}}, Output: ",a,a\n"},
+	{Input: [][]string{{"a", "", ""}}, Output: "a,,\n"},
+	{Input: [][]string{{"a", "", "a"}}, Output: "a,,a\n"},
+	{Input: [][]string{{"a", "a", ""}}, Output: "a,a,\n"},
+	{Input: [][]string{{"a", "a", "a"}}, Output: "a,a,a\n"},
+	{Input: [][]string{{`\.`}}, Output: "\"\\.\"\n"},
 }
 func TestWrite(t *testing.T) {