excel csv hell

Just when you think Excel couldn’t suck less, it does when you have to do the most mundane of all things - export a goddamn sheet as a csv.

I’m using Excel 2016 for this example, but I’ve seen this for a bunch of versions of Excel and I can’t imagine the brilliant engineers over at Microsoft have ever looked into making their products compatible with, well, anything.

reading a csv in python

Reading actual csv files in python is usually a fairly trivial affair (using python 2.7):

import csv

with open('my_file.csv') as f:
    reader = csv.reader(f)
    rows = [row for row in reader]

Google Sheets

For example, if I have the following data in Google Sheets, an actual working program built out by an excellent engineering organization that takes its customers’ needs into account:

source_data

I can go ahead and read it in just fine (those gnarly trailing characters on padmé are just the utf8 representation of the e with the accent):

In [24]: with open('sheets_output.csv') as f:
    ...:     reader = csv.reader(f)
    ...:     rows = [row for row in reader]
    ...:

In [25]: rows
Out[25]: [['id', 'name'], ['0', 'ryan'], ['1', 'padm\xc3\xa9']]

Excel

Now let’s say I have literally the same exact data in Excel.

I then select ‘Save As…’ and see ‘CSV UTF-8’ as a file option (sounds about right, of course). Save the file, and then the fun begins.

In [29]: with open('excel_output.csv') as f:
    ...:     reader = csv.reader(f)
    ...:     rows = [row for row in reader]
    ...:
---------------------------------------------------------------------------
Error                                     Traceback (most recent call last)
<ipython-input-29-0a959f4a7dde> in <module>()
      1 with open('excel_output.csv') as f:
      2     reader = csv.reader(f)
----> 3     rows = [row for row in reader]
      4

Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?

Gross. No idea what that means. I thought I was just exporting some tabular data as a csv, not signing up for eternal damnation to hell.

So some googling indicates I need to change up my syntax a bit to account for universal-newline mode:

In [30]: with open('excel_output.csv', 'rU') as f:
    ...:     reader = csv.reader(f)
    ...:     rows = [row for row in reader]
    ...:

In [31]: rows
Out[31]: [['\xef\xbb\xbfid', 'name'], ['0', 'ryan'], ['1', 'padm\xc3\xa9']]

Okay, that’s starting to look at least a bit better, but what in the flying fuck is that gnarly thing prepended to my id column!?

It turns out this is a Byte-Order-Mark (BOM) that Excel had to bake into its shitty csv outputs to explicitly state that this file is in UTF-8 format so that its shitty csv reader functionality would be able to handle reading files. This stack overflow link provides more information than I would ever care to know about this: http://stackoverflow.com/questions/155097/microsoft-excel-mangles-diacritics-in-csv-files

Thankfully someone wrote a python module for handling crazy csv inputs, and it’s appropriately called unicodecsv, and this helps us solve the issue first by reading things in as unicode and second, specifying the encoding so it can strip out garbage. It basically acts as a wrapper around the builtin csv module, so just drop it right in:

In [55]: with open('excel_output.csv', 'rU') as f:
    ...:     reader = unicodecsv.reader(f, encoding='utf-8-sig')
    ...:     rows = [row for row in reader]
    ...:

In [56]: rows
Out[56]: [[u'id', u'name'], [u'0', u'ryan'], [u'1', u'padm\xe9']]

Link to the module: https://github.com/jdunck/python-unicodecsv

And it turns out we can use this exact code for reading the sheets_output.csv file without issue too!

encoding details

file will show you the high-level details about a file.

⚡  ~/Desktop ⚡  file excel_output.csv
excel_output.csv: UTF-8 Unicode (with BOM) text, with CR line terminators
⚡  ~/Desktop ⚡  file sheets_output.csv
sheets_output.csv: UTF-8 Unicode text, with CRLF line terminators

python 3

I’m a bit late to the python 3 party, but apparently the csv module now generally handles this stuff alright BUT you still need the codecs module to explicitly open it up correctly to eliminate the BOM:

>>> import codecs
>>> import csv
>>> with codecs.open('excel_output.csv', encoding='utf-8-sig') as f:
...     reader = csv.reader(f)
...     rows = [row for row in reader]
...
>>> rows
[['id', 'name'], ['0', 'ryan'], ['1', 'padmé']]