excel csv hell# 9 Apr 2017
Just when you think Excel couldn’t suck less, it does when you have to do the most mundane of all things - export a goddamn sheet as a csv.
I’m using Excel 2016 for this example, but I’ve seen this for a bunch of versions of Excel and I can’t imagine the brilliant engineers over at Microsoft have ever looked into making their products compatible with, well, anything.
reading a csv in python
Reading actual csv files in python is usually a fairly trivial affair (using python 2.7):
import csv with open('my_file.csv') as f: reader = csv.reader(f) rows = [row for row in reader]
For example, if I have the following data in Google Sheets, an actual working program built out by an excellent engineering organization that takes its customers’ needs into account:
I can go ahead and read it in just fine (those gnarly
trailing characters on padmé are just the
representation of the e with the accent):
In : with open('sheets_output.csv') as f: ...: reader = csv.reader(f) ...: rows = [row for row in reader] ...: In : rows Out: [['id', 'name'], ['0', 'ryan'], ['1', 'padm\xc3\xa9']]
Now let’s say I have literally the same exact data in Excel.
I then select ‘Save As…’ and see ‘CSV UTF-8’ as a file option (sounds about right, of course). Save the file, and then the fun begins.
In : with open('excel_output.csv') as f: ...: reader = csv.reader(f) ...: rows = [row for row in reader] ...: --------------------------------------------------------------------------- Error Traceback (most recent call last) <ipython-input-29-0a959f4a7dde> in <module>() 1 with open('excel_output.csv') as f: 2 reader = csv.reader(f) ----> 3 rows = [row for row in reader] 4 Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
Gross. No idea what that means. I thought I was just exporting some tabular data as a csv, not signing up for eternal damnation to hell.
So some googling indicates I need to change up my syntax a bit to account for universal-newline mode:
In : with open('excel_output.csv', 'rU') as f: ...: reader = csv.reader(f) ...: rows = [row for row in reader] ...: In : rows Out: [['\xef\xbb\xbfid', 'name'], ['0', 'ryan'], ['1', 'padm\xc3\xa9']]
Okay, that’s starting to look at least a bit better, but
what in the flying fuck is that gnarly thing prepended to my
It turns out this is a Byte-Order-Mark (BOM) that Excel had to bake into its shitty csv outputs to explicitly state that this file is in UTF-8 format so that its shitty csv reader functionality would be able to handle reading files. This stack overflow link provides more information than I would ever care to know about this: http://stackoverflow.com/questions/155097/microsoft-excel-mangles-diacritics-in-csv-files
Thankfully someone wrote a python module for handling crazy
csv inputs, and it’s appropriately called
this helps us solve the issue first by reading things in as
unicode and second, specifying the encoding so it can strip
out garbage. It basically acts as a wrapper around the
csv module, so just drop it right in:
In : with open('excel_output.csv', 'rU') as f: ...: reader = unicodecsv.reader(f, encoding='utf-8-sig') ...: rows = [row for row in reader] ...: In : rows Out: [[u'id', u'name'], [u'0', u'ryan'], [u'1', u'padm\xe9']]
Link to the module: https://github.com/jdunck/python-unicodecsv
And it turns out we can use this exact code for reading the
sheets_output.csv file without issue too!
file will show you the high-level details about a file.
⚡ ~/Desktop ⚡ file excel_output.csv excel_output.csv: UTF-8 Unicode (with BOM) text, with CR line terminators ⚡ ~/Desktop ⚡ file sheets_output.csv sheets_output.csv: UTF-8 Unicode text, with CRLF line terminators
I’m a bit late to the python 3 party, but apparently the
csv module now generally handles this stuff alright BUT
you still need the
codecs module to explicitly open it up
correctly to eliminate the BOM:
>>> import codecs >>> import csv >>> with codecs.open('excel_output.csv', encoding='utf-8-sig') as f: ... reader = csv.reader(f) ... rows = [row for row in reader] ... >>> rows [['id', 'name'], ['0', 'ryan'], ['1', 'padmé']]