Understanding Pandas read_csv read_excel errors

less than 1 minute read

In data science we often deal with messy, heterogeneous data and file types too. Python Pandas is a very powerful data science tool. A simple but not infrequent mistake is using the wrong Pandas function to read data, that is, using read_excel to read CSV data or read_csv to read Excel spreadsheet data.

Note: Pandas cannot read ODS OpenDocument formats, so for those using LibreOffice/OpenOffice, convert ODS data to XLSX first.

Pandas wrong function format errors

Obviously, the answer is to use:

  • read_csv() for .csv and .tsv files
  • read_excel for .xls and .xlsx files

but sometimes simple mistakes lead us to this webpage.

read_excel(.csv)

This mistake leads to errors including:

xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b’’

read_csv → .xlsx

This mistake leads to errors including:

UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xfa in position 1: invalid start byte

ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

Pandas prereqs

To help avoid excessive prerequisites, Pandas makes the xlrd install optional–until using read_excel, so simply do:

pip install pandas xlrd

Leave a comment