Fix corrupt UTF8 files to avoid Python UnicodeDecodeError

I find that sometimes files included in Python projects, for example Fortran files, have corrupted characters that are incorrect UTF-8 characters. Maybe it’s a case of bad OCR that also plagues LaTeX/BibTeX copy/paste references from journal websites. Thus, this method will also apply to BibTeX files.

I have created a Python script that recursively:

  1. finds such corrupt files
  2. converts from UTF-8 to ASCII
  3. removes the corrupted characters
  4. puts the output file to a temporary location
  5. end user can copy fixed file over original

Leave a Comment