Fix corrupt UTF8 files to avoid Python UnicodeDecodeError

1 minute read

I find that sometimes files included in Python projects, for example Fortran files, have corrupted characters that are incorrect UTF-8 characters. Maybe it’s a case of bad OCR that also plagues LaTeX/BibTeX copy/paste references from journal websites. Thus, this method will also apply to BibTeX files.

Python script “find_bad_characters.py” recursively:

  1. finds such corrupt files
  2. converts from UTF-8 to ASCII
  3. removes the corrupted characters
  4. puts the output file to a temporary location
  5. end user can copy fixed file over original
#!/usr/bin/env python
"""
iteratively find files with "bad" characters that Python doesn't like.
useful for f2py, BibTeX and more.
Michael Hirsch, Ph.D.
"""
import warnings
from tempfile import mkstemp
import subprocess
from pathlib import Path

try:
    subprocess.run('iconv -f utf-8 -t ascii <<< \ ',
                   shell=True,executable='/bin/bash',timeout=1)
    FIX=True
except Exception as e:
    FIX=False

def scanbadchar(path,ext):
    """
    ext: file extension INCLUDING PERIOD
    """
    path = Path(path).expanduser()
    if path.is_file():
        flist = [path]
    elif path.is_dir():
        flist = path.glob('*'+ext)
    else:
        raise FileNotFoundError(f'{path} not found')

    for f in flist:
        try:
            f.open('r').read()
        except UnicodeDecodeError:
            warnings.warn(f'BAD character in {f}')
            if FIX:
                ofn = mkstemp(f.suffix)[1]
                print(f'{f} => {ofn}')
                # this returns stderr 1 if characters were bad despite conversion success.
                subprocess.run(f'iconv -c -f utf-8 -t ascii {f} > '+ofn,
                               shell=True,timeout=1)
                subprocess.run(['diff',f,ofn],timeout=1)
                print('---------------')

if __name__ == '__main__':
    import signal
    signal.signal(signal.SIGINT, signal.SIG_DFL)
    
    from argparse import ArgumentParser
    p = ArgumentParser()
    p.add_argument('path',help='top path to search')
    p.add_argument('ext',help='file extension WITH PERIOD',nargs='?',default='')
    p = p.parse_args()

    scanbadchar(p.path,p.ext)

Leave a Comment