Sunday, August 28, 2016

Python finding the encoding of a file

chardet is a universal character encoding detector in python. It can find the encoding of a file also provides a confidence score of the encoding. To have chardet in your environment, install chardet with your package manager. With pip it would look like

pip install chardet

Once chardet is installed, finding encoding of a file is very simple (if data.txt is a file)

chardetect data.txt

It comes back with "data.txt: ascii with confidence 1.0"

You can provide multiple files also to th chardet

chardetect data.txt chinese.txt 

It comes back with
data.txt: ascii with confidence 1.0
chinese.txt: utf-8 with confidence 0.99

Integrating chardet in a pthon program to find the encoding is as follows:

import chardet

file_name = 'data.txt'
result = chardet.detect(file_name)
print("Encoding: " + result['encoding'] + " Confidence " +  str(result['confidence']))

No comments:

Post a Comment