數位音符: Python - chardet 辨識中文

2017年9月17日星期日

Python - chardet 辨識中文

[以 Python3 為開發環境]

之前抓取網頁後，一直遇到編碼問題
看到 chardet 後，覺得這真是個好東西

但是，當檔案大的時候，分析時間還蠻久的

以下是使用範例

import chardet

tmpStr = (b'this is english')
print('1', chardet.detect(tmpStr))

tmpStr = '使用這個偵測真讚'

tmpStr1 = bytes(tmpStr, 'utf-8')
print('2', chardet.detect(tmpStr1))

tmpStr2 = bytes(tmpStr, 'big5')
print('3', chardet.detect(tmpStr2))

tmpStr3 = bytes(tmpStr, 'CP950')

print('4', chardet.detect(tmpStr3))

>>>
1 {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
2 {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
3 {'encoding': 'Big5', 'confidence': 0.99, 'language': 'Chinese'}

4 {'encoding': 'Big5', 'confidence': 0.99, 'language': 'Chinese'}

後面的 1.0 和 0.99 表示猜測的準確度

數位音符

Tag

2017年9月17日星期日

Python - chardet 辨識中文

沒有留言:

張貼留言

Python - UNICODE字串與bytes字串的關聯性與轉換

Tag

2017年9月17日 星期日

Python - chardet 辨識中文

沒有留言:

張貼留言

Python - UNICODE字串與bytes字串的關聯性與轉換

2017年9月17日星期日