數位音符

2017年9月17日星期日

Python - UNICODE字串與bytes字串的關聯性與轉換

Python3 預設就是使用 UNICODE 字串
但我們處理文章或是網頁的時候，常會遇到編碼問題，這邊做個整理

1. 產生 UNICODE 字串 or bytes 字串的方法
2. UNICODE 與 bytes 字串間的轉換
3. 利用 encode 與 decode 切換 UNICODE 與其他編碼
4. 不同編碼間要轉換，請切回 UNICODE

=============================================================

1. 產生 UNICODE 字串 or bytes 字串的方法
(標準產生法，字串前加b的內容，只能是 ASCII code)
我們宣告任何字串，預設就是 UNICODE, 不需要前面特別加個小 u

tmpStr0 = 'byte' #預設就是 UNICODE
tmpStr1 = b'byte'
tmpStr2 = 'Unicode字串'
tmpStr3 = u'Unicode字串'
print(type(tmpStr0))
print(type(tmpStr1))
print(type(tmpStr2))
print(type(tmpStr3))

if tmpStr2 == tmpStr3:
print("預設字串就是 Unicode 字串")

>>>
<class 'str'>
<class 'bytes'>
<class 'str'>
<class 'str'>

預設字串就是 Unicode 字串

2. UNICODE 與 bytes 字串間的轉換

Python3 預設編碼是 UNICODE ，他對於字串只有兩種格式，就是'str' or 'bytes'

當你創造一個字串，就是 UNICODE 格式，
轉成別種格式後，就變成 Byte Array

tmpStr = '讚呀'

tmpStr1 = tmpStr

print("tmpStr1", tmpStr)

print("tmpStr1", type(tmpStr)) # 原始字串，type 是 str

tmpStr2 = tmpStr.encode('big5')

print("tmpStr2", tmpStr2)

print("tmpStr2", type(tmpStr2)) # 新編碼字串，type 自動變成 bytes

tmpStr3 = bytes(tmpStr, 'big5')

print("tmpStr3", tmpStr3)

print("tmpStr3", type(tmpStr3)) # 新編碼字串，type 自動變成 bytes

if (tmpStr2 == tmpStr3):

print("兩種轉法相同")

else:

print("兩種轉法不同")

輸出結果

tmpStr1 讚呀

tmpStr1 <class 'str'>

tmpStr2 b'\xc6g\xa7r'

tmpStr2 <class 'bytes'>

tmpStr3 b'\xc6g\xa7r'

tmpStr3 <class 'bytes'>

兩種轉法相同

3. 利用 encode 與 decode 切換 UNICODE 與其他編碼
一個字串，initial 的時候是 'str' type, 當換成別種格式，就會變成 bytes,
Decode 回來後，又會變成 'str'

tmpStr = '讚呀'

tmpStr2 = tmpStr

tmpStr2 = tmpStr2.encode('big5') # tmpStr2 一經轉換，就自動變成 bytes

print('1', type(tmpStr2))

tmpStr2 = tmpStr2.decode('big5') # 使用decode轉回後，自動變成 'str'

print('2', type(tmpStr2))

輸出：

1 <class 'bytes'>

2 <class 'str'>

4. 不同編碼間要轉換，請切回 UNICODE
以下範例將 big5 轉回 UTF-8

tmpStr = '讚呀'

tmpStr2 = tmpStr

tmpStr2 = tmpStr2.encode('big5') # tmpStr2 一經轉換，就自動變成 bytes

print('1', type(tmpStr2))

tmpStr2 = tmpStr2.decode('big5') # 使用decode轉回後，自動變成 'str'

print('2', type(tmpStr2))

tmpStr2 = tmpStr2.encode('UTF-8')

print('3', type(tmpStr2))

輸出：

1 <class 'bytes'>

2 <class 'str'>
3 <class 'bytes'>

Python - chardet 辨識中文

[以 Python3 為開發環境]

之前抓取網頁後，一直遇到編碼問題
看到 chardet 後，覺得這真是個好東西

但是，當檔案大的時候，分析時間還蠻久的

以下是使用範例

import chardet

tmpStr = (b'this is english')
print('1', chardet.detect(tmpStr))

tmpStr = '使用這個偵測真讚'

tmpStr1 = bytes(tmpStr, 'utf-8')
print('2', chardet.detect(tmpStr1))

tmpStr2 = bytes(tmpStr, 'big5')
print('3', chardet.detect(tmpStr2))

tmpStr3 = bytes(tmpStr, 'CP950')

print('4', chardet.detect(tmpStr3))

>>>
1 {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
2 {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
3 {'encoding': 'Big5', 'confidence': 0.99, 'language': 'Chinese'}

4 {'encoding': 'Big5', 'confidence': 0.99, 'language': 'Chinese'}

後面的 1.0 和 0.99 表示猜測的準確度

2017年9月15日星期五

編碼

常見的中文編碼格式

編碼格式	說明
ASCII	通用英文字規格，每個字 1 Byte
UNICODE	每個字元都是2-Byte
UTF-8	改良UNICODE，變成可變長度第1位元與 ASCII 相容
Big-5	通用於台灣，香港，澳門，繁體Windows採用此種編碼

以下是 wiki 的說明

UNICODE

目前實際應用的統一碼版本對應於UCS-2，使用16位的編碼空間。也就是每個字元占用2個位元組

基本多文種平面的字元的編碼為U+hhhh，其中每個h代表一個十六進位數字，與UCS-2編碼完全相同。而其對應的4位元組UCS-4編碼後兩個位元組一致，前兩個位元組則所有位均為0。

UTF-8（8-bit Unicode Transformation Format）

是一種針對Unicode的可變長度字元編碼，也是一種字首碼。它可以用來表示Unicode標準中的任何字元，且其編碼中的第一個位元組仍與ASCII相容，這使得原來處理ASCII字元的軟體無須或只須做少部份修改，即可繼續使用。因此，它逐漸成為電子郵件、網頁及其他儲存或傳送文字的應用中，優先採用的編碼。

Big5

又稱為大五碼或五大碼，是使用繁體中文（正體中文）社群中最常用的電腦漢字字元集標準，共收錄13,060個漢字^[1]。

中文碼分為內碼及交換碼兩類，Big5屬中文內碼，知名的中文交換碼有CCCII、CNS11643。

Big5雖普及於台灣、香港與澳門等繁體中文通行區，但長期以來並非當地的國家/地區標準或官方標準，而只是業界標準。倚天中文系統、Windows繁體中文版等主要系統的字元集都是以Big5為基準，但廠商又各自增加不同的造字與造字區，衍生成多種不同版本。

2003年，Big5被收錄到CNS11643中文標準交換碼的附錄當中，取得了較正式的地位。這個最新版本被稱為Big5-2003。

Python3 將文字分成 str 及 byte，

Tag

2017年9月17日 星期日

Python - UNICODE字串與bytes字串的關聯性與轉換

Python - chardet 辨識中文

2017年9月15日 星期五

編碼

Python - UNICODE字串與bytes字串的關聯性與轉換

2017年9月17日星期日

2017年9月15日星期五