La codificación por defecto de Python 3.x es Unicode, lo que sería equivalente a cadena de caracteres (str) en Python 2.x.
Unicode es un estándar de codificación de caracteres para facilitar el tratamiento de textos de múltiples lenguajes, incluido los basados en ideogramas o aquellos usados en textos de lenguas muertas. El término Unicode proviene de los objetivos perseguidos durante el desarrollo del proyecto: universalidad, uniformidad y unicidad.
En Unicode los caracteres alfabéticos, los ideogramas y los símbolos se tratan de forma equivalente y se pueden mezclar entre sí en un mismo texto, es decir, es posible representar en un mismo párrafo caracteres del alfabeto árabe, cirílico, latino, ideogramas japoneses y símbolos musicales.
One-character Unicode strings can also be created with the
chr()
built-in function, which takes integers and returns a Unicode string of length 1
that contains the corresponding code point. The reverse operation is the
built-in ord()
function that takes a one-character Unicode string and
returns the code point value:>>> chr(57344)
'\ue000'
>>> ord('\ue000')
57344
The unicode_escape encoding will convert all the various ways of entering unicode characters. The '\x00' syntax, the '\u0000' and even the '\N{name}' syntax:
The
\x
takes two hex digits \xDF
.
The lower-case \u
takes four hex digits \u00DF
.
The upper-case \U
takes eight hex digits \U000000DF
. >>> from makeunicode import u
>>> print(u('\u00dcnic\u00f6de'))
Ünicöde
>>> print(u('\xdcnic\N{Latin Small Letter O with diaeresis}de'))
Ünicöde
>>>u"\N{EURO SIGN}"
€
Escape Characters
The recognized escape sequences are:- \newline
- Ignored
- \
- Backslash (\)
- \’
- Single quote (‘)
- \”
- Double quote (”)
- \a
- ASCII Bell (BEL)
- \b
- ASCII Backspace (BS)
- \f
- ASCII Formfeed (FF)
- \n
- ASCII Linefeed (LF)
- \N{name}
- Character named NAME in the Unicode database (Unicode only)
- \r
- ASCII Carriage Return (CR)
- \t
- ASCII Horizontal Tab (TAB)
- \uxxxx
- Character with 16-bit hex value XXXX (Unicode only) (1)
- \Uxxxxxxxx
- Character with 32-bit hex value XXXXXXXX (Unicode only) (2)
- \v
- ASCII Vertical Tab (VT)
- \ooo
- Character with octal value OOO (3,5)
- \xhh
- Character with hex value HH (4,5)
En Python 3 no existen cadenas codificadas en UTF-8, cuando hablamos de codificación es porque convertimos cadenas de caracteres Unicode en una secuencia de bytes con una codificación determinada ( por ejemplo UTF-8). Por tanto los bytes no son caracteres (Unicode), los bytes son bytes y un carácter es una abstracción, siendo una cadena una sucesión de abstracciones.
In Python 3, all strings are sequences of Unicode characters
UTF-8 is a way of encoding characters as a sequence of bytes
>>> b'\x80abc'.decode("utf-8", "strict") Traceback (most recent call last): ... UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte >>> b'\x80abc'.decode("utf-8", "replace") '\ufffdabc' >>> b'\x80abc'.decode("utf-8", "backslashreplace") '\\x80abc' >>> b'\x80abc'.decode("utf-8", "ignore") 'abc'
The opposite method of
bytes.decode()
is str.encode()
,
which returns a bytes
representation of the Unicode string, encoded in the
requested encoding. The errors parameter is the same as the parameter of the
decode()
method but supports a few more possible handlers. As well as
'strict'
, 'ignore'
, and 'replace'
(which in this case
inserts a question mark instead of the unencodable character), there is
also 'xmlcharrefreplace'
(inserts an XML character reference),
backslashreplace
(inserts a \uNNNN
escape sequence) and
namereplace
(inserts a \N{...}
escape sequence).The following example shows the different results:
>>> u = chr(40960) + 'abcd' + chr(1972)
>>> u.encode('utf-8')
b'\xea\x80\x80abcd\xde\xb4'
>>> u.encode('ascii')
Traceback (most recent call last):
...
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in
position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore')
b'abcd'
>>> u.encode('ascii', 'replace')
b'?abcd?'
>>> u.encode('ascii', 'xmlcharrefreplace')
b'ꀀabcd'
>>> u.encode('ascii', 'backslashreplace')
b'\\ua000abcd\\u07b4'
>>> u.encode('ascii', 'namereplace')
b'\\N{YI SYLLABLE IT}abcd\\u07b4'
Bytes are not characters; bytes are bytes. Characters are an abstraction. A string is a sequence of those abstractions.
① | You can’t concatenate bytes and strings. They are two different data types. |
② | You can’t count the occurrences of bytes in a string, because there are no bytes in a string. A string is a sequence of characters. Perhaps you meant “count the occurrences of the string that you would get after decoding this sequence of bytes in a particular character encoding”? Well then, you’ll need to say that explicitly. Python 3 won’t implicitly convert bytes to strings or strings to bytes. |
③ | By an amazing coincidence, this line of code says “count the occurrences of the string that you would get after decoding this sequence of bytes in this particular character encoding.” |
And here is the link between strings and bytes:
bytes
objects have a decode()
method that takes a character encoding and returns a string, and strings have an encode()
method that takes a character encoding and returns a bytes
object.u'something'.encode('utf-8')
will generate b'bytes'
,but so does
bytes(u'something', 'utf-8')
.And
b'bytes'.decode('utf-8')
seems to do same thing as str(b'', 'utf-8')
Por ejemplo :
>>> n = 'NIÑO'
>>> m = bytes(n,'utf-8') -----> b'N\x3c\xb1O' -----> class 'byte'
equivalente a :
>>> m = n.encode('utf-8') ---> b'N\x3c\xb1O' -----> class 'byte'
para convertir una cadena Byte a Unicode podemos utilizar la función decode():
>>> l = m.decode('utf-8') ----> 'NIÑO' ---------------> class 'str'
Comprobando la codificación por defecto del sistema sys.getdefaultencoding() comprobará que es 'utf-8', por tanto, n.encode() y n.decode() pueden sustituir n.encode('utf-8') y n.decode('utf-8')
In Python 3.x, the prefix 'b' indicates the string is a
bytes
object which differs from the normal string (which as we know is by
default a Unicode string), and even the 'b' prefix is preserved:b'prefix in Python 3.x'
Para declarar una cadena ByteArray utilizaremos la función bytearray():
>>> entidad = bytearray('niño', 'utf-8')
>>> type(entidad)
Syntax:
bytes([source[, encoding[, errors]]])
Return a new "bytes" object, which is an immutable sequence of small integers in the range 0 <= x < 256, print as ASCII characters when displayed. bytes is an immutable version of bytearray – it has the same non-mutating methods and the same indexing and slicing behavior.
Syntax:
bytearray([source[, encoding[, errors]]])
Return a new array of bytes. The bytearray type is a mutable sequence of integers in the range 0 <= x < 256. It has most of the usual methods of mutable sequences, described in Mutable Sequence Types, as well as most methods that the bytes type has, see Bytes and Byte Array Methods.
Difference between bytes and bytearray object in Python
>>> #bytearray objects are a mutable counterpart to bytes objects
>>> x = bytearray("Python bytearray", "utf8")
>>> print(x) bytearray(b'Python bytearray')
>>> #can remove items from the bytes
>>> del x[11:15]
>>> print(x) bytearray(b'Python bytey')
>>> #can add items from the bytes
>>> x[11:15] = b" object"
>>> print(x) bytearray(b'Python byte object')
>>> #can use the methods of mutable type iterable objects as the lists
>>> x.append(45)
>>> print(x) bytearray(b'Python byte object-')
Convert a bytes to bytearray
>>> #create a bytes object from a list of integers in the range 0 through 255
>>> x = bytes([105, 100, 107, 112, 132, 118, 107, 112, 200])
>>> print(x)
b'idkp\x84vkp\xc8'
>>> #generates a new array of bytes from a bytes object
>>> x1 = bytearray(x)
>>> print(x1)
bytearray(b'idkp\x84vkp\xc8')
bs = 'niño'
print(type(bs))
bb = bs.encode('utf-8')
print(type(bb))
bb
b'ni\xc3\xb1o'
bss = bb.decode('utf-8')
print(type(bss))
bss
niño
mod_wsgi
Apache/2.4.27 (Unix) mod_wsgi/4.5.18 Python/3.6
:QUERY_STRING: searchPhrase=NI%C3%91OS
DEBUG:wsgi.wsgateway:wsgi.errors: <_io .textiowrapper="" encoding="utf-8" name="<wsgi.errors>">
DEBUG:wsgi.wsgateway:self.request searchPhrase=NI%C3%91OS DEBUG:parse_qs:qs parametros searchPhrase NI\xc3\x83\xc2\x91OS
- WSGI environ keys are unicode
- WSGI environ values that contain incoming request data are bytes
- headers, chunks in the response iterable as well as status code are bytes as well
2.Para la variable WSGI 'wsgi.url_scheme' contenida en el entorno WSGI, el valor de la variable debe ser una cadena nativa.
3.Para las variables CGI contenidas en el entorno WSGI, los valores de las variables son cadenas nativas. Donde las cadenas nativas son cadenas unicode, la codificación ISO-8859-1 se usaría de tal manera que los datos de los caracteres originales se conserven y, como sea necesario, la cadena unicode pueda volver a convertirse en bytes y, a continuación, descodificada para volver a codificar usando una codificación diferente.
4.La secuencia de entrada WSGI 'wsgi.input' contenida en el entorno WSGI y desde donde se lee el contenido de la solicitud, debe generar cadenas de bytes.
5.La línea de estado especificada por la aplicación WSGI debe ser una cadena de bytes. Donde las cadenas nativas son cadenas unicode, el tipo de cadena nativa también se puede devolver, en cuyo caso se codificaría como ISO-8859-1.
6.La lista de encabezados de respuesta especificados por la aplicación WSGI debe contener tuplas que consisten en dos valores, donde cada valor es una cadena de bytes. Donde las cadenas nativas son cadenas unicode, el tipo de cadena nativa también se puede devolver, en cuyo caso se codificaría como ISO-8859-1.
7.El iterable devuelto por la aplicación y del que se deriva el contenido de la respuesta, debe producir cadenas de bytes. Donde las cadenas nativas son cadenas unicode, el tipo de cadena nativa también se puede devolver, en cuyo caso se codificaría como ISO-8859-1.
8.El valor que se pasa a la devolución de llamada 'write ()' devuelto por 'start_response ()' debe ser una cadena de bytes. Donde las cadenas nativas son cadenas unicode, también se puede suministrar un tipo de cadena nativo, en cuyo caso se codificaría como ISO-8859-1.
Con POSTGRESQL
Reading data from database is similar to reading from file. Decode when reading, process it, encode when writing. However, some python database libraries do this for you automatically. sqlite3, MySQLdb, psycopg2 all allow you to pass unicode string directly to INSERT or SELECT statement. When you specify the string encoding when creating connection, the returned string is also decoded to unicode string automatically.
Here is a psycopg2 example:
#!/usr/bin/env python # coding=utf-8 """ postgres database read/write example """ import psycopg2 def get_conn(): return psycopg2.connect(host="localhost", database="t1", user="t1", password="fNfwREMqO69TB9YqE+/OzF5/k+s=") def write(): with get_conn() as conn: cur = conn.cursor() cur.execute(u"""\ CREATE TABLE IF NOT EXISTS t1 (id integer, data text); """) cur.execute(u"""\ DELETE FROM t1 """) cur.execute(u"""\ INSERT INTO t1 VALUES (%s, %s) """, (1, u"✓")) def read(): with get_conn() as conn: cur = conn.cursor() cur.execute(u"""\ SELECT id, data FROM t1 """) for row in cur: data = row[1].decode('utf-8') print(type(data), data) def main(): write() read() if __name__ == '__main__': main()
No hay comentarios:
Publicar un comentario