Python email package: how to reliably convert/decode multipart messages to str

author

2018-05-01 16:52

Problem I was trying to process old, potentially non-compliant emails with Python. I could read in the message without problem:

In [1]: m=email.message_from_binary_file(open('/path/to/problematic:2,S',mode='rb'))

But subsequently converting it to string gave a UnicodeEncodeError: ‘gb2312’ codec can’t encode character ‘\ufffd’ in position 1238: illegal multibyte sequence. The (multi-)part of this problematic message has “Content-Type: text/plain; charset=”gb2312” and “Content-Transfer-Encoding: 8bit”.

In [2]: m.as_string()
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-26-919a3a20e7d8> in <module>()
----> 1 m.as_string()

~/tools/conda/envs/conda3.6/lib/python3.6/email/message.py in as_string(self, unixfrom, maxheaderlen, policy)
    156                       maxheaderlen=maxheaderlen,
    157                       policy=policy)
--> 158         g.flatten(self, unixfrom=unixfrom)
    159         return fp.getvalue()
    160

~/tools/conda/envs/conda3.6/lib/python3.6/email/generator.py in flatten(self, msg, unixfrom, linesep)
    114                     ufrom = 'From nobody ' + time.ctime(time.time())
    115                 self.write(ufrom + self._NL)
--> 116             self._write(msg)
    117         finally:
    118             self.policy = old_gen_policy

~/tools/conda/envs/conda3.6/lib/python3.6/email/generator.py in _write(self, msg)
    179             self._munge_cte = None
    180             self._fp = sfp = self._new_buffer()
--> 181             self._dispatch(msg)
    182         finally:
    183             self._fp = oldfp

~/tools/conda/envs/conda3.6/lib/python3.6/email/generator.py in _dispatch(self, msg)
    212             if meth is None:
    213                 meth = self._writeBody
--> 214         meth(msg)
    215
    216     #

~/tools/conda/envs/conda3.6/lib/python3.6/email/generator.py in _handle_multipart(self, msg)
    270             s = self._new_buffer()
    271             g = self.clone(s)
--> 272             g.flatten(part, unixfrom=False, linesep=self._NL)
    273             msgtexts.append(s.getvalue())
    274         # BAW: What about boundaries that are wrapped in double-quotes?

~/tools/conda/envs/conda3.6/lib/python3.6/email/generator.py in flatten(self, msg, unixfrom, linesep)
    114                     ufrom = 'From nobody ' + time.ctime(time.time())
    115                 self.write(ufrom + self._NL)
--> 116             self._write(msg)
    117         finally:
    118             self.policy = old_gen_policy

~/tools/conda/envs/conda3.6/lib/python3.6/email/generator.py in _write(self, msg)
    179             self._munge_cte = None
    180             self._fp = sfp = self._new_buffer()
--> 181             self._dispatch(msg)
    182         finally:
    183             self._fp = oldfp

~/tools/conda/envs/conda3.6/lib/python3.6/email/generator.py in _dispatch(self, msg)
    212             if meth is None:
    213                 meth = self._writeBody
--> 214         meth(msg)
    215
    216     #

~/tools/conda/envs/conda3.6/lib/python3.6/email/generator.py in _handle_text(self, msg)
    241                 msg = deepcopy(msg)
    242                 del msg['content-transfer-encoding']
--> 243                 msg.set_payload(payload, charset)
    244                 payload = msg.get_payload()
    245                 self._munge_cte = (msg['content-transfer-encoding'],

~/tools/conda/envs/conda3.6/lib/python3.6/email/message.py in set_payload(self, payload, charset)
    313             if not isinstance(charset, Charset):
    314                 charset = Charset(charset)
--> 315             payload = payload.encode(charset.output_charset)
    316         if hasattr(payload, 'decode'):
    317             self._payload = payload.decode('ascii', 'surrogateescape')

UnicodeEncodeError: 'gb2312' codec can't encode character '\ufffd' in position 1238: illegal multibyte sequence

I’m not really familiar with the idiosyncrasies of email internals, and searching online for this type of errors turned up mostly problems while scraping the web, and basically suggested somewhat the obvious: the raw bytes read in contains Unicode characters that are not possible to encode with the target codec.

My question is: what’s the correct way to reliably handle (potentially non-compliant) emails?

EDIT It is interesting that m.get_payload(i=0).as_string() would trigger the same exception, but m.get_payload(i=0).get_payload(decode=False) gave a str that displayed correctly on my terminal, while m.get_payload(i=0).get_payload(decode=True) gave a bytes (b'\xd7\xaa...') that I can’t decode. However, the error happens on a different character:

----> 1 m.get_payload(i=0).get_payload(decode=True).decode('gb2312')
UnicodeDecodeError: 'gb2312' codec can't decode byte 0xac in position 1995: illegal multibyte sequence

----> 1 m.get_payload(i=0).get_payload(decode=True).decode('gb18030')
UnicodeDecodeError: 'gb18030' codec can't decode byte 0xa3 in position 2033: illegal multibyte sequence

Answer Apparently, if Content-Transfer-Encoding is 8bit, message.get_payload(decode=False) will still try to decode it to recover the original bytes. On the other hand, message.get_payload(decode=True) always produces bytes, although actual decoding happens only if Content-Transfer-Encoding exists and is quoted-printable or base64.

I ended up with the following code. Not sure if this is the correct way of handling emails.

body = []
if m.preamble is not None:
    body.extend(m.preamble.splitlines(keepends=True))

for part in m.walk():
    if part.is_multipart():
        continue

    ctype = part.get_content_type()
    cte = part.get_params(header='Content-Transfer-Encoding')
    if (ctype is not None and not ctype.startswith('text')) or \
       (cte is not None and cte[0][0].lower() == '8bit'):
        part_body = part.get_payload(decode=False)
    else:
        charset = part.get_content_charset()
        if charset is None or len(charset) == 0:
            charsets = ['ascii', 'utf-8']
        else:
            charsets = [charset]

        part_body = part.get_payload(decode=True)
        for enc in charsets:
            try:
                part_body = part_body.decode(enc)
                break
            except UnicodeDecodeError as ex:
                continue
            except LookupError as ex:
                continue
        else:
            part_body = part.get_payload(decode=False)

    body.extend(part_body.splitlines(keepends=True))

if m.epilogue is not None:
    body.extend(m.epilogue.splitlines(keepends=True))

Comments