Python email package: how to reliably convert/decode multipart messages to str
Problem I was trying to process old, potentially non-compliant emails with Python. I could read in the message without problem:
In [1]: m=email.message_from_binary_file(open('/path/to/problematic:2,S',mode='rb'))
But subsequently converting it to string gave a UnicodeEncodeError: ‘gb2312’ codec can’t encode character ‘\ufffd’ in position 1238: illegal multibyte sequence. The (multi-)part of this problematic message has “Content-Type: text/plain; charset=”gb2312” and “Content-Transfer-Encoding: 8bit”.
In [2]: m.as_string() --------------------------------------------------------------------------- UnicodeEncodeError Traceback (most recent call last) <ipython-input-26-919a3a20e7d8> in <module>() ----> 1 m.as_string() ~/tools/conda/envs/conda3.6/lib/python3.6/email/message.py in as_string(self, unixfrom, maxheaderlen, policy) 156 maxheaderlen=maxheaderlen, 157 policy=policy) --> 158 g.flatten(self, unixfrom=unixfrom) 159 return fp.getvalue() 160 ~/tools/conda/envs/conda3.6/lib/python3.6/email/generator.py in flatten(self, msg, unixfrom, linesep) 114 ufrom = 'From nobody ' + time.ctime(time.time()) 115 self.write(ufrom + self._NL) --> 116 self._write(msg) 117 finally: 118 self.policy = old_gen_policy ~/tools/conda/envs/conda3.6/lib/python3.6/email/generator.py in _write(self, msg) 179 self._munge_cte = None 180 self._fp = sfp = self._new_buffer() --> 181 self._dispatch(msg) 182 finally: 183 self._fp = oldfp ~/tools/conda/envs/conda3.6/lib/python3.6/email/generator.py in _dispatch(self, msg) 212 if meth is None: 213 meth = self._writeBody --> 214 meth(msg) 215 216 # ~/tools/conda/envs/conda3.6/lib/python3.6/email/generator.py in _handle_multipart(self, msg) 270 s = self._new_buffer() 271 g = self.clone(s) --> 272 g.flatten(part, unixfrom=False, linesep=self._NL) 273 msgtexts.append(s.getvalue()) 274 # BAW: What about boundaries that are wrapped in double-quotes? ~/tools/conda/envs/conda3.6/lib/python3.6/email/generator.py in flatten(self, msg, unixfrom, linesep) 114 ufrom = 'From nobody ' + time.ctime(time.time()) 115 self.write(ufrom + self._NL) --> 116 self._write(msg) 117 finally: 118 self.policy = old_gen_policy ~/tools/conda/envs/conda3.6/lib/python3.6/email/generator.py in _write(self, msg) 179 self._munge_cte = None 180 self._fp = sfp = self._new_buffer() --> 181 self._dispatch(msg) 182 finally: 183 self._fp = oldfp ~/tools/conda/envs/conda3.6/lib/python3.6/email/generator.py in _dispatch(self, msg) 212 if meth is None: 213 meth = self._writeBody --> 214 meth(msg) 215 216 # ~/tools/conda/envs/conda3.6/lib/python3.6/email/generator.py in _handle_text(self, msg) 241 msg = deepcopy(msg) 242 del msg['content-transfer-encoding'] --> 243 msg.set_payload(payload, charset) 244 payload = msg.get_payload() 245 self._munge_cte = (msg['content-transfer-encoding'], ~/tools/conda/envs/conda3.6/lib/python3.6/email/message.py in set_payload(self, payload, charset) 313 if not isinstance(charset, Charset): 314 charset = Charset(charset) --> 315 payload = payload.encode(charset.output_charset) 316 if hasattr(payload, 'decode'): 317 self._payload = payload.decode('ascii', 'surrogateescape') UnicodeEncodeError: 'gb2312' codec can't encode character '\ufffd' in position 1238: illegal multibyte sequence
I’m not really familiar with the idiosyncrasies of email internals, and searching online for this type of errors turned up mostly problems while scraping the web, and basically suggested somewhat the obvious: the raw bytes read in contains Unicode characters that are not possible to encode with the target codec.
My question is: what’s the correct way to reliably handle (potentially non-compliant) emails?
EDIT It is interesting that m.get_payload(i=0).as_string() would trigger the same exception, but m.get_payload(i=0).get_payload(decode=False) gave a str that displayed correctly on my terminal, while m.get_payload(i=0).get_payload(decode=True) gave a bytes (b'\xd7\xaa...') that I can’t decode. However, the error happens on a different character:
----> 1 m.get_payload(i=0).get_payload(decode=True).decode('gb2312') UnicodeDecodeError: 'gb2312' codec can't decode byte 0xac in position 1995: illegal multibyte sequence
or
----> 1 m.get_payload(i=0).get_payload(decode=True).decode('gb18030') UnicodeDecodeError: 'gb18030' codec can't decode byte 0xa3 in position 2033: illegal multibyte sequence
Answer Apparently, if Content-Transfer-Encoding is 8bit, message.get_payload(decode=False) will still try to decode it to recover the original bytes. On the other hand, message.get_payload(decode=True) always produces bytes, although actual decoding happens only if Content-Transfer-Encoding exists and is quoted-printable or base64.
I ended up with the following code. Not sure if this is the correct way of handling emails.
body = [] if m.preamble is not None: body.extend(m.preamble.splitlines(keepends=True)) for part in m.walk(): if part.is_multipart(): continue ctype = part.get_content_type() cte = part.get_params(header='Content-Transfer-Encoding') if (ctype is not None and not ctype.startswith('text')) or \ (cte is not None and cte[0][0].lower() == '8bit'): part_body = part.get_payload(decode=False) else: charset = part.get_content_charset() if charset is None or len(charset) == 0: charsets = ['ascii', 'utf-8'] else: charsets = [charset] part_body = part.get_payload(decode=True) for enc in charsets: try: part_body = part_body.decode(enc) break except UnicodeDecodeError as ex: continue except LookupError as ex: continue else: part_body = part.get_payload(decode=False) body.extend(part_body.splitlines(keepends=True)) if m.epilogue is not None: body.extend(m.epilogue.splitlines(keepends=True))
Comments
Comments powered by Disqus