Content-Length: 285295 | pFad | http://github.com/decalage2/oletools/pull/836

41 Recognize txt by christian-intra2net · Pull Request #836 · decalage2/oletools · GitHub
Skip to content

Recognize txt#836

Open
christian-intra2net wants to merge 3 commits intodecalage2:masterfrom
christian-intra2net:recognize-txt
Open

Recognize txt#836
christian-intra2net wants to merge 3 commits intodecalage2:masterfrom
christian-intra2net:recognize-txt

Conversation

@christian-intra2net
Copy link
Contributor

olevba's heuristic for detecting plain text (no \x00 in the binary data) does not work with many unicode encodings like utf16. Improve on that heuristic and move it to ftguess.py, so we can at least deal with harmless text encoded with utf8, latin1, or utf16 (with or without BOMs). This is far from perfect, ignores popular Asian encodings, but according to wikipedia utf8 is by far the most popular encoding used in software. If we need something better still, I'd recommend not re-inventing the wheel here but use libmagic or other specialized libraries.

I created sample files for all the encodings used and unittests to check them.

Test-driven development: want to correctly detect these as text in ftguess.
Already use future ftguess text type.

Since we're at it: slightly improve output of unittest
This is not so simple since various text encodings can look rather
"binary", but a few simple heuristics will deal with many text types (at
least those encountered here in Europe).

Of course, all xml is text as well, so use checks for "is this text" only
after more specialized tests like "is this xml".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants









ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: http://github.com/decalage2/oletools/pull/836

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy