Handle dvi font names as ASCII bytestrings by jkseppan · Pull Request #6977 · matplotlib/matplotlib

jkseppan · 2016-08-25T18:37:17Z

Dvi is a binary format that includes some ASCII strings such as
TeX names of some fonts. The associated files such as psfonts.map
need to be ASCII too. This patch changes their handling to keep
them as binary strings all the time.

This avoids the ugly workaround

    try:
        result = some_mapping[texname]
    except KeyError:
        result = some_mapping[texname.decode('ascii')]

which is essentially saying that texname is sometimes a string,
sometimes a bytestring. The workaround masks real KeyErrors,
leading to incomprehensible error messages such as in #6516.

anntzer · 2016-12-26T23:30:02Z

lib/matplotlib/dviread.py

    def __init__(self, scale, tfm, texname, vf):
-        if six.PY3 and isinstance(texname, bytes):
-            texname = texname.decode('ascii')
+        assert(isinstance(texname, bytes))


No parentheses here. (And in general I think we should avoid bare asserts in favor of raising explicit exceptions.)

anntzer · 2016-12-26T23:33:19Z

lib/matplotlib/dviread.py

       is usually very different from any external font names, and
       :class:`dviread.PsfontsMap` can be used to find the external
-       name of the font.
+       name of the font. ASCII bytestring.


If we can't use numpydoc/napoleon-style docstrings I would at least put the type at a more prominent place.

anntzer · 2016-12-26T23:35:11Z

lib/matplotlib/dviread.py


    def _parse(self, file):
-        """Parse each line into words."""
+        """Parse each line into words and process them."""


This docstring seems rather pointless (it is no news that a method named _parse parses and processes). I would either make it more explicit (what is the parsed format?) or just get rid of it.

anntzer · 2016-12-26T23:39:35Z

lib/matplotlib/dviread.py

+            if line == b'' or line.startswith(b'%'):
                continue
            words, pos = [], 0
            while pos < len(line):


If I understand the logic of this method correctly it should be possible to rewrite this loop simply as something like (untested)

re.findall(b'("[^"]*"|[^ ]*)', line)

right? (as findall returns nonoverlapping matches)

anntzer · 2016-12-26T23:40:04Z

lib/matplotlib/dviread.py


+        # input must be bytestrings (the file format is ASCII)
+        for word in words:
+            assert(isinstance(word, bytes))


Same remark as above re: assertions.

anntzer · 2016-12-26T23:41:10Z

lib/matplotlib/dviread.py

+        # input must be bytestrings (the file format is ASCII)
+        for word in words:
+            assert(isinstance(word, bytes))
+


Again, the logic of this function may perhaps be more easily expressed using regexes.

anntzer · 2016-12-26T23:43:06Z

@jkseppan Not sure why you asked me to review the PR (I don't know that much about that part of the codebase) but I had a quick look for now.

jkseppan · 2016-12-27T18:46:25Z

@anntzer Many thanks for the review! I thought this had been waiting for quite a while and I noticed your name in the git history for dviread.py, so I figured you have at least taken a look at it recently.

anntzer · 2016-12-27T18:52:16Z

No worries.

jkseppan · 2016-12-29T21:49:46Z

I changed a bunch of the docstrings to the numpydoc format, and rebased because I couldn't get the documentation build to work otherwise.

tacaswell · 2016-12-29T22:00:12Z

lib/matplotlib/dviread.py

-                    filename = word
+        # input must be bytestrings (the file format is ASCII)
+        for word in words:
+            assert isinstance(word, bytes)


A while ago where was an effort to remove assert from all of the main code base and replace with if not ...: raise blocks. Are these here for testing reasons or run-time checks?

This is a private function and should only be called from the same class, so if this triggers it's an internal error in the code. While writing this code I felt this was more obvious than the later errors you get from mixing bytestrings and strings, but perhaps this is not really needed.

tacaswell · 2016-12-29T22:28:45Z

The test failures look real

anntzer · 2016-12-30T01:00:43Z

lib/matplotlib/dviread.py

                line = line[:comment_start]
            line = line.strip()

            if state == 0:


I tried googling without much success the format of enc files but it looks like this is looking for patterns for the form

/FooEncoding [ /abc/def/ghi ] def

and returning the slash-separated words within the brackets. At least the searching for brackets-surrounded fragments can clearly be rewritten using re.findall as above.

jkseppan · 2016-12-31T21:23:22Z

That seems to have helped with Python 3 failures, but now we get test failures on Python 2.7.

jkseppan · 2017-01-01T15:22:57Z

Now the Travis build passed after I restarted one of the jobs.

codecov-io · 2017-01-01T15:58:49Z

Current coverage is 62.10% (diff: 91.89%)

Merging #6977 into master will decrease coverage by 4.47%

@@             master      #6977   diff @@
==========================================
  Files           109        174     +65   
  Lines         46648      56012   +9364   
  Methods           0          0           
  Messages          0          0           
  Branches          0          0           
==========================================
+ Hits          31060      34786   +3726   
- Misses        15588      21226   +5638   
  Partials          0          0

Powered by Codecov. Last update b12b6a7...08c69f5

QuLogic · 2017-01-02T07:21:57Z

lib/matplotlib/dviread.py

+       Used for verifying against the dvi file.
+    design_size : int
+       Design size of the font (unknown units)
+    width : dict


Assuming they are all of similar description and the original just eschewed repeating all that, this can be combined with the next two as:

width, height, depth : dict Dimensions of each character, ...

QuLogic · 2017-01-02T07:23:32Z

lib/matplotlib/dviread.py

        except KeyError:
-            result = self._font[texname.decode('ascii')]
+            matplotlib.verbose.report(textwrap.fill
+                ('A PostScript file for the font whose TeX name is "%s" '


It seems weird to me to wrap between a function call and its opening parenthesis.

QuLogic · 2017-01-02T07:24:37Z

lib/matplotlib/dviread.py

+                 'package manager.' % (texname.decode('ascii'),
+                                       self._filename),
+                 break_on_hyphens=False, break_long_words=False),
+                'helpful')


Especially because this line is in a different function call than the ones above it.

QuLogic · 2017-01-02T07:26:41Z

lib/matplotlib/dviread.py

    def _parse(self, file):
-        """Parse each line into words."""
        for line in file:
+            line = six.b(line)


Why not open the file in binary mode in __init__ instead of decoding and re-encoding here? Also, six.b is supposed to be applied on string literals exclusively; this should use an explicit encode if you really want to do this.

QuLogic · 2017-01-02T07:32:39Z

lib/matplotlib/dviread.py

            self._parse(file)

    def __getitem__(self, texname):
+        assert(isinstance(texname, bytes))


Parentheses removal, from @anntzer.

QuLogic · 2017-01-02T07:45:44Z

lib/matplotlib/dviread.py

+        if not match:
+            raise ValueError("Cannot locate end of encoding in {}"
+                             .format(file))
+        data = data[:match.span()[0]]


match.start()?

QuLogic · 2017-01-02T07:47:00Z

lib/matplotlib/dviread.py

+        lines = (line[:line.find(b'%')] if b'%' in line else line.strip()
+                 for line in file)
+        data = b''.join(lines)
+        match = re.search(six.b(r'\['), data)


six.b is for 2.5 support; it can be dropped and the proper prefix used.

Shouldn't this just be data.find('[') (update below accordingly)?

QuLogic · 2017-01-02T07:47:13Z

lib/matplotlib/dviread.py

+            raise ValueError("Cannot locate beginning of encoding in {}"
+                             .format(file))
+        data = data[match.span()[1]:]
+        match = re.search(six.b(r'\]'), data)


Remove six.b here as well.

QuLogic · 2017-01-02T07:47:33Z

lib/matplotlib/dviread.py

-                        raise ValueError("Broken name in encoding file: " + w)
-
-        return result
+        return re.findall(six.b(r'/([^][{}<>\s]+)'), data)


More six.b removal.

QuLogic · 2017-01-02T07:50:06Z

lib/matplotlib/tests/test_dviread.py

    with open(os.path.join(dir, 'test.json')) as f:
        correct = json.load(f)
+        for entry in correct:
+            entry['text'] = [[a, b, c, six.b(d), e]


Replace six.b with explicit encode.

jkseppan · 2017-01-03T09:38:52Z

The Travis failures are in test_scales.py (fix in #7726), and for some reason the OSX builder is missing LaTeX and Ghostscript.

jenshnielsen · 2017-01-03T09:49:35Z

The osx build does not install latex because it takes way to long. IMHO the tests should be robust to this and skip when latex is not available

jkseppan · 2017-01-03T10:36:06Z

I was wrong: it seems that the actual failure in the OSX case is from test_scales.py too, even though there are error messages about missing latex.

Dvi is a binary format that includes some ASCII strings such as TeX names of some fonts. The associated files such as psfonts.map need to be ASCII too. This patch changes their handling to keep them as binary strings all the time. This avoids the ugly workaround try: result = some_mapping[texname] except KeyError: result = some_mapping[texname.decode('ascii')] which is essentially saying that texname is sometimes a string, sometimes a bytestring. The workaround masks real KeyErrors, leading to incomprehensible error messages such as in matplotlib#6516.

So if you follow the troubleshooting instructions and rerun with --verbose-helpful you get a hint about the usual reason for matplotlib#6516.

These are now ASCII bytestrings so we should not assume they are strings.

Don't mix filenames and dvi font names as keys of the same dict.

Use re.findall, and open the file as binary.

Improve a docstring, remove unneeded parens from an assert, open a file as binary instead of encoding each line read from it, don't call six.b on variable strings, simplify string handling, improve the formatting of a matplotlib.verbose.report call.

Combine the word splitting and classification in one regex so we only have to scan each line once. Add some quotation marks in the test case to check that we are handling quoted words correctly (the behavior should always have matched this test case).

jkseppan · 2017-01-29T17:45:02Z

Rebased because of conflicts with recently merged PRs

tacaswell · 2017-02-11T20:07:55Z

lib/matplotlib/backends/backend_pdf.py

        self.writeObject(fontdictObject, fontdict)
        return fontdictObject

-    def embedTeXFont(self, texname, fontinfo):


Do we care about this API change?

I don't think that's something that an external user would ever have called. Let's mark the function private with an underscore and document the change.

tacaswell · 2017-02-11T20:08:04Z

lib/matplotlib/backends/backend_pdf.py

        gc._fillcolor = orig_fill
        gc._effective_alphas = orig_alphas

-    def tex_font_mapping(self, texfont):


do we care about this API change?

tacaswell · 2017-02-11T20:10:09Z

lib/matplotlib/dviread.py

-        with open(filename, 'rt') as file:
+        self._filename = filename
+        if six.PY3 and isinstance(filename, bytes):
+            self._filename = filename.decode('ascii', errors='replace')


Why ascii instead of utf-8 or the system encoding?

I suppose the system encoding is more correct, but conversions like that make me somewhat wary. It's not really enough to specify UTF-8, you have to know which representation to choose for characters where you have a choice. (For example, the Wikipedia page on HFS+: "File and folder names in HFS Plus are [...] normalized to a form very nearly the same as Unicode Normalization Form D (NFD)". At least at one time the Linux HFS+ implementation didn't follow this.)

The correct encoding depends on where the bytestring originates. If it's out of a TeX file, I wouldn't be surprised if ASCII were good enough considering the esoteric requirements like fitting in 8 characters.

If it's something the user supplies, then there's really no good default and they really should have done it themselves. If the "user" is us, then we really need to fix that end instead.

I think the only current users in our code are the PDF backend and the text2path code, both of which just pass in the location of "pdftex.map". I initially thought this might need to be made customizable but I've never seen that as a feature request.

tacaswell · 2017-02-11T20:33:38Z

lib/matplotlib/dviread.py

-                word = word.lstrip('<')
-                if word.startswith('[') or word.endswith('.enc'):
+        empty_re = re.compile(br'%|\s*$')
+        word_re = re.compile(


I learned quite a bit about regular expressions understanding this pattern.

For example, you can make them readable and name groups!

tacaswell · 2017-02-11T20:36:53Z

lib/matplotlib/dviread.py

+            psname = w.group('eff2') or w.group('eff1')
+
+            for w in words:
+                eff = w.group('eff1') or w.group('eff2')


Everywhere else these are listed in reverse order to how they are in the expression, does that matter?

The named groups are mutually exclusive so the order doesn't matter for correctness, but there may be a very slight performance difference. I think I was listing the groups in an approximate order of probability of occurrence, e.g. effects are almost certainly quoted because they include arguments, but file and font names are almost certainly not quoted. I can see how this would be confusing; I'll add a comment.

tacaswell · 2017-02-11T20:42:31Z

lib/matplotlib/dviread.py

+            raise ValueError("Cannot locate beginning of encoding in {}"
+                             .format(file))
+        data = data[beginning:]
+        end = data.find(b']')


should this be an rfind?

nevermind 🐑

tacaswell · 2017-02-11T20:55:44Z

That was fun to review!

I left 1 question about order of an or that should be addressed. The 2 API change questions I think just need to be documented or shimmed back to what they were.

I am overall 👍 on this.

tacaswell · 2017-02-11T20:59:58Z

Also re-milestoned for 2.1. It isn't really fixing a regression (these issues are very long standing) and makes some pretty substantial changes to subtle code so seems higher-risk.

I am open to arguments that this should be backported though.

Only used once in the code, but makes the lazy parsing more standard.

tacaswell · 2017-02-11T23:26:04Z

Also jkseppan#6

ENH: make texFontMap a property

And add an underscore in the beginning of the method whose signature changes.

jkseppan · 2017-02-12T19:59:55Z

I think the issue dates back to when Matplotlib started supporting Python 3. I think this started out as a smaller fix, but the earlier round of review suggested some improvements that I agreed with. I think the 2.1 milestone is appropriate.

mdboom added the status: needs review label Aug 25, 2016

tacaswell added this to the 2.0.1 (next bug fix release) milestone Aug 25, 2016

jkseppan mentioned this pull request Aug 27, 2016

savefig to pdf: 'str' object has no attribute 'decode' #6516

Closed

jkseppan requested a review from anntzer December 26, 2016 15:13

anntzer reviewed Dec 26, 2016

View reviewed changes

jkseppan force-pushed the dvi-ascii branch from 592cd28 to b055e2e Compare December 29, 2016 21:47

tacaswell reviewed Dec 29, 2016

View reviewed changes

anntzer reviewed Dec 30, 2016

View reviewed changes

QuLogic reviewed Jan 2, 2017

View reviewed changes

jkseppan added 5 commits January 29, 2017 19:38

Test that the KeyError is raised when the font is missing

dbc8b9e

Mention bytestrings in docstring

93fad55

Add a helpful note when raising KeyError from dviread.PsFonts

4874e4e

So if you follow the troubleshooting instructions and rerun with --verbose-helpful you get a hint about the usual reason for matplotlib#6516.

Attempted fix for Python 3.4 compatibility

a130ba7

jkseppan added 5 commits January 29, 2017 19:43

Fix dvi font name handling in pdf backend

2e19a61

These are now ASCII bytestrings so we should not assume they are strings.

Separate the handling of dvi fonts in the pdf backend

119934a

Don't mix filenames and dvi font names as keys of the same dict.

Simplify enc file parsing

8fa303f

Use re.findall, and open the file as binary.

jkseppan force-pushed the dvi-ascii branch from 3f14667 to 254e3df Compare January 29, 2017 17:44

Try to fix the KeyError test

a8674b3

QuLogic mentioned this pull request Feb 10, 2017

"savefig" bug with unicode characters (version 2.0.0) #8039

Closed

tacaswell reviewed Feb 11, 2017

View reviewed changes

tacaswell modified the milestones: 2.1 (next point release), 2.0.1 (next bug fix release) Feb 11, 2017

ENH: make texFontMap a property

25a8fed

Only used once in the code, but makes the lazy parsing more standard.

jkseppan added 4 commits February 12, 2017 20:40

Merge pull request #6 from tacaswell/dvi-ascii

92e2c52

ENH: make texFontMap a property

Use file system encoding for the psfonts file name

5ba21b0

Document minor API changes

10135bf

And add an underscore in the beginning of the method whose signature changes.

Explain named group ordering

6de9813

tacaswell approved these changes Feb 12, 2017

View reviewed changes

QuLogic merged commit 4a2c850 into matplotlib:master Feb 26, 2017

QuLogic removed the status: needs review label Mar 23, 2017

jkseppan deleted the dvi-ascii branch January 6, 2018 08:38

Uh oh!

Conversation

jkseppan commented Aug 25, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anntzer commented Dec 26, 2016

Uh oh!

jkseppan commented Dec 27, 2016

Uh oh!

anntzer commented Dec 27, 2016

Uh oh!

jkseppan commented Dec 29, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tacaswell commented Dec 29, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkseppan commented Dec 31, 2016

Uh oh!

jkseppan commented Jan 1, 2017

Uh oh!

codecov-io commented Jan 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current coverage is 62.10% (diff: 91.89%)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anntzer Jan 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkseppan commented Jan 3, 2017

Uh oh!

jenshnielsen commented Jan 3, 2017

Uh oh!

jkseppan commented Jan 3, 2017

Uh oh!

jkseppan commented Jan 29, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-io commented Jan 1, 2017 •

edited

Loading

anntzer Jan 2, 2017 •

edited

Loading