Compiler implementation for FluentBundle

django-ftl · Jun 17, 2019 · d1481d6 · d1481d6
1 parent 20f3e25
commit d1481d6
Show file tree

Hide file tree

Showing 31 changed files with 4,142 additions and 95 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,5 +1,11 @@
 .tox
 *.pyc
 .eggs/
+*.pot
+*.mo
+*.po
+.pytest_cache
 *.egg-info/
 _build
+.benchmarks
+.hypothesis
diff --git a/fluent.runtime/CHANGELOG.rst b/fluent.runtime/CHANGELOG.rst
@@ -8,6 +8,7 @@ fluent.runtime development version (unreleased)
   terms.
 * Refined error handling regarding function calls to be more tolerant of errors
   in FTL files, while silencing developer errors less.
+* Added ``CompilingFluentBundle`` implementation.
 
 fluent.runtime 0.1 (January 21, 2019)
 -------------------------------------

diff --git a/fluent.runtime/docs/implementations.rst b/fluent.runtime/docs/implementations.rst
@@ -0,0 +1,115 @@
+FluentBundle Implementations
+============================
+
+python-fluent comes with two implementations of ``FluentBundle``. The default is
+``fluent.runtime.InterpretingFluentBundle``, which is what you get under the
+alias ``fluent.runtime.FluentBundle``. It implements an interpreter for the FTL
+Abstract Syntax Tree.
+
+The alternative is ``fluent.runtime.CompilingFluentBundle``, which works by
+compiling a set of FTL messages to a set of Python functions using Python `ast
+<https://docs.python.org/3/library/ast.html>`_. This results in very good
+performance (see below for more info).
+
+While the two implementations have the same API, and return the same values
+under most situations, there are some differences, as follows:
+
+* ``InterpretingFluentBundle`` has some protection against malicious FTL input
+  which could attempt things like a `billion laughs attack
+  <https://en.wikipedia.org/wiki/Billion_laughs_attack>`_ to consume a large
+  amount of memory or CPU time. For the sake of performance,
+  ``CompilingFluentBundle`` does not have these protections.
+
+  It should be noted that both implementations are able to detect and stop
+  infinite recursion errors (``CompilingFluentBundle`` does this at compile
+  time), which is important to stop infinite loops and memory exhaustion which
+  could otherwise occur due to accidental cyclic references in messages.
+
+* While the error handling strategy for both implementations is the same, when
+  errors occur (e.g. a missing value in the arguments dictionary, or a cyclic
+  reference, or a string is passed to ``NUMBER()`` builtin), the exact errors
+  returned by ``format`` may be different between the two implementations.
+
+  Also, when an error occurs, in some cases (such as a cyclic reference), the
+  error string embedded into the returned formatted message may be different.
+  For cases where there is no error, the output is identical (or should be).
+
+  Neither implementations guarantees that the exact errors returned will be the
+  same between different versions of ``fluent.runtime``.
+
+Performance
+-----------
+
+Due to the strategy of compiling to Python, ``CompilingFluentBundle`` has very
+good performance, especially for the simple common cases. The
+``tools/benchmark/gettext_comparisons.py`` script includes some benchmarks that
+compare speed to GNU gettext as a reference. Below is a rough summary:
+
+For the simple but very common case of a message defining a static string,
+``CompilingFluentBundle.format`` is very close to GNU gettext, or much faster,
+depending on whether you are using Python 2 or 3, and your Python implementation
+(e.g. CPython or PyPy). (The worst case we found was 5% faster than gettext on
+CPython 2.7, and the best case was about 3.5 times faster for PyPy2 5.1.2). For
+cases of substituting a single string into a message,
+``CompilingFluentBundle.format`` is between 30% slower and 70% faster than an
+equivalent implementation using GNU gettext and Python ``%`` interpolation.
+
+For message where plural rules are involved, currently ``CompilingFluentBundle``
+can be significantly slower than using GNU gettext, partly because it uses
+plural rules from CLDR that can be much more complex (and correct) than the ones
+that gettext normally does. Further work could be done to optimize some of these
+cases though.
+
+For more complex operations (for example, using locale-aware date and number
+formatting), formatting messages can take a lot longer. Comparisons to GNU
+gettext fall down at this point, because it doesn't include a lot of this
+functionality. However, usually these types of messages make up a small fraction
+of the number of internationalized strings in an application.
+
+``InterpretingFluentBundle`` is, as you would expect, much slower that
+``CompilingFluentBundle``, often by a factor of 10. In cases where there are a
+large number of messages, ``CompilingFluentBundle`` will be a lot slower to
+format the first message because it first compiles all the messages, whereas
+``InterpretingFluentBundle`` does not have this compilation step, and tries to
+reduce any up-front work to a minimum (sometimes at the cost of runtime
+performance).
+
+
+Security
+--------
+
+You should not pass un-trusted FTL code to ``FluentBundle.add_messages``. This
+is because carefully constructed messages could potentially cause large resource
+usage (CPU time and memory). The ``InterpretingFluentBundle`` implementation
+does have some protection against these attacks, although it may not be
+foolproof, while ``CompilingFluentBundle`` does not have any protection against
+these attacks, either at compile time or run time.
+
+``CompilingFluentBundle`` works by compiling FTL messages to Python `ast
+<https://docs.python.org/3/library/ast.html>`_, which is passed to `compile
+<https://docs.python.org/3/library/functions.html#compile>`_ and then `exec
+<https://docs.python.org/3/library/functions.html#exec>`_. The use of ``exec``
+like this is an established technique for high performance Python code, used in
+template engines like Mako, Jinja2 and Genshi.
+
+However, there can understandably be some concerns around the use of ``exec``
+which can open up remote execution vulnerabilities. If this is of paramount
+concern to you, you should consider using ``InterpretingFluentBundle`` instead
+(which is the default).
+
+To reduce the possibility of our use of ``exec`` harbouring security issues, the
+following things are in place:
+
+1. We generate `ast <https://docs.python.org/3/library/ast.html>`_ objects and
+   not strings. This greatly reduces the security problems, since there is no
+   possibility of a vulnerability due to incorrect string interpolation.
+
+2. We use ``exec`` only on AST derived from FTL files, never on "end user input"
+   (such as the arguments passed into ``FluentBundle.format``). This reduces the
+   attack vector to only the situation where the source of your FTL files is
+   potentially malicious or compromised.
+
+3. We employ defence-in-depth techniques in our code generation and compiler
+   implementation to reduce the possibility of a cleverly crafted FTL code
+   producing security holes, and ensure these techniques have full test
+   coverage.
diff --git a/fluent.runtime/docs/usage.rst b/fluent.runtime/docs/usage.rst
@@ -93,6 +93,19 @@ module or the start of your repl session:
 
     from __future__ import unicode_literals
 
+CompilingFluentBundle
+~~~~~~~~~~~~~~~~~~~~~
+
+In addition to the default ``FluentBundle`` implementation, there is also a high
+performance implementation that compilers to Python AST. You can use it just the same:
+
+.. code-block:: python
+
+   from fluent.runtime import CompilingFluentBundle as FluentBundle
+
+Be sure to check the notes on :doc:`implementations`, especially the security
+section.
+
 Numbers
 ~~~~~~~
 
@@ -225,5 +238,6 @@ Help with the above would be welcome!
 Other features and further information
 --------------------------------------
 
+* :doc:`implementations`
 * :doc:`functions`
 * :doc:`errors`
diff --git a/fluent.runtime/fluent/runtime/__init__.py b/fluent.runtime/fluent/runtime/__init__.py
@@ -1,19 +1,23 @@
 from __future__ import absolute_import, unicode_literals
 
+from collections import OrderedDict
+
 import babel
 import babel.numbers
 import babel.plural
 
 from fluent.syntax import FluentParser
-from fluent.syntax.ast import Message, Term
+from fluent.syntax.ast import Junk, Message, Term
 
 from .builtins import BUILTINS
+from .compiler import compile_messages
+from .errors import FluentDuplicateMessageId, FluentJunkFound
 from .prepare import Compiler
-from .resolver import ResolverEnvironment, CurrentEnvironment
+from .resolver import CurrentEnvironment, ResolverEnvironment
 from .utils import ATTRIBUTE_SEPARATOR, TERM_SIGIL, ast_to_id, native_to_fluent
 
 
-class FluentBundle(object):
+class FluentBundleBase(object):
     """
     Message contexts are single-language stores of translations.  They are
     responsible for parsing translation resources in the Fluent syntax and can
@@ -33,27 +37,60 @@ def __init__(self, locales, functions=None, use_isolating=True):
             _functions.update(functions)
         self._functions = _functions
         self.use_isolating = use_isolating
-        self._messages_and_terms = {}
-        self._compiled = {}
-        self._compiler = Compiler()
+        self._messages_and_terms = OrderedDict()
+        self._parsing_issues = []
         self._babel_locale = self._get_babel_locale()
         self._plural_form = babel.plural.to_python(self._babel_locale.plural_form)
 
     def add_messages(self, source):
         parser = FluentParser()
         resource = parser.parse(source)
-        # TODO - warn/error about duplicates
         for item in resource.body:
             if isinstance(item, (Message, Term)):
                 full_id = ast_to_id(item)
-                if full_id not in self._messages_and_terms:
+                if full_id in self._messages_and_terms:
+                    self._parsing_issues.append((full_id, FluentDuplicateMessageId(
+                        "Additional definition for '{0}' discarded.".format(full_id))))
+                else:
                     self._messages_and_terms[full_id] = item
+            elif isinstance(item, Junk):
+                self._parsing_issues.append(
+                    (None, FluentJunkFound("Junk found: " +
+                                           '; '.join(a.message for a in item.annotations),
+                                           item.annotations)))
 
     def has_message(self, message_id):
         if message_id.startswith(TERM_SIGIL) or ATTRIBUTE_SEPARATOR in message_id:
             return False
         return message_id in self._messages_and_terms
 
+    def _get_babel_locale(self):
+        for l in self.locales:
+            try:
+                return babel.Locale.parse(l.replace('-', '_'))
+            except babel.UnknownLocaleError:
+                continue
+        # TODO - log error
+        return babel.Locale.default()
+
+    def format(self, message_id, args=None):
+        raise NotImplementedError()
+
+    def check_messages(self):
+        """
+        Check messages for errors and return as a list of two tuples:
+           (message ID or None, exception object)
+        """
+        raise NotImplementedError()
+
+
+class InterpretingFluentBundle(FluentBundleBase):
+
+    def __init__(self, locales, functions=None, use_isolating=True):
+        super(InterpretingFluentBundle, self).__init__(locales, functions=functions, use_isolating=use_isolating)
+        self._compiled = {}
+        self._compiler = Compiler()
+
     def lookup(self, full_id):
         if full_id not in self._compiled:
             entry_id = full_id.split(ATTRIBUTE_SEPARATOR, 1)[0]
@@ -83,11 +120,55 @@ def format(self, message_id, args=None):
                                   errors=errors)
         return [resolve(env), errors]
 
-    def _get_babel_locale(self):
-        for l in self.locales:
-            try:
-                return babel.Locale.parse(l.replace('-', '_'))
-            except babel.UnknownLocaleError:
-                continue
-        # TODO - log error
-        return babel.Locale.default()
+    def check_messages(self):
+        return self._parsing_issues[:]
+
+
+class CompilingFluentBundle(FluentBundleBase):
+    def __init__(self, *args, **kwargs):
+        super(CompilingFluentBundle, self).__init__(*args, **kwargs)
+        self._mark_dirty()
+
+    def _mark_dirty(self):
+        self._is_dirty = True
+        # Clear out old compilation errors, they might not apply if we
+        # re-compile:
+        self._compilation_errors = []
+        self.format = self._compile_and_format
+
+    def _mark_clean(self):
+        self._is_dirty = False
+        self.format = self._format
+
+    def add_messages(self, source):
+        super(CompilingFluentBundle, self).add_messages(source)
+        self._mark_dirty()
+
+    def _compile(self):
+        self._compiled_messages, self._compilation_errors = compile_messages(
+            self._messages_and_terms,
+            self._babel_locale,
+            use_isolating=self.use_isolating,
+            functions=self._functions)
+        self._mark_clean()
+
+    # 'format' is the hot path for many scenarios, so we try to optimize it. To
+    # avoid having to check '_is_dirty' inside 'format', we switch 'format' from
+    # '_compile_and_format' to '_format' when compilation is done. This gives us
+    # about 10% improvement for the simplest (but most common) case of an
+    # entirely static string.
+    def _compile_and_format(self, message_id, args=None):
+        self._compile()
+        return self._format(message_id, args)
+
+    def _format(self, message_id, args=None):
+        errors = []
+        return self._compiled_messages[message_id](args, errors), errors
+
+    def check_messages(self):
+        if self._is_dirty:
+            self._compile()
+        return self._parsing_issues + self._compilation_errors
+
+
+FluentBundle = InterpretingFluentBundle