Skip to content

Commit

Permalink
Compiler implementation for FluentBundle
Browse files Browse the repository at this point in the history
  • Loading branch information
spookylukey committed Jun 17, 2019
1 parent 20f3e25 commit d1481d6
Show file tree
Hide file tree
Showing 31 changed files with 4,142 additions and 95 deletions.
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
.tox
*.pyc
.eggs/
*.pot
*.mo
*.po
.pytest_cache
*.egg-info/
_build
.benchmarks
.hypothesis
1 change: 1 addition & 0 deletions fluent.runtime/CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ fluent.runtime development version (unreleased)
terms.
* Refined error handling regarding function calls to be more tolerant of errors
in FTL files, while silencing developer errors less.
* Added ``CompilingFluentBundle`` implementation.

fluent.runtime 0.1 (January 21, 2019)
-------------------------------------
Expand Down
115 changes: 115 additions & 0 deletions fluent.runtime/docs/implementations.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
FluentBundle Implementations
============================

python-fluent comes with two implementations of ``FluentBundle``. The default is
``fluent.runtime.InterpretingFluentBundle``, which is what you get under the
alias ``fluent.runtime.FluentBundle``. It implements an interpreter for the FTL
Abstract Syntax Tree.

The alternative is ``fluent.runtime.CompilingFluentBundle``, which works by
compiling a set of FTL messages to a set of Python functions using Python `ast
<https://docs.python.org/3/library/ast.html>`_. This results in very good
performance (see below for more info).

While the two implementations have the same API, and return the same values
under most situations, there are some differences, as follows:

* ``InterpretingFluentBundle`` has some protection against malicious FTL input
which could attempt things like a `billion laughs attack
<https://en.wikipedia.org/wiki/Billion_laughs_attack>`_ to consume a large
amount of memory or CPU time. For the sake of performance,
``CompilingFluentBundle`` does not have these protections.

It should be noted that both implementations are able to detect and stop
infinite recursion errors (``CompilingFluentBundle`` does this at compile
time), which is important to stop infinite loops and memory exhaustion which
could otherwise occur due to accidental cyclic references in messages.

* While the error handling strategy for both implementations is the same, when
errors occur (e.g. a missing value in the arguments dictionary, or a cyclic
reference, or a string is passed to ``NUMBER()`` builtin), the exact errors
returned by ``format`` may be different between the two implementations.

Also, when an error occurs, in some cases (such as a cyclic reference), the
error string embedded into the returned formatted message may be different.
For cases where there is no error, the output is identical (or should be).

Neither implementations guarantees that the exact errors returned will be the
same between different versions of ``fluent.runtime``.

Performance
-----------

Due to the strategy of compiling to Python, ``CompilingFluentBundle`` has very
good performance, especially for the simple common cases. The
``tools/benchmark/gettext_comparisons.py`` script includes some benchmarks that
compare speed to GNU gettext as a reference. Below is a rough summary:

For the simple but very common case of a message defining a static string,
``CompilingFluentBundle.format`` is very close to GNU gettext, or much faster,
depending on whether you are using Python 2 or 3, and your Python implementation
(e.g. CPython or PyPy). (The worst case we found was 5% faster than gettext on
CPython 2.7, and the best case was about 3.5 times faster for PyPy2 5.1.2). For
cases of substituting a single string into a message,
``CompilingFluentBundle.format`` is between 30% slower and 70% faster than an
equivalent implementation using GNU gettext and Python ``%`` interpolation.

For message where plural rules are involved, currently ``CompilingFluentBundle``
can be significantly slower than using GNU gettext, partly because it uses
plural rules from CLDR that can be much more complex (and correct) than the ones
that gettext normally does. Further work could be done to optimize some of these
cases though.

For more complex operations (for example, using locale-aware date and number
formatting), formatting messages can take a lot longer. Comparisons to GNU
gettext fall down at this point, because it doesn't include a lot of this
functionality. However, usually these types of messages make up a small fraction
of the number of internationalized strings in an application.

``InterpretingFluentBundle`` is, as you would expect, much slower that
``CompilingFluentBundle``, often by a factor of 10. In cases where there are a
large number of messages, ``CompilingFluentBundle`` will be a lot slower to
format the first message because it first compiles all the messages, whereas
``InterpretingFluentBundle`` does not have this compilation step, and tries to
reduce any up-front work to a minimum (sometimes at the cost of runtime
performance).


Security
--------

You should not pass un-trusted FTL code to ``FluentBundle.add_messages``. This
is because carefully constructed messages could potentially cause large resource
usage (CPU time and memory). The ``InterpretingFluentBundle`` implementation
does have some protection against these attacks, although it may not be
foolproof, while ``CompilingFluentBundle`` does not have any protection against
these attacks, either at compile time or run time.

``CompilingFluentBundle`` works by compiling FTL messages to Python `ast
<https://docs.python.org/3/library/ast.html>`_, which is passed to `compile
<https://docs.python.org/3/library/functions.html#compile>`_ and then `exec
<https://docs.python.org/3/library/functions.html#exec>`_. The use of ``exec``
like this is an established technique for high performance Python code, used in
template engines like Mako, Jinja2 and Genshi.

However, there can understandably be some concerns around the use of ``exec``
which can open up remote execution vulnerabilities. If this is of paramount
concern to you, you should consider using ``InterpretingFluentBundle`` instead
(which is the default).

To reduce the possibility of our use of ``exec`` harbouring security issues, the
following things are in place:

1. We generate `ast <https://docs.python.org/3/library/ast.html>`_ objects and
not strings. This greatly reduces the security problems, since there is no
possibility of a vulnerability due to incorrect string interpolation.

2. We use ``exec`` only on AST derived from FTL files, never on "end user input"
(such as the arguments passed into ``FluentBundle.format``). This reduces the
attack vector to only the situation where the source of your FTL files is
potentially malicious or compromised.

3. We employ defence-in-depth techniques in our code generation and compiler
implementation to reduce the possibility of a cleverly crafted FTL code
producing security holes, and ensure these techniques have full test
coverage.
14 changes: 14 additions & 0 deletions fluent.runtime/docs/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,19 @@ module or the start of your repl session:
from __future__ import unicode_literals
CompilingFluentBundle
~~~~~~~~~~~~~~~~~~~~~

In addition to the default ``FluentBundle`` implementation, there is also a high
performance implementation that compilers to Python AST. You can use it just the same:

.. code-block:: python
from fluent.runtime import CompilingFluentBundle as FluentBundle
Be sure to check the notes on :doc:`implementations`, especially the security
section.

Numbers
~~~~~~~

Expand Down Expand Up @@ -225,5 +238,6 @@ Help with the above would be welcome!
Other features and further information
--------------------------------------

* :doc:`implementations`
* :doc:`functions`
* :doc:`errors`
113 changes: 97 additions & 16 deletions fluent.runtime/fluent/runtime/__init__.py
Original file line number Diff line number Diff line change
@@ -1,19 +1,23 @@
from __future__ import absolute_import, unicode_literals

from collections import OrderedDict

import babel
import babel.numbers
import babel.plural

from fluent.syntax import FluentParser
from fluent.syntax.ast import Message, Term
from fluent.syntax.ast import Junk, Message, Term

from .builtins import BUILTINS
from .compiler import compile_messages
from .errors import FluentDuplicateMessageId, FluentJunkFound
from .prepare import Compiler
from .resolver import ResolverEnvironment, CurrentEnvironment
from .resolver import CurrentEnvironment, ResolverEnvironment
from .utils import ATTRIBUTE_SEPARATOR, TERM_SIGIL, ast_to_id, native_to_fluent


class FluentBundle(object):
class FluentBundleBase(object):
"""
Message contexts are single-language stores of translations. They are
responsible for parsing translation resources in the Fluent syntax and can
Expand All @@ -33,27 +37,60 @@ def __init__(self, locales, functions=None, use_isolating=True):
_functions.update(functions)
self._functions = _functions
self.use_isolating = use_isolating
self._messages_and_terms = {}
self._compiled = {}
self._compiler = Compiler()
self._messages_and_terms = OrderedDict()
self._parsing_issues = []
self._babel_locale = self._get_babel_locale()
self._plural_form = babel.plural.to_python(self._babel_locale.plural_form)

def add_messages(self, source):
parser = FluentParser()
resource = parser.parse(source)
# TODO - warn/error about duplicates
for item in resource.body:
if isinstance(item, (Message, Term)):
full_id = ast_to_id(item)
if full_id not in self._messages_and_terms:
if full_id in self._messages_and_terms:
self._parsing_issues.append((full_id, FluentDuplicateMessageId(
"Additional definition for '{0}' discarded.".format(full_id))))
else:
self._messages_and_terms[full_id] = item
elif isinstance(item, Junk):
self._parsing_issues.append(
(None, FluentJunkFound("Junk found: " +
'; '.join(a.message for a in item.annotations),
item.annotations)))

def has_message(self, message_id):
if message_id.startswith(TERM_SIGIL) or ATTRIBUTE_SEPARATOR in message_id:
return False
return message_id in self._messages_and_terms

def _get_babel_locale(self):
for l in self.locales:
try:
return babel.Locale.parse(l.replace('-', '_'))
except babel.UnknownLocaleError:
continue
# TODO - log error
return babel.Locale.default()

def format(self, message_id, args=None):
raise NotImplementedError()

def check_messages(self):
"""
Check messages for errors and return as a list of two tuples:
(message ID or None, exception object)
"""
raise NotImplementedError()


class InterpretingFluentBundle(FluentBundleBase):

def __init__(self, locales, functions=None, use_isolating=True):
super(InterpretingFluentBundle, self).__init__(locales, functions=functions, use_isolating=use_isolating)
self._compiled = {}
self._compiler = Compiler()

def lookup(self, full_id):
if full_id not in self._compiled:
entry_id = full_id.split(ATTRIBUTE_SEPARATOR, 1)[0]
Expand Down Expand Up @@ -83,11 +120,55 @@ def format(self, message_id, args=None):
errors=errors)
return [resolve(env), errors]

def _get_babel_locale(self):
for l in self.locales:
try:
return babel.Locale.parse(l.replace('-', '_'))
except babel.UnknownLocaleError:
continue
# TODO - log error
return babel.Locale.default()
def check_messages(self):
return self._parsing_issues[:]


class CompilingFluentBundle(FluentBundleBase):
def __init__(self, *args, **kwargs):
super(CompilingFluentBundle, self).__init__(*args, **kwargs)
self._mark_dirty()

def _mark_dirty(self):
self._is_dirty = True
# Clear out old compilation errors, they might not apply if we
# re-compile:
self._compilation_errors = []
self.format = self._compile_and_format

def _mark_clean(self):
self._is_dirty = False
self.format = self._format

def add_messages(self, source):
super(CompilingFluentBundle, self).add_messages(source)
self._mark_dirty()

def _compile(self):
self._compiled_messages, self._compilation_errors = compile_messages(
self._messages_and_terms,
self._babel_locale,
use_isolating=self.use_isolating,
functions=self._functions)
self._mark_clean()

# 'format' is the hot path for many scenarios, so we try to optimize it. To
# avoid having to check '_is_dirty' inside 'format', we switch 'format' from
# '_compile_and_format' to '_format' when compilation is done. This gives us
# about 10% improvement for the simplest (but most common) case of an
# entirely static string.
def _compile_and_format(self, message_id, args=None):
self._compile()
return self._format(message_id, args)

def _format(self, message_id, args=None):
errors = []
return self._compiled_messages[message_id](args, errors), errors

def check_messages(self):
if self._is_dirty:
self._compile()
return self._parsing_issues + self._compilation_errors


FluentBundle = InterpretingFluentBundle
Loading

0 comments on commit d1481d6

Please sign in to comment.