27 March 2017

How to write (or customise) a templating engine?

While working on extending a web templating system to support a new programming language, I collected some features I wanted to implement or preserve. These features have come from my experience with various templating systems in PHP, Perl and JavaScript, and a need to create a secure and high-performance software environment for web developers.

At times the following sound like prescriptions, but in reality they just call attention to danger areas you might want to think about either when developing a new templating system, or adopting an existing one. One can freely go against these and still create a fast and correct system if one mitigates their effects by proper escaping or other workarounds.

Code security


Focusing on what is evaluated in what way can help prevent nasty surprises like code injection attacks and data leaks caused by the system interpolating into your template something else than what you intended. This general requirement surfaces in many different ways. These are closely related, but let's discuss them separately.

Avoid treating code as templating


Imagine we have a PHP templating system that replaces variables or placeholders enclosed in double square brackets, e.g.

<html>
<head>
<title>[[title]]</title>
</head>
<body>
<h1>[[title]]</h1>

Then a requirement comes in to make the title inside the page all-caps. One may be tempted to extend the templating system to support arbitrary PHP code to make it more powerful, e.g. write

<html>
<head>
<title>[[title]]</title>
</head>
<body>
<h1><?php echo strtoupper("[[title]]"); ?></h1>

As the replacement of placeholders must precede running the PHP code, you can see how dangerous this can be if the title happens to be "); echo(file_get_contents('passwords.php') . ". You can also run into trouble if the placeholder delimiters can occur naturally in your programming language.

You can mitigate these dangers by, e.g. escaping all double and single quotes when replacing placeholders, but in my view there are so many edge cases in systems like this that it is best to avoid this altogether.

Parse templates only once


This is a more general version of the previous point. Notice that in the previous example we in fact parsed a template twice: once for placeholders, and then for PHP code. This is not limited to code, though. Imagine that we have the [[...]] templating system and we are successfully building a webpage from

<html>
<head>
<title>[[title]]</title>
</head>
<body>
<div class="header">User: [[username]]</div>
<h1>[[title]]</h1>

As the next step, we would like to introduce caching to speed rendering up, and we realise that most of the page content stays the same while only the username may be changing. So we could introduce a new kind of placeholder that is not replaced in the first instance, but only after retrieving the page from cache. Given the template:

<html>
<head>
<title>[[title]]</title>
</head>
<body>
<div class="header">User: ((username))</div>
<h1>[[title]]</h1>

we cache the string

<html>
<head>
<title>A4 paper, pack of 500</title>
</head>
<body>
<div class="header">User: ((username))</div>
<h1>A4 paper, pack of 500</h1>

I call this technique "caching a hole", and our plan is to replace ((...)) placeholders after caching, just before serving the page. Again you can see that this system can break if a title or any other content you have no control over contains "((".

The problem here again is that the template was parsed twice for placeholders or other active elements, but by the second time it also contained interpolated content that we also parsed.

The solution is to parse the template only once, and never re-parse and already parsed component including any external data that is incorporated. This is, in fact, another facet of the same problem:

Don't parse user-submitted content


(Here, this includes e.g. data the marketing team enters into the CMS database.)

Assist with avoiding unescaped or double-escaped content in the output


Apart from a secure language and way of parsing, a templating system should also assist developers with ensuring that its output is secure, too. It should prevent (or help avoid, or warn about) unescaped user-submitted content from appearing in its output, which would easily make a website open to code injection attacks (with, e.g. a search string <script>document.location = 'http://fakebank.com';</script> shared in a URL). For convenience, it should also help to avoid escaping content multiple times, which is usually not what a developer intended to do.

One solution to tracing what has already been escaped and what hasn't, especially in more complicated templating systems capable of producing output in different languages (HTML, CSS, plain text, JavaScript, etc.) can be to use objects that store this information as metadata attached to the piece of content.

In other systems only capable of producing output in one language (e.g. HTML), it may be enough to simply escape everything, but there still needs to be a mechanism to mark some already parsed content as safe and prevent it from being escaped again.

It is worth mentioning that HTML, strictly speaking, uses two languages, HTML content and XML attribute values. However, the two can be treated in the same way by over-escaping, that is, by escaping quotation marks even in regular HTML content.

Speed and caching


The next area of concern is caching to help make a templating system as efficient as possible. On many websites (maybe with the exclusion of web apps) content generated by the server side is largely similar for the same requests, and must be done repeatedly, which is clearly wasteful on high-traffic websites. Caching the whole or parts of the output is a natural solution - it speeds up templating and saves electricity.

Allow caching half-ready templates


This is one of what I see are the two main approaches to caching content that allows introducing changes (like the name of the logged-in user) after caching. As suggested above, often HTML pages are largely similar for the same request but have small variations. These can include timestamps, randomised content (A/B tests), and content dependent on the current user like their name, favourites, or tailored suggestions.

Whether it is worth caching against the user (e.g. creating separate cache elements for each user) depends on a number of factors, like the cost of generating the content, cache hit rates, and the number of users. In any case, it is always useful for a templating system to allow introducing minor changes after caching.

Two main ways of doing so are caching components and caching "holes." In the first case, we cache parts of the page which are fully evaluated. The correct components are then selected and assembled for each request, allowing for variations. In the second cache we "cache a hole," and as described above we aim to cache a partially evaluated piece of templating that still has placeholders or other active elements.

Supporting this places some additional constraints on the templating system, which are connected to the suggestion that it should parse templates only once. Some templating engines indeed parse templates only once, and they convert them into code and function calls. This is a very efficient solution, although in my view these systems (like server-side React.js) struggle with caching "holes." Parsing templates into an object structure may support caching holes more easily, if you can arrange for the objects to be serialised.

Side channels and side effects: the correctness of caching


We also need to investigate what parts or templates can be cached at all. For the output of any code to be cached transparently, it needs to adhere to the functional paradigm: its output must depend solely on a set of well-defined inputs (like mathematical functions - no side channels), and we must be able to recreate all its output from our cache (no side effects).

The first constraint is necessary so that we can represent all relevant inputs in the cache key and avoid code contamination. If, for example, we have a header template object that renders the name of an item for sale and takes the item name as its input, but internally also retrieves who the user currently logged in is from a global variable to add their name, then caching this object against its arguments (the item name only) would be incorrect and result in usernames being leaked to other users. Ideally a templating system should facilitate enumerating all relevant inputs and discourage reaching out to other application data in templating code.

The second constraint is perhaps less relevant in the case of pure templating (the "view" layer), as then the output is usually restricted to HTML or half-ready templates. Still, if any controller-like logic finds its way into our templates, or we would like to introduce caching in the controller layer, too, we need to make sure that the code, when it runs, has no other effect than returning some values that we can retrieve from the cache. It cannot, for example, access a globals response object and inject a cookie as it would simply not happen when the return value of the code is retrieved from the cache.

A solution to this could be to allow any kind of object or data structure to be returned by cacheable code blocks, where apart from return values and templating, side-effects can also be represented like cookies to be set, or redirect or error pages to be rendered. This freedom can have useful applications inside the view layer, too, as metadata (e.g. success flags) returned with templating can be very useful when processing a template further. While not strictly necessary, I think this flexibility can make a templating system a considerably more powerful tool.

Usability


Define your language


Whatever you choose to parse your template into (e.g. objects or code), it is always a good idea to clearly define, in advance, your templating language. Then you can parse your templates using specially crafted regexes, or a lexer / parser combination using an LL parser or similar.

The definition can help one to see where it could go wrong. Are there any uncertainties in meaning? Can we, for example, always distinguish between placeholders and calls to helper functions? Creating a templating system is similar to creating a high-level programming language, and the more redundancy you build in, the more linting you can do, and the less likely it is that a developer will make a mistake by meaning one thing but writing another, which still looks correct.

For example, I suggest clearly distinguishing between placeholders (bringing in data) and calls to helper functions in the templating syntax, and not relying on what functions happen to be defined when the template is processed. If both placeholders and function calls look like [[...]], how do you address the original "name" datum masked by the helper function in "data-name" in the following example?

Templating.call('HeaderTemplate', 
{ name: "John Smith", gender: "m", email: "js@me.com" }
);

--- HeaderTemplate ---

function name (templateArgs) {
return (templateArgs.gender == 'm' ? 'Mr' : 'Ms')
+ ' ' + templateArgs.name;
}

<div class="header" data-name="???">[[name]] - [[email]]</div>

The solution handlebars uses is to extend its syntax that can address data in the incoming data structure and use "./name" to access a masked datum. But I believe a developer knows whether they mean a piece of data or a function call, and should indicate this to the system. For example, add "()" to function calls and disallow parentheses as parts of argument names:

<div class="header" data-name="[[name]]">[[name()]] - [[email]]</div>

Avoid clashes with the output language


Creating a specification for your language also allows you to avoid any clashes with the language (text, HTML, CSS, etc.) the templating system is supposed to generate.

Clashes can limit what you can do in your templating language. Imagine that one decides to use XML tags for all placeholders and require that any template is well-formed XML. This works well in cases like

<div class="header"><Name/> <Email/></div>

but how do you interpolate data into an XML attribute? <div data-name="<Name/>"> would clearly fail.

Clashes with the output language can also upset syntax highlighters and linters that in any IDE help developers avoid bugs and mistakes that can be difficult and costly to trace.

Easy internationalisation


I believe every modern templating system should support internationalisation out of the box. As elsewhere in coding, any solution used should avoid duplication, e.g. using the original text as a key to look up the translated versions. This would make it necessary to update code in multiple places merely to fix a typo or capitalisation issue in the original text, which can easily lead to breaking the link between the texts and losing the translated content.

Universality


The final, and perhaps least important feature I like to see of any templating system, is universality. I think a templating system should be capable of producing any output, at least in its target language(s). In extremis, if the output languages allow, it should be capable of producing content that looks just like its input. Any arbitrary limitation on this may signal a templating language or parsing system that was not adequately designed, and one can be sure to hit this limitation sooner rather than later. For example, if the templating language allows whitespace around placeholders for readability, but removes these from the output, is it possible to add the whitespace back if needed? Can whitespace be added to the end of a templating unit? Can one add HTML comments or arbitrary XML attributes?

Different escaping mechanisms


In particular, I like to see different escaping mechanisms that are relevant for the output language supported by the templating system. For HTML, useful translations apart from escaping XML metacharacters and quotes are URL escaping (for interpolating URL query values) or JS escaping (for interpolating data into in-page JS code). Easy access to these mean that developers are more likely to use them, which leads to fewer errors and a more robust and resilient system.

Conclusion


The above list cannot be complete, but I hope these points will help you design or choose your next templating system. Happy templating!

No comments:

Post a Comment