15 July 2017

AI is More Instinct than Intelligence

Lately I've been spending a lot of time with machine learning, neural nets, and the question of extracting and communicating their thinking so that humans could review their conclusions and/or learn from them. It is a hardly surprising observation that what these models do is primarily pattern matching: a system that assesses whether a bank transaction is fraudulent will flag up a transaction because in some very complicated ways it is similar to other transactions that were results of fraud. Even non-supervised learning, which autonomously finds patterns in data, is doing just that: finding patterns.

If so, then everything these systems do is more like what we do by instinct, and not what we do via high-level reasoning which we traditionally call intelligence. In fact, I think this is partly why these systems appear so magical: from reading handwriting to self-driving cars, they do things we don't know how we do ourselves. They are getting pretty good at things we can learn to do by instinct.

(Classic AI / ML did in fact concern itself with symbolic calculations and reasoning, but the statistical models that are becoming so powerful today meant a shift from reasoning to instinctual decisions.)

This, in turn, is why it's so difficult to understand how an AI model arrives at a conclusion: it does so based on patterns and similarity, like our amygdala, and does not complement this with some kind of abstract reasoning. Even if it would simply rationalise a decision already made based on its instinct (like probably how most humans arrive at "rational" decisions), adding this rational layer would be truly amazing, as it would allow us to communicate with an AI system and peek into its thought processes.

20 June 2017

Display sample predictions during training using Keras callbacks

I have seen many sample Keras scripts where training a model by calling model.fit is allowed to finish after a couple of epochs to create an opportunity to compute and display a sample of the predictions the model generates. model.fit is then called again in a loop to continue training. Unfortunately, this technique confuses TensorBoard as it tries to trace how the training progresses and indicators like loss and accuracy change. This is beacuse restarting the fitting makes TensorBoard think that new data is coming in at previous timepoints, leading to some very confused graphs.

Fortunately, it is very easy to set up a new Keras callback to display sample predictions without having to stop the training using LambdaCallback. We can have a function called after each epoch to print out the info; it can even check the number of the epoch to do so not at the end of every epoch, but less frequently.

Using this callback together with the callback that logs data for TensorBoard is illusrated below. In this code we use an iterator to get the training data, but the same pattern can be used with pre-loaded data, too.

30 March 2017

Debian installer cannot find missing firmware

During installation, Debian may ask for (non-free) firmware for certain devices, like the wi-fi. Unfortunately, even if you have located the file and copied it to a USB drive or similar, the installer often still cannot find it. The solution is to switch to a terminal (Alt-Ctrl-F2), mount the USB drive manually, and copy the file to /lib/firmware (create the directory if needed). See also the relevant section of the installation documentation of Debian jessie.

27 March 2017

How to write (or customise) a templating engine?

While working on extending a web templating system to support a new programming language, I collected some features I wanted to implement or preserve. These features have come from my experience with various templating systems in PHP, Perl and JavaScript, and a need to create a secure and high-performance software environment for web developers.

At times the following sound like prescriptions, but in reality they just call attention to danger areas you might want to think about either when developing a new templating system, or adopting an existing one. One can freely go against these and still create a fast and correct system if one mitigates their effects by proper escaping or other workarounds.

Code security


Focusing on what is evaluated in what way can help prevent nasty surprises like code injection attacks and data leaks caused by the system interpolating into your template something else than what you intended. This general requirement surfaces in many different ways. These are closely related, but let's discuss them separately.

Avoid treating code as templating


Imagine we have a PHP templating system that replaces variables or placeholders enclosed in double square brackets, e.g.

<html>
<head>
<title>[[title]]</title>
</head>
<body>
<h1>[[title]]</h1>

Then a requirement comes in to make the title inside the page all-caps. One may be tempted to extend the templating system to support arbitrary PHP code to make it more powerful, e.g. write

<html>
<head>
<title>[[title]]</title>
</head>
<body>
<h1><?php echo strtoupper("[[title]]"); ?></h1>

As the replacement of placeholders must precede running the PHP code, you can see how dangerous this can be if the title happens to be "); echo(file_get_contents('passwords.php') . ". You can also run into trouble if the placeholder delimiters can occur naturally in your programming language.

You can mitigate these dangers by, e.g. escaping all double and single quotes when replacing placeholders, but in my view there are so many edge cases in systems like this that it is best to avoid this altogether.

Parse templates only once


This is a more general version of the previous point. Notice that in the previous example we in fact parsed a template twice: once for placeholders, and then for PHP code. This is not limited to code, though. Imagine that we have the [[...]] templating system and we are successfully building a webpage from

<html>
<head>
<title>[[title]]</title>
</head>
<body>
<div class="header">User: [[username]]</div>
<h1>[[title]]</h1>

As the next step, we would like to introduce caching to speed rendering up, and we realise that most of the page content stays the same while only the username may be changing. So we could introduce a new kind of placeholder that is not replaced in the first instance, but only after retrieving the page from cache. Given the template:

<html>
<head>
<title>[[title]]</title>
</head>
<body>
<div class="header">User: ((username))</div>
<h1>[[title]]</h1>

we cache the string

<html>
<head>
<title>A4 paper, pack of 500</title>
</head>
<body>
<div class="header">User: ((username))</div>
<h1>A4 paper, pack of 500</h1>

I call this technique "caching a hole", and our plan is to replace ((...)) placeholders after caching, just before serving the page. Again you can see that this system can break if a title or any other content you have no control over contains "((".

The problem here again is that the template was parsed twice for placeholders or other active elements, but by the second time it also contained interpolated content that we also parsed.

The solution is to parse the template only once, and never re-parse and already parsed component including any external data that is incorporated. This is, in fact, another facet of the same problem:

Don't parse user-submitted content


(Here, this includes e.g. data the marketing team enters into the CMS database.)

Assist with avoiding unescaped or double-escaped content in the output


Apart from a secure language and way of parsing, a templating system should also assist developers with ensuring that its output is secure, too. It should prevent (or help avoid, or warn about) unescaped user-submitted content from appearing in its output, which would easily make a website open to code injection attacks (with, e.g. a search string <script>document.location = 'http://fakebank.com';</script> shared in a URL). For convenience, it should also help to avoid escaping content multiple times, which is usually not what a developer intended to do.

One solution to tracing what has already been escaped and what hasn't, especially in more complicated templating systems capable of producing output in different languages (HTML, CSS, plain text, JavaScript, etc.) can be to use objects that store this information as metadata attached to the piece of content.

In other systems only capable of producing output in one language (e.g. HTML), it may be enough to simply escape everything, but there still needs to be a mechanism to mark some already parsed content as safe and prevent it from being escaped again.

It is worth mentioning that HTML, strictly speaking, uses two languages, HTML content and XML attribute values. However, the two can be treated in the same way by over-escaping, that is, by escaping quotation marks even in regular HTML content.

Speed and caching


The next area of concern is caching to help make a templating system as efficient as possible. On many websites (maybe with the exclusion of web apps) content generated by the server side is largely similar for the same requests, and must be done repeatedly, which is clearly wasteful on high-traffic websites. Caching the whole or parts of the output is a natural solution - it speeds up templating and saves electricity.

Allow caching half-ready templates


This is one of what I see are the two main approaches to caching content that allows introducing changes (like the name of the logged-in user) after caching. As suggested above, often HTML pages are largely similar for the same request but have small variations. These can include timestamps, randomised content (A/B tests), and content dependent on the current user like their name, favourites, or tailored suggestions.

Whether it is worth caching against the user (e.g. creating separate cache elements for each user) depends on a number of factors, like the cost of generating the content, cache hit rates, and the number of users. In any case, it is always useful for a templating system to allow introducing minor changes after caching.

Two main ways of doing so are caching components and caching "holes." In the first case, we cache parts of the page which are fully evaluated. The correct components are then selected and assembled for each request, allowing for variations. In the second cache we "cache a hole," and as described above we aim to cache a partially evaluated piece of templating that still has placeholders or other active elements.

Supporting this places some additional constraints on the templating system, which are connected to the suggestion that it should parse templates only once. Some templating engines indeed parse templates only once, and they convert them into code and function calls. This is a very efficient solution, although in my view these systems (like server-side React.js) struggle with caching "holes." Parsing templates into an object structure may support caching holes more easily, if you can arrange for the objects to be serialised.

Side channels and side effects: the correctness of caching


We also need to investigate what parts or templates can be cached at all. For the output of any code to be cached transparently, it needs to adhere to the functional paradigm: its output must depend solely on a set of well-defined inputs (like mathematical functions - no side channels), and we must be able to recreate all its output from our cache (no side effects).

The first constraint is necessary so that we can represent all relevant inputs in the cache key and avoid code contamination. If, for example, we have a header template object that renders the name of an item for sale and takes the item name as its input, but internally also retrieves who the user currently logged in is from a global variable to add their name, then caching this object against its arguments (the item name only) would be incorrect and result in usernames being leaked to other users. Ideally a templating system should facilitate enumerating all relevant inputs and discourage reaching out to other application data in templating code.

The second constraint is perhaps less relevant in the case of pure templating (the "view" layer), as then the output is usually restricted to HTML or half-ready templates. Still, if any controller-like logic finds its way into our templates, or we would like to introduce caching in the controller layer, too, we need to make sure that the code, when it runs, has no other effect than returning some values that we can retrieve from the cache. It cannot, for example, access a globals response object and inject a cookie as it would simply not happen when the return value of the code is retrieved from the cache.

A solution to this could be to allow any kind of object or data structure to be returned by cacheable code blocks, where apart from return values and templating, side-effects can also be represented like cookies to be set, or redirect or error pages to be rendered. This freedom can have useful applications inside the view layer, too, as metadata (e.g. success flags) returned with templating can be very useful when processing a template further. While not strictly necessary, I think this flexibility can make a templating system a considerably more powerful tool.

Usability


Define your language


Whatever you choose to parse your template into (e.g. objects or code), it is always a good idea to clearly define, in advance, your templating language. Then you can parse your templates using specially crafted regexes, or a lexer / parser combination using an LL parser or similar.

The definition can help one to see where it could go wrong. Are there any uncertainties in meaning? Can we, for example, always distinguish between placeholders and calls to helper functions? Creating a templating system is similar to creating a high-level programming language, and the more redundancy you build in, the more linting you can do, and the less likely it is that a developer will make a mistake by meaning one thing but writing another, which still looks correct.

For example, I suggest clearly distinguishing between placeholders (bringing in data) and calls to helper functions in the templating syntax, and not relying on what functions happen to be defined when the template is processed. If both placeholders and function calls look like [[...]], how do you address the original "name" datum masked by the helper function in "data-name" in the following example?

Templating.call('HeaderTemplate', 
{ name: "John Smith", gender: "m", email: "js@me.com" }
);

--- HeaderTemplate ---

function name (templateArgs) {
return (templateArgs.gender == 'm' ? 'Mr' : 'Ms')
+ ' ' + templateArgs.name;
}

<div class="header" data-name="???">[[name]] - [[email]]</div>

The solution handlebars uses is to extend its syntax that can address data in the incoming data structure and use "./name" to access a masked datum. But I believe a developer knows whether they mean a piece of data or a function call, and should indicate this to the system. For example, add "()" to function calls and disallow parentheses as parts of argument names:

<div class="header" data-name="[[name]]">[[name()]] - [[email]]</div>

Avoid clashes with the output language


Creating a specification for your language also allows you to avoid any clashes with the language (text, HTML, CSS, etc.) the templating system is supposed to generate.

Clashes can limit what you can do in your templating language. Imagine that one decides to use XML tags for all placeholders and require that any template is well-formed XML. This works well in cases like

<div class="header"><Name/> <Email/></div>

but how do you interpolate data into an XML attribute? <div data-name="<Name/>"> would clearly fail.

Clashes with the output language can also upset syntax highlighters and linters that in any IDE help developers avoid bugs and mistakes that can be difficult and costly to trace.

Easy internationalisation


I believe every modern templating system should support internationalisation out of the box. As elsewhere in coding, any solution used should avoid duplication, e.g. using the original text as a key to look up the translated versions. This would make it necessary to update code in multiple places merely to fix a typo or capitalisation issue in the original text, which can easily lead to breaking the link between the texts and losing the translated content.

Universality


The final, and perhaps least important feature I like to see of any templating system, is universality. I think a templating system should be capable of producing any output, at least in its target language(s). In extremis, if the output languages allow, it should be capable of producing content that looks just like its input. Any arbitrary limitation on this may signal a templating language or parsing system that was not adequately designed, and one can be sure to hit this limitation sooner rather than later. For example, if the templating language allows whitespace around placeholders for readability, but removes these from the output, is it possible to add the whitespace back if needed? Can whitespace be added to the end of a templating unit? Can one add HTML comments or arbitrary XML attributes?

Different escaping mechanisms


In particular, I like to see different escaping mechanisms that are relevant for the output language supported by the templating system. For HTML, useful translations apart from escaping XML metacharacters and quotes are URL escaping (for interpolating URL query values) or JS escaping (for interpolating data into in-page JS code). Easy access to these mean that developers are more likely to use them, which leads to fewer errors and a more robust and resilient system.

Conclusion


The above list cannot be complete, but I hope these points will help you design or choose your next templating system. Happy templating!

18 September 2016

Evaluating randomised tests

Having run a randomised test on a service or product for some time, we may see that version B produces 3% better results on average than version A. But how can we be certain that this is a significant difference? That it is statistically significant? The ideas below helped me understand what's behind this question.

A randomised test usually means that some users see a different version of the product than others. The success of both versions is then measured by a score we give to each user (e.g. how much money they ultimately spend). We see a difference in the overall average (or arithmetic mean) of these scores, but it may not be caused by the different versions.

The reason for this is simple. Each user has a different score, and from the whole pool of users we have selected a subset (e.g. those who were shown version A) for which we calculated the mean score. This will vary slightly from subset to subset, and this can explain the difference between the mean scores for users shown version A and those shown version B. What our goal is, therefore, to try to decide whether this could be the explanation.

To do this, we assume that this is indeed the case, and the difference between the means is only caused by this normal variation between subsets. This is our null hypothesis, and it would mean that the two subsets of users came from the same distribution and the different versions of the product shown to them made no difference whatsoever. So we pool the users who were shown version A (let's say there were n of them), and those shown version B (m of them), and from this combined set we randomly select subsets of n and m users with replacement. This process is called bootstrapping, and it results in many pairs of subsets for which we calculate the difference between their mean scores.

This gives us a number of differences that we expect to see just because the users are divided into two subsets. We can arrange them in a histogram and check where the difference we saw in our test is on this histogram. If, say, it turns out that 70% of the differences we produced by bootstrapping is smaller than the difference given by the test, and 30% is larger, then it is entirely possible (and quite likely) that the difference in the test is just caused by separating users into subsets. In other words, the null hypothesis is accepted and the difference in the test is not statistically significant.

However, if only 5% or less of the generated differences is greater than the difference in the test (5% being the usual benchmark), then the null hypothesis is rejected as it is very unlikely that the difference in the test could have occurred on its own, and we say that it is statistically significant. That is, showing the different versions to users mattered.

There are many ways to calculate whether some result is statistically significant, but some rely on assumptions about the underlying distribution of scores. Actually performing the bootstrapping doesn't, and a simple C program, Manatee can do just that - it's available on GitHub.

Non-cheating self-replication programs (Quines)

Just for fun, here are some short self-replicating programs I created in JavaScript, PHP and Perl. These don't cheat in the sense that they don't use language features that print sources of functions, access the file or script of the program or similar.

Most of these use the following structure:

a="print('a="'.a.'";'.a);";print('a="'.a.'";'.a);

The complication is to be able to include a (double) quote (orange) inside a string that uses the same delimiter (green). Perl can do this as it matches braces in the q{} notation; in JavaScript I used code to create these characters. In PHP it was even more difficult, so I went for a different solution entirely.

23 January 2016

Proving the correctness of an algebraic file synchroniser

I did some work on an algebraic approach to file synchronisation back in 2000. This basically means that if one has multiple copies of the same filesystem, and different users modify them differently, then we combine all changes (synchronise them) not by specifying rules based on what content we find in the filesystems, but based on the updates or commands that have been applied to them. This approach works very well as one can see whether two updates are in conflict just by looking at pairs of commands and a small set of rules.

I have recently become interested in this problem again, and in particular in offering rigorous proofs that the algorithm proposed to detect conflicting commands really works as intended, and cannot be improved in this framework. The statements to prove were intuitive and easy to grasp, but actually proving them turned out to be a difficult task. One reason for this was that previously, all propositions about commands were of the form "these two commands behave exactly like that command on all filesystems" - that is, they had an implied universal quantifier (or an existential one if one negated the proposition). However, in order to prove, for example, that the commands returned by the synchroniser algorithm designed to modify the filesystems so that they would be as much like each other again as possible will not actually cause an error, one needed to deal with propositions of the form "if these commands do not cause an error, then those commands won't cause an error, either". In other words, the algebraic system needed to be extended to be able to describe implication, or some kind of conditionality.

I think I managed to do so, and I am very excited about the result. The end product of this 5-month research is a paper, now available on arXiv. And my favourite proof from the paper is actually the figure above.

Algebraic File Synchronization: Adequacy and Completeness
Elod Pal Csirmaz
Abstract
With distributed computing and mobile applications, synchronizing diverging replicas of data structures is a more and more common problem. We use algebraic methods to reason about filesystem operations, and introduce a simplified definition of conflicting updates to filesystems. We also define algorithms for update detection and reconciliation and present rigorous proofs that they not only work as intended, but also cannot be improved on.
To achieve this, we introduce a novel, symmetric set of filesystem commands with higher information content, which removes edge cases and increases the predictive powers of our algebraic model. We also present a number of generally useful classes and properties of sequences of commands.
These results are often intuitive, but providing exact proofs for them is far from trivial. They contribute to our understanding of this special type of algebraic model, and toward building more complete algebras of filesystem trees and extending algebraic approaches to other data storage protocols. They also form a theoretical basis for specifying and guaranteeing the error-free operation of applications that implement an algebraic approach to synchronization.

Read the paper in full »