Elod P Csirmaz’s Blog

29 August 2018

The correctness and completeness of algebraic file synchronisers

I did some work on an algebraic approach to file synchronisation back in 2000. This basically means that if one has multiple copies of the same filesystem, and different users modify them differently, then we combine all changes (synchronise them) not by specifying rules based on what content we find in the filesystems, but based on the updates or commands that have been applied to them. This approach works very well as one can see whether two updates are in conflict just by looking at pairs of commands and a small set of rules.

I have recently become interested in this problem again, and in particular in offering rigorous proofs that the algorithm proposed to detect conflicting commands really works as intended, and cannot be improved in this framework. The statements to prove were intuitive, but actually proving them turned out to be a difficult task. One reason for this was that previously, all propositions about commands were of the form "these two commands behave exactly like that command on all filesystems" - that is, they had an implied universal quantifier (or an existential one if one negated the proposition). However, in order to prove, for example, that the commands returned by the synchroniser algorithm designed to modify the filesystems so that they would be as much like each other again as possible will not actually cause an error, one needed to deal with propositions of the form "if these commands do not cause an error, then those commands won't cause an error, either". In other words, the algebraic system needed to be extended to be able to describe implication, or some kind of conditionality.

This could be done based on partial functions and a partial order, and I am very excited about the result. The end product of this new research is a paper, now available on arXiv.

Algebraic File Synchronization: Adequacy and Completeness
Elod Pal Csirmaz
With distributed computing and mobile applications, synchronizing diverging replicas of data structures is a more and more common problem. We use algebraic methods to reason about filesystem operations, and introduce a simplified definition of conflicting updates to filesystems. We also define algorithms for update detection and reconciliation and present rigorous proofs that they not only work as intended, but also cannot be improved on.
To achieve this, we introduce a novel, symmetric set of filesystem commands with higher information content, which removes edge cases and increases the predictive powers of our algebraic model. We also present a number of generally useful classes and properties of sequences of commands.
These results are often intuitive, but providing exact proofs for them is far from trivial. They contribute to our understanding of this special type of algebraic model, and toward building more complete algebras of filesystem trees and extending algebraic approaches to other data storage protocols. They also form a theoretical basis for specifying and guaranteeing the error-free operation of applications that implement an algebraic approach to synchronization.

Read the paper in full »


23 July 2018

How much data do you need for an A/B test?

If you have plenty of data coming in (lucky you), but you are still not sure whether you can be confident in the difference your test results show (especially if you don't have access to the individual data points, and don't know the underlying distribution or the variance, which isn't reported by e.g. Google Analytics), this back-of-the-envelope calculation may help.

If we measure something in a system, and have a well-established mean value, and then we test a variant of the system and have \(n\) measurements from the variant, their mean can be different from the established mean merely because the measurements are by nature random. That is, it is possible that the variant system performs in exactly the same way as the original system, and the difference in means we see is due to the randomness. (This is the null hypothesis.) We use Chebyshev's Inequality to approximate the probability of this being the case.

In particular, we assume the measurements are independent random variables \(X_i\) which all have the same expectation \(\mathrm{E}(X_i)=\mu\) and variance \(\mathrm{Var}(X_i)=\sigma^2\). We assume that the mean of measurements from the original system provides the expectation (but see below for evaluating two-sided tests). From Chebyshev's Inequality (http://math.mit.edu/~goemans/18310S15/chernoff-notes.pdf) \[\mathrm{Prob}\left(\left|\frac{\sum X_i}{n}-\mu\right|\geq\epsilon\right)\leq\frac{\sigma^2}{n\epsilon^2},\] which is an upper bound of this probability. If \(r\) is the range of the data (the difference between the maximum and minimum values), we know \(\sigma^2\leq r^2\). If we also describe the difference of the means as a proportion of the range (\(\epsilon=rd\)), we get \[\mathrm{Prob}\leq\frac{r^2}{nr^2d^2}=\frac{1}{nd^2}.\] As usual we may consider the difference significant and reject the null hypothesis if its probability is less than 5%, that is, \[\mathrm{Prob}\leq\frac{1}{nd^2}\leq\frac{1}{20},\] from which \[n\geq\frac{20}{d^2}.\]

In terms of concrete examples, this means:

\(d\) of this or more is significant with a confidence of 95%if \(n\) is at least

If, instead of comparing one variant to the original system (which we assumed to provide the underlying distribution and expectation), we compare two variants in an A/B test, the probability we calculate is that of seeing a given difference between the means of the measurements if the measurements on the two sides have the same expectation, that is, the two variants do not perform differently in reality. This is the probability of either the measurements on the A side being sufficiently far from the expectation, or the measurements on the B side being far. We approximate this probability with the (always greater) sum of the individual probabilities, and, as \(d\) is the difference between the expectation and the mean, we use \(d=d_{AB}/2\) where \(d_{AB}\) is the difference seen between the means on the A versus the B side. (We do this because in the worst case, the common expectation is halfway between the means, which allows both means to be close to it.) Assuming we have \(n\) measurements on each side we get \[\mathrm{Prob}_{AB}\leq 2\frac{1}{n{\frac{d_{AB}}{2}}^2} = \frac{8}{nd_{AB}^2}.\] For this probability to be less than 5%, we need \[\mathrm{Prob}_{AB}\leq\frac{8}{nd_{AB}}\leq\frac{1}{20}\] \[n\geq\frac{160}{d_{AB}^2}.\] In terms of some concrete numbers:

\(d_{AB}\) of this or more is significant with a confidence of 95%if \(n\) is at least

30 June 2018

Display the full value of a tensor in Tensorflow / Keras

I have been working on visualizing the internals of a neural network. While the Keras FAQ suggests building partial models to achieve this, and this post suggests setting up functions to evaluate a layer, I was tempted to look into the backend's and TensorFlow's Print function to display the values of a tensor during the actual computation, as this creates an actual node in the computational graph, and prints the values as the model is used.

In the first iteration I used a Lambda layer to wrap keras.backend.print_tensor, but the output was truncated to 3 values from the tensor. I proceeded to dig deeper: tf.keras.backend.print_tensor is defined here; it uses tf.Print, which, according to the documentation has a summarize parameter one cannot set through keras. So I started using tf.Print directly, but it was still unclear what the parameter meant.

tf.Print is defined in logging_ops.py, and, as this post explains, gen_logging_ops is generated from ops/logging_ops.cc, but where is the actual source? Finally, after further digging, I found it in tensorflow/core/kernels/logging_ops.cc.

Apparently, it uses SummarizeValue() to truncate the output, which is defined in core/framework/tensor.cc. Looking at the code all we seem to need to do is to pass an integer larger than or equal to the size of the tensor. Also, form the code it is apparent that the values are written to STDERR. The code I used was therefore

layer = keras.layers.Lambda((lambda x: tf.Print(x, [x], message=message, first_n=-1, summarize=1024)), name=name)(layer)

Later, this could be extended to use the number of elements in x as the summarize parameter, perhaps through tf.size.

15 July 2017

AI is More Instinct than Intelligence

Lately I've been spending a lot of time with machine learning, neural nets, and the question of extracting and communicating their thinking so that humans could review their conclusions and/or learn from them. It is a hardly surprising observation that what these models do is primarily pattern matching: a system that assesses whether a bank transaction is fraudulent will flag up a transaction because in some very complicated ways it is similar to other transactions that were results of fraud. Even non-supervised learning, which autonomously finds patterns in data, is doing just that: finding patterns.

If so, then everything these systems do is more like what we do by instinct, and not what we do via high-level reasoning which we traditionally call intelligence. In fact, I think this is partly why these systems appear so magical: from reading handwriting to self-driving cars, they do things we don't know how we do ourselves. They are getting pretty good at things we can learn to do by instinct.

(Classic AI / ML did in fact concern itself with symbolic calculations and reasoning, but the statistical models that are becoming so powerful today meant a shift from reasoning to instinctual decisions.)

This, in turn, is why it's so difficult to understand how an AI model arrives at a conclusion: it does so based on patterns and similarity, like our amygdala, and does not complement this with some kind of abstract reasoning. Even if it would simply rationalise a decision already made based on its instinct (like probably how most humans arrive at "rational" decisions), adding this rational layer would be truly amazing, as it would allow us to communicate with an AI system and peek into its thought processes.

20 June 2017

Display sample predictions during training using Keras callbacks

I have seen many sample Keras scripts where training a model by calling model.fit is allowed to finish after a couple of epochs to create an opportunity to compute and display a sample of the predictions the model generates. model.fit is then called again in a loop to continue training. Unfortunately, this technique confuses TensorBoard as it tries to trace how the training progresses and indicators like loss and accuracy change. This is beacuse restarting the fitting makes TensorBoard think that new data is coming in at previous timepoints, leading to some very confused graphs.

Fortunately, it is very easy to set up a new Keras callback to display sample predictions without having to stop the training using LambdaCallback. We can have a function called after each epoch to print out the info; it can even check the number of the epoch to do so not at the end of every epoch, but less frequently.

Using this callback together with the callback that logs data for TensorBoard is illusrated below. In this code we use an iterator to get the training data, but the same pattern can be used with pre-loaded data, too.

30 March 2017

Debian installer cannot find missing firmware

During installation, Debian may ask for (non-free) firmware for certain devices, like the wi-fi. Unfortunately, even if you have located the file and copied it to a USB drive or similar, the installer often still cannot find it. The solution is to switch to a terminal (Alt-Ctrl-F2), mount the USB drive manually, and copy the file to /lib/firmware (create the directory if needed). See also the relevant section of the installation documentation of Debian jessie.

27 March 2017

How to write (or customise) a templating engine?

While working on extending a web templating system to support a new programming language, I collected some features I wanted to implement or preserve. These features have come from my experience with various templating systems in PHP, Perl and JavaScript, and a need to create a secure and high-performance software environment for web developers.

At times the following sound like prescriptions, but in reality they just call attention to danger areas you might want to think about either when developing a new templating system, or adopting an existing one. One can freely go against these and still create a fast and correct system if one mitigates their effects by proper escaping or other workarounds.

Code security

Focusing on what is evaluated in what way can help prevent nasty surprises like code injection attacks and data leaks caused by the system interpolating into your template something else than what you intended. This general requirement surfaces in many different ways. These are closely related, but let's discuss them separately.

Avoid treating code as templating

Imagine we have a PHP templating system that replaces variables or placeholders enclosed in double square brackets, e.g.


Then a requirement comes in to make the title inside the page all-caps. One may be tempted to extend the templating system to support arbitrary PHP code to make it more powerful, e.g. write

<h1><?php echo strtoupper("[[title]]"); ?></h1>

As the replacement of placeholders must precede running the PHP code, you can see how dangerous this can be if the title happens to be "); echo(file_get_contents('passwords.php') . ". You can also run into trouble if the placeholder delimiters can occur naturally in your programming language.

You can mitigate these dangers by, e.g. escaping all double and single quotes when replacing placeholders, but in my view there are so many edge cases in systems like this that it is best to avoid this altogether.

Parse templates only once

This is a more general version of the previous point. Notice that in the previous example we in fact parsed a template twice: once for placeholders, and then for PHP code. This is not limited to code, though. Imagine that we have the [[...]] templating system and we are successfully building a webpage from

<div class="header">User: [[username]]</div>

As the next step, we would like to introduce caching to speed rendering up, and we realise that most of the page content stays the same while only the username may be changing. So we could introduce a new kind of placeholder that is not replaced in the first instance, but only after retrieving the page from cache. Given the template:

<div class="header">User: ((username))</div>

we cache the string

<title>A4 paper, pack of 500</title>
<div class="header">User: ((username))</div>
<h1>A4 paper, pack of 500</h1>

I call this technique "caching a hole", and our plan is to replace ((...)) placeholders after caching, just before serving the page. Again you can see that this system can break if a title or any other content you have no control over contains "((".

The problem here again is that the template was parsed twice for placeholders or other active elements, but by the second time it also contained interpolated content that we also parsed.

The solution is to parse the template only once, and never re-parse and already parsed component including any external data that is incorporated. This is, in fact, another facet of the same problem:

Don't parse user-submitted content

(Here, this includes e.g. data the marketing team enters into the CMS database.)

Assist with avoiding unescaped or double-escaped content in the output

Apart from a secure language and way of parsing, a templating system should also assist developers with ensuring that its output is secure, too. It should prevent (or help avoid, or warn about) unescaped user-submitted content from appearing in its output, which would easily make a website open to code injection attacks (with, e.g. a search string <script>document.location = 'http://fakebank.com';</script> shared in a URL). For convenience, it should also help to avoid escaping content multiple times, which is usually not what a developer intended to do.

One solution to tracing what has already been escaped and what hasn't, especially in more complicated templating systems capable of producing output in different languages (HTML, CSS, plain text, JavaScript, etc.) can be to use objects that store this information as metadata attached to the piece of content.

In other systems only capable of producing output in one language (e.g. HTML), it may be enough to simply escape everything, but there still needs to be a mechanism to mark some already parsed content as safe and prevent it from being escaped again.

It is worth mentioning that HTML, strictly speaking, uses two languages, HTML content and XML attribute values. However, the two can be treated in the same way by over-escaping, that is, by escaping quotation marks even in regular HTML content.

Speed and caching

The next area of concern is caching to help make a templating system as efficient as possible. On many websites (maybe with the exclusion of web apps) content generated by the server side is largely similar for the same requests, and must be done repeatedly, which is clearly wasteful on high-traffic websites. Caching the whole or parts of the output is a natural solution - it speeds up templating and saves electricity.

Allow caching half-ready templates

This is one of what I see are the two main approaches to caching content that allows introducing changes (like the name of the logged-in user) after caching. As suggested above, often HTML pages are largely similar for the same request but have small variations. These can include timestamps, randomised content (A/B tests), and content dependent on the current user like their name, favourites, or tailored suggestions.

Whether it is worth caching against the user (e.g. creating separate cache elements for each user) depends on a number of factors, like the cost of generating the content, cache hit rates, and the number of users. In any case, it is always useful for a templating system to allow introducing minor changes after caching.

Two main ways of doing so are caching components and caching "holes." In the first case, we cache parts of the page which are fully evaluated. The correct components are then selected and assembled for each request, allowing for variations. In the second cache we "cache a hole," and as described above we aim to cache a partially evaluated piece of templating that still has placeholders or other active elements.

Supporting this places some additional constraints on the templating system, which are connected to the suggestion that it should parse templates only once. Some templating engines indeed parse templates only once, and they convert them into code and function calls. This is a very efficient solution, although in my view these systems (like server-side React.js) struggle with caching "holes." Parsing templates into an object structure may support caching holes more easily, if you can arrange for the objects to be serialised.

Side channels and side effects: the correctness of caching

We also need to investigate what parts or templates can be cached at all. For the output of any code to be cached transparently, it needs to adhere to the functional paradigm: its output must depend solely on a set of well-defined inputs (like mathematical functions - no side channels), and we must be able to recreate all its output from our cache (no side effects).

The first constraint is necessary so that we can represent all relevant inputs in the cache key and avoid code contamination. If, for example, we have a header template object that renders the name of an item for sale and takes the item name as its input, but internally also retrieves who the user currently logged in is from a global variable to add their name, then caching this object against its arguments (the item name only) would be incorrect and result in usernames being leaked to other users. Ideally a templating system should facilitate enumerating all relevant inputs and discourage reaching out to other application data in templating code.

The second constraint is perhaps less relevant in the case of pure templating (the "view" layer), as then the output is usually restricted to HTML or half-ready templates. Still, if any controller-like logic finds its way into our templates, or we would like to introduce caching in the controller layer, too, we need to make sure that the code, when it runs, has no other effect than returning some values that we can retrieve from the cache. It cannot, for example, access a globals response object and inject a cookie as it would simply not happen when the return value of the code is retrieved from the cache.

A solution to this could be to allow any kind of object or data structure to be returned by cacheable code blocks, where apart from return values and templating, side-effects can also be represented like cookies to be set, or redirect or error pages to be rendered. This freedom can have useful applications inside the view layer, too, as metadata (e.g. success flags) returned with templating can be very useful when processing a template further. While not strictly necessary, I think this flexibility can make a templating system a considerably more powerful tool.


Define your language

Whatever you choose to parse your template into (e.g. objects or code), it is always a good idea to clearly define, in advance, your templating language. Then you can parse your templates using specially crafted regexes, or a lexer / parser combination using an LL parser or similar.

The definition can help one to see where it could go wrong. Are there any uncertainties in meaning? Can we, for example, always distinguish between placeholders and calls to helper functions? Creating a templating system is similar to creating a high-level programming language, and the more redundancy you build in, the more linting you can do, and the less likely it is that a developer will make a mistake by meaning one thing but writing another, which still looks correct.

For example, I suggest clearly distinguishing between placeholders (bringing in data) and calls to helper functions in the templating syntax, and not relying on what functions happen to be defined when the template is processed. If both placeholders and function calls look like [[...]], how do you address the original "name" datum masked by the helper function in "data-name" in the following example?

{ name: "John Smith", gender: "m", email: "js@me.com" }

--- HeaderTemplate ---

function name (templateArgs) {
return (templateArgs.gender == 'm' ? 'Mr' : 'Ms')
+ ' ' + templateArgs.name;

<div class="header" data-name="???">[[name]] - [[email]]</div>

The solution handlebars uses is to extend its syntax that can address data in the incoming data structure and use "./name" to access a masked datum. But I believe a developer knows whether they mean a piece of data or a function call, and should indicate this to the system. For example, add "()" to function calls and disallow parentheses as parts of argument names:

<div class="header" data-name="[[name]]">[[name()]] - [[email]]</div>

Avoid clashes with the output language

Creating a specification for your language also allows you to avoid any clashes with the language (text, HTML, CSS, etc.) the templating system is supposed to generate.

Clashes can limit what you can do in your templating language. Imagine that one decides to use XML tags for all placeholders and require that any template is well-formed XML. This works well in cases like

<div class="header"><Name/> <Email/></div>

but how do you interpolate data into an XML attribute? <div data-name="<Name/>"> would clearly fail.

Clashes with the output language can also upset syntax highlighters and linters that in any IDE help developers avoid bugs and mistakes that can be difficult and costly to trace.

Easy internationalisation

I believe every modern templating system should support internationalisation out of the box. As elsewhere in coding, any solution used should avoid duplication, e.g. using the original text as a key to look up the translated versions. This would make it necessary to update code in multiple places merely to fix a typo or capitalisation issue in the original text, which can easily lead to breaking the link between the texts and losing the translated content.


The final, and perhaps least important feature I like to see of any templating system, is universality. I think a templating system should be capable of producing any output, at least in its target language(s). In extremis, if the output languages allow, it should be capable of producing content that looks just like its input. Any arbitrary limitation on this may signal a templating language or parsing system that was not adequately designed, and one can be sure to hit this limitation sooner rather than later. For example, if the templating language allows whitespace around placeholders for readability, but removes these from the output, is it possible to add the whitespace back if needed? Can whitespace be added to the end of a templating unit? Can one add HTML comments or arbitrary XML attributes?

Different escaping mechanisms

In particular, I like to see different escaping mechanisms that are relevant for the output language supported by the templating system. For HTML, useful translations apart from escaping XML metacharacters and quotes are URL escaping (for interpolating URL query values) or JS escaping (for interpolating data into in-page JS code). Easy access to these mean that developers are more likely to use them, which leads to fewer errors and a more robust and resilient system.


The above list cannot be complete, but I hope these points will help you design or choose your next templating system. Happy templating!