## 29 August 2018

### The correctness and completeness of algebraic file synchronisers

I did some work on an algebraic approach to file synchronisation back in 2000. This basically means that if one has multiple copies of the same filesystem, and different users modify them differently, then we combine all changes (synchronise them) not by specifying rules based on what content we find in the filesystems, but based on the updates or commands that have been applied to them. This approach works very well as one can see whether two updates are in conflict just by looking at pairs of commands and a small set of rules.

I have recently become interested in this problem again, and in particular in offering rigorous proofs that the algorithm proposed to detect conflicting commands really works as intended, and cannot be improved in this framework. The statements to prove were intuitive, but actually proving them turned out to be a difficult task. One reason for this was that previously, all propositions about commands were of the form "these two commands behave exactly like that command on all filesystems" - that is, they had an implied universal quantifier (or an existential one if one negated the proposition). However, in order to prove, for example, that the commands returned by the synchroniser algorithm designed to modify the filesystems so that they would be as much like each other again as possible will not actually cause an error, one needed to deal with propositions of the form "if these commands do not cause an error, then those commands won't cause an error, either". In other words, the algebraic system needed to be extended to be able to describe implication, or some kind of conditionality.

This could be done based on partial functions and a partial order, and I am very excited about the result. The end product of this new research is a paper, now available on arXiv.

Algebraic File Synchronization: Adequacy and Completeness
Elod Pal Csirmaz
Abstract
With distributed computing and mobile applications, synchronizing diverging replicas of data structures is a more and more common problem. We use algebraic methods to reason about filesystem operations, and introduce a simplified definition of conflicting updates to filesystems. We also define algorithms for update detection and reconciliation and present rigorous proofs that they not only work as intended, but also cannot be improved on.
To achieve this, we introduce a novel, symmetric set of filesystem commands with higher information content, which removes edge cases and increases the predictive powers of our algebraic model. We also present a number of generally useful classes and properties of sequences of commands.
These results are often intuitive, but providing exact proofs for them is far from trivial. They contribute to our understanding of this special type of algebraic model, and toward building more complete algebras of filesystem trees and extending algebraic approaches to other data storage protocols. They also form a theoretical basis for specifying and guaranteeing the error-free operation of applications that implement an algebraic approach to synchronization.

## 23 July 2018

### How much data do you need for an A/B test?

If you have plenty of data coming in (lucky you), but you are still not sure whether you can be confident in the difference your test results show (especially if you don't have access to the individual data points, and don't know the underlying distribution or the variance, which isn't reported by e.g. Google Analytics), this back-of-the-envelope calculation may help.

If we measure something in a system, and have a well-established mean value, and then we test a variant of the system and have $$n$$ measurements from the variant, their mean can be different from the established mean merely because the measurements are by nature random. That is, it is possible that the variant system performs in exactly the same way as the original system, and the difference in means we see is due to the randomness. (This is the null hypothesis.) We use Chebyshev's Inequality to approximate the probability of this being the case.

In particular, we assume the measurements are independent random variables $$X_i$$ which all have the same expectation $$\mathrm{E}(X_i)=\mu$$ and variance $$\mathrm{Var}(X_i)=\sigma^2$$. We assume that the mean of measurements from the original system provides the expectation (but see below for evaluating two-sided tests). From Chebyshev's Inequality (http://math.mit.edu/~goemans/18310S15/chernoff-notes.pdf) $\mathrm{Prob}\left(\left|\frac{\sum X_i}{n}-\mu\right|\geq\epsilon\right)\leq\frac{\sigma^2}{n\epsilon^2},$ which is an upper bound of this probability. If $$r$$ is the range of the data (the difference between the maximum and minimum values), we know $$\sigma^2\leq \frac{r^2}{4}$$. If we also describe the difference of the means as a proportion of the range ($$\epsilon=rd$$), we get $\mathrm{Prob}\leq\frac{r^2}{4nr^2d^2}=\frac{1}{4nd^2}.$ As usual we may consider the difference significant and reject the null hypothesis if its probability is less than 5%, that is, $\mathrm{Prob}\leq\frac{1}{4nd^2}\leq\frac{1}{20},$ from which $n\geq\frac{5}{d^2}.$

In terms of concrete examples, this means:

$$d$$ of this or more is significant with a confidence of 95%if $$n$$ is at least
5%2,000
1%50,000
0.5%200,000

If, instead of comparing one variant to the original system (which we assumed to provide the underlying distribution and expectation), we compare two variants in an A/B test, the probability we calculate is that of seeing a given difference between the means of the measurements if the measurements on the two sides have the same expectation, that is, the two variants do not perform differently in reality. This is the probability of either the measurements on the A side being sufficiently far from the expectation, or the measurements on the B side being far. We approximate this probability with the (always greater) sum of the individual probabilities, and, as $$d$$ is the difference between the expectation and the mean, we use $$d=d_{AB}/2$$ where $$d_{AB}$$ is the difference seen between the means on the A versus the B side. (We do this because in the worst case, the common expectation is halfway between the means, which allows both means to be close to it.) Assuming we have $$n$$ measurements on each side we get $\mathrm{Prob}_{AB}\leq 2\frac{1}{4n{\frac{d_{AB}}{2}}^2} = \frac{2}{nd_{AB}^2}.$ For this probability to be less than 5%, we need $\mathrm{Prob}_{AB}\leq\frac{2}{nd_{AB}}\leq\frac{1}{20}$ $n\geq\frac{40}{d_{AB}^2}.$ In terms of some concrete numbers:

$$d_{AB}$$ of this or more is significant with a confidence of 95%if $$n$$ is at least
5%16,000
1%400,000

## 30 June 2018

### Display the full value of a tensor in Tensorflow / Keras

I have been working on visualizing the internals of a neural network. While the Keras FAQ suggests building partial models to achieve this, and this post suggests setting up functions to evaluate a layer, I was tempted to look into the backend's and TensorFlow's Print function to display the values of a tensor during the actual computation, as this creates an actual node in the computational graph, and prints the values as the model is used.

In the first iteration I used a Lambda layer to wrap keras.backend.print_tensor, but the output was truncated to 3 values from the tensor. I proceeded to dig deeper: tf.keras.backend.print_tensor is defined here; it uses tf.Print, which, according to the documentation has a summarize parameter one cannot set through keras. So I started using tf.Print directly, but it was still unclear what the parameter meant.

tf.Print is defined in logging_ops.py, and, as this post explains, gen_logging_ops is generated from ops/logging_ops.cc, but where is the actual source? Finally, after further digging, I found it in tensorflow/core/kernels/logging_ops.cc.

Apparently, it uses SummarizeValue() to truncate the output, which is defined in core/framework/tensor.cc. Looking at the code all we seem to need to do is to pass an integer larger than or equal to the size of the tensor. Also, form the code it is apparent that the values are written to STDERR. The code I used was therefore

layer = keras.layers.Lambda((lambda x: tf.Print(x, [x], message=message, first_n=-1, summarize=1024)), name=name)(layer)

Later, this could be extended to use the number of elements in x as the summarize parameter, perhaps through tf.size.