- Research Article
- Open Access
- Published:

# Site-specific recombinatorics: in situ cellular barcoding with the Cre Lox system

*BMC Systems Biology***volume 10**, Article number: 43 (2016)

## Abstract

### Background

Cellular barcoding is a recently developed biotechnology tool that enables the familial identification of progeny of individual cells in vivo. In immunology, it has been used to track the burst-sizes of multiple distinct responding T cells over several adaptive immune responses. In the study of hematopoiesis, it revealed fate heterogeneity amongst phenotypically identical multipotent cells. Most existing approaches rely on ex vivo viral transduction of cells with barcodes followed by adoptive transfer into an animal, which works well for some systems, but precludes barcoding cells in their native environment such as those inside solid tissues.

### Results

With a view to overcoming this limitation, we propose a new design for a genetic barcoding construct based on the Cre Lox system that induces randomly created stable barcodes in cells in situ by exploiting inherent sequence distance constraints during site-specific recombination. We identify the cassette whose provably maximal code diversity is several orders of magnitude higher than what is attainable with previously considered Cre Lox barcoding approaches, exceeding the number of lymphocytes or hematopoietic progenitor cells in mice.

### Conclusions

Its high diversity and in situ applicability, make the proposed Cre Lox based tagging system suitable for whole tissue or even whole animal barcoding. Moreover, it can be built using established technology.

## Background

The fate of the progeny of two seemingly identical cells can be markedly distinct. Well studied examples include the immune system and hematopoietic system, for which the extent of clonal expansion and differentiation has been shown to vary greatly between cells of the same phenotype [1–4]. Fate and expression heterogeneity at the single-cell level are also apparent in other systems including the brain [5–7] and cancers [8–10]. Whether this heterogeneity is due to the stochastic nature of cellular decision making, reflects limitations in phenotyping, is caused by external events, or a mixture of effects, is a subject of active study in several fields [11–13]. As addressing this pivotal question through population-level analysis is not possible, experimental tools have been developed that facilitate monitoring single cells and their offspring over several generations.

Long-term fluorescence microscopy represents the most direct approach to assess fate heterogeneity at the single-cell level. Studies employing that technique are numerous [14–20], and have revealed, among many other significant observations, that although the fate of stimulated B cells are heterogeneous, there exist strong correlations at the clonal level in terms of differentiation and death versus division fates [15, 21]. Filming and tracking of cell families in vitro remains technically challenging, is labor intensive, and only partially automatable [22, 23]. Despite significant advances in the field, continuous tracking in vivo is confined to certain tissues, and time windows of up to twelve hours for slowly or non-migrating cells.

A radically different approach to long-term clonal monitoring is to mark single cells with unique DNA tags via retroviral transduction, a technique known as cellular barcoding [2, 10, 24–27]. As tags are heritable, clonally related cells can be identified via DNA sequencing. By tagging multi-potent cells of the hematopoietic system and adoptively transferring them into irradiated mice, the contribution of single stem cells to overall hematopoiesis has been quantified [24–26, 28]. Amongst other discoveries, this has revealed statistically consistent heterogeneity in the collection of distinct cell types produced from apparently equi-potent progenitors [29–31]. Current barcoding techniques are unsuitable for tagging cells in vivo, and typically require ex vivo barcoding followed by adoptive cell transfer [26]. This restricts its scope to cells suitable for adoptive transfer, such as hematopietic stem and progenitors, naive lymphocytes, and cancer cells.

Ideally, a cellular barcoding system would inducibly mark cells in situ in their native environment, would be non-toxic, permanent and heritable, barcodes would be easy to read with a high-throughput technique, and the system would enable labeling large numbers of cells with unique barcodes. Recently, two studies have been published that address some of these points. Sun et al. [32] employed a Dox inducible hyperactive form of the Sleeping Beauty transposase to genetically tag stem cells in situ, and followed clonal dynamics during native hematopoiesis in mice. In that system, tags consist of a random insertion site of an artificial transposon, which upon withdrawal of Dox is relatively stable. A second in vivo cellular barcoding system based on site-specific DNA recombination with the Rci invertase was implemented by Peikon and co-workers [33–35]. Inspired by the brainbow mouse [36], this system induces a random barcode by stochastically shuffling a synthetic cassette pre-integrated into the genome of a cell. The authors predicted high code diversity from relatively small constructs (approx. 2 kb) and demonstrated feasibility of random barcode generation in Escherichia coli [35].

Each of those approaches provide elegant advances on shortcomings of previous systems by generating largely unique tags without significant perturbation to the system of interest, but some difficulties remain. For barcode readout, the method in [32] requires whole-genome amplification technology and three arm-ligation-mediated PCR to efficiently amplify unknown insertion sites. Furthermore, the random location of the transposon may impact behavior of some barcoded clones and thus lead to biased data. Moreover, some background transposon mobilization was detected in certain cell types, subverting the stability of the barcodes. The Rci invertase based system remains to be implemented in cells other than bacteria. Similar to the Sleeping Beauty transposase, the method requires tight temporal control over Rci expression to make codes permanent.

In the present article, we consider the Cre Lox system as a driver to induce in situ from a series of tightly spaced Lox sites large numbers of distinct, permanent, randomly determined barcodes. In contrast to the Brainbow construct [36, 37], which relies on overlapping pairs of incompatible Lox sites to recombine randomly into one of several stable DNA sequence configurations, our design exploits constraints on the distance between Lox sites that arise during DNA loop formation, a prerequisite for site-specific recombination [38–40]. This known feature has not previously been exploited, but is a crucial design element for obtaining high barcode diversity. First, by allowing repeated usage of the same Lox site, code diversity is solely restricted by cassette size and not, as in the Brainbow construct, by the relatively small set of non-interacting Lox sites [41]. Second, for a design without distance constraints, the diversity of stable barcodes creatable with the Cre Lox system is of order *O*(*n*) at best, where *n* is the number of Lox sites [35]. Whereas with distance contraints, optimal barcode diversities of order *O*(*n*
^{3}) are possible. As will be shown in this article, boosting this scaling with the four incompatible Lox sites that have been reported in the literature [41], 10^{12} distinct codes of about 600 bp each can be generated from a genetic construct as small as 2.5 kb. In combination with the CreEr system [42], this is sufficient to inducibly barcode label e.g. all naive CD8 T cells in a mouse [43]. Desirable features are inherently part of the Lox barcode cassette design, including: short and stable barcodes; a single barcode per cell; and robust read-out.

### Cre Lox biology

Before introducing the Lox barcode cassette, we revisit some facts about Cre Lox biology [44, 45]. Cre is a bacteriophage Pl recombinase that catalyzes site-specific recombination between Lox sites. A Lox site is a 34 bp long sequence composed of two 13 bp palindromic flanking regions and an asymmetric 8 bp core region (Fig. 1 a). For recombination to occur four Cre proteins bind to the four palindromic regions of two Lox sites and form a synaptic complex. A first pair of strand exchanges leads to a Holliday junction intermediate [46]. Isomerization of the intermediate then allows a second pair of strand exchanges, and formation of the final recombinant product [40]. The DNA cleavage site is situated in the asymmetric core region. If the Lox sites are on the same chromosome, their interaction requires formation of a DNA loop. If they have the same orientation (direct repeats), recombination results in excision of the intervening sequence, and this reaction is essentially unidirectional [47]. If Lox sites are in the opposite orientation to each other (inverted repeats), the sequence between the sites is inverted, becoming its reverse complement (Fig. 1 b). In the absence of Cre, Lox-Lox recombination events are below detection limits (e.g. [37], Fig. S1). Due to compatibility with eukaryotes, the Cre Lox system has become an essential tool in genetic engineering and a large array of transgenic mouse models with inducible cell-type specific expression of Cre have been created [42].

In in vitro trials with Cre mediated Lox reactions, a sharp decrease in recombination efficiency has been observed when the sequence separating two Lox sites is less than 94 bp [38]. Recombination is still detectable at low levels at 82 bp, but not at 80 bp where DNA stiffness appears to prevent DNA loop formation, and as a consequence Lox site interaction. For the distinct, but similar, Flp/FRT system this minimal distance was established to be smaller in vivo, with interactions still possible at 74 bp [39]. The existence of a minimal distance is one of the key features that we will exploit to make random barcodes stable, but in our proposed design it will only prove necessary for it to be greater than 44 bp.

### Lox barcode cassettes

In full generality, a Lox barcode cassette is a series of Lox sites interlaced with *n* distinguishable DNA code elements of size *m* bp each. On Cre expression, code elements change orientation and position, or are excised [34]. Through Cre mediated excision, the number of elements eventually decreases until reaching a stable number (Fig. 1
c). Sequences that have attained a stable number of code elements form size-stable barcodes. A cassette’s code diversity is the number of size-stable barcodes that can be generated from the cassette via site-specific recombination.

Our main result is a robust Lox cassette design that provably maximizes code diversity for any given cassette length *n* and element length *m*≥5 bp. The design is robust to both sequencing errors and to the minimal interaction distance between Lox sites. The analysis that leads us to the design is provided in the “Optimal design” Section. The identification of code element sequences that avoid misclassification due to sequencing mismatch errors then follows. Finally, probabilistic aspects of code generation from an optimal barcode cassette are explored via Monte Carlo simulation. Lox cassettes with code elements of size 4 bp, higher order Lox interactions, the impact of transient Cre activation, and distance-dependent Lox-Lox complex formation are considered in the discussion.

## Results

### A robust cassette design that maximizes code diversity

The optimal design will prove to have the orientation of both the outmost, and any two consecutive, Lox sites inverted (Fig. 1 c). Code elements between Lox sites are of size longer than four bp, but shorter than 24 bp. The lower limit ensures that elements can be chosen sufficiently distinctly to correct for at least two sequencing errors per element. Due to the minimal Lox interaction distance, the upper limit is necessary to ensure that barcodes with three code elements are size-stable.

The barcode diversity for this cassette design with *n* code elements under constitutive Cre expression will, as shown in the Optimal design Section, transpire to be

which is maximal for code elements that are larger than four base pairs.

A good compromise between cassette length, robustness to sequencing errors and barcode diversity is given by an alternating Lox cassette with 13 elements of length 7 bp each as shown in Fig. 1
c. The cassette is initially 567 bp long, which after excisions and inversions, generates size-stable barcodes that are composed of either a single element or three elements, with lengths 75 bp and 157 bp respectively, including remaining inactive Lox sites. This generates a code diversity of 1022 barcodes, far less than the 3×10^{9} base pairs of the mouse genome, i.e. the maximal theoretical diversity achievable by the Sleeping Beauty transposase barcoding system [32].

However, concatenating four such cassettes with poorly-interacting Lox variants (e.g. LoxP, Lox2272, Lox5171 and m2 [41], Fig. 1
d) yields a size-stable code diversity of 1022^{4}≈10^{12}. In mice, this is sufficient to tag all CD8 T cells [43] or all nucleated cells in the bone marrow [48].

### A practical implementation

To implement Cre Lox barcoding in the mouse, one could cross mice generated from embryonic stem cells that had previously been transduced with the concatenated Lox barcoding cassettes described above (2268 bp) onto a Tamoxifen inducible cell-type specific CreEr expressing background [42]. A barcoding experiment would then be initiated by administrating Tamoxifen to the animal, which activates Cre and generation of a barcode (≤628 bp) in each cell where Cre becomes active. Some time after activation, cells of interest would be harvested and sorted for specific phenotypes, and sequenced using a next generation sequencing platform that allow read-lengths >600 bp. Cells originating from the same progenitor alive at the time of tamoxifen administration would carry the same barcode. This information would then used for inference on, for example, lineage pathways and clonal fate tracking. To identify frequent barcodes that are to be discarded in the analysis (see the Barcode distribution is heterogeneous Section), in a control experiment large numbers of cells would be harvested shortly after tamoxifen administration and sequenced.

### Optimal design

A simple upper bound on the barcode diversity of *k* elements from a cassette initially containing *n* elements is the number of possible outcomes when choosing *k* from *n* elements in arbitrary order and orientation:

Although loose, it will become clear that it captures the dominant growth, *O*(*n*
^{k}), indicating the importance of *k* in generating barcode diversity and motivating a closer look at how cassette designs influence it.

For what follows, we introduce some terminology: a cassette is alternating if the orientation of any two consecutive Lox sites is inverted (Fig. 1 c); outermost Lox sites are termed flanking Lox sites; and flanking sites are direct or inverted if they have the same or opposite orientation, respectively.

#### Code diversity is determined by code element length and orientation of flanking sites

Cre recombination requires a minimal distance between the interacting Lox sites. In what follows we assume that the minimal distance for Lox interaction is 82 bp, but our results will be robust for any minimal interaction distance greater than 44 bp.

To understand how a minimal Lox-Lox interaction distance and cassette design determine size-stable barcodes and code diversity, we start with the simplest case, a barcode with a single code element (Fig. 2 a). If the code element is less than 82 bp, the barcode is size-stable irrespective of the orientation of its flanking sites. If the element is larger than 82 bp, the code is only size-stable if the flanking sites are inverted as excision will remove the element.

For a barcode with two elements, the sequence between the flanking sites contains an additional element and a Lox site (34 bp), giving a sequence of 2*m*+34 bp. If the flanking sites have the same orientation, the barcode is size-stable if 2*m*+34<82 bp, hence if *m*<24 bp. If they are in opposite orientation, excisions can only occur if flanking sites interact with the middle Lox site, and *m*<82 bp is sufficient for stability (Fig. 2
b). For given *m*, in general if there exists a barcode of size *k* with direct flanking sites, a barcode with *k*+1 elements is possible that has inverted flanking sites. Thus *m* and the orientation of the flanking sites are critical features that determine the maximum *k*.

In Fig. 2
c, the stability of barcodes with *k*∈{2,3,4,5} is shown as a function of *m* for a cassette with inverted flanking sites. The stability depends on a critical distance, i.e., the largest distance between two Lox sites in the barcode that is, or can be brought into, the same orientation via recombination. As shown, barcodes of size three and four become unstable if *m*≥24 bp and *m*≥5 bp, respectively, while barcodes of size five or greater are always unstable.

Orientation of a cassette’s flanking sites is immutable under recombination. Therefore cassettes with direct and inverted flanking sites generate barcodes with direct and inverted flanking sites only. Having seen that maximal code diversity grows as *O*(*n*
^{k}), and that having inverted flanking sites relative to direct ones increases the maximum size of barcodes by one, it follows that the diversity for cassettes with inverted flanking sites is of the order *O*(*n*
^{k+1}). Inverted flanking sites are thus superior in terms of code diversity and are an essential design decision.

Optimality regarding the size of the elements, *m*, is more intricate. For *m*<5, the maximum size of barcodes is four elements, and according to the formula above, their diversity grows as *O*(*n*
^{4}). The stability of barcodes with four elements is, however, sensitive to the minimal distance estimate (illustrated by the gray interval in Fig. 2
c). In addition, the short length of code elements limits error correction, a point revisited later. Thus we focus on cassettes in the regime 5bp≤*m*<24 bp, which generate error-robust barcodes of up to size three and a code diversity that is insensitive to the reported minimal Lox interaction distance.

#### Alternating Lox cassettes with inverted flanking sites maximize code diversity

For the orientation of the remaining Lox sites we prove, via a two-step strategy, that the alternating design produces maximal code diversity. First we derive a refined upper bound for the diversity that takes into account the structure of the Lox cassette, but ignores constraints imposed by the recombination process. We then show that alternating Lox cassettes with inverted flanking sites and *n*≥7 elements are unconstrained in terms of barcode generation via sequential recombination events, thus achieving this upper bound.

#### An upper bound for Lox barcode diversity

During Cre induced recombination, Cre proteins cleave the core region of the interacting Lox sites asymmetrically [40]. The sequences between subsequent cleavage sites are not affected by Cre and represent the fundamental building blocks of the Lox barcode cassette. Each block contains a code element and half a Lox site on each side.

Depending on the orientation of the Lox sites, there are four possible types of blocks (Fig. 2 d). Three colours have been used to code these: red, green and blue. By definition, the reverse complement of a block is of the same colour class. In contrast to blue blocks, red and green blocks have their Lox cores cleaved in a way such that their flanking Lox sites are unchanged after inversion, while the intervening sequence is reverse-complemented.

Blocks are similar to the concept of units defined in [34], which proves instrumental to derive expressions for the total number of sequences, stable or unstable, that are generated from a Lox cassette where all (*n*+1) sites can interact. In our context, the latter condition implies *m*>82 and, as discussed above, a code diversity of order *O*(*n*). Here we focus on enumerating exclusively size-stable sequences that arise in the regime 5bp≤*m*<24 bp with code diversities of order *O*(*n*
^{3}).

Stable codes are necessarily made of blocks from the initial cassette, and as shown in Fig. 2
e, their composition in terms of block colors is prescribed. Letting *n*
_{
r
}, *n*
_{
g
}, and *n*
_{
b
} be the number of red, green, and blue blocks in the initial cassette with *n* elements, an upper bound on the number of possible barcodes of size *k* with *k*
_{
r
} red, *k*
_{
g
} green and *k*
_{
b
} blue blocks is

where *n*
_{
r
}+*n*
_{
g
}+*n*
_{
b
}=*n* and *k*
_{
r
}+*k*
_{
g
}+*k*
_{
b
}=*k*. It is the number of possible outcomes when choosing *k*
_{
r
}, *k*
_{
g
} and *k*
_{
b
} from *n*
_{
r
}, *n*
_{
g
} and *n*
_{
b
} elements in arbitrary order. The additional factor \(2^{k_{r}+k_{g}}\phantom {\dot {i}\!}\) arises as there are two valid orientations of every code element of a red and green block after recombination, whereas blue blocks due not enjoy this property. Conditioned on *n*
_{
r
}, *n*
_{
g
}, and *n*
_{
b
}, to derive an upper bound for a cassettes’s diversity, we add the numbers for the four possible stable barcode configurations of *k*
_{
r
}, *k*
_{
g
}, and *k*
_{
b
}, shown in Fig. 2
e, taking into account that certain configurations appear more than once (e.g. the configurations with one red and two blue blocks appears three times). For 5 bp ≤*m*<24 bp, and cassettes with inverted flanking sites pointing at each other (the opposite case is similar) this yields, by applying the expression above to each of the four configurations,

By construction, *n*
_{
g
}=*n*
_{
r
}−1, and since *n*
_{
b
}=*n*−2*n*
_{
r
}+1, substituting the respective terms leads to an expression that is a function of *n* and *n*
_{
r
} alone. For given *n* odd, this reduces the task of finding the optimal cassette design to an explicitly solvable one-dimensional optimization problem,

For *n*≥5, the global maximum is achieved at the boundary *n*
_{
r
}=(*n*+1)/2. This implies *n*
_{
b
}=0, and a global upper diversity bound of (*n*+1)(*n*−1)^{2}+(*n*+1), of order *O*(*n*
^{3}). It is easily verified that *n*
_{
b
}=0 is only possible if the cassette design is alternating and n is odd, which implies the flanking sites are inverted.

#### Alternating Lox cassette design achieves the upper diversity bound

For an alternating cassette design, achieving the code diversity upper bound requires complete freedom in code generation via recombination events. By construction, we show that this is the case if *n*≥7. To aid understanding, we illustrate in Fig. 2
f operations on odd and even elements via Cre Lox recombination, in cassettes with five, six and seven elements. Note that each operation (or move) inverts the orientation of the respective element.

Consider an alternating cassette with five elements and *m*≥5 bp, and recombination events that do not alter the size of the cassette (i.e., inversions). First note that red blocks in position three and five can move into the first position via a single recombination event (Fig. 2
f). Furthermore, a red block in position one can be inverted by first moving to position three, then to five, and back again. A straight-forward recipe to create an arbitrary code made of a single red block is then to: i) move the block into the first position (if required); ii) change its orientation (if required); and finally iii) excise the remaining blocks.

Similarly, to generate an arbitrary code composed of a red and a green block from an alternating cassette with six elements, we can perform steps i) and ii). Then we apply the same procedure to the green blocks, leaving the first block untouched. This results in the first two blocks of the cassette being identical to the desired code. To generate the size-stable code, elements that are not part of the code are excised.

Finally, for a cassette with seven elements, sequentially following the recipe given above, the first three blocks can be populated such that they match any possible code before excising the remaining blocks. This shows that any possible code of size one to three can be created via Lox recombination if the cassette is alternating, *n*≥7, *m*≥5 bp, and flanking sites are inverted.

Under constitutive Cre expression, barcodes with three elements can still undergo inversions via the flanking sites, which reduces their code diversity by a factor of two. The code diversity is therefore that given in Eq. (1).

### Design of code element sequences

That barcodes generated from a Lox cassette are pre-defined in terms of sequence and position in the genome represents an advantage over existing in situ barcoding systems that rely on insertion site analysis for barcode readout [32, 49]. If codes-reading was error-free, choosing code elements of a particular color (red, green or blue, see Fig. 2
d for the definition) from a set of sequences that differ at least by one bp pair in both orientations would be sufficient. The maximum number of such elements is easily computed as (4^{m}−4^{m/2})/2 and 4^{m}/2 for *m* even or odd, respectively, which is large even for small *m*.

With reading errors, in order to remain perfectly robust to one mismatch error, elements of a given color need to differ by at least three base pairs in both orientations for nearest-neighbor matching to be able to correct the error [50]. To ensure correction of *j* mismatch errors, the minimal required Hamming distance between the code elements is 2*j*+1 bp. The size of the sets of elements that meet this condition quickly decreases with increasing *j* (see Fig. 3
a for numerical estimates). To reliably be able to correct for two sequencing errors requires *m*≥5 bp.

Assuming that sequencing errors arise independently and error rates are identical for all bases, the number of mismatch sequencing errors in a code element of size *m* is Binomial with parameters *m* and the error probability per bp [51]. Any element that has *j* or less errors will be classified correctly by nearest-neighbor matching. The probability of more than *j* errors gives an upper bound for the expected proportion of misclassified code elements. Fig. 3
b shows this for elements of size *m*=7 bp as a function of the minimal distance and the mismatch error rates for next-generation sequencing platforms [52]. Different symbols indicate different sequence data. Even for low-fidelity platforms like Pacific Bioscience single molecule real time sequencing, a minimal distance of five bp results in less than ten misclassified elements per million.

A concrete example for an alternating Lox barcoding cassette with 13 code elements of size seven bp each (in bold), and robust to two sequencing errors per element (i.e. the minimal Hamming distance between elements of the same color is 5), is:

ATAACTTCGTATA ATGTATGC TATACGAAGTTAT **AAAAAAC**ATAACTTCGTATA GCATACAT TATACGAAGTTAT **AAACCCG**ATAACTTCGTATA ATGTATGC TATACGAAGTTAT **AACGCTA**ATAACTTCGTATA GCATACAT TATACGAAGTTAT **AGTCATC**ATAACTTCGTATA ATGTATGC TATACGAAGTTAT **ACCCGCC**ATAACTTCGTATA GCATACAT TATACGAAGTTAT **ATGAACA**ATAACTTCGTATA ATGTATGC TATACGAAGTTAT **ACGTTAA**ATAACTTCGTATA GCATACAT TATACGAAGTTAT **CACTGAA**ATAACTTCGTATA ATGTATGC TATACGAAGTTAT **CAACTGA**ATAACTTCGTATA GCATACAT TATACGAAGTTAT **CCAATCC**ATAACTTCGTATA ATGTATGC TATACGAAGTTAT **CCTCCAG**ATAACTTCGTATA GCATACAT TATACGAAGTTAT **GCCCCGA**ATAACTTCGTATA ATGTATGC TATACGAAGTTAT **CGTAGCA**ATAACTTCGTATA GCATACAT TATACGAAGTTAT.

### Probabilistic features of optimal Lox cassettes

In this section we explore probabilistic features of the optimal design: the probability to generate each of the final codes; and the number of recombination events that are needed to create size-stable codes. For the analysis, we make two assumptions: first, all interactions with Lox sites that are at least 82 bp apart are equally likely; second, recombination events occur sequentially and independently.

#### Barcode distribution is heterogeneous

Size-stable barcodes of a Lox cassette are randomly generated and not all codes are equally likely. This is in contrast to the Rci invertase based approach implemented by Peikon et al. [35], who reported a close to the ideal uniform distribution of barcodes generated in E. coli. after several recombination events.

Although an analytical expression for the probability mass function of final codes is not available, stochastic simulations enable us to study properties of practical importance such as the probability of generating a code more than once. Ensuring this probability is low is important in practice because progeny of two cells that independently generate the same code will be confounded as pertaining to the same clone.

Figure 3
c shows the probability to be generated for each of the 1022 codes that ensue from a cassette with 13 elements (sorted in ascending order). To produce this plot, 10^{8} barcodes were Monte Carlo generated in silico via sequential recombination of the initial cassette. The number of times a specific code appeared was recorded, normalized and sorted. While some codes are relatively frequent, most are rare. In Fig. 3
d, the average number of recombination events (inversions: blue, excision: black) is plotted as a function of barcode probability. The number of inversions and barcode probability are negatively correlated, an indication that rare codes undergo, on average, more inversions. The number of excisions is close to two for all codes.

Ideally, each cell is tagged with a unique barcode. As with all existing barcoding techniques however, 100 % unique barcodes cannot be guaranteed unless each cell is separately transduced with a different code, an approach pursued by Grosselin et al. [53]. What influences the expected number of unique barcodes is the code diversity *D*, *p*
_{
i
}, the probability of code *i*, where *i*∈{1,2,…,*D*}, and *j*, the total number of codes that are generated. Using analysis of the generalised birthday party problem [54], the expected proportion of unique codes is

where the numerically convenient approximation on the right hand side arises from a Taylor expansion around 0 and is appropriate if (*j*−1)≪1/(max*i*
*p*
_{
i
}). Relatively large *p*
_{
i
}’s negatively affect the expected proportion of unique codes. Therefore, for heterogeneous barcode distributions, a natural strategy is to discard most frequent codes in order to exclude from the analysis barcodes that are more likely to be induced more than once. In the following, we assume that from all induced barcodes, keeping a subset that contains on average 99 % unique barcodes is sufficient for most applications and call these barcode sets 99 %-unique.

Using the approximation Eq. (2), in Fig. 3
e we computed the maximum number of cells in which a barcode is induced versus the number of induced barcodes that are 99 %-unique, for one to four sequential cassettes (indicated by the numbers 1 to 4). The color represents the percentage of discarded codes relative to the total code diversity. This parameter can be adjusted to meet the specific needs of a given experiment. For instance, for four concatenated cassettes with 13 elements each, inducing barcodes in a target population of 10^{8} cells yields 10 %, or 10^{7} 99 %-unique barcodes (indicated by a circle). If the target population is larger, e.g., 10^{12} cells (indicated by a square), the proportion of 99 %-unique to total induced barcodes is reduced (approximately 0.1 %), giving 10^{9} single cells that carry a 99 %-unique barcode.

These results show that by discarding frequent codes from the read-out, large numbers of clones can be tracked with high confidence, suggesting Cre Lox in situ barcoding is suitable for high-throughput lineage tracing experiments.

#### Number of recombination events to generate barcodes does not diverge with cassette size

If Cre is expressed for long enough, Lox cassettes will eventually become size-stable. The time this will take correlates with the number of recombination events that separate a stable barcode from its initial cassette. Below, we estimate this quantity using the theory of absorbing Markov chains.

In a cassette with *n* elements, there are *n*+1 Lox sites. The number of Lox pairs that are flanking *k* elements is *n*+1−*k*. Lox pairs that have less than three elements in between do not interact, as they are separated by less than the minimal 82 bp distance. Pairs of Lox sites that have three or more elements in between are termed productive. For *n*≥3 the number of productive pairs is \(\sum _{k=3}^{n}(n+1-k)=(n-1)(n-2)/2\), and the number of productive pairs, where recombination leads to excision, i.e. where an even number of elements separates the two sites, is

for *n* odd. The equalities are a direct consequence of evaluating the respective sums. The probability that a productive pair excises exactly *k* elements is given by the ratio of productive pairs that are separated by *k* elements by the total number of productive pairs, i.e.

for *k* even, 3≤*k*≤*n*, otherwise it is zero. Similarly, the number of productive pairs where recombination leads to inversion is (for *n* is odd)

and the probability that interaction of a productive pair leads to an inversion is

Equations (3)–(5) enable a description of the formation of size-stable barcodes as a discrete-time absorbing Markov chain. The number of elements in the cassette corresponds to its state, and Eqs. (3) and (5) give the transition probabilities from *n* to *n*−*k*, and from *n* to *n* elements respectively. There are *n*−3 transient and 4 absorbing states. Absorbing states are cassettes that have either three, two, one, or zero elements. Absorbing Markov models are well understood, and a wealth of theoretical predictions regarding their properties are available [55]. These include the average number of steps until reaching an absorbing state, starting in one of the transient states. The fundamental matrix of this Markov Chain is

where *I*
_{
n−3} is an (*n*−3)×(*n*−3) identity matrix, and *Q* is the transition matrix corresponding to the transient states. The expected number of recombination events, starting with a cassette of *n* elements, until reaching a final code is then the *n*
^{th} entry of the vector *t*=*N*
*c*, where c is a column vector all of whose entries are 1.

In Fig. 3
f, the average number of recombination events that separate the initial cassette from a final code is shown as a function of the cassette length. Although code diversity grows as *O*(*n*
^{3}), the number of recombination events required to generate a code increases linearly in *n*.

## Discussion

*Lox barcode cassettes with code elements of size four*

When we derive the upper code diversity bound and the optimal Lox barcode cassette, we focus on code elements in the regime 5 bp≤*m*<24 bp. These have maximal size-stable barcodes of three elements that are largely insensitive to over and under estimation of the minimal Lox interaction distance. For *m*<5 bp, size-stable barcodes of four elements are possible and their maximal code diversity grows as *O*(*n*
^{4}). These are stable, however, only if the minimal interaction distance between two Lox sites is greater than 80 bp, a distance at which interactions have shown to still be possible in vivo in the similar Flp/FRT system [39].

Most interesting is the case *m*=4 bp, which permits correction of one sequencing error with six code elements that are 3 bp apart in both orientations (see gray bars in Fig. 3
a). The upper diversity bound is derived along the same lines as for *m*≥5 bp (see Fig. 4
a for possible stable codes), which gives

To maximize usage of the 6 code elements, we start with a cassette that has six red, five green and six blue blocks, i.e. {*n*
_{
r
},*n*
_{
g
},*n*
_{
b
}}={6,5,6}. This gives an upper diversity bound of 36996 barcodes. As confirmed by Monte Carlo simulations, this upper bound is attained by a cassette with inverted flanking sites in which the first 11 Lox sites are alternating, and the remaining sites, except the last, are oriented in the same direction as the first Lox site (Fig. 4
b). Under constitutive Cre expression, barcodes with four elements can still undergo inversions, and the effective code diversity is 19,716.

Careful measurements will be needed to determine whether Lox sites at a distance of 80 bp still interact. If they don’t, the cassette shown in Fig. 4
b with *m*=4 bp represents an interesting alternative to the barcode cassette design described in the main text, as with less elements it reaches higher code diversity, but at the cost of less robustness to sequencing error and hence barcode readout fidelity.

*Higher order Lox interactions*

In the Cre Lox system, single recombination events always involve exactly two Lox sites. However nothing except DNA flexibility prevents several pairs of Lox sites to interact at the same time. The rate at which pairs of Lox sites bind simultaneously depends on the number of Lox sites and the kinetic rates of Lox-Lox complexes. In vitro, the latter appear surprisingly stable [40] and together with the potentially large number of Lox sites in the barcode cassettes, make simultaneous interactions a plausible possibility.

Higher order Lox interactions lead to unexpected and in certain cases novel recombination products (Fig. 4 c). For example, simultaneous interactions of two overlapping pairs of Lox sites oriented in the same direction do not result in excision, but in a reordering of the sequences between the sites. Similarly, if pairs are inverted, simultaneous recombinations do not invert but excise the sequence between the outermost sites.

For the alternating cassette and *n*≥7, multiple concurrent Lox interactions do not generate additional codes as the upper code diversity bound is already attained. Therefore our results on Lox barcode design and code elements remain unchanged in the presence of higher order Lox interactions. What does change is the distribution over barcodes, which flattens in the tail if more than one Lox pair recombines at a time (Fig. 4
d).

*Transient Cre expression*

Code diversity strongly depends on the number of elements in size-stable barcodes. If Cre is expressed constitutively, size-stable barcodes with code elements of size *m*≥5 bp have a maximum of three elements. One possibility is to create transient Cre activity rather than constitutive.

A well tested system that provides temporal control over Cre activity is tamoxifen inducible CreEr [42]. In the presence of tamoxifen, the fusion protein CreEr, which is normally located in the cytoplasm, is transported into the nucleus, where it can bind to Lox sites and induce recombination. Depending on the duration of Cre activation and its efficiency, stable sequences with more than three elements are likely to be generated from a Lox barcode cassette. Although most of these sequences are stable only in the absence of Cre, in this section we make no distinction between these and the size-stable barcodes defined earlier.

Figure 4
e shows barcode probabilities after activation of CreEr in 10^{6} cells with an optimal Lox cassette of size 13. The number of recombination events induced by transient CreEr activity is assumed Poisson distributed with mean one. About 10^{4} distinct barcodes are generated, and 30 % of these appear only once. For comparison, the inset, similar to Fig. 3
e, indicates that a maximum of circa 170 99 %-unique barcodes are generated from a single cassette, by inducing barcodes in about 17000 cells. For a Poisson distributed number of recombination events with mean one, this is 30 times more than what is feasible with size-stable codes from the same cassette.

Although highly promising in terms of code diversity, it should be noted that potential drawbacks of this approach are the length of the barcodes (leading to more involved code sequencing), leakiness of CreEr into the nucleus in non-induced cells [56], the relatively long half-life of tamoxifen [57], and a barcode probability that depends on the efficiency of Cre induction.

*Distance dependent Lox-Lox complex formation.*

Cre and co-localization are necessary for two Lox sites to form a complex. Therefore the distance between two sites, in addition to the minimal distance constraint considered so far, is likely to impact on Lox-Lox recombination efficiencies. In this section we analyze how distance dependent Lox-Lox interactions change barcode probabilities relative to uniform interactions.

Modelling DNA as a flexible polymer, the probability of a Lox-Lox complex is predicted to be inversely proportional to the distance in bp between sites [58]. Together with a minimal distance of 82 bp, we use this model to compare the distribution over barcodes with the distance-independent scenario for the 13-element optimal Lox barcoding cassette. As shown in Fig. 4 f, Lox sites that are closer form complexes more often relative to the uniform case (inset), and barcode probabilities are more homogeneous. Thus, in our model, distance dependent Lox-Lox complex complex formation improves mixing of Lox barcode cassettes before reaching their size-stable configuration.

## Conclusions

Existing cellular barcoding approaches have already lead to significant biological discoveries and so new approaches that overcome their shortcomings are inherently desirable. Here we have established that using Cre Lox, it would be feasible to create an in situ, triggerable barcoding system with sufficient diversity to label a whole mouse, and propose this as a system for experimental implementation.

## Abbreviations

bp, base pair; DOX, doxycycline; kb, kilobase

## References

- 1
Buchholz VR, Flossdorf M, Hensel I, Kretschmer L, Weissbrich B, Gräf P, Verschoor A, Schiemann M, Höfer T, Busch DH. Disparate individual fates compose robust CD8+ T cell immunity. Science. 2013; 340(6132):630–5.

- 2
Gerlach C, van Heijst JWJ, Swart E, Sie D, Armstrong N, Kerkhoven RM, Zehn D, Bevan MJ, Schepers K, Schumacher TNM. One naive T cell, multiple fates in CD8+ T cell differentiation. J Exp Med. 2010; 207(6):1235–46. doi:10.1084/jem.20091175.

- 3
Verovskaya E, Broekhuis MJ, Zwart E, Ritsema M, van Os R, de Haan G, Bystrykh LV. Heterogeneity of young and aged murine hematopoietic stem cells revealed by quantitative clonal analysis using cellular barcoding. Blood. 2013; 122(4):523–32.

- 4
Ema H, Morita Y, Suda T. Heterogeneity and hierarchy of hematopoietic stem cells. Exp Hematol. 2014; 42(2):74–82.

- 5
Johnson MB, Wang PP, Atabay KD, Murphy EA, Doan RN, Hecht JL, Walsh CA. Single-cell analysis reveals transcriptional heterogeneity of neural progenitors in human cortex. Nat Neurosci. 2015; 18(5):637–46. doi:10.1038/nn.3980.

- 6
Yagi T. Genetic basis of neuronal individuality in the mammalian brain,. J Neurogenet. 2013; 27(3):97–105. doi:10.3109/01677063.2013.801969.

- 7
Zeisel A, Muñoz-Manchado AB, Codeluppi S, Lönnerberg P, La Manno G, Juréus A, Marques S, Munguba H, He L, Betsholtz C, Rolny C, Castelo-Branco G, Hjerling-Leffler J, Linnarsson S. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science. 2015; 347(6226):1138–42.

- 8
Nolan-Stevaux O, Tedesco D, Ragan S, Makhanov M, Chenchik A, Ruefli-Brasse A, Quon K, Kassner PD. Measurement of cancer cell growth heterogeneity through lentiviral barcoding identifies clonal dominance as a characteristic of in vivo ttmor engraftment. PLoS ONE. 2013; 8(6):67316.

- 9
Bhang H-EC, Ruddy DA, Krishnamurthy Radhakrishna V, Caushi JX, Zhao R, Hims MM, Singh AP, Kao I, Rakiec D, Shaw P, Balak M, Raza A, Ackley E, Keen N, Schlabach MR, Palmer M, Leary RJ, Chiang DY, Sellers WR, Michor F, Cooke VG, Korn JM, Stegmeier F. Studying clonal dynamics in response to cancer therapy using high-complexity barcoding. Nat Med. 2015; 21(5):440–8. doi:10.1038/nm.3841.

- 10
Klauke K, Broekhuis MJC, Weersing E, Dethmers-Ausema A, Ritsema M, González MV, Zwart E, Bystrykh LV, de Haan G. Tracing dynamics and clonal heterogeneity of cbx7-induced leukemic stem cells by cellular barcoding. Stem Cell Rep. 2015; 4(1):74–89. doi:10.1016/j.stemcr.2014.10.012.

- 11
Rohr JC, Gerlach C, Kok L, Schumacher TN. Single cell behavior in t cell differentiation. Trends Immunol. 2014; 35(4):170–7.

- 12
Reiner SL, Adams WC. Lymphocyte fate specification as a deterministic but highly plastic process. Nat Rev Immunol. 2014; 14(10):699–704.

- 13
Duffy KR, Hodgkin PD. Intracellular competition for fates in the immune system. Trends Cell Biol. 2012; 22(9):457–64. doi:10.1016/j.tcb.2012.05.004.

- 14
Smith JA, Martin L. Do cells cycle? Proc Natl Acad Sci U S A. 1973; 70(4):1263–7.

- 15
Hawkins ED, Markham JF, McGuinness LP, Hodgkin PD. A single-cell pedigree analysis of alternative stochastic lymphocyte fates. Proc Natl Acad Sci U S A. 2009; 106(32):13457–62.

- 16
Markham JF, Wellard CJ, Hawkins ED, Duffy KR, Hodgkin PD. A minimum of two distinct heritable factors are required to explain correlation structures in proliferating lymphocytes. J R Soc Interface. 2010; 7(48):1049–59.

- 17
Rieger MA, Hoppe PS, Smejkal BM, Eitelhuber AC, Schroeder T. Hematopoietic cytokines can instruct lineage choice. Science. 2009; 325(5937):217–8.

- 18
Gomes FL, Zhang G, Carbonell F, Correa JA, Harris WA, Simons BD, Cayouette M. Reconstruction of rat retinal progenitor cell lineages in vitro reveals a surprising degree of stochasticity in cell fate decisions. Development. 2011; 138(2):227–35.

- 19
Giurumescu CA, Kang S, Planchon TA, Betzig E, Bloomekatz J, Yelon D, Cosman P, Chisholm AD. Quantitative semi-automated analysis of morphogenesis with single-cell resolution in complex embryos. Development. 2012; 139(22):4271–9.

- 20
Richards JL, Zacharias AL, Walton T, Burdick JT, Murray JI. A quantitative model of normal caenorhabditis elegans embryogenesis and its disruption after stress. Dev Biol. 2013; 374(1):12–23.

- 21
Duffy KR, Wellard CJ, Markham JF, Zhou JHS, Holmberg R, Hawkins ED, Hasbold J, Dowling MR, Hodgkin PD. Activation-induced B cell fates are selected by intracellular stochastic competition. Science. 2012; 335(6066):338–41.

- 22
Etzrodt M, Endele M, Schroeder T. Quantitative single-cell approaches to stem cell research. Cell Stem Cell. 2014; 15(5):546–58.

- 23
Cohen AR. Extracting meaning from biological imaging data. Mol Biol Cell. 2014; 25(22):3470–3.

- 24
Gerrits A, Dykstra B, Kalmykowa OJ, Klauke K, Verovskaya E, Broekhuis MJC, de Haan G, Bystrykh LV. Cellular barcoding tool for clonal analysis in the hematopoietic system. Blood. 2010; 115(13):2610–8. doi:10.1182/blood-2009-06-229757.

- 25
Lu R, Neff NF, Quake SR, Weissman IL. Tracking single hematopoietic stem cells in vivo using high-throughput sequencing in conjunction with viral genetic barcoding. Nat Biotech. 2011; 29(10):928–33. doi:10.1038/nbt.1977.

- 26
Naik SH, Schumacher TN, Perié L. Cellular barcoding: a technical appraisal. Exp Hematol. 2014; 42(8):598–608. doi:10.1016/j.exphem.2014.05.003.

- 27
Schepers K, Swart E, van Heijst JW, Gerlach C, Castrucci M, Sie D, Heimerikx M, Velds A, Kerkhoven RM, Arens R, Schumacher TN. Dissecting T cell lineage relationships by cellular barcoding. J Exp Med. 2008; 205(10):2309–18. doi:10.1084/jem.20072462.

- 28
Capel B, Hawley R, Covarrubias L, Hawley T, Mintz B. Clonal contributions of small numbers of retrovirally marked hematopoietic stem cells engrafted in unirradiated neonatal W/Wv mice. Proc Natl Acad Sci U S A. 1989; 86(12):4564–8.

- 29
Naik SH, Perié L, Swart E, Gerlach C, van Rooij N, de Boer RJ, Schumacher TN. Diverse and heritable lineage imprinting of early haematopoietic progenitors. Nature. 2013; 496(7444):229–32. doi:10.1038/nature12013.

- 30
Perié L, Hodgkin PD, Naik SH, Schumacher TN, de Boer RJ, Duffy KR. Determining lineage pathways from cellular barcoding experiments. Cell Rep. 2014; 6(4):617–24.

- 31
Perié L, Duffy KR, Kok L, de Boer RJ, Schmacher TN. The branching point in erythro-myeloid differentiation. Cell. 2015; 163(7):1655–62.

- 32
Sun J, Ramos A, Chapman B, Johnnidis JB, Le L, Ho YJ, Klein A, Hofmann O, Camargo FD. Clonal dynamics of native haematopoiesis. Nature. 2014; 514(7522):322–7. doi:10.1038/nature13824.

- 33
Zador AM, Dubnau J, Oyibo HK, Zhan H, Cao G, Peikon ID. Sequencing the Connectome. PLoS Biol. 2012; 10(10):1001411. doi:10.1371/journal.pbio.1001411.

- 34
Wei Y, Koulakov AA. An exactly solvable model of random site-specific recombinations. Bull Math Biol. 2012; 74(12):2897–916.

- 35
Peikon ID, Gizatullina DI, Zador AM. In vivo generation of DNA sequence diversity for cellular barcoding. Nucleic Acids Res. 2014; 42(16):127. doi:10.1093/nar/gku604.

- 36
Livet J, Weissman TA, Kang H, Draft RW, Lu J, Bennis RA, Sanes JR, Lichtman JW. Transgenic strategies for combinatorial expression of fluorescent proteins in the nervous system. Nature. 2007; 450(7166):56–62.

- 37
Cai D, Cohen KB, Luo T, Lichtman JW, Sanes JR. Improved tools for the Brainbow toolbox. Nat Methods. 2013; 10(6):540–7. doi:10.1038/nmeth.2450.

- 38
Hoess R, Wierzbicki A, Abremski K. Formation of small circular DNA molecules via an in vitro site-specific recombination system. Gene. 1985; 40(2-3):325–9.

- 39
Ringrose L, Chabanis S, Angrand PO, Woodroofe C, Stewart AF. Quantitative comparison of DNA looping in vitro and in vivo: chromatin increases effective DNA flexibility at short distances. EMBO J. 1999; 18(23):6630–41.

- 40
Pinkney JN, Zawadzki P, Mazuryk J, Arciszewska LK, Sherratt DJ, Kapanidis AN. Capturing reaction paths and intermediates in Cre-loxP recombination using single-molecule fluorescence. Proc Natl Acad Sci U S A. 2012; 109(51):20871–6.

- 41
Parrish M, Unruh J, Krumlauf R. BAC modification through serial or simultaneous use of CRE/Lox technology. J Biomed Biotechnol. 2011; 2011:1–12. doi:10.1155/2011/924068.

- 42
Nagy A. Cre recombinase: the universal reagent for genome tailoring. Genesis. 2000; 26:99–109.

- 43
Blattman JN, Antia R, Sourdive DJD, Wang X, Kaech SM, Murali-Krishna K, Altman JD, Ahmed R. Estimating the precursor frequency of naive antigen-specific CD8 T cells. J Exp Med. 2002; 195(5):657–64.

- 44
Sternberg N, Hamilton D, Hoess R. Bacteriophage P1 site-specific recombination. II. Recombination between loxP and the bacterial chromosome. J Mol Biol. 1981; 150(4):487–507.

- 45
Hamilton DL, Abremski K. Site-specific recombination by the bacteriophage P1 lox-Cre system. Cre-mediated synapsis of two lox sites. J Mol Biol. 1984; 178(2):481–6.

- 46
Guo F, Gopaul DN, Van Duyne GD. Structure of Cre recombinase complexed with DNA in a site-specific recombination synapse. Nature. 1997; 389(6646):40–6. doi:10.1038/37925.

- 47
Oberdoerffer P, Otipoby KL, Maruyama M, Rajewsky K. Unidirectional Cre-mediated genetic inversion in mice using the mutant loxP pair lox66/lox71. Nucleic Acids Res. 2003; 31(22):e140.

- 48
Colvin GA, Lambert JF, Abedi M, Hsieh CC, Carlson JE, Stewart FM, Quesenberry PJ. Murine marrow cellularity and the concept of stem cell competition: geographic and quantitative determinants in stem cell biology. Leukemia. 2004; 18(3):575–83. doi:10.1038/sj.leu.2403268.

- 49
Bystrykh LV, Verovskaya E, Zwart E, Broekhuis M, de Haan G. Counting stem cells: methodological constraints. Nat Meth. 2012; 9(6):567–74. doi:10.1038/nmeth.2043.

- 50
Cover TM, Thomas JA. Elements of information theory. New York: Wiley-Interscience; 1991.

- 51
Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008; 18(11):1851–8. doi:10.1101/gr.078212.108.

- 52
Ross MG, Russ C, Costello M, Hollinger A, Lennon NJ, Hegarty R, Nusbaum C, Jaffe DB. Characterizing and measuring bias in sequence data. Genome Biol. 2013; 14(5):51. doi:10.1186/gb-2013-14-5-r51.

- 53
Grosselin J, Sii-Felice K, Payen E, Chretien S, Roux DT, Leboulch P. Arrayed lentiviral barcoding for quantification analysis of hematopoietic dynamics. Stem Cells. 2013; 31(10):2162–71. doi:10.1002/stem.1383.

- 54
Koot MR, Mandjes M. The analysis of singletons in generalized birthday problems. Probab Eng Inform Sc. 2012; 26(2):245–62. doi:10.1017/s0269964811000350.

- 55
Grinstead CM, Snell JL. Introduction to Probability, 2 revised edn. Providence, Rhode Island, U.S.A: American Mathematical Society; 1997.

- 56
Kretzschmar K, Watt FM. Lineage Tracing. Cell. 2012; 148(1-2):33–45. doi:10.1016/j.cell.2012.01.002.

- 57
Reinert RB, Kantz J, Misfeldt AA, Poffenberger G, Gannon M, Brissova M, Powers AC. Tamoxifen-induced Cre-loxP recombination is prolonged in pancreatic islets of adult mice. PLoS ONE. 2012; 7(3):33529.

- 58
Inferring the in vivo looping properties of DNA. Proc Natl Acad Sci U S A. 2005; 102(49):17642–5. doi:10.1073/pnas.0505693102.

## Acknowledgements

The authors thank Ton Schumacher (Netherlands Cancer Institute) for informative discussions.

### Funding

The work of T.W., S.N. and K.D. was supported by Human Frontier Science Program grant RGP0060/2012. K.D. was also supported by Science Foundation Ireland grant 12 IP 1263. D.M. was supported by a National Health and Medical Research Council Early Career Fellowship grant 1052195.

### Availability of supporting data

The C++ code supporting the results is provided in Additional file 1.

### Authors’ contributions

TW, with input from SN and KD, conceived the study. TW, MD and KD performed the mathematical analysis. TW, MD, DM, SG, SN and KD interpreted the data. TW, MD, DM, SG, SN, and KD wrote the paper. All authors read and approved the final manuscript.

### Authors’ information

Not applicable.

### Competing interests

The authors declare that they have no competing interests.

### Consent for publication

Not applicable.

### Ethics approval and consent to participate

Not applicable.

## Author information

## Additional file

### Additional file 1

The C++ code provided in ‘Additional file 1’ computes barcode probabilities (Fig. 3 c) and average number of inversions and excisions (Fig. 3 d) for a Lox barcoding cassette with m Lox sites. With modifications (specified at the end of the file) it also computes the data for Figs. 3 e and 4 e (inset) and distributions shown in Fig. 4 d-f. CPP 12.1 kb

## Rights and permissions

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

## About this article

#### Received

#### Accepted

#### Published

#### DOI

### Keywords

- Cell fate tracking
- Cellular barcoding
- Cre lox system
- DNA stochastic programme
- Combinatorial explosion