LPTMS, CNRS and Univ. Paris-Sud, UMR8626, Bat. 100, 91405 Orsay, France

Dipartimento di Fisica "G. Galilei", Università di Padova, via Marzolo 8, I-35131 Padova, Italy

IGM, CNRS and Univ. Paris-Sud, UMR8621, Bat. 400, F-91405 Orsay cedex, France

Unité de Génétique Mycobactérienne, Institut Pasteur, Paris, France

Abstract

Background

Classification and naming is a key step in the analysis, understanding and adequate management of living organisms. However, where to set limits between groups can be puzzling especially in clonal organisms. Within the

Results

To adequately infer the relative divergence time between strains, we used a distance method inspired by the recent evolutionary model by Reyes

Conclusion

Altogether, this study shows how the new clustering algorithm Affinity Propagation can help building or refining clonal organims classifications. It also describes well-supported families and subfamilies among

Background

The advent of powerful genotyping methods, either by global sequencing or by high-throughput analysis of variation at specific loci (mini- or micro-satellites

Clustering methods can be applied to different types of loci, ranging from repetitive sequences such as insertion sequences, micro-, mini-satellites or the CRISPR loci to single nucleotide polymorphisms (SNPs), provided an appropriate method is available to calculate the distance between individuals. Such methods usually rely on a model of the mutation process. Which loci should be targeted depends on the mean divergence time between individuals, as repetitive sequences mutate faster than SNP loci. Several mutation models have been developed for DNA sequence with point mutations

CRISPR loci (Clustered Regularly Interspaced Short Palindromic Repeats) form a new family of repetitive sequences

The worldwide database of spoligotyping in

Here we wanted to take advantage of a recently developed algorithm, Affinity Propagation, to confirm and extend these methods. This algorithm identifies references for every data point so that data are grouped and centered on these references while a specific cost function is minimized. The cost of adding a new reference point, assigned by the user, determines the final number of clusters. Prior to the use of this algorithm, we tested different distances to calculate pairwise distances between spoligotype patterns. We took advantage of previously identified references and expert assignation to rank these distances, some of which are derived from previously proposed evolutionary models

Altogether, this approach allowed us to assess the robustness of previously identified sublineages among MTC, to identify new relevant sublineages and to provide re-assignations of the spoligotype patterns described in SpolDB4. These re-assignations interestingly matched those of studies using VNTR and/or SNP data.

Results

Comparison of classifications based on new distances or on Jaccard index to expert classification of SpolDB4

Clustering of CRISPR patterns (spoligotypes) of _{correct }= 1402), and the "Domain Walls" method (84.0%, n_{correct }= 1399). These methods also provided the smallest amount of assignations that differed from those of the experts ("Deletions": 11.0%, n _{false }= 183; "Domain Walls": 12.3%, n _{false }= 204). These methods thus appear to be the best for fitting the expert classification out of the four methods we tested.

References of the ten best acknowledged

**SIT **

**SpolDB4 classification **

**Reference Spoligotype pattern **

**family **

**subfamily**

1

BEIJ

BEIJ

26

CAS

CAS1

42

LAM

LAM9

50

H

H3

53

T

T1

100

MANU

MANU1

119

X

X1

236

EAI

EAI5

181

AFRI

AFRI1

482

animal

BOV1

BEIJ = Beijing also referred to as East Asia; CAS = Central Asia also referred to as East Africa and India; LAM = Latino-American and Mediterranean; H = Haarlem; EAI = East African Indian.

Distance methods

**Distance methods**. **A: **classically implemented Jaccard index. **B-D**: newly proposed distance methods. w = Domain Walls also referred to as walls. Numbers below the spoligotype patterns count the number of their common features: either the number of common spacers (A), common walls (B), common blocks (C), or common deletions (D). These numbers are summed and divided by the total number of features in the two spoligotype patterns to obtain the similarity between the two spoligotypes.

Assignations matches between SpolDB4 and the different distance methods on whole SpolDB4 database (n = 1937 SIT)

**Assignations matches between SpolDB4 and the different distance methods on whole SpolDB4 database (n = 1937 SIT)**. References are those described in Table 1. Assignations were performed according to the reference for which the distance was the lowest. The patterns for which the most similar reference is the same as that indicated by its SpolDB4 assignations, were scored as "Correct". Note that "Domain Walls" and "Deletions" have equally high values of assignations agreeing with the expert classification. When the method identified two identically similar references for a pattern, this pattern was scored as Unassigned and described as Ambiguous assignation. Ambiguity was the lowest with "Domain Walls" method.

"Deletions" method succeeds in correcting false SpolDB4 assignations

Some families' assignations provided by SpolDB4 have been debated. For instance, patterns classified as LAM7-TUR ^{1212 }mutation that defines the LAM family _{LAM7-TUR }= 8) to the T family as did methods using VNTRs _{H4 }= 34) in SpolDB4 were recently excluded from the Haarlem family based on them not carrying the mgtC^{545 }mutation

Assignations of LAM7, H4 and selected "U" spoligotype patterns from SpolDB4, according to different methods.

**spoligotype pattern**

**SpolDB4 **

**Recent litterature **

**Deletions **

**Domain Walls **

**SpotClust subfamily**

**SIT**

**family **

**Sub-family**

**assignation**

**family**

**family**

**SpolDB3-based **

**RIM**

41

LAM

LAM7

T-TUR

**T**

186

LAM

LAM7

T-TUR

**T**

367

LAM

LAM7

T-TUR

**T**

930

LAM

LAM7

T-TUR

**T**

1261

LAM

LAM7

T-TUR

**T**

1589

LAM

LAM7

T-TUR

**T**

1924

LAM

LAM7

T-TUR

**T**

1937

LAM

LAM7

T-TUR

**T**

35

H

H4

T-Ural

**T**

262

H

H4

T-Ural

**T**

361

H

H4

T-Ural

**T**

399

H

H4

T-Ural

**T**

**T2**

596

H

H4

T-Ural

**T**

597

H

H4

T-Ural

**T**

656

H

H4

T-Ural

**T**

762

H

H4

T-Ural

**T**

777

H

H4

T-Ural

**T**

817

H

H4

T-Ural

**T**

920

H

H4

T-Ural

**T**

921

H

H4

T-Ural

**T**

922

H

H4

T-Ural

**T**

1117

H

H4

T-Ural

**T**

1134

H

H4

T-Ural

**T**

1165

H

H4

T-Ural

**T**

1174

H

H4

T-Ural

**T**

1242

H

H4

T-Ural

**T**

1269

H

H4

T-Ural

**T**

1276

H

H4

T-Ural

**T**

1281

H

H4

T-Ural

**T**

1292

H

H4

T-Ural

**T**

1447

H

H4

T-Ural

**T**

1448

H

H4

T-Ural

**T**

1457

H

H4

T-Ural

**T**

1568

H

H4

T-Ural

**T**

1581

H

H4

T-Ural

**T**

1384

H

H4

T-Ural

**T**

U

**T3**

**N40**

1446

H

H4

T-Ural

**T**

U

1452

H

H4

T-Ural

**T**

U

1455

H

H4

T-Ural

**T**

U

U

1456

H

H4

T-Ural

**T**

U

1461

H

H4

T-Ural

**T**

U

1480

H

H4

T-Ural

**T**

U

105

U

U

H

U

1274

U

U

LAM

U

1531

U

U

X

**X**

**X**

**X1**

**N44**

"Recent literature assignation" represents the standard, and refers to studies using loci other than the CRISPR locus: T-TUR classification has been suggested both by Millet et al.

Plot of similarity to their reference for patterns assigned as the expert classification (Black) and differently than the expert classification (Gray)

**Plot of similarity to their reference for patterns assigned as the expert classification (Black) and differently than the expert classification (Gray)**.

Interestingly, Beijing, X and EAI families exhibited no incongruence between the "Deletions" and the expert method (no light gray box), suggesting that these families are clearly and appropriately defined. As reported above (Figure

**Plot of similarity to their reference for patterns assigned as the expert classification (Green), patterns not assigned due to ambiguity (Gray) and patterns assigned differently than the expert classification (Red)**.

Click here for file

Assignations of U spoligotype patterns

Assignations thus seem phylogenetically relevant using the "Deletions" method and the references of the well-acknowledged families. We thus propose an alternative spoligotype patterns classification on the 1939 spoligotypes reported in SpolDB4 (Additional File

**SpolDB4 new assignations, using the previously identified references or the newly identified ones**.

Click here for file

Assignations of 'U" patterns managed by the different methods

**Assignations of 'U" patterns managed by the different methods**. Percentage was calculated based on the 272 "U" patterns found in SpolDB4.

Automatized identification of references by Affinity Propagation clustering

The "Deletions" method is highly useful to classify spoligotype patterns in the described families, but this classification highly depends on the identification of references. These references are widely acknowledged for major families but the relevance of finer classification is recurrently debated

numerous, and the mean similarity with the representative increases. Interestingly, when the number of clusters does not vary even if the penalty changes, this indicates that the data points are not evenly distributed,

Number of clusters found by Affinity Propagation as a function of the penalty

**Number of clusters found by Affinity Propagation as a function of the penalty p for the distance between a data point and its reference**. Note that two plateau can be detected, at 14 and 32 clusters respectively, indicating that the corresponding clustering is robust, and therefore might be relevant.

**Mean similarity of patterns with their representative as a function of the cluster size, and for different clustering methods (AP: Affinity Propagation; Bio: Bionumerics; KM: K-Means)**.

Click here for file

References after Affinity Propagation clustering for n_{clusters }= 12.

**AP-family**

**Reference**

**Majoritary SpolDB4 family**

**SpolDB4 subfamily**

**SIT**

**Spoligotype pattern**

**Family**

**Proportion in the AP-family**

**Total Nb**

**animal1(Bov1-3-Cap-Mic-Pin)**

bovis_1

**482**

Animal

0.888

206

animal2(Bov2)

bovis_2

683

Animal

0.621

66

Beij-afri

BEIJ

255

Afri

0.339

56

**CAS**

CAS_1

**26**

CAS

0.760

96

**EAI**

EAI_5

**236**

EAI

0.84

250

H1-2

H_1

47

H

0.853

68

**H**3

H_3

**50**

H

0.874

111

**LAM**
_{(9-3-11-6-4)}

LAM_9

**42**

LAM

0.721

179

T2

T_2

52

T

0.545

145

T3-LAM_{(2-5)}

LAM_2

17

LAM

0.432

148

**T**-(Ural-H3-LAM_{10-7)})

T_1

**53**

T

0.823

351

S(&U)

S

34

T

0.554

74

T(&U)

T_1

173

T

0.420

81

**X**

X_1

**119**

X

0.75

108

BEIJ = Beijing (also East Asia); afri=

References after Affinity Propagation clustering for n_{clusters }= 32.

**AP-subfamily naming**

**reference**

**Most represented subfamilies**

**Nb of spoligotype patterns**

**Classical subfamily naming**

**SIT**

**spoligotype (43 format)**

**First most represented subfamily**

**Second most repr. family**

**Subfamily**

**Prop.**

**Afri1**

AFRI1

181

AFRI1

0.531

AFRI

32

**Afri2-3**

AFRI2

331

AFRI2

0.364

AFRI3

22

**Beij**

BEIJ

1

BEIJ

0.842

U

19

**Bov1-3**

BOV1

482

BOV1

0.585

BOV

159

**Bov2**

BOV2

683

BOV2

0.467

BOV

45

**Cap**

CAP

647

CAP

0.75

U

20

**CAS**

CAS1

26

CAS1

0.487

CAS

80

**EAI1**

EAI1

48

EAI1

0.804

U

46

**EAI3-5 (del2-3-37-38-39)**

EAI2

11

EAI5

0.383

EAI3

55

**EAI2 (del3-20-21)**

EAI2

19

EAI2

0.5

U

48

**EAI**

EAI5

236

EAI5

0.651

EAI4

86

**EAI6 (del23-37)**

EAI6

292

EAI6

0.5

EAI5

42

**H1-2**

H1

47

H1

0.790

U

62

**H3**

H3

50

H3

0.927

U

96

**Ural**

H4

262

H4

0.714

U

28

**LAM5-2-1(del3-13)**

LAM2

17

LAM5

0.207

U

92

**LAM3**

LAM3

33

LAM3

0.455

U

33

**LAM**

LAM9

42

LAM9

0.574

LAM11

136

**Manu**

MANU2

54

MANU2

0.793

U

29

**Pin-Mic**

PIN

637

BOV

0.391

U

23

**S**

S

34

S

0.678

U

59

T (T1-H3-Lam10-Cam)

T1

53

T1

0.828

H3

261

T1a (del5-40-43)

T1

833

T1

0.484

U

31

**T1b (del21)**

T1

291

U

0.367

T1

30

**T1c (del15)**

T2

118

T1

0.432

U

37

**T2 (del40)**

T2

52

T2

0.521

U

119

**T3 (del13)**

T3

37

T3

0.373

U

59

T4 (del19-23-24-38-39)

T4

39

T1

0.406

T4

32

**T5 (del23)**

T5

44

T5

0.561

U

41

**X**

X1

119

X1

0.492

U

61

**X2**

X2

137

X2

0.824

T1

34

**SEA1 (del29-34)**

U

458

U

0.955

CAS

22

BEIJ = Beijing (also East Asia); afri=

Spoligotype patterns clustered with SIT 458 with Affinity Propagation when n_{clusters }= 32.

**SIT**

**Spoligotype pattern**

**SpolDB4 assignation**

**Main country**

458*

U

THA

354

U

GBR

526

U

GNB

527

U

GNB

863

U

BRA

1172

U

EST

1186

U

THA

1187

U

THA

1374

U

MYS

1386

U

BGD

1436

U

BGD

1462

U

GEO

1515

U

MDG

1518

U

MDG

1519

U

MDG

1520

U

MDG

1521

U

MDG

1524

U

MDG

710

U

NLD

405

U

VNM

426

CAS_2

USA

523

U

MYS

Note that most of the patterns carry the 29-34 spacers deletion, and that most of them are unclassified by SpolDB4. "Main country" refers to the country where the highest number of isolates carrying this pattern were found according to SpolDB4

Discussion

Here we first validated a simple distance method that can be used to classify CRISPR genetic profiles based on a worldwide

Clustering power of CRISPR patterns

We used here an automatized approach for clustering CRISPR patterns. Our clusters largely reproduced the well-acknowledged MTC families and provided meaningful clustering for Ural, TUR and Cameroon. In fact, the misclassification of Ural among Haarlem family was due to the merging of all signatures having spacer 31 deleted and spacer 32 present disregarding the left border of the deletion. This classification criterion is not relevant knowing the evolutionary dynamics of CRISPR loci due either to the insertion of IS

Still, the fact that some families are "ill-defined" is an intrinsic problem of spoligotyping: CRISPR loci in

We thus argue that CRISPR profiles evolving by the insertion of transposable elements or by deletion such as those of

Distance methods for CRISPR profiles

If CRISPR can be used to infer phylogenetic relationship, the evolutionary model or distance method used during the inference is also of great importance. Several developments had been proposed until now. We want to discuss here what our approach adds to previous ones.

CRISPR profiles (spoligotype patterns) form a sequence of binary data. As such, it has been analyzed with tools developed for binary information such as the Jaccard Index that focuses on the sharing of every unit in the profile (here the spacers) taken independently. This however ignores an essential feature of the corresponding CRISPR locus: that it evolves by the loss of spacers. These losses can occur either because of the insertion of a transposable element that disrupts the sequence used in the spoligotyping technique, or by deletion. Deletions can occur for several spacers at once, even if the frequency of large deletions may be lower

Performance of the Affinity propagation algorithm on CRISPR profiles clustering

Affinity Propagation is a message-passing algorithm that considers clustering as a problem of minimizing an "energy" function of the clusters configuration in the data set (see Methods section for a general review of the algorithm, and ^{2 }if N is the size of the dataset) and thus the possibility to analyze very large networks is encouraging the use of this algorithm. With this method we identified both families and subfamilies in MTC. A single family out of 14 made no sense (Beijing-africanum). This is likely due to a lack of information in Beijing spoligotype pattern as the large 1-36 deletion limits the recognition of other signatures. When considering patterns carrying a larger number of spacers, the classification was largely congruent with the literature. In addition, we could identify new signatures, especially one, termed SEA1, among previously unclassified patterns. We therefore believe that this algorithm is very useful for classifying the widely used 43-spoligotype patterns in

"Euro-american" lineage evolution

Despite large sequencing efforts

Combining the families and subfamilies identification, we could provide a simplified evolutionary scheme for this lineage (Figure

Evolutionary scheme of the Euro-American supported sublineages

**Evolutionary scheme of the Euro-American supported sublineages**. Note that our study does not identify the monophyly of H1-H2 and H3. Monophyly of T sublineages is not supported by this method either. LAM monophyly (once LAM7 and LAM10 were extracted) is in contrast well supported.

Conclusion

This study describes 1) a novel distance method to be applied on genetic loci evolving by deletion, as for instance do inactive CRISPR loci, 2) a framework to take advantage of identified references for classifying individuals using such loci, 3) a way to identify new references using the Affinity Propagation algorithm

Methods

Spoligotyping data

SpolDB4_{tot }= 231), Beijing (n = 21), CAS (n = 86), EAI (n = 213), H (n = 233), LAM (n = 224), MANU (n = 39), T (n = 482) including S and H37Rv ST as suggested by Brudey et al (2006), X (n = 90). We excluded SIT69 which was suppressed by Institut Pasteur

Methods to compute distances

Three new methods to compute distances were designed that fit CRISPR loci evolutionary dynamics such as that of _{w }and j_{w }Domain Walls respectively and K_{w }are common, the distance is:

The "Blocks" method considers blocks of spacers; let i_{b }and j_{b }be the numbers of blocks carried by the two profiles and K_{b }the number of blocks they share,

The "Deletions" method considers deleted blocks; let i_{d }and j_{d }be the numbers of blocks of deletions carried by the two profiles and K_{d }the number of shared deletions,

These distance methods were used to compute the distance between each SpolDB4 spoligotype pattern and the references of the ten main

Clustering algorithms

Affinity Propagation (AP), proposed first in ^{2}), where N is the total number of nodes to cluster). The starting point is thus a set of data points, representing the nodes of the network, and a similarity matrix S defining the similarities among all the nodes as deduced from the distance between all these nodes. The similarity between two points

provided that the distance _{i }

The first term of the function defined above is (minus) the sum of all the similarities between a point and its exemplar, while the second term is introduced to avoid any configuration in which an exemplar does not belong to the cluster that itself represents, that is, an exemplar must be the exemplar of itself. This is granted by defining the function

and by taking the log function of it and summing it over all the nodes, so that the energy becomes infinite if at least an exemplar is represented by a different exemplar. The parameter

Once the Cavity equations are written one is left with two coupled update rules for each couple of nodes

These update rules represents messages that the nodes are exchanging between iteration _{i}

Authors' contributions

CS and SF initiated this work through informal discussions; ML did the first experiments under SF supervision and CS guidance for classification (University Paris-Sud Master 1 program, Physics). ML wrote his Master report on this topic. SF supervised CB (Master 2 program) to program writing, acquisition and analysis of data. GR performed complementary program writing, data acquisition and analysis. CB and GR were both involved in data analysis and writing of the manuscript. CS provided the taxonomic expertise and contributed to the revision of the manuscript. All authors approved the final version.

Acknowledgements

Marc Mézard is acknowledged for fruitful discussions on the design of the study. Edgar Abadia, Jian Zhang and Michel Gomgnimbou are acknowledged for discussions on the design of the manuscript. CS and SF acknowledge financial support of Univ. Paris-Sud in the form of a « Chaire d'excellence » respectively in Physics (SF) and Microbiology (CS).