Wednesday, 27 August 2025

Tuberculosis

At reference 1, I took a look at some whizzy graphics that I had come across in my travels. Today I start with a much simpler graphic, snapped above.

To give this graphic some context, tuberculosis has been around since at least the time of the cave men; was the scourge of France in Simenon’s day; carried off Molière, D. H. Lawrence, Muhammad Ali Jinnah, George Orwell and Vivien Leigh amongst many others; and, lingered on here in the UK until my own childhood in the 1950s. Still a major problem in the world at large, particularly in south Asia. It is also a major problem with cattle. One result of all of which is the huge amount of effort being put into bringing it under control.

The modern tuberculosis lineages L2-L4 account for most of the continuing & considerable disease burden, especially L4. The older L1 is mainly confined to South Asia; L7 to East Africa, possibly the original seat of the disease. Lineages L5-L9 more generally are confined to Africa, where they are important. Mycobacterium Bovis infects a range of mammals, including both humans and cattle, for which last tuberculosis is an economically important disease. Studying these lineages is an important tool in trying to deal with the tuberculosis we have now.

All this arises from trying to read reference 2, part of trying to make a bit more sense of genome sequencing – not so much why one does it but how one does it. Where Figure 1 above was all very well, but where did it come from? What did the colours mean? The size of the triangles, the lengths of the horizontal or vertical bars?

The first stop was reference 10, the paper referenced at the bottom of the snap above. But before I get into that, a bacterial digression.

M. tuberculosis

It seems that the bacteria which cause tuberculosis have been very thoroughly studied, with the central cluster being called the Mycobacterium tuberculosis complex (MTBC). Some of them specialise in humans, others in various other animals and still others can infect a variety of hosts.

Mycobacterium is a genus of over 190 species of bacteria in the group Actinobacteria, with the ‘myco’ prefix referring to their tendency to form mold like colonies. So not necessarily unicellular.

Turning to reference 4, I find that the Actinobacteria are one of fourteen groups into which the Eubacteria are divided, with the Eubacteria being one of the legs of the tripod of life, with the other two legs being the Archaea and the Eukaryotes. Viruses don’t count as being properly alive. Active or inactive, but not alive or dead.

Moving from the general to the particular, lots of things are easier in the world of genome sequencing if one has a good quality reference genome to tie one’s new sequences into. It seems that in this case the reference genome is something called H37Rv, to be found at reference 5.

Then moving forward from 1905 to just a few years ago, we have reference 6, where the abstract tells us that:

… three independently cultured and sequenced H37Rv aliquots of a single laboratory stock. Two of the 4,417,942 base-pair long H37Rv assemblies are 100%  identical, with the third differing by a single nucleotide. Compared to the existing H37Rv reference, the new sequence contains ~6.4kb additional base pairs…

So a good deal more accurate than I had thought, first in replication, organic or otherwise, These cultured bacteria much more stable than I would have thought. Second, in sequencing: perhaps clever statistics have neutralised, have tamed the messy chemistry. But I have not worked out how an assembly differs from a genome.

I have tried analysing a tuberculosis genome taken from reference 7, expressed as 5,647 genes specified in terms of start position and end position, expressed as base pair number, running from 1 to something over 4 million – leaving out of account all the non-coding data. Which is all very straightforward, much more straightforward than I would expect the much larger human genome to be. With one catch being that there is a lot of overlap, with the total length of the genes exceeding the total number of base pairs by something more than 50,000 base pairs. All this illustrated by the figure snapped above. Something appears to be wrong – quite possibly something to do with the many repeats I understand to be present.

There is a superficially simpler presentation of a similar genome at reference 8, snapped above, from more than twenty years ago. Despite which, the count agrees pretty well with that given above, if you allow for the newly added 6,500 or so base pairs. We have clearly known quite a lot about the tuberculosis genome for quite a long time.

Superficially simpler, but it would take me a good while to work out what it was all about. Not attempted and, oddly, not even tempted.

In any event, all rather different to the purple figure of the human chromosome set to be found at reference 9.

From a sequencing point of view, bacteria have the advantages that one can store them for long periods and that they are easily cultured, that is to say they replicate themselves. One does not need to go in for chemical trickery. And they are not very big, with a genome of the order of one thousandth the size of the human genome: a much more accessible object of study. On the other hand, some of them are dangerous and may only be handled in a secure laboratory.

Varieties of tuberculosis

People have been interested in the various varieties of tuberculosis bacteria for a long time. Also in the evolution of those varieties in time and their distribution in space across the world. The current story seems to be that, to some extent at least, tuberculosis bacteria evolve to work with the human genome available. To that extent, the geographic spread of the bacteria matches that of the human genome. Figure 1 from reference 2, at the top of the present post is evidence of that interest.

A figure which is primarily about lineage, about the tree structure, which, in itself, is scale free. Notwithstanding which, one supposes the vertical and horizontal scales are significant too.

[Taken from reference 10]

This figure is said to have been adapted from reference 10, so that was clearly the place to go to find out how it was put together. To find that my Figure 1 was probably some version of Figure 2 there. Where ‘L’ is for lineage.

A figure which admits more or less indefinite zoom, from which I deduced that the vertical scale is the number of samples in, for example, lineage L2. Which might, depending on the data used by the study at hand, reflect its prevalence in the population at large.

[Figure taken into Powerpoint from reference 3]

Then there is Figure 1 in reference 3, very much the same sort of thing. Furthermore, I was able to download the table of the 851 genomes involved into an Excel worksheet, where I was able to count up the various lineages.

This is snapped above. BCG is the live but feeble version of the bacillus which is the active ingredient in the widely used BCG inoculation. The ‘M’ lines in the lower half of the table are for various animal related strains of the Tuberculosis bacillus, with some of these strains having a strong preference for some particular host, for example Mycobacterium mungi for the mongoose, 

And the ratio of human to animal genomes there very roughly agreed with the corresponding vertical distances in the figure.

However, the numbers of the individual lineages in the upper half of the table did not correspond to the size of the triangles on the figure (Figure 1) that I started with.

I then found my way to the downloadable Table 1 of reference 11, possibly a more population orientated sample of genomes, resulting in the pivot table snapped above. But this did not agree very well with my starting figure either.

At this point I gave up on the vertical scale. I neglect the possibility that the vertical scale is all about geographical spread, a thought arising from the correlation of strain with geography.

The horizontal scale is about the difference between the various strains, measured by the number of single nucleotide polymorphisms – which I take to mean the number of differences between corresponding nucleotides – expressed as a ratio of the number of differences per so many base pairs: differences at the level of base pairs. Sometimes used as a proxy for time, given a steady rate of mutation.

A lot of these differences in base pairs make no appreciable difference to the bacillus. No need to invent new species or varieties.

[A zoom of the figure taken from reference 3]

We have a scale bar, no doubt meaningful to workers in the field. And maybe the long grey bar reflects the amount of change in that lineage, since it split off from the pink lineage at the left, before it split again at the right, before the amount of change or the sort of change warranted splitting the lineage. But what all the 100s are I have not been able to fathom.

The colouring of the triangles in my figure, the one I started with, has largely been lifted from the version at reference 10. Here they just for decoration, to brighten up the graphic, there they serve to better identify the clusters of interest in a busy graphic.

The left hand vertices look to be very roughly indicative of when the lineage started to branch out. 

But, leaving aside most of the animal lineages and leaving aside order down the page, we seem, at least, to have broad agreement about the tree structure of the human lineages across the three sources – just about visible in the snap above – very possibly generated by the same sort of algorithm as generated the trees decorating the graphic at the top of reference 1. The weasel words ‘simplified’ and ‘adapted’ in the caption to the left hand figure cover it.

Other odds and ends arising along the way

Deaths

Along the way I stumbled across reference 12, which, as it turned out, it was not relevant to the present inquiry. But I was struck by the figure snapped below.

Perhaps 50 million deaths here out of a total deaths for the year of 70 million, for a world population of near 8 billion. Which does not seem very many. I have not checked, but I can only suppose that most of the world’s population is very young – with only the rich countries growing old.

Google’s Gemini seems to have a very reasonable grasp of the matter, adding the idea that infant mortality continues to decline in countries with high birth rates into the mix. But without coming up with anything specific.

Note 1: ischemic heart disease is about problems arising from a narrowing or blocking of coronary arteries, in particular poor blood supply to the heart. Hypertensive heart disease is about problems arising from high blood pressure.

Note 2: you can get a very different picture if you look at individual countries. See, for example, Afghanistan at reference 13.

Data

Reference 10  explains that: ‘… Supporting external data includes sequences from studies PRJEB3334, PRJNA52007, PRJEB3223, PRJEB23179, PRJEB5162, PRJEB9680, PRJEB2138, PRJEB7727, PRJNA211633, PRJNA211637…’. 

I tried searching in Bing for ‘PRJNA211663’ and all this turned up was reference 10 itself – but at least it was able to turn the paper up on the basis of what amounted to an obscure identifier – a test of search capability used when searching large datasets was relatively new and I was involved in a procurement to buy some. I then tried ‘PRJNA211663 tuberculosis archived bacteria sample’ and that was not much better – but I did get to an interesting paper in ResearchGate, another candidate digression. On the same search term, Google offers just two results, neither of them containing the string of interest. But Gemini cracked the problem in short order, with the start of my interchange with him snapped above. He also puts me in my place: a serious person would have known without having to ask!

Bovine TB

According to Gemini, ‘if left untreated, bovine tuberculosis (bTB) in cattle is a chronic, progressive, and ultimately fatal disease’. Infected cattle can be a hazard to humans, primarily though infected milk, and the value of the carcase is much reduced. It is well worth spending serious money to keep TB in cattle down and here in the UK, as in most developed countries, infected cattle are identified and culled, usually before there are visible symptoms.

I note that it is getting very easy just to ask Gemini and to make use of what he says without checking. He is a very convenient source of convincing information and I do not catch him out in a significant way very often these days.

Famous victims

Asking the Internet for famous victims of tuberculosis proved to be a bit hit and miss, with some of those listed not checking out. Or at least only checking out to the extent of possible or probable. A reminder that one needs to be careful.

Conclusions

[copy of the figure we started with]

I conclude that the Figure 1 I started with, copied above, is intended to show a simple family tree of the human portion of the tuberculosis bacillus. M. canettii is an outgroup of the sort featured at reference 14. See also reference 15.

The vertical scale is vaguely indicative of prevalence in humans and the horizontal scale is vaguely indicative of difference between the lineages. But vague: don’t get picky about it.

Which to my data analyst turn of mind, is not really good enough. One ought to be more careful about such things. But then again, it is all too likely that my reading of reference 2 has been careless too.

I might say in passing that the standard of the Excel worksheets that I come across in this context is not high. They are there and the data is available – which is good – but the authors could have taken a bit more care with their design. Maybe thought about the likely needs of the intended user.

[BP514-14. Turned up by Google. FIG. 1. Colony morphology of the “Mycobacterium canettii” isolate on Middlebrook 7H10 agar (4 weeks old). “Mycobacterium canettii” Isolated from a Human Immunodeficiency Virus-Positive Patient: First Case Recognized in the United States - Akos Somoskovi and others - 2009]

References

Reference 1: https://psmv5.blogspot.com/2025/08/another-whizzy-graphic.html

Reference 2: The Mycobacterium tuberculosis genome at 25 years: lessons and lingering questions - Benjamin N. Koleske, William R. Jacobs Jr., William R. Bishai – 2023. 

Reference 3: A new phylogenetic framework for the animal-adapted Mycobacterium tuberculosis complex - Brites, D. et al. – 2018. 

Reference 4: The tree of life: a phylogenetic classification – Lecointre and Le Guyader – 2006.

Reference 5: https://en.wikipedia.org/wiki/H37Rv

Reference 6: A comprehensive update to the Mycobacterium tuberculosis H37Rv reference genome – Poonam Chitale, Alexander D. Lemenze, Emily C. Fogarty, Avi Shah, Courtney Grady, Aubrey R. Odom-Mabey, W. Evan Johnson, Jason H. Yang, , A. Murat Eren, Roland Brosch, Pradeep Kumar, David Alland – 2022. 

Reference 7: https://mycobacterium.biocyc.org/

Reference 8: Mycobacterium tuberculosis Pathogenesis and Molecular Determinants of Virulence – Issar Smith – 2003. 

Reference 9: https://psmv5.blogspot.com/2025/06/a-cell.html

Reference 10: Phylogenomics of Mycobacterium africanum reveals a new lineage and a complex evolutionary history – Coscolla M, et al. – 2021. Reference 60 in reference 3 above. 

Reference 11: Robust barcoding and identification of Mycobacterium tuberculosis lineages for epidemiological and clinical studies – Gary Napier, Susana Campino, Yared Merid, Markos Abebe, Yimtubezinash Woldeamanuel, Abraham Aseffa, Martin L. Hibberd, Jody Phelan, Taane G. Clark – 2020. The source of the 92 row table of lineages.

Reference 12: Global tuberculosis report – WHO – 2024. 

Reference 13: https://www.who.int/data/gho/data/themes/mortality-and-global-health-estimates/ghe-leading-causes-of-death

Reference 14: https://psmv5.blogspot.com/2024/05/outgroups.html

Reference 15: https://en.wikipedia.org/wiki/Mycobacterium_canettii

No comments:

Post a Comment