[Caption: Heat maps of gene expression values show how experimental conditions influenced production (expression) of mRNA for a set of genes. Green indicates reduced expression. Cluster analysis has placed a group of down regulated genes in the upper left corner]
Not whizzing through reference 1, still having only got around as far as page 111. I got stuck yesterday morning on tegmentum, a region of the brain stem, a region where lots of important stuff goes on.
But I was a bit vague about where it was, so I turned up reference 2. This was not quite what I wanted, so back to Bing and he turned up reference 3, where there was much talk of transcription and lots of fancy diagrams. Needing to remind myself what transcription was all about, Bing pointed me to reference 4, which is illustrated with the figure above (hereafter WF). About which all that the text offers is:
… Apart from selecting a clustering algorithm, user usually has to choose an appropriate proximity measure (distance or similarity) between data objects. The figure above represents the output of a two-dimensional cluster, in which similar samples (rows, above) and similar gene probes (columns) were organized so that they would lie close together. The simplest form of class discovery would be to list all the genes that changed by more than a certain amount between two experimental conditions…
The caption, as it turned out, was accurate enough, but I did not find it very informative. Clicked on ‘more information’, to be told all about the picture and its pixels, but not much about its content:
Searching for the label for the first row, ‘E217S298R783’, did not produce anything useful.
Next stop, Google Images which turned up this picture and lots of others like it. At this point I work out that maybe ‘heatmap’ is the way in. So at reference 5, I find what appears to be the very same picture but with a new caption, snapped below.
Heat map generated from DNA microarray data reflecting gene expression values in several conditions (Eisen et al.).
But maybe the row labels are the three-dimensional coordinates of sample points in a brain and the column labels are some kind of genetic marker? However, searching for the label for the rightmost column did not produce anything useful.
Hoping to confirm this hypothesis otherwise, I dig up Eisen & Co at reference 6, to find figures like WF, but not very like. We also have reference 7, where we have a very slightly different version of the very same WF. Which leads to reference 8, an old paper from Science which has not leaked into the public domain – and $30 is a bit strong. Maybe get a day ticket at UCL, should I happen to be in London for some other reason? It is the summer holiday and the place will probably not be busy.
But back at reference 7, we do get an improved caption, snapped below.
Figure 1: Cluster heat map from Andrade (2008), based on Eisen et al. (1998). The aspect ratio has been adjusted to make the pixels square. The rows (or columns) of a microarray heat map represent genes and the columns (or rows) represent samples. Each cell is colorized based on the level of expression of that gene in that sample.
Andrade appears to be a Wikipedia editor who appears in the file information for the copy of our figure at reference 5. Which does not help me.
StemBase, another lead from the file information, appears to be some kind of gene database, which I could probably get into, despite reference 10 having gone missing, but I decide against. I would almost certainly get lost. Not worth the amount of effort involved.
Reference 7 also suggests that the colour code might be blue for low, green for mean and red for high, but this does not work for WF where we have lots of black. On the other hand, reference 5 says green for weak expression and red for high expression - which agrees with the strong association of red and hot.
I tried asking Gemini about the row and column labels, but while he offered reasonable general advice, he was unable to be specific. But he did offer me a research plan to click on, suggested various paths forward, paths which he was not going to take himself. Not clear where his boundaries are here: some paths he does take, others not. Copyright issues? Doesn’t like dynamic content?
But then I tried searching for the column label again, ‘1425824_a_at’, using Google this time, and turned up reference 11. We do indeed have some kind of mouse flavoured gene. And the chart in the snap above looks to quantify its expression in various places in the murine brain. Burning up the space by using bar height rather than cell colour for visual quantification.
PSCK4 is well documented, including a brief entry at Wikipedia, at reference 11 and above.
Dot plots
It so happens that reference 3 contains a number of two-dimensional dot plots, which might be thought of as a version of the heat map – unfortunately as puzzling in the first instance as WF.
[Caption: Figure 1 … The top marker genes that specify the identity of each “excitatory” or “inhibitory” cluster are in (h) and (i), respectively. h, i Dot plots illustrating the expression level of the topmarker gene for the “excitatory” (h) and “inhibitory” (i) neuronal groups. All differentially expressed genes in the dot plot have an average log fold-change…]
The highlighted ‘Col12a1’ and ‘Zfp804b’ are both the names of genes, both known to Google. If generalised, this says that most, but no means all, the column labels are pairs of genes. I make the row count 65 and the column count 71.
I suppose that that marker genes are the rows and the groups are the columns, these last mostly identified by their top two genes. This sort of works in the bottom left hand corner, snapped above.
[Caption: Figure 1 … e Dot plot of 35 cell marker genes that univocally identify each cell type. 3 marker genes were plotted for all cell types except for CPE cells, where only the top 2 were used]
However, in the same figure we had a rather simpler dot plot, so I had a go at that. Here the row labels are genes and the column labels are for groups of neurons. Individual dots (or discs) are for the expression of this gene in the cells of that group. I think the idea is that the colour of the dot indicates departure from the all-cell mean while the size of the dot indicates percentage expression in cells of that group. I assume for present purposes that for any one cell, a gene is either expressed or not, a binary choice at that level – an assumption which may be invalid in the case that the chromosome set contains more than one copy of a gene, possibly many copies. One might have degrees of expression, driving changes in the volume of production of the chemical concerned.
For the two rows that I checked, ‘Vcan’ and ‘Slc38a11’, Bing turns up references 15 and 16. The odd looking ‘9630013A20Rik’ works too, giving the mouse flavoured reference 17. Clearly got a problem with naming conventions!
There is strong correlation between the rows and columns, brought out by appropriate ordering, but it is not the case that positive expression of a marker gene identifies the group in an unmbigous way, as is suggested by the caption. Nor do I see what is going one with the exceptional CPE cells.
So while I get the general idea, not happy with the detail. I don’t think my difficulties are explained away by my ignorance of the subject matter, so once again, a case of where there could have been better labelling.
t-SNP plots
[Caption from original: Fig. 2: Embeddings generated from our entire corpus visualised using t-SNE and coloured according to their grammatical class. Adjective: Green, Verb (Past Participle): Blue, Verb (Present): Yellow, Female: Large pink, Verb (Past): Purple, Character: Orange nodes, Verb: Red, Male word: Large grey, Noun: White]
Yet another sort of two-dimensional plot which crops up in reference 3, a wheeze for visualising complex entities in two dimensions, described at reference 18, from which the snap above is taken.
This time I can do better at tracking down the source, which is at reference 19. And what we have is an analysis of the near six million words used in 49 19th century, British and Irish novels, lifted from Project Gutenberg. The idea being to locate female authored gendered words (pink) and male authored gendered words (grey) in a space of other words, coloured by grammatical class. Gendered words being words like he, she, priest, wife, king and uncle. All this without having to bother to read the novels yourself.
A former correspondent would have been horrified by this invasion of his literary world by IT nerds. Nerds from the other island too, sometime home of the incomparable Joyce. He might have needed to pay a visit to Wetherspoon’s to calm down.
But at least in this case, I do get the general idea and I was able to track down the example used in Wikipedia.
Conclusions
I have not been able to run down the source of the Wikipedia figure (WF), but it is indeed a heat map, just like it says on the tin, a popular example of the genre, seemingly illustrating expression of various genes (columns) in various places in a murine brain (rows). And that is going to have to do for now.
Which heat maps, together with their clusters, look to be a powerful and widely used tool for data exploration.
It also appears that there are lots of computer packages out there supporting their generation. And lots of more or less publicly available data about genes and brains to exercise them on.
I was reminded that there are lots of different sorts of neurons – which must take a lot of work to model accurately at scale.
All a bit long winded, but not without peripheral interest. Maybe I would have done better with less clicking and more reading. That said, the labelling of WF could have been better.
References
Reference 1: The brain and the inner world: an introduction to the neuroscience of subjective experiences - Mark Solms, Oliver Turnbull – 2002.
Reference 2: https://en.wikipedia.org/wiki/Tegmentum.
Reference 3: A spatially-resolved transcriptional atlas of the murine dorsal pons at single-cell resolution – Stefano Nardone, Roberto De Luca, Antonino Zito, Nataliya Klymko and others – 2024.
Reference 4: https://en.wikipedia.org/wiki/Gene_expression_profiling.
Reference 5: https://en.wikipedia.org/wiki/Heat_map.
Reference 6: Cluster analysis and display of genome-wide expression patterns – Eisen, M., Spellman, P., Brown, P., Botstein, D. – 1998.
Reference 7: The History of the Cluster Heat Map – Leland Wilkinson, Michael Friendly – 2008.
Reference 8: A Postgenomic Visual Icon – Weinstein, J – 2008. Inaccessible – although Weinstein is visible enough.
Reference 9: https://faculty.mdanderson.org/profiles/john_weinstein.html. A lot of work here on cancer, atlases and genes. See snap above.
Reference 10: http://www.stembase.ca/. Closed down.
Reference 11: http://brainstars.org/genesym/Pcsk4.
Reference 12: https://en.wikipedia.org/wiki/PCSK4.
Reference 13: https://omabrowser.org/oma/vps/ENSG00000115257/.
Reference 14: ESR Modern Radiology eBook: Central Nervous System – Laura Oleaga, European Society of Radiology - 2025. I came across this one in the margins. Looks like a useful chunk of accessible tutorial material.
Reference 15: https://en.wikipedia.org/wiki/Versican.
Reference 16: https://www.genecards.org/cgi-bin/carddisp.pl?gene=SLC38A11.
Reference 17: https://www.informatics.jax.org/allele/MGI:6158415.
Reference 18: https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding. Where the ‘t’ is for Student’s t distribution.
Reference 19: Exploring the Role of Gender in 19th Century Fiction Through the Lens of Word Embeddings - Siobhan Grayson, Maria Mulvany, Karen Wade, Gerardine Meaney, Derek Greene – 2017.








No comments:
Post a Comment