psmv5: A genome

[Figure 01]

Still pegging away at reference 1, in the course of which I came across a human reference genome called hg18, something I have been looking out for for a while. From 2006, so rather long in the tooth now by the standards of such things, but it still has a big Internet footprint. This does not include its own Wikipedia page, but it does get a couple of mentions at reference 2.

[Figure 02]

This led to reference 3, from where I lifted the promising looking table above, thoughtfully laid out in a way which makes it easy enough to import into an Excel worksheet – where the comma’d numbers are easier to read. And where one can do sums to check one’s understanding.

An alternative overview

[Figure 03]

Anticipating the deeper dive which follows, I offer this alternative presentation of this genome in the figure above, in which I have not bothered to replicate the structure for an individual. Which last might well include both copies of all the regular chromosomes.

A rather simplified, main structure to the right of ‘chromosome’ to which the brown boxes hanging down in brown add gaps, alternates and landmarks – some of which will have been worked away in the near twenty years since this genome was produced. Alternate where, despite this being a reference genome, choice seems appropriate. Landmark or signpost some feature which has been identified before or apart from the main sequencing operation. Perhaps a stake in the ground around which sequences can be arranged.

The box labelled quality is there to remind us that sequences can vary a good deal in quality. It is not a binary split between good and bad.

A structure which is mostly about the way in which this assembly was built – involving many teams and even more people – rather than the detailed, functional structure of the genome. Or the organisation of the coding part of the genome into genes – which, roughly speaking, code for proteins – or the three base pair groups which code (slightly redundantly) for their constituent amino codes. I dare say that all this is part of what the specialised browsers do.

For my purposes, converting the hierarchically organized zip file into a flat file listing all the sequences – an Excel worksheet – with less than 300 contigs, judging by the above average sized chromosome 6 at Figure 05 below - there can only be a few tens of thousands of them – would be useful. But that, sadly, is beyond my IT skills. Or patience.

Drilling down in Figure 01

So starting top left, a species may have a number of reference genomes and each reference genome may go through a number of builds before the builders are satisfied with it. So this build is NCBI 36.1, aka hg18, aka golden path. A series of genomes which ended with hg38 in 2013. I suppose that ‘hg’ stands for ‘human genome’. Much the same as one gets with successive releases of software – or, for that matter, of hardware.

NCBI, the National Center for Biotechnology Information at the National Library of Medicine, a very impressive resource, is to be found at reference 4. Let us hope that it survives the good-government-is-small-government turmoil in the US.

The column letters added to Figure 01 are the same as the Excel column headers in Figure 02.

Sizes are given in base pairs.

Column A

This is the chromosome number (or letter). The 22 regular chromosomes have been numbered in descending order of size, as established a long time ago using optical microscopes and electrophoresis. This order has been retained and agrees with what we now know to be the size in base pairs.

One can also stain chromosomes to show their G-bands (of reference 9), which are pretty much the same for everybody. One chromosome is reliably different from another in this way. An example of this banding can be seen at Figure 05 below.

The upshot of which is that, with training, one can say which chromosome is which by looking at them under a microscope.

This order has little to do with the order or arrangement of chromosomes within a cell – which is, in any case, a three-dimensional rather than a one-dimensional object. Things do not have to have a natural, linear order – although it so happens that DNA and proteins do – which makes them much more tractable than they might otherwise be.

Columns B, C and D

The total size of a chromosome is given by column B, which is the sum of column C and column D.

Columns E, F. G and H

Euchromatin (here ‘Euch.’)is the less-condensed, gene-rich portion of the genome. It is where most of the active genes are located and transcribed.

Heterochromatin (here ‘non-Euch.’): This is the highly-condensed, gene-poor portion of the genome. It is often found in regions like centromeres and telomeres. These areas are typically difficult to sequence and assemble due to their highly repetitive nature.

I assume that a gap is where two sequences have been placed on the chromosome, leaving a gap between them which has not been sequenced. I do not know how this placing was done, nor have I found out how gaps are expressed in this zip file. The base pairs of each contig, as listed in ‘agp’ files, for which see below, are numbered from one until the end, but while I assume that these contigs are listed in the order in which they belong on their chromosome, I have found no provision for gaps between them. Or for start position on the chromosome. Or provision for their not have been precisely placed at all.

Although I believe that the random configs of columns L and M of Figure 01, where present, do come at the end of the chromosome level lists.

Column I

Clones in this context are segments of DNA, often derived from bacterial artificial chromosomes (BACs; see reference 10), that were sequenced individually and then used as building blocks for the genome assembly. In this 2006 build, they accounted for a significant proportion of the total and I think that this is what column K is about. I believe that the balance was sequenced using whole genome, shotgun sequencing.

Unplaced means those clones that could not be definitively mapped to a known location on its chromosome.

Column B less Column I equals the sum of the lengths of the constituent configs, for which see below. Which suggests that these configs do not include the unplaced clones.

Columns J, K, L and M

A chromosome is mostly built out of configs. Column J plus column L is the number of configs in the chromosome.

Column I appears to be the sum of the lengths of all the random configs (column L), while column N appears to be sum of the lengths of just some of them. A bit untidy.

Summary

All rather incomplete and I have failed to find an accessible description of this table.

Drilling down some more

I got pointed at references 5 and 6. The first, a glossary of some of the terms used in this space is useful tutorial, the second leads me to contigAgp.zip – a zip-filed description of how the assembly was generated from chromosomes, contigs and sequences – a file which Gemini had already mentioned. A file which I can get at without needing to get to grips with the sophisticated browsing tools offered at reference 4 and elsewhere.

In what follows, I avoid the term ‘scaffold’, used for elements somewhere between the chromosome and the sequence.

[Figure 04]

With sub-folders here corresponding more or less to the rows of Figures 01 and 02 – although without all the properties of a Windows folder, this being a zip-file.

The ‘hap’ folders are variations of regions of chromosomes, a consensus sequence not being appropriate at that time; the ‘alternates’ of Figure 03. Their contigs – two in the case of ‘5_hp_hap1’, one for ‘6_cox_hap1’, six for ‘6_cox_hap1’ and one for ‘22_h2_hap1’ – appear to be excluded from the counts at Figure 01. Nor have I worked out how they are mapped onto the rest of the chromosome, into their host config.

[Figure 05. Right hand section lifted from a previous post; the image with the infinite zoom]

Taking the sixth chromosome by way of example, we get another layer of structure, with 11 more sub-folders, corresponding to the 9 ordered contigs plus the 2 random contigs of Figure 02 above. I call the entities at this level contigs, which may follow professional usage.

Scaffold is another term here, which I avoid, and which seems to be used for something between a chromosome from a genome and a sequence from a sequencing machine.

An alternative, older view of this chromosome is included right, lifted from the previous post at reference 8 and probably originally taken from reference 9. A view which shows, inter alia, the G-banding explained at reference 8 and the centromere towards the top. And gives a size of 170m base pairs, which agrees pretty well with the total given in the middle panel – and that at Figure 02.

We are now at the bottom of the zip-folder structure, with, for example, ‘NT_007583’ containing just the one file ‘NT_007583.agp’, where ‘.agp’ is for ‘a golden path’.

[Figure 06]

These files can be opened in Microsoft’s Notepad, giving another level of structure, nines row of it in the example above. I call the entities at this level sequences. While the number of sequences to the contig given in the middle of the previous figure (Figure 05) looks to vary a good deal, their average size is fairly steady at around 95,000bp, with the exception of the exceptional 10th and 11th contigs.

[Figure 07]

Analysis of the large 6th contig of chromosome 6 gives the rather broad distribution shown above.

The second and third columns of the main panel of Figure 06 give the start and end position of the constituent sequences in terms of base pairs from the start of the contig, all nice and neat with no gaps or overlaps.

The sixth column gives the accession number of a publicly lodged sequence, an example of which can be found at reference 4. Most of these accession numbers start with ‘AL’, but there are other prefixes, for example ‘B’. I have failed to find a list of these prefixes.

[Figure 08]

I turn now to the odd looking seventh column, column G in the snap above. This appears to be the start point to use in the accession file. Most of the time the contig uses the whole of the accession file, but sometimes it only starts at position 2001, with the first 2,000 base pairs not being used, perhaps because they duplicate those of the preceding accession file. I have no idea what this is about. At one point I had wondered whether ‘2001’ was a date!

The last column, column I in the snap above, says whether the sequence is to be read forwards or backwards.

The accession file for the first sequence in the snap above, ‘AL603783.14’, follows.

[Figure 09]

A sequence which might be obsolete – a warning that these reference genomes can still be work in progress – but which was the work of the Sanger Institute here in the UK.

[Figure 10]

After some descriptive material, we get the sequence itself, arranged in rows made up of six groups of ten base pairs, each represented by one of the letters A, C, G or T.

Note that no attempt is made here to mark the frames of three letters which code for amino acids: the many frameshifts would make this much too complicated to visualise. But one can, if one wants, use Notepad search to look for particular sequences – a search complicated by the presence of the spacing spaces.

Additional information

[Figure 11]

The unhelpful table of prefixes that I did find at reference 11 - but we do, at least, have ‘NT_’ in the folder structure exhibited above. Offering contigs, scaffolds and whole genome shotgun sequences.

Noting, in passing, that while one can sort out nuclear or DNA material from tissue on a large scale, I have not come across sorting out individual chromosomes.

I took a quick look at reference 12. A proper look would no doubt be good for my understanding of the foundations of all this, despite it being twenty years old - but I suspect this would soak up more time than I want to spend. In any event, there was one quick win; one piece of low hanging fruit – a phrase which was popular at the time I left the world of work. This was some material about how we came to know that (the more visible) genes are on chromosomes, that they are organised serially on those chromosomes and about how we can use recombination to tell us something about that serial order. Maybe this provided a framework for the sequencers to work with.

Conclusions

On this view, all we have is a very long sequence of base pairs – around three billion of them – just analysed into twenty odd chromosomes. All the other structure in my zip file is down to the accidents of sequencing – although I dare say there is some correlation with function.

No doubt the specialised browsers can do a lot more.

An interesting dive into hg18, which has nicely demonstrated the inevitable complexity of building such a large object – very large in numbers and very small in grams. It would be interesting to see how all this pans out in today’s version of the genome – but we shall have to see. Maybe the time would be better spent playing with one of the browsers…

References

Reference 1: A spatially-resolved transcriptional atlas of the murine dorsal pons at single-cell resolution – Stefano Nardone, Roberto De Luca, Antonino Zito, Nataliya Klymko and others – 2024.

Reference 2: https://en.wikipedia.org/wiki/Reference_genome.

Reference 3: https://genome.ucsc.edu/goldenPath/stats.html#hg18.

Reference 4: https://www.ncbi.nlm.nih.gov/.

Reference 5: https://www.ensembl.org/info/website/glossary.html.

Reference 6: https://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/. A list of some of the stuff that is available from the Golden Path.

Reference 7: https://www.ncbi.nlm.nih.gov/nuccore/AC009704.8. A sample accession file from NCBI.

Reference 8: https://psmv5.blogspot.com/2025/06/a-cell.html.

Reference 9: https://en.wikipedia.org/wiki/G_banding.

Reference 10: https://en.wikipedia.org/wiki/Bacterial_artificial_chromosome.

Reference 11: https://www.ncbi.nlm.nih.gov/books/NBK21091/.

Reference 12: Genes VII - Benjamin Lewin - 2000. Originally published in 1983, now at least at Genes XI published in 2014, with a raft of co-authors.