Wednesday, 17 July 2024

A word frequency analysis from a dream diary

This hare was started on the occasion noticed at reference 1, that is to say, my wanting to produce a word frequency analysis on the contents of a (Microsoft) Excel worksheet. As explained there, this led me to the people at reference 2. The thought was that a word frequency count might be a way to get into the content of my dream diary, held in the form of an Excel worksheet, presently containing getting on for a thousand rows. And with the thought behind that being that looking at this twilight zone between sleeping and waking might tell me something about the nature of consciousness.

All this prompted in turn by thinking that it was about time I got round to following up the report from my dream diary at reference 3, getting on for a couple of years ago now.

I could produce such a frequency count from my dream diary, from my writings generally and from the world in general. One could use the statistical machinery which I assume is fairly readily available to look at the differences between the three counts. Or, one could just look at the counts and look at the differences between them in a more low-tech way. Mark one eyeball as people sometimes used to say.

Noting in passing that analysis of this sort does not depend on the way in which my dream world has been carved up into rows on the Excel worksheet as all the words in all the dream reports are just run together into one big pot.

Testing

Having had the first go reported at reference 1, using some Visual Basic code supplied by a company called Ablebits, I then tried to be a bit more systematic, first with the real data and then with some toy data.

And this time, I read some of the instructions that came with the code from Ablebits before I kicked off. So we have a worksheet function which delivers dynamic output array with a top left corner at the cell in which the function is called by something of the form ‘=FreqWords(…)’. You get a mild error if there is not enough space, if something else has already taken it – so care is needed as the array might only be two columns across – but it might be many rows down, perhaps thousands.

Now the data in question had one row for each object and five consecutive columns containing the text of interest, 893 rows and 4,465 cells in total. So I started off with one row of data and worked up. This worked well enough up until 600 rows of data containing something more than 6,000 distinct words in a total of around 56,000 words. But at 800 rows, after a pause of a few minutes, it produced a frequency distribution which was clearly wrong. But at least it had not crashed, as it had the day before and as reported at reference 1. Distribution wrong in the sense that while its length was plausible with more than 7,500 rows, the counts at the top of the distribution were far too small and the counts from around row 3,000 on were zero. In this context, frequency counts are not zero, by definition. So the code does not include enough internal checks, enough to catch and report failure. Nor does it report progress, a device which I have found useful with resource greedy progams in the past.

I ran the 800 row test again and got the same answers. I followed this up with some rather desultory testing on toy data. Apart from one minor glitch at the boundary, the function performed as it should.

Carrying on with this same workbook a few days later, I found it to be very sticky. It was taking a very long time, for example, just to move the cursor around what I thought was some unrelated worksheet. After a while I deleted the worksheet containing the FreqWords function calls, having copied the results elsewhere, and this seemed to cure the problem. Maybe the Excel cell computation engine was being fired up unnecessarily, in this case freezing the system while it ground through all the FreqWords computations again.

Tutorial

Along the way I needed to take some time out to think about what Excel does with its worksheets. The results of which I now share – and while what follows is a rather simplified version of what actually goes on in Excel, hopefully it makes some teaching points which are near enough the truth.

In the beginning, an Excel Workbook was made up of Worksheets and Worksheets were made up of cells, arranged in a rectangular array of rows and columns. Rows were numbered from the top and columns were numbered from the left. In later releases of the product, the size of cells and the numbers of rows and columns were for most practical purposes unlimited.

Cells were rather more than what you saw on your computer screen, being made up of a number of properties, one of which was the display value. Other properties might specify, for example, the number of decimal points or the font. There were various labour-saving wheezes whereby one did not have to specify every property for every cell, one by one. One could, for example, do that at the level of rows and columns.

But there was a product-defining complication. One cell was allowed to be derived from one or more other cells by virtue of a device called a formula. So, for example, this cell might be the sum of the values in those cells. 

In the figure above, the funny arrow connecting formula and cell is intended to suggest a cell may or may not have a formula, with the little circle carrying the possibility of not. While a formula may refer to one or more other cells. These possibilities being carried by the arrow.

The big rule was that the directed network of derivations was not allowed to loop; the derivation engine had to be able to work through the network of derivation in an orderly way and the derivation engine had to both start (at a minimum) and end (at a maximum).

This was mixed in with a related device called the function, whereby the value of a cell was computed by the system in some more or less complicated way. For example, the square root of something or the time of day.

In the beginning, Excel was about doing stuff with numbers. But this gradually broadened out to include doing stuff with text. And there was plenty of it, Excel being the sort of thing that grows rather than shrinks. You are, for example, allowed to add images to worksheets – this being, as it happens, how the graphics produced by Excel itself are included in a workbook.

Then, after a while, it was thought worthwhile to complicate this relatively simple story and introduce arrays, an intermediate structure between the worksheet and the cell. So an array is a rectangular array of cells, more than a single cell and less than the worksheet. The specification of the array lives as a formula in the top left hand corner and the rest of the array must start empty.

Any one cell in the worksheet can belong to at most one array. And in the case that it does not belong to an array, it is just a regular cell of the worksheet. To some extent, the properties of the cells of an array are specified at the array level.

Adding icing to the cake, Microsoft now allow dynamic arrays, arrays which can change their size as the input data changes. So, in the present case, if one changes the text in a cell, the word frequency analysis of a block of text which includes that cell is updated, which will include new rows to the extent that the change introduces new words and lose old rows to the extent that the change removes all occurrences of old words.

Ablebits

At this point, mindful of the dangers of using other people’s code, uncertificated as it were, although, as noted at reference 1, code which looked harmless enough. A not very big chunk of source code, open for inspection. Still and all, I am not qualified to do that sort of inspection and would probably not notice a call to some unknown but sinister function within it.

Ablebits appears to be a supplier of Excel add-ins, some of them free – but I have not been able to find out much about it, apart from it being a Polish company. The results of searches for ‘ablebits’ are nearly all from the company website.

I tried asking Gemini, and while he could not add much, he did suggest various places to look, most of which required you to create a sign-in and some required you to flash the plastic. I settled for Linkedin, which was content with a fairly soft sign-in – and did not ask for my plastic. And from which I learned that Ablebits was a Polish company with perhaps 50 employees and associates, located on the outskirts of Warsaw and going for getting on for twenty years. They specialise in add-ins for popular Microsoft and Google products. Nothing recorded against. Probably safe enough, although I dare say a serious IT department would need more.

They might operate out of the upper part of the building snapped above, my best guess with their address and Street View, with the amusingly named off-license below.

Looking at the data

Now while I had not been able to produce frequency counts from all my data, I had produced what looked like frequency counts from three quarters of it. Maybe I could extract something useful from them.

So a very skew distribution, with a small number of noise words with lots of occurrences and a very long tail of words occurring just once – more than 3,000 words occurring just once and more than 4,000 just once or twice.

Of the words mentioned above, fish stands out for me. I apply the Excel find function to the source data and find that there is indeed quite a lot of it, this find function including all the variations on fish – fishy, fishing, fishes and so on. A good proportion is not really fishy in that I write of fishing something out of something. But is the remainder just an artefact of our having just one word for fish, covering both live and culinary fishes, while we do not have such a word for animals? Not really interesting at all?

This going back to the source data produced the table above, which tallies well enough with the 6 of the previous table when one discounts the later entries. Where incidental means that while the fishy word occurred clearly enough, there was little if any fishy content; as far as my dream awareness was concerned it was no more than a label. A consequence of the dream being in a context where such crops up, no more. And then there were fishy words like haddock and octopus which do not make this particular cut.

However, for any of this to be of much interest, surely I need to know both whether there is more fish in my dreams than when waking and whether there is more fish in my thoughts than in those of other people? Is there any signal here for a Freudian style, detailed dream analysis to bite on?

I next tried to pull what I called the content words out of the first 200 or so rows of the frequency table.

So quite a lot of words suggestive of other people populating my dreams, in particular BH. A fair number suggestive of place and/or work. With ‘US’, ‘white’ and ‘car’ being oddities. But here again, is this of any use without some yardstick to say whether these counts are normal, are significant, signal or noise?

Searching the raw data for ‘us’ mostly turns up ‘us’ within some larger word; not relevant at all. While doing a case sensitive search for ‘US’ probably turns up most of the relevant ones. The number perhaps reflecting both the importance of the US here in the UK and the fact that I have paid more visits to the US than any other country. Visits which were apt, of their nature, to generate vivid memories. So perhaps this count does no more than reflect the content of my life. A point worth following up: to what extent does dream content follow what I have done in life, rather than what I would like to have done? A contrast which I imagine varies from person to person.

I then think that I need to group words together in some way, for example, to bring all the sexed pronouns together.

A first go at this is included above, a pivot table computed from some extra columns that I created for the purpose. The right hand column is the number of words included in the group, the middle column the aggregate number of occurrences in 600 rows of data. ‘Context’, for example, is words suggestive of where a dream is taking place, words like house, room or building – while ‘Place’ is particular places like London or Cambridge.

Sadly, while I have learned a bit more about the vagaries of pivot tables, I do not feel I have learned much about the content of my dreams.

Odds and ends

My dreams quite often involve people or places which are clearly named. But while the name may well be present in the dream, there might or might not be any clues in the images that comes with them. Quite often there are not. And sometimes there are images associated with the name, but set apart from the dream proper, as if one was looking them up. Contrariwise, I recently had a dream involving, at one point, being outside a strip club. The dream included this label, but there was no sexual content, the name did not come with any baggage, at least baggage that I was conscious of in the dream. The business of relating image to word is clearly complicated.

Here we stick with the methods available to the amateur dream diarist, that is to say we have no ambitions to use electrical or physiological activity to prompt an alarm call when one is likely to be dreaming. We just record what we can when we wake up in the ordinary way of things – while recognising that alarm calls might well generate more interesting reports.

My present view is that one row on the dream workbook for each proper waking up would have been the way to do it, with proper waking up corresponding to the union of getting out of bed, perhaps only temporarily, and taking morning tea, perhaps early in the morning. Plus something for out of hours siestas or snoozes. Things will happen at the margin, but hopefully not too often. This mapping of the world of dream onto some number of rows of an Excel worksheet would be reasonably tidy. 

It might be possible to work over the existing data, but possible or not, it is unlikely to happen, although I have adopted this rule going forward.

Another possible issue is the spectrum of states from which dream-like material comes.  There is a big grey area between regular dreaming and regular waking, an area which seems to be getting bigger as I get older and my sleep becomes less regular. Plenty of time is spent on both going to sleep and waking up. Sometimes in broad daylight, as in day dreaming. But my present view is that it is usually clear enough when conscious content has become detached from reality and any such content is grist to this particular mill. That said, maybe there is a coding issue: it would be nice to analyse content by the state from which it came, for which purpose one needs codes, codes which are presently missing.

A problem which arose during my foray with Hurlburt’s Descriptive Experience Sampling (which can be reached through reference 4 below) is the interaction between the reporting and the phenomenon being reported on, in this case dreaming. I am sure that there is interaction, but am presently quite unsure whether it matters or what to do about it.

Dreaming that one is flying seems a good way around the problem that most of our skeletal muscles are turned off when we are asleep. Nothing much doing there, which might make it difficult to emulate the sensations arising from physical activity in dreams. Emulation which occurs, in my case, very rarely. And, as it happens, flying never, despite occasional efforts to promote such and despite flying being said to be common enough in the dreams of others.

There is no sound in my dreams. Quite possibly in contexts which involve sound, but with the sound turned off.

Similarly, there is no food or drink taken. Again, they might be about, but I do not consume.

I liked the jagged edge inserted at the bottom of some of the shots taken from Excel included at reference 5. Very neat! But how was it done?

Conclusions

It is a pity that I did not get the helpful looking FreqWords function to work properly.

But before I invest more time in that, I clearly need to spend more time thinking about how I am going to get anything useful out of such frequency analyses. A bit more analysis before I dive back into implementation – the mantra, as it happens, of the bit of the civil service that used, for a while, to promote the use of structured methods in the construction of government computer systems. Don’t start building before you have worked out what it was that you wanted to build! Ditto buying. Which was all well and good, but I always retained the subversive feeling that it was the business of building that really revealed the interesting possibilities, possibilities which would be missed if you locked down the requirements right at the beginning. Subversion which could be countered with the time and expense of pilot studies, a luxury that was not always available.

In this case, I think some kind of cluster analysis might be the way forward. So first, one produces a cluster analysis of words in general, perhaps mapping them onto a two-dimensional rectangle or disc. Then second, one maps the words in the dream reports onto that same rectangle. Does this produce any interesting clusters? This might cut across the difficulty of capturing meaning with single words – but has the disadvantage that I do not have the skills or tools needed to produce such an analysis for myself. My Visual Basic is good for sums but not for pictures. Maybe I could leverage the Excel scattergram function?

References

Reference 1: https://psmv5.blogspot.com/2024/07/limitations.html.  

Reference 2: https://www.ablebits.com

Reference 3: https://psmv5.blogspot.com/2022/10/dream-diary-first-report.html

Reference 4: https://psmv3.blogspot.com/2017/10/on-taxonomy-of-consciousness.html

Reference 5: https://www.ablebits.com/office-addins-blog/excel-word-frequency-analysis/.    

Reference 6: https://psmv5.blogspot.com/2024/07/invasion.html. A rather different sort of frequency analysis!

No comments:

Post a Comment