The post in which I discuss the difficulty of working with a small, repetitive set of formulaic text chunks

This week I had the opportunity to wonder at the black box of topic modeling again, but with a better understanding of the type of content that this black box was attempting to magic into interpretability. I’ve seen some conference presentations and some various forms of workshops that discuss how topic modeling is supposed to “work,” but never truly understood how the process of moving from one bag of words to another, smaller, bag of 10 or so words made sense. This week I’m not sure I get it as much, but after poking and prodding at the tool until I was able to generally convince myself that there was a common thread between most of the (non-garbage, non-stopword) words, I think I sort of get why one might want to use this?

Data entry across the centuries

This week I was able to wrangle two of my Latin texts (for the manorial record years 1208-1209 and 1210-1211) into files for ingest into the Topic Modeling Tool. I made sure to break the texts out into both years (labeled 1209 and 1211), and manors (with the various semi-corrected Latin manor names). This let me run the topic model analysis not only at the combined set of ~120 records, but also at groupings per year, to see if there was anythign that jumped out as interesting particular to one grouping or another. Unfortunately, since most the actual text of my documents is a three-part formula: ‘[This person hereby renders an account for XYZ debts/income for the year]. [The total is £A B shillings, C pence]. [He has paid up this amount, no debt remains]’ with some iteration within those three sections, there was a lot of administrative kruft language that was being incorporated into my topics alongside various renditions of cash numbers. For conceptual topics, that wasn’t very useful to me. However, what did end up helping quite a lot was pulling a Latin stopword list to help remove some of the non-constructive prepositions and articles. I modified that list slightly, as it contained a list of verbs for giving, seeing, making, saying, having, carying, and becoming. I wanted to retain these words, as I figured they might have value for economic activities. Once the topic modeling was performed with the stopword list associated, it was a bit more constructive. I also needed to add Roman numerals with terminal “shilling” and “pence” denominations, as well as the terminal “j” indicator (instead of an “i” which would otherwise indicate the Roman numeral for 1), which was used in the printed edition of the 1209 volume I pulled from. This was in an attempt to focus not on the highly-variable price/cost element but more on the concrete vocabulary used.

Topic model discussion

Unfortunately, with the limited sample of texts and the highly repetitive administrative language used almost exclusively throguhout the documents, there was not a great deal of insight to be had from the topic modeling exercise just between these two texts (even when split amongst the various manorial records). Most topics had a string that included basically “sum, half, quarter, account, church, expense, grain, receipt, mill, week, year” in one order or another. While this list would be fascinating if in a larger set of Latin texts, the fact that this set of documents is exclusively about the payment of landowners (including mill owners) to the church, it is to be expected. The only real highlight was that one topic generated for each set (texts for 1209 alone, texts for 1211 alone, and both sets combined) was a topic that included many named entities (William, Richard, Robert, Lord, etc.) as well as “the Church” (episcopi, which is referenced in these documents as an entity rather than a place most often), son (filio)usually in reference to the person reporting the account or from whom something was received, as in “X amount from purchase of land from the son of Richard,” indicated by the ablative ending), and “land” (terre) as well. This inclusion of “land,” also in the ablative, seems to reference most often a default in rent, rendered usually as “X person ‘In defectu terre’”. I don’t know if there is anything to the linking of names and these administrative ablatives, but as they are often closely related in the text itself I can see how the topic modeling program would have come to the conclusion that they were related. However, I would like to try with a larger set of manorial documents to see if there is anything else interesting in this set of formulaic documents that might jump out. Unfortunately the only other printed volumes available to me so far are in English, and do not have sufficiently clean OCR for this process just yet. I would very much like to see if there is a broader trend within smaller categories as well, across many years. Perhaps there is a trend towards particular names being more closely linked to defaults in rent, or sales of particular goods such as eggs or wheat on particular manors. This will require massive amounts of work to transcribe and encode the remaining Latin manuscripts, however. But I’m hopeful that advances in handwriting analysis programs over the coming years will make this more feasible.