My first true CL/NLP project was a Python program that used regular expressions to analyze Akkadian nouns in specific (check it out here). I had the idea in mind for a good bit before that, but my first true implementation was built sometime in December of 2025. I chose Akkadian mainly because in terms of Semitic languages, nouns are agglutinative and there are no broken plurals, and so are concatenative. In verbs, however, they do go beyond simple regexes without backreferencing, lookaheads and lookbehinds, which is why I didn't use them here. The concatenative character of Akkadian morphology makes it conducive for a normal regular language as opposed to something more expressive, maybe something between regular and context-free (since verbs do require processing the roots and stems in conjunction, I think they go beyond a memoryless regular language and would require recursion to extract features, but regexes do have that expressive power anyway). Regexes also have the power for a letter either to exist once or not at all (?), so handling mimation was a breeze.
The code's design is extensible, being that regex patterns are self-contained and could be added, erased or remade. The GUI and engine don't need much of any fixing comparatively as the idea behind them is static, its implementation differs only slightly (e.g. feature structures could be a dict and not class attributes, regexes that use more expressive techniques to extract roots and stems of verbal nouns, new types of nouns, et cetera).
I also used PySimpleGUI (or rather FreeSimpleGUI) for the frontend, which is a wrapper for different GUI frameworks (including Tkinter, default with Python). It's simple to use, and I like the minimalist design it offers, makes me feel like I just downloaded something from 2005. The image (also made by me!) was meant to represent that as well, with the blue gradient, a cheatsheet from "Mastering Regular Expressions" by Friedl and the image of a cuneiform tablet fading together.
Fundamentally, how it works is this: For a word or words (in later versions), it creates an object of class Noun, its attributes being its features which are extracted from a dictionary of regex patterns and the features in a list. Version 1 supported a maximum of one word, version 2 two words.
Version 1.0 supported a maximum of one word, and only analyzed for case, gender and number. This was just to get things started and to build a base I could customize on. It's made up of three parts:
-noun_patterns.py: The dictionary of regexes and their respective features as a list/array. For example,
r"[^t]um?$": ["masc", "sg", "nom"],
This regex means: match a word such that it ends with um, with the m optional, and the letter t does not precede the suffix "um". This means that it's masculine (as the stem does not have -t at the end), nominative and singular (as it ends with -um or -u).