Mime is where Legacy Systems go to die
Your new system went live. Migration of current, active data went well. A decision was made not to move historic data and keep the old system around in "read-only" mode, just in case some information needs to be looked up. Over time your zoo of legacy systems grows. I'll outline a way to put them to rest.
All recent systems (that's younger than 30 years) data is stored more or less normalized. A business document, like a contract, is split over multiple tables like customer, address, header, line items, item details, product etc.
The issue gets aggravated by the prevalence of magic numbers and abbreviations that are only resolved inside the legacy system. So looking at one piece of data tells you squid. Only an old hand would be able to make sense of
Status 82 or
Access to meaningful information is confined to the user interface of the legacy application. It provides search and assembly of business relevant context
Solving this puzzle requires a three step approach:
- make accessible
The persistence format needs to be something that is closer to a document structure. The formats suitable would be XML, JSON or YAML. Probably owing to my age I would make a case for XML. I would argue that
<Status id="42">Universe questions answered</Status> is way more readable than JSON when you try to link the magic number
42 to its business meaning
Universe questions answered.
To be very clear: You will massively duplicate data and if any of the relations would change, you would face a data nightmare. But our use case: archive, read only doesn't face any penalties, other than storage, for this duplication.
The reflective approach for an IT department tasked with this would be the creation of another future legacy system to visualize the data.
Now you have at least 2 files: the denormalized XML and the HTML (or PDF). Eventually you have binary data like image files or office attachments that belong to that business record.
Enter MIME. MIME is a container format, typically found in eMail systems. Nothing stops us reusing its capabilities. The MIME file consists of a MIME Header and one or more MIME parts, that qualify their content by a type.
A MIME part can contain a mime header and one or more mime parts - at nauseam or stack overflow.
Using MIME we can compose a single file. In the header would be meta information: what legacy system it came from, when it was exported, where the documentation could be found etc. Thereafter would be MIME parts with HTML, XML, PDF and binary files.
Just double clicking on that file would open it in the standard mail viewer. With clever set header fields (e.g. From as name of the legacy system) it will render the HTML part into something directly read- and understandable. Thus the file is useful in itself. If automated further processing is required, the XML or binary parts can be harvested.
Now take that file and store in an an append only object store (S3 anyone). The final piece is: how to make the information findable?
Users might search for a part, a customer, a time frame, a combination of all sorts of criteria. This type of requirements can be covered by a fulltext index and a search engine. Lucene is content format aware, so queries can be very specific or very simple. The beauty of the approach: Other than a database, the MIME documents don't need to have a shared structure to be able to be searched by Lucene. Still the indexing process can do some level of standardization like merging
Id, ID, id, recordid, account_id to be searchable as
Of course implementation needs to be planned carefully. Nevertheless this can be used as a blueprint for the one archive system to rule them all.
As usual YMMV