eMail archival in PDF and electronic record keeping

The question pops up quite regularly: " Our compliance department has decided to use PDF/A for long term record storage, how can I save my eMail to it?" (The question applies to ALL eMail systems). The short answer: Not as easy as you think. The biggest obstacle is legal need vs. user expectation. To make that clear: I'm not a lawyer, this is not legal advise, just my opinion, talk to your legal counsel before taking action. User expectation (and thus problem awareness): "Storing as PDF is like storing on paper, so what's the big deal?" In reality electronic record keeping has a few different requirement (and NO printing an eMail as seen on screen is NOT record keeping - more on this in a second). Every jurisdiction has their own regulations, but they are strikingly similar (for the usual devil in the details ask your lawyer), so I just take Singapore's electronic transactions act as a sample:
Retention of electronic records
9. —(1) Where a rule of law requires any document, record or information to be retained, or provides for certain consequences if it is not, that requirement is satisfied by retaining the document, record or information in the form of an electronic record if the following conditions are satisfied:
(a) the information contained therein remains accessible so as to be usable for subsequent reference;
(b) the electronic record is retained in the format in which it was originally generated, sent or received, or in a format which can be demonstrated to represent accurately the information originally generated, sent or received;
(c) such information, if any, as enables the identification of the origin and destination of an electronic record and the date and time when it was sent or received, is retained; and
(d) any additional requirements relating to the retention of such electronic records specified by the public agency which has supervision over the requirement for the retention of such records are complied with.
(colour emphasis mine)
So as there is "more than meets the eyes". A eMail record is only completely kept if you keep the header information. Now you have 2 possibilities: change the way you "print" to PDF to include all header / hidden fields (probably at the end of the message) or you use PDF capabilities to retain them accessible as PDF properties. The later case is more interesting since it resembles the user experience in your mail client: users don't see the "techie stuff" but it is a click away to have a peek. There are a number of ways how to create the PDF:

Use a commercial package like DominoPDF, AGE Exporter or IntelliPrint that are capable of generating PDF directly. They have their limitations of what they can output around signature and encryption. Big advantage: vendor support
Use a PDF printer driver that can be programmed to automatically assign a known file name. Advantage: works like printing. Disadvantage: depends on the OS printing system
Export your eMail as MIME or DXL and use a XSLT transformation to generate XSL:FO that can be saved as PDF. There is Apache FOP and a series of commercial tools. Advantage of this approach: you could have more than one pipeline (e.g. email, Notes apps, client apps, web apps) that end at the XSL:FO processor. Disadvantage: XSL:FO is simply a beast
Generate the PDF in a discrete way using a Java library. Two Opensource libraries are quite popular: iText and PDFBox. iText is Affero GPL licensed, so it might not be suitable for your project. PDFBox is licenced under the Apache license

So now you have a PDF showing the visible part of your message (or document). Now it is time to add the (hidden) fields. In the sample below I used PDFBox and I exported all fields short of RichText. The first possibility is to use custom PDF properties. You also could think of: use XML Meta data, storing the entire document as roundtrip save DXL or store it as converted mime (if it isn't MIME already) as a meta data property. Adobe's Meta data specification is called XMP (Extensible Metadata Platform) which is based on XML and makes heavy use of the Dublin Core specification of RDF (also known as semantic web). I like that since it is a good example of implementation of an open standard.

Sample 1: Store all regular Notes fields as custom properties in PDF

/**
* @param pDoc The PDF Document that will receive the meta data
* @param nDoc The Notes Document where the data will reside
* @throws NotesException
*/
@SuppressWarnings ( "unchecked" )
public void saveNoteToMeta (PDDocument pDoc, Document nDoc ) throws NotesException {

Vector allItems = nDoc. getItems ( ) ;
PDDocumentInformation info = pDoc. getDocumentInformation ( ) ;

for ( int i = 0 ; i < allItems. size ( ) ; i ++ ) {
Item curItem = (Item ) allItems. get (i ) ;
// TODO: exclude more items
if (curItem. getType ( ) != Item. RICHTEXT && curItem. getType ( ) != Item. ATTACHMENT && curItem. getType ( ) != Item. EMBEDDEDOBJECT ) {
String itemName = curItem. getName ( ) ;
String itemValue = curItem. getValueString ( ) ;
info. setCustomMetadataValue (itemName, itemValue ) ;
}
}
}

I will, so time permits, publish more samples for meta data storage: Sample 2: Store all regular Notes fields as XML in PDF , Sample 3: Store the entire Note as XML in PDF and Sample 4: store the Note as MIME entries.
As usual YMMV

Posted by Stephan H Wissel on 26 January 2011 | Comments (3) | categories: Show-N-Tell Thursday Software

posted by Jasper Duizendstra on Wednesday 26 January 2011 AD:
Hi Stephan, How does the EML drag and drop functionality fit in this story? As far as I know there is no API yet, but it could function as a translation before generating the PDF.

posted by Warren Elsmore on Thursday 27 January 2011 AD:
Stephan,

Personally, for any of these types of request we'd always recommend looking at an enterprise archiving solution that handles all this for you - and makes sure that you *cannot* 'loose' anything accidentally :) It's an infrastructure problem, not a development requirement IMHO.

IBM makes one, so do Symantec and a number of other vendors. I have strong views as to which is best though :)

posted by Stephan H. Wissel on Thursday 27 January 2011 AD:
@Jasper: the D&D is just a representation of the API that has been there for a while. The challenge here is: MIME is not XML, so you have to hammer it into shape

@Warren: I concur. Nevertheless "we want PDF" is something that also applies to enterprise archival solutions. Storing the meta data inside the PDF (in addition to the archival database) has the huge advantage, that being self-contained a collection of PDF would "survive" a terminal destruction of the enterprise archival database. You simply rebuild it from the embedded meta data.

So in conclusion: an enterprise archival solution answers the question where to store the data, while my little PDF musings are more around what to store there.
Emoticon biggrin.gif

stw