What document am I?
Determining the nature of a document
Last November I headed up a team* competing in Rocket Software’s inaugural Rocket.Build hackathon. We placed third with a solution that allows our document archiving product Arkivio to determine archiving strategies based on the nature of the document.
Arkivio allows administrators to set rules that determine when and how to archive documents on corporate file servers. These rules look at externalities of the document such as location, size, document type extension, and when the document was last created/modified/accessed. We wanted to demonstrate the ability to include rules that would say something like “if this document is a legal document, then archive it to a storage medium with a 30 year retention.”
To achieve this, we had to build a classifier that would inspect said document, and then return a measure of likelihood that the document was indeed a legal document.
A bag of words
A common approach to determining the nature of a document is what is called the “bag of words” concept. It’s very simple in its implementation. You:
- Extract all words, and throw away common stop words such as “and,” “or,” “the,” etc.
- Build a list of all frequencies for each word.
- Repeat this for many documents of each type and determine the frequencies per type.
Step 3 is called the training phase. When a new document is assessed, steps 1 and 2 are repeated, and then the frequencies of the document in question are compared with the frequencies for each category of document. Using a classifier such as a neural network allows to calculate the probabilities of the document to belong to any of the previously analyzed categories.
This is an extremely simple algorithm to implement. There are existing toolkits out there that already do this, so coding is minimal. Memory requirements are also reasonable for this approach. The Oxford English Dictionary, the gold standard when it comes to the English language, contains about 170,000 words (in current use). Even when analyzing hundreds of documents, the number of actual words counted would be an order of magnitude lower.
But there are a few issues. One is that you need to train the system (doing steps 1-3) with a large number of documents for each category so as to avoid having a new document show up that happens to belong to one of these categories yet contain words that have not really been encountered before. This is an issue in the scenario we have, as an enterprise wanting to introduce a new category may find it hard to find sufficient documents to train.
A second issue is that documents often suffer from “cross-contamination.” For instance, when analyzing court documents from medical malpractice cases, there are lengthy sections full of medical terminology, giving cause to a degradation in the ability to distinguish between categories.
Last, and in part related to the previous issue, is that it’s a very wasteful algorithm in that by counting words without context it throws away any semantic information that could be very valuable for improving the performance of the determination approach. That might be countered in part by not just counting single word frequencies but double (“supreme court” vs. “supreme” and “court”) or triple word combinations (these are called digrams and trigrams respectively). But this has two detrimental consequences: it hugely inflates the number of entries being tracked and it slows down processing considerably.
So the bag of words concept is attractive due to its simplicity to implement but it is wasteful. That approach is still useful in many circumstances but we felt that we could do considerably better. After all, the human mind does not recognize the nature of a document by counting word frequencies. It does it by deducting semantic meaning and that is based on pattern recognition, something the human mind is extremely efficient at.
We know that we are light years away from competing with the human mind but we believed taking a step into this direction should improve our solution.
A bag of concepts
Luckily Rocket Software has a product called Aerotext. Aerotext was developed about 15 years ago to deal with detecting patterns in natural language texts. It’s rule-based, allowing the definition of patterns to identify and extract.
Rule-based natural language systems fell out of favor quite a while ago. They were never very efficient in areas such as automated text translation as their performance in determining semantic meaning was surpassed by statistical translation systems (think Google Translate). But in our case, we did not need to extract actual meaning of the document in nature. What we wanted was to extract concepts that were domain specific. In other words, we want to improve the bag of words approach described above by introducing a bag of concepts approach instead (which still would include bag of words to a minor extent).
Aerotext comes with an extensive rule base in the English language. This rule base does a respectable job of identifying and extracting common types of entities, such as references to people, companies, addresses, dates, events. The rule base provides a good foundation to write new rules that are specific to the concepts we wanted to address.
An example would be that a legal document commonly has phrases such as “John Doe, Defendant.” In Aerotext we would then have a rule that matches [Person], “DEFENDANT” which matches any person reference (using the existing in built rule base) with the following literal string “DEFENDANT.” Detecting such a pattern carries more informational value than detecting the word “DEFENDANT.”
Building a rule base
For Rocket.Build we decided to stick with three categories of documents: medical papers, legal documents and real estate advertisements. The first two are typified by using specific patterns of wordings giving rise to a domain specific variant of the English language. To counter this we added in real estate advertisements, which is less subject to formal constraints on the use of language (although it is prone to use a bit of “flowery” language).
For the medical domain we had the advantage that there is a large set of medical terms. Aerotext supports the use of lists of specific terms so we could load it with extracts of common terms. In legal practice this set of terms is considerably shorter. But it does have very specific formalisms that can be used. Courts and participants are identified in a particular matter. Key phrases such as “damages to be awarded” are common, albeit in multiple variations.
In real estate advertisements we found common terminology to some extent (after all it is about selling houses and apartments) but it gave great opportunity to identify common patterns such as listing “appliances (insert your brand here)’”in “kitchens” and finding phrases such as “just a (short walk/stroll/hop skip and jump) away from (insert your local amenity here such as shopping, pools, beaches etc).”
There is a clear case of diminishing returns when identifying phrases. For Rocket.Build we looked at the most common ones. That permitted us to write a relatively small set of rules, with about 50-100 match patterns per document type, made possible by the fact that a match pattern will reuse existing patterns or word list occurrences, hence greatly reducing the amount of work.
Building a demonstration
For a hackathon, rules are quite liberal with regard to what you build. You are demonstrating a concept, not a product. To implement our concept, we built the following components:
- Arkivio was extended to include conditions that would match a category type to a threshold of its probability and would send a link of a document it wanted inspected to Aerotext.
- Aerotext was run in a daemon mode which would watch a directory, process any files it would find there and return a count of frequencies of concepts found for each domain. Concepts ranged from identified specialty words to specific patterns of phrase.
- In Azure Machine Learning we trained a predictor that would take the concept frequencies as an input to determine the probabilities of the category of document.
All over, the implementation effort was amazingly small. We counted the amount of code we created and came up with:
- Arkivio: 200 lines of C++
- Aerotext: about 500 lines of rules (excluding word lists)
- Aerotext daemon: 100 lines of Java to build a directory watcher
- FLASK: a python based web service framework took about 30 lines to tie it all together
We set out to demonstrate the ability to identify the nature of a document. To do this we took the default bag of words approach and updated it to a bag of concepts approach. The cost was that we had to write specific rules to identify concepts for each domain, as opposed to simply training a generic system with many existing documents. We felt that this was worth the extra effort for the efficiency of the implementation as well as for recall and correctness.
The writing of rules to identify specific patterns could conceivably be automated with its own learning approach.
 Amedee Poitier, Jie Wei, Stephan Meyn, and Zhi Li
 Aerotext comes with rule bases for several other languages as well