Matching Account Names – Panama Papers
Classic dirty data matching issue
Outside the high profile casualties our clients have been faced with the apparently easy question, “How many of my clients are in the 11.5 million files?”. Because the IRS wants to know. And my regulator(s).
Finding one name in the Panama List is relatively easy. What if you have to check one thousand names? Or ten thousand? Or a million?
And do it quickly.
CPRA developed an analytical application that contains a series of rules which addresses the nuances in the data, carries out a series of matching processes of differing strength and categorises such matches.
In one case it was reviewed by a Big Four Auditor, and they did not come back with much.
As is common with many of CPRA’s applications, we use modern data tools that represent the logical data flow as a living picture: it aids understanding, auditing and moreover is highly flexible.
So far we have only discussed name matching, though when we speak about Client Data more generally, we can apply similar matching processes to data such as address fields and phone numbers.
The Data Picture
Superficially, matching a company’s client list (we call this the “Private List”) against the Panama Papers is a straightforward exercise. However, a number of practical issues immediately become apparent once the data is examined.
Both the Panama Papers and the Private Lists have their own data nuances. Critically, the data nuances may not be consistent within a given data set nor between the data sets. The data needs to be normalised to ensure we are matching names on a consistent basis.
Any matching process would have to take these nuances into account. The traditional approach to this is some form of “Fuzzy Matching”. But fuzzy matching comes with its own issues:
The amount of fuzziness will be determined by a number of criteria:
- Balance between false positives and false negatives
- The increased run time from the fuzzy process
- Black box (eg numerical threshold based) versus rules based
It’s heuristic, or put more prosaically, it’s try, see and modify. It is important for clients (and their regulators and tax authorities) to understand that such an exercise is empirical and has no right answer.
Sophisticated clients recognise that they are accepting and agreeing to the logic of a process and there is not a black and white answer to all combinations.
And then there are local data privacy laws. For some countries and business units it is highly unlikely that client data is allowed to cross borders. This has a significant practical effect: it means that any matching process cannot be co-ordinated on a powerful central server, which is in direct conflict with an exercise that is complex, will incorporate computer intensive fuzzy matching and is on a large data set.
If the matching process needs to be deployed at several different locations, differences and limitations of hardware and IT expertise need to be considered.
Punctuation (including full stops for abbreviations, commas to separate names, and hyphens) needs to be cleaned. And then accents (or more generally diacritics) need to be considered. And diacritics shine a spotlight on character encoding, it’s not always going to be UTF-8.
Company types and suffixes: they can come in multiple flavours, and could be abbbreviated or written in full.
GmbH or Gesellschaft mit beschränkter Haftung? Which also nicely comes with multiple cases and an umlaut.
Word order: what would 007 be? Bond; James Bond; Mr (optional full stop) Bond or just Agent Bond? We cannot rely upon names being in a consistent order.
A robust and repeatable regulatory response – delivered fast
The Panama Papers data leak lead to highly visible regulatory and tax authority scrutiny.
A complex data issue that superficially looks trivial had to be delivered and robustly, sometimes on desktop computers in local offices.
Which is what CPRA did.