Tagging data and standardisation
Calimere. Calimere Point.
CPRA. calimere point risk advisory. CP.
Identifying and tagging the data you want, and their numerous variants, is a prosaic but difficult task.
Tagging text data
The ability to identify, extract and standardise the data variants you need from text descriptions should be commonplace right? But it is not when you are looking for scores or more of data components in thousands of data rows.
It covers so many day to day things: client names; client addresses; inventory data; product databases; invoice itemisation and marketing data.
The spectrum of data variants and inconsistencies
Data in text form is highly vulnerable to data inconsistency as it often a free form text field. Even for relatively structured data, such as country names (which also has an ISO standard) there can be several variants.
USA; US; US of A; U.S.A.; United States; United States of America
Holland; The Netherlands; Netherlands
UK; United Kingdom; GB
And it gets worse with client names.
JPMike Charlie & Co
JP Mike Charlie
21st Charlie Foxtrot
Twenty-First Charlie Foxtrot
Mr. John Smith
Mister John Smith
John T Smith mag.
Mr Chi pa Choi
Mr Choi Chi-pa
pa Choi Chi
CPRA private limited company
CPRA Société anonyme
CPRA Societe Anonyme
Nuit-Saint-Georges 1er Cru ‘Les Vaucrains’, Henri Gouges 2013, Half Bottle
Domaine Henri Gouges NSG PC Les Vaucrains 37.5 cl
Nuits St Georges premier cru Les Vaucrains, Dom. Henri Gouges
486589 Jean Heaven Skinny Blue 3232
AW17/NEW/486589/JN Heaven Skinny Blu 32-32
Heaven Skinny Blue 32 JN H486589
Web Banner Advertising / Keyword Searches
Wall Street Journal
Wall St Journal
The Financial Times
For physical inventory, even something as simple as volume can be tricky:
Adaptive text cleaning
The solution is to i) identify and extract the data variants, ii) map them to an agreed data standard and iii) thereby creating a structured data model.
However, a normal find and replace is too blunt a tool: context is very important.
Mr Sam Smith
Smith Industries SA
Norwegian Company ASA
ch – financial services – 300 x 600
chapter and verse – 300 x 600
For wine buffs:
Stag’s Leap <> Stags’ Leap <> Stags Leap
And Saint Julien is a famous Bordeaux wine region, right?
Chateau Leoville Poyferré, 2Éme Cru Classé, Saint-Julien Bordeaux
Chateau Beychevelle Grand Cru Classe St Julien
Cave Beaujolaise de Saint-Julien
Prieure St-Julien Côte du Rhône
CPRA’s Data Tagger
The engine identifies and extracts attributes that can be used to uniquely define each data row.
Attributes are moved into the appropriate fields of the structured data model.
Cleaning of data for case, accents, punctuation is done as a matter of course.
A key feature of the data tagger is its flexibility: it is driven by business user accessible parameters so that as new data comes in the tagger stays relevant without having to access the engine.
Automated, on-premises or on CPRA’s cloud.
|Name||Stag’s Leap – Cask 23 Cabernet Sauvignon 2009||Stags’ Leap Winery Ne Cede Malis Syrah||Pine Ridge Stags Leap Cabernet Sauvignon|
|Producer||Stag’s Leap Wine Cellars||Stags’ Leap Winery||Pine Ridge|
|Brand||Cask 23||Ne Cede Malis||Stags Leap|
|Grape||Cabernet Sauvignon||Shiraz||Cabernet Sauvignon|
|Sub_region01||North Coast||North Coast||North Coast|
|Sub_region02||Napa Valley||Napa Valley||Napa Valley|
|Sub_region03||Stags Leap District||Stags Leap District||Stags Leap District|
Creating a structured, consistent data set then unlocks the data.
Attributes can we used to perform a structured matching process to identify any duplicates.
Hierarchies can now be imposed and the removal of duplicates allows accurate data consolidation and reconciliation to finally occur.