skip to Main Content

Tagging data and standardisation

Calimere.  Calimere Point.

CPRA.  calimere point risk advisory. CP.

Identifying and tagging the data you want, and their numerous variants, is a prosaic but difficult task.

Tagging text data

The ability to identify, extract and standardise the data variants you need from text descriptions should be commonplace right?  But it is not when you are looking for scores or more of data components in thousands of data rows.

It covers so many day to day things: client names; client addresses; inventory data; product databases; invoice itemisation and marketing data.

The spectrum of data variants and inconsistencies

Data in text form is highly vulnerable to data inconsistency as it often a free form text field.  Even for relatively structured data, such as country names (which also has an ISO standard) there can be several variants.

USA; US; US of A; U.S.A.; United States; United States of America

Holland; The Netherlands; Netherlands

UK; United Kingdom; GB

And it gets worse with client names.

JPMike Charlie & Co

JP Mike

JP Mike Charlie

JPMC Bank

JPMike

21st Charlie Foxtrot

Foxtrot

Twenty-First Charlie Foxtrot

And people

Mr. John Smith

Smith, John

Mister John Smith

John T Smith mag.

Mr Chi pa Choi

Mr Choi Chi-pa

pa Choi Chi

Company suffixes

CPRA Ltd

CPRA Limited

CPRA private limited company

CPRA L.T.D.

CPRA SA

CPRA S.A.

CPRA Société anonyme

CPRA Societe Anonyme

Inventory

Nuit-Saint-Georges 1er Cru ‘Les Vaucrains’, Henri Gouges 2013, Half Bottle

Domaine Henri Gouges NSG PC Les Vaucrains 37.5 cl

Nuits St Georges premier cru Les Vaucrains, Dom. Henri Gouges

486589 Jean Heaven Skinny Blue 3232

AW17/NEW/486589/JN Heaven Skinny Blu 32-32

Heaven Skinny Blue 32 JN  H486589

Web Banner Advertising / Keyword Searches

Wall Street Journal

Wall St Journal

WSJ

The Financial Times

FT

F.T.

Financial Times

For physical inventory, even something as simple as volume can be tricky:

0.75 L

75 cl

750 ml

bottle

0.75 liter

.75

Adaptive text cleaning

The solution is to i) identify and extract the data variants, ii) map them to an agreed data standard and iii) thereby creating a structured data model.

However, a normal find and replace is too blunt a tool: context is very important.

Mr Sam Smith

Smith Industries SA

NASA

Norwegian Company ASA

ch – financial services – 300 x 600

chapter and verse – 300 x 600

For wine buffs:

Stag’s Leap <> Stags’ Leap <> Stags Leap

And Saint Julien is a famous Bordeaux wine region, right?

Chateau Leoville Poyferré, 2Éme Cru Classé, Saint-Julien  Bordeaux

Chateau Beychevelle Grand Cru Classe St Julien

Not always:

Cave Beaujolaise de Saint-Julien

Prieure St-Julien Côte du Rhône

CPRA’s Data Tagger

The engine identifies and extracts attributes that can be used to uniquely define each data row.

Attributes are moved into the appropriate fields of the structured data model.

Cleaning of data for case, accents, punctuation is done as a matter of course.

A key feature of the data tagger is its flexibility: it is driven by business user accessible parameters so that as new data comes in the tagger stays relevant without having to access the engine.

Automated, on-premises or on CPRA’s cloud.

TypeData
NameStag’s Leap – Cask 23 Cabernet Sauvignon 2009Stags’ Leap Winery Ne Cede Malis SyrahPine Ridge Stags Leap Cabernet Sauvignon
ProducerStag’s Leap Wine CellarsStags’ Leap WineryPine Ridge
BrandCask 23Ne Cede MalisStags Leap
GrapeCabernet SauvignonShirazCabernet Sauvignon
RegionCaliforniaCaliforniaCalifornia
Sub_region01North CoastNorth CoastNorth Coast
Sub_region02Napa ValleyNapa ValleyNapa Valley
Sub_region03Stags Leap DistrictStags Leap DistrictStags Leap District

Creating a structured, consistent data set then unlocks the data.

Attributes can we used to perform a structured matching process to identify any duplicates.

Hierarchies can now be imposed and the removal of duplicates allows accurate data consolidation and reconciliation to finally occur.

Back To Top