Cleaning Data

Posted on October 1, 2010 by Dick Kusleika

Are there any good utilities for cleaning up Excel data so that it could be put into a database? A quick search didn’t reveal anything of note. I think one problem is that data sets, and the problems with them, are so varied that it would be difficult to develop a general purpose utility that had any real, you know, utility.

But if such a utility could exist, what would it look like? First, I think it shouldn’t endeavor to fix 100% of the problems. I think more like 80% would be a good number. If you got a spreadsheet that was a real piece of work and could run a macro that “fixes” 80% of it, that would be useful. Maybe even 50% would be useful. Let’s start a list of the features of this utility and see what happens.

Column Identification

The first no-brainer is identifying the data type of all the columns: String, Double, Integer, Boolean, and all the rest of our data type friends. The utility should flag those entries that aren’t the correct data type. Maybe it should optionally fix them so they are correct and only flag those that can’t be converted.

Beyond data type, there should be some special columns.

Name – An option to split first, last, middle, saluattion, suffix into separate columns. An option to keep it all together, but in a predefined format like Last, First M.
Address – Split into street1, street2, city, state, zip, country
Phone – Format in a predefined format like (999) 999-9999 or 999-999-9999. Option to include empty parens or omit them for no area code found.

I’m sure there’s more special columns that should be included.

Data Manipulation

In addition to the data manipulation that comes from specifying a special column type, there could more general purpose data manipulation

Maximum or minimum string length – truncate or pad as necessary
String replace – just like Excel’s built-in replace
Remove text – remove anything that’s not a number
Remove numbers – remove anything that’s not text
Clean – remove leading, trailing, and extra spaces or any non-printing characters
Simple truncating – chop of the first or last x characters
Math – perform some operation on all data, like dividing by 1,000.
Removing records – delete any rows that contain certain data

Flagging Problems

Lists – flag any data that isn’t part of a list. Lists could be in a range somewhere or could be entered. They could be simple lists like {Male, Female} or larger lists like all the cities in Minnesota.
Regular expressions – flag any data that doesn’t fit. Have some built-in regular expressions, such as for URLs or email addresses.
Number ranges – flag any data that’s not in a certain number range (or date range)
Contains text – flag any data that doesn’t contain a certain string
Duplicates – flag any data that appears anywhere else in the column

This should probably be combined with the column identification with an option to flag or modify.

Splitting

Split the column using a certain number of characters or based on a certain sequence of characters.

Normalization

Identify a column that should be its own table. A new sheet would be created with an Autonumber field and the unique entries from that column. Then the original data would be replaced with the Autonumber.

I suppose you should be able to save the settings for those repeated data cleansing activities. Thoughts? Any existing products out there?

15 thoughts on “Cleaning Data”

Ron De Bruin says:

October 1, 2010 at 5:53 am

Hi Dick

Will be a lot of work to make a good program for this

I have a old page on my site with some info
http://www.rondebruin.nl/clean.htm
sam says:

October 1, 2010 at 6:04 am

Data Normalization

Name Jan Feb Mar…
A 10 20 30

To be converted in to

Name Month Amt
A Jan 10
A Feb 20 etc

I use a Mutiple Consolidation Range Pivot/ Or a Union Query in Access
if it can be done in a click it would be great
Tiimothy says:

October 1, 2010 at 6:16 am

I would also insert the option to indicate that some of the columns need unique records.
And of course flag all the duplicate ones.
Glenn says:

October 1, 2010 at 6:40 am

In my previous job, I created just such a macro called “FormatThisSheet”. It did much of what you mention above by cycling through and recognizing the column headers.

For example, if it found three consecutive columns labelled: HNPA | HNXX | HNUM
It would insert a new column in front of the three labelled “HomePhone”, and cycle through the rows, combining as apropriate and formatting as (999)999-9999. A yes/no prompt would then ask me if the individual columns should be deleted, or left to remain in the data. Similarly, WNPA | WNXX | WNUM would be combined into “WorkPhone”.

Similar processes worked for columns:
CSYear | CSMonth | CSDay combining into dates labelled “ContractStart”
CEYear | CEMonth | CEDay combining into dates labelled “ContractEnd”
FName | MName | LName | NameSuf combining into strings labelled “CustName”
etc.

It also formatted Zip Codes correctly since here in the northeastern US many zip codes start with zero (I’d like to shoot the guy who designed that system!) and Excel always removes the leading zero.

It also called macro: “FormatHeaderRow” which makes each cell in row 1 bold font, with a border, and uses “freeze panes” to lock it into place.

I loved that macro – it saved me thousands of hours of formatting work! I am now at a different job where “FormatThisSheet” is no longer used… but my “FormatHeaderRow” is still hard at work and saving me countless hours and headaches!

Glenn
Chris Greaves says:

October 1, 2010 at 8:24 am

Hi Dick, It just so happens ….
Chris Greaves says:

October 1, 2010 at 8:29 am

Hi Dick, It just so happens …
… That I am partway through such a project.
My documentation is incomplete, but is updated as i spend time on it, once every few days.
Current details at http://www.chrisgreaves.com/Greaves/Contacts/Notes.htm
I’d be happy to make the code available to you, and to others through this site.
Also to indulge ion correspondence with any of your readers.

P.S. I think a partial-reply got sent prior to this, if so please feel free to delete it.
Kanwaljit says:

October 1, 2010 at 12:37 pm

Hi Dick,

You must have tried ASAP-Utilities ! ? Haven’t you ! ?

Kanwaljit
Rob van Gelder says:

October 1, 2010 at 1:46 pm

I’ve just recently been through a name / phone / address cleansing activity.

A nice feature would be to pull a column of data (or columns!) then present to the end user row at a time. The data would be coloured according to an automated detection pattern (in this case, the detection pattern selected is “Full Name”).
A column which contains Full Name would colour the first word blue to signal first name, and the second and subsequent words red to signal last name.
At this point, I’d be able to tab between the coloured selections and arrow left or right to absorb less or more of the adjacent colour.
eg. “Mary Kate Smith” – “Mary” is selected blue, “Kate Smith” is coloured red. I tab to the red colour, then press right-arrow to reduce the selection to Smith alone.
(in fact, since “Mary” was selected first, I’d probably just do a right-arrow to make “Mary Kate” blue)

Once the colours are in the correct position, ok commits the split.

This technique could also work well with addresses, and pulling area codes out of telephone numbers.
sam says:

October 2, 2010 at 7:11 am

Existing products:
For Remove duplicates and Higlight Duplicates – Built in Excel 2007/2010
For Fuzzy Duplicates (typos) – Excellent addin from Able Bits – Fuzzy Duplicate finder – commercial
For leading trailing, excess spaces + removal of non printing ascii carachters + insert before/after + truncate specified number of leading/trailing characters + delete rows that contain some data – some of the utilities in ASAP utilities – free
For transposing data – one of the utilities in PUP – commercial
Divide numbers by 1000 – paste special divide

Unfortunately no one product that does it all
Ed Ferrero says:

October 3, 2010 at 6:58 pm

Hi Dick,

One little trick I use. If I have a list of standard entries that I want in clean data, and some dodgy data.
e.g. I want clean data to be a value in this list (in a named range called KnownList)
Gateway
Kuraby
Stapylton Rd

And I have some data like;
My Gateway North –
Gateway South –
Kuraby South –
Stapylton Rd Exit

Then I use a formula like;
=INDEX(KnownList,-SUMPRODUCT(ISERR(SEARCH(KnownList,$A2,1))-1,ROW(KnownList))-1)

The formula is in five parts;
A
1) SEARCH(TollList, Value)
This searches for the string TollList in the string Value
It either returns the position of TollList in Value (1) or an error if the Value string does not contain TollList
But, if TollList is a range, the formula returns an array {error,error,1,error,error}
We can look for the position of the non-error value in the array to see which of the array strings is contained in Value
A
2) ISERR(array1) a” 1
If array1 contains some errors and some values, then the ISERR function returns another array
ISERR(error) returns TRUE (=1), and ISERR(not-error) returns FALSE (=0)
So ISERR(array1)-1 returns an array with 1 in the position of TollList where we have a match
{0,0,0,1,0,0}
A
3) ROW(TollList)
This returns an array of the row numbers for each cell in TollList
Since the TollList starts in row 2, I have subtracted 1 from this
{1,2,3,4,5,6}
A
4) SUMPRODUCT(array1, array2)
Multiplies array1 by array2
Array1 has 1 for the position in TollList where we have a match {0,0,0,1,0,0}
Array2 is the row numbers for TollList values {1,2,3,4,5,6}
A
Multiply these together =4
A
5) INDEX(TollList, value from SumProduct)
Simple Index formula to return the matched value in TollList

HTH
Rob van Gelder says:

October 3, 2010 at 10:58 pm

Ed: I like your -SUMPRODUCT() – 1
When needed I’d use SIGN(), but I think I’ll use your technique now.
Jam says:

October 4, 2010 at 2:13 am

Hi all,

ETL (Extract Transform Load) softwares should be the solution. Have a look to Talend Studio, open source program (but note that there are other programs like this one).
Jamie Collins says:

October 4, 2010 at 3:54 am

From Wikipedia: “normalization is a systematic way of ensuring that a database structure is suitable for general-purpose querying and free of certain undesirable characteristicsa”insertion, update, and deletion anomalies”. There are lots of tutorials and examples available online. The definition used here (“Identify a column that should be its own table”) misses that point that there are a number of stages a design should be taken through (1NF, 2NF, 3NF, BCNF, 4NF and 5NF), noting that even a 5NF design may still exhibit insertion, update, and deletion anomalies. Finally, normalization has nothing to do with “Autonumber fields”: surrogate keys vs natural keys is a subject in its own right and a controversial one too.
Joe Mako says:

October 4, 2010 at 6:04 am

Have you taken a look at http://www.datamartist.com/ yet? for example: http://www.datamartist.com/data-profiling-rules-and-data-formats

It is a straightforward data profiling and data transformation tool. So you can both see the patterns/quality of your data and change/clean your data.

If you are looking for more of a code (Java/Peal) generator that does profiling/data quality checks and ETL, then http://www.talend.com is worth a look, although the UI is not really intuitive.
Joseph A. Smith says:

December 13, 2010 at 12:28 pm

Very recently released is software: Google Refine 2.0
This does most of what you want. It is downloadable.
Price cannot be beat. Check it out.

Posting code? Use <pre> tags for VBA and <code> tags for inline.

Daily Dose of Excel

Haphazardly Posted Excel Information and Other Stuff

Cleaning Data

15 thoughts on “Cleaning Data”

Leave a Reply