Cleaning Data

Are there any good utilities for cleaning up Excel data so that it could be put into a database? A quick search didn’t reveal anything of note. I think one problem is that data sets, and the problems with them, are so varied that it would be difficult to develop a general purpose utility that had any real, you know, utility.

But if such a utility could exist, what would it look like? First, I think it shouldn’t endeavor to fix 100% of the problems. I think more like 80% would be a good number. If you got a spreadsheet that was a real piece of work and could run a macro that “fixes” 80% of it, that would be useful. Maybe even 50% would be useful. Let’s start a list of the features of this utility and see what happens.

Column Identification

The first no-brainer is identifying the data type of all the columns: String, Double, Integer, Boolean, and all the rest of our data type friends. The utility should flag those entries that aren’t the correct data type. Maybe it should optionally fix them so they are correct and only flag those that can’t be converted.

Beyond data type, there should be some special columns.

  • Name – An option to split first, last, middle, saluattion, suffix into separate columns. An option to keep it all together, but in a predefined format like Last, First M.
  • Address – Split into street1, street2, city, state, zip, country
  • Phone – Format in a predefined format like (999) 999-9999 or 999-999-9999. Option to include empty parens or omit them for no area code found.

I’m sure there’s more special columns that should be included.

Data Manipulation

In addition to the data manipulation that comes from specifying a special column type, there could more general purpose data manipulation

  • Maximum or minimum string length – truncate or pad as necessary
  • String replace – just like Excel’s built-in replace
  • Remove text – remove anything that’s not a number
  • Remove numbers – remove anything that’s not text
  • Clean – remove leading, trailing, and extra spaces or any non-printing characters
  • Simple truncating – chop of the first or last x characters
  • Math – perform some operation on all data, like dividing by 1,000.
  • Removing records – delete any rows that contain certain data

Flagging Problems

  • Lists – flag any data that isn’t part of a list. Lists could be in a range somewhere or could be entered. They could be simple lists like {Male, Female} or larger lists like all the cities in Minnesota.
  • Regular expressions – flag any data that doesn’t fit. Have some built-in regular expressions, such as for URLs or email addresses.
  • Number ranges – flag any data that’s not in a certain number range (or date range)
  • Contains text – flag any data that doesn’t contain a certain string
  • Duplicates – flag any data that appears anywhere else in the column

This should probably be combined with the column identification with an option to flag or modify.

Splitting

Split the column using a certain number of characters or based on a certain sequence of characters.

Normalization

Identify a column that should be its own table. A new sheet would be created with an Autonumber field and the unique entries from that column. Then the original data would be replaced with the Autonumber.

I suppose you should be able to save the settings for those repeated data cleansing activities. Thoughts? Any existing products out there?

Posted in Uncategorized

15 thoughts on “Cleaning Data

  1. Data Normalization

    Name Jan Feb Mar…
    A 10 20 30

    To be converted in to

    Name Month Amt
    A Jan 10
    A Feb 20 etc

    I use a Mutiple Consolidation Range Pivot/ Or a Union Query in Access
    if it can be done in a click it would be great

  2. I would also insert the option to indicate that some of the columns need unique records.
    And of course flag all the duplicate ones.

  3. In my previous job, I created just such a macro called “FormatThisSheet”. It did much of what you mention above by cycling through and recognizing the column headers.

    For example, if it found three consecutive columns labelled: HNPA | HNXX | HNUM
    It would insert a new column in front of the three labelled “HomePhone”, and cycle through the rows, combining as apropriate and formatting as (999)999-9999. A yes/no prompt would then ask me if the individual columns should be deleted, or left to remain in the data. Similarly, WNPA | WNXX | WNUM would be combined into “WorkPhone”.

    Similar processes worked for columns:
    CSYear | CSMonth | CSDay combining into dates labelled “ContractStart”
    CEYear | CEMonth | CEDay combining into dates labelled “ContractEnd”
    FName | MName | LName | NameSuf combining into strings labelled “CustName”
    etc.

    It also formatted Zip Codes correctly since here in the northeastern US many zip codes start with zero (I’d like to shoot the guy who designed that system!) and Excel always removes the leading zero.

    It also called macro: “FormatHeaderRow” which makes each cell in row 1 bold font, with a border, and uses “freeze panes” to lock it into place.

    I loved that macro – it saved me thousands of hours of formatting work! I am now at a different job where “FormatThisSheet” is no longer used… but my “FormatHeaderRow” is still hard at work and saving me countless hours and headaches!

    Glenn

  4. Hi Dick, It just so happens …
    … That I am partway through such a project.
    My documentation is incomplete, but is updated as i spend time on it, once every few days.
    Current details at http://www.chrisgreaves.com/Greaves/Contacts/Notes.htm
    I’d be happy to make the code available to you, and to others through this site.
    Also to indulge ion correspondence with any of your readers.

    P.S. I think a partial-reply got sent prior to this, if so please feel free to delete it.

  5. I’ve just recently been through a name / phone / address cleansing activity.

    A nice feature would be to pull a column of data (or columns!) then present to the end user row at a time. The data would be coloured according to an automated detection pattern (in this case, the detection pattern selected is “Full Name”).
    A column which contains Full Name would colour the first word blue to signal first name, and the second and subsequent words red to signal last name.
    At this point, I’d be able to tab between the coloured selections and arrow left or right to absorb less or more of the adjacent colour.
    eg. “Mary Kate Smith” – “Mary” is selected blue, “Kate Smith” is coloured red. I tab to the red colour, then press right-arrow to reduce the selection to Smith alone.
    (in fact, since “Mary” was selected first, I’d probably just do a right-arrow to make “Mary Kate” blue)

    Once the colours are in the correct position, ok commits the split.

    This technique could also work well with addresses, and pulling area codes out of telephone numbers.

  6. Existing products:
    For Remove duplicates and Higlight Duplicates – Built in Excel 2007/2010
    For Fuzzy Duplicates (typos) – Excellent addin from Able Bits – Fuzzy Duplicate finder – commercial
    For leading trailing, excess spaces + removal of non printing ascii carachters + insert before/after + truncate specified number of leading/trailing characters + delete rows that contain some data – some of the utilities in ASAP utilities – free
    For transposing data – one of the utilities in PUP – commercial
    Divide numbers by 1000 – paste special divide

    Unfortunately no one product that does it all

  7. Hi Dick,

    One little trick I use. If I have a list of standard entries that I want in clean data, and some dodgy data.
    e.g. I want clean data to be a value in this list (in a named range called KnownList)
    Gateway
    Kuraby
    Stapylton Rd

    And I have some data like;
    My Gateway North –
    Gateway South –
    Kuraby South –
    Stapylton Rd Exit

    Then I use a formula like;
    =INDEX(KnownList,-SUMPRODUCT(ISERR(SEARCH(KnownList,$A2,1))-1,ROW(KnownList))-1)

    The formula is in five parts;
    A
    1) SEARCH(TollList, Value)
    This searches for the string TollList in the string Value
    It either returns the position of TollList in Value (1) or an error if the Value string does not contain TollList
    But, if TollList is a range, the formula returns an array {error,error,1,error,error}
    We can look for the position of the non-error value in the array to see which of the array strings is contained in Value
    A
    2) ISERR(array1) a€” 1
    If array1 contains some errors and some values, then the ISERR function returns another array
    ISERR(error) returns TRUE (=1), and ISERR(not-error) returns FALSE (=0)
    So ISERR(array1)-1 returns an array with 1 in the position of TollList where we have a match
    {0,0,0,1,0,0}
    A
    3) ROW(TollList)
    This returns an array of the row numbers for each cell in TollList
    Since the TollList starts in row 2, I have subtracted 1 from this
    {1,2,3,4,5,6}
    A
    4) SUMPRODUCT(array1, array2)
    Multiplies array1 by array2
    Array1 has 1 for the position in TollList where we have a match {0,0,0,1,0,0}
    Array2 is the row numbers for TollList values {1,2,3,4,5,6}
    A
    Multiply these together =4
    A
    5) INDEX(TollList, value from SumProduct)
    Simple Index formula to return the matched value in TollList

    HTH

  8. Ed: I like your -SUMPRODUCT() – 1
    When needed I’d use SIGN(), but I think I’ll use your technique now.

  9. Hi all,

    ETL (Extract Transform Load) softwares should be the solution. Have a look to Talend Studio, open source program (but note that there are other programs like this one).

  10. From Wikipedia: “normalization is a systematic way of ensuring that a database structure is suitable for general-purpose querying and free of certain undesirable characteristicsa€”insertion, update, and deletion anomalies”. There are lots of tutorials and examples available online. The definition used here (“Identify a column that should be its own table”) misses that point that there are a number of stages a design should be taken through (1NF, 2NF, 3NF, BCNF, 4NF and 5NF), noting that even a 5NF design may still exhibit insertion, update, and deletion anomalies. Finally, normalization has nothing to do with “Autonumber fields”: surrogate keys vs natural keys is a subject in its own right and a controversial one too.

  11. Have you taken a look at http://www.datamartist.com/ yet? for example: http://www.datamartist.com/data-profiling-rules-and-data-formats

    It is a straightforward data profiling and data transformation tool. So you can both see the patterns/quality of your data and change/clean your data.

    If you are looking for more of a code (Java/Peal) generator that does profiling/data quality checks and ETL, then http://www.talend.com is worth a look, although the UI is not really intuitive.

  12. Very recently released is software: Google Refine 2.0
    This does most of what you want. It is downloadable.
    Price cannot be beat. Check it out.

Leave a Reply

Your email address will not be published. Required fields are marked *