Importing Fixed Width Text Files with ActiveX Data Objects

A few weeks ago I was struggling with fixed width text files. Remember 1995? Me neither, but I’m living it every day. I happened upon a stackoverflow.com post that dealt with using ADO for this purpose. I don’t remember the post, so no link. It did, however, lead me to this MSDN article, which was very helpful.

If you haven’t been following along, I wrote a post about creating a sample fixed-width file and one about importing said file. This post is about importing that file with ADO. You may remember ADO from such database objects as Connection, Command, and Recordset. But, like me, you may never have considered using to import text files; or even knew that it could.

Before I get into the specifics, there are two things that endeared me to this method. First, it allows me to only import the data I want and easily ignores headers, totals, and other non-record data. Second, it’s super fast. I had a huge text file that took several minutes to read using the Input$ function. I got it down to 90 seconds using ADO. Ninety seconds still stinks, but it beats having to get a coffee every time the code runs. The bonus third reason I love ADO is that I can replicate it for different text files easily. Usually, fixed width text files are not arranged as normalized data, so there are some challenges. But I went from setting this up for one specific report to a dozen reports very quickly. Let’s see how it’s done.

Go read the MSDN article if you want the specifics, but basically you need a file called Schema.ini that tells ADO what’s what in your file. I don’t take a crap without a class module, so we’ll be using class modules here. Didn’t this use to be a family friendly blog?

If you couldn’t tell, the text file in question contains transactions from a general ledger. That means I need a Transaction class to hold each of them. Using the column headers from the file, I create a CTransaction class module.

My VBHelpers add-in quickly converts those to properties and creates a CTransactions parent class. Next, I create an MEntryPoints standard module and insert the following code.

I don’t have a FillFromFile method or a WriteToRange method, but I like to write my main procedure as if I already had those. You’ll need a reference to Microsoft ActiveX Data Objects 2.8 Library (although any version close to that will do). The FillFromFile method is pretty simple. It creates and ADO Connection and an ADO Recordset, then loops through the recordset adding CTransaction instances as it reads them in. It’s treating our text file as if it’s a database with field names and everything.

We’ll hold off on how to create the Schema file for now. The Connection string is created with this little utility. You pass in the path and returns an array of strings ready to be joined.

I’ve recently starting using arrays and Joins to concatenate strings of any length. I find it makes the code much more readable and manageable once you get used to it. Let’s talk about that SQL statement. In Schema.ini, I’ve defined column names and column widths. We’ll look at it in a moment. The SQL statement selects all the columns from the text file based on some criteria. The first section of the SQL statement is the SELECT keyword. For the second section, I have a property that returns an array of columns. You could just as easily use “SELECT * FROM”, but I was recently shown the benefit of following the never-select-astrisk rule, so I’m trying to be good.

Just an array of column names used in the SQL statment and in Schema.ini. The third section of the SQL statement is the FROM keyword followed by the file name in brackets. The Dir$ function strips the path out of the fullname and returns only the file name. You don’t need the path in the SQL statement because Schema.ini is in the same directory as the text file. It has to be, so it’s not looking anywhere else.

The final section of the SQL statement is the WHERE clause. This is where you have to get a little creative. As I scan down my text file, I need to find some characteristic of “good” rows that is not present in “bad” rows. For this example, it was pretty easy. Every row that I want has a real date in PostDate and every row that I don’t want doesn’t. They aren’t all that easy. Would you like to see some examples of WHERE clauses I’ve used? Well, would you?

Some of those are pretty gnarly. Our example file has lots of options, not just PostDate. We have three Yes/No fields and we could use any of those. It’s pretty unlikely that header information or totals rows are going to have a Yes or No in that same position. The idea is only get the rows you want. For some of the dates I used "__/__/____" and I think it’s pretty safe. But for this example, I used "[0-1][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]" which is a little more specific.

With a good connection string and SQL statement, I open the connection, open the recordset, and start looping. There’s not much to filling the CTransaction class via the FillFromRecordset method. It uses an Nz function that I wrote to avoid Null problems and return a default. For numbers and dates, I specify that I want to return a zero in place of a Null. For strings, it automatically returns vbNullString. The FillFromRecordset procedure is a method in CTransaction. The Nz function is in a standard module.

Now on to Schema.ini, at long last. I have a MakeSchema procedure in a standard module that simply creates the file where it’s supposed to. One of the arguments to MakeSchema is a string for the contents of the file. That comes from the Schema property of the CTranscations class (shown as Me.Schema in the FillFromFile method above). The Schema property takes the columns from the Columns property and puts them together with column widths to create the string.

The widths array is simply how many characters wide each column is. The lines in my file are 132 characters long. Counting them is a pain. Usually, I grab a couple of representative lines from the text file and put them in a spreadsheet. Here’s how I came up with the column widths for this file.

Lines 2-5 are Courier New 9pt and the first two lines are typed – no fancy formula to get those numbers. I like to get a couple of representative lines so I don’t miss anything. Then I go put pipes where I want the column breaks to be and put this formula in A7

=FIND("|",$A$4,A6+1)

and fill down until I get an error. Column B is just the difference. Finally, the MakeSchema utility takes that string and puts into a file.

And the Schema.ini file looks like this:

The final piece is writing all of the CTransaction objects to a range. The CTransactions collection class has a WriteToRange method that calls an OutputRange property. It’s pretty straightforward.

I put apostrophes in front of the strings so they don’t get converted just because they look like a number or a date. And when I’m done, I sum up the Debit column and compare it to the text file.

It’s a match! I must have done something right. No comments for you accountants about how my debits don’t equal my credits. You try to make fake data this internally consistent. :)

You can download FixedWidthADO.zip

Importing Fixed Width Text Files

I’ve been working with fixed width text files quite a bit lately. The built-in method for importing these files is terrible. First, it’s yet another wizard and I think it would be better as a single form (but then I think everything is better as a single form). The worst part, by far, is the tiny window that shows a preview of your data.

After a few days of using this wizard, it was pretty obvious that I was going to write my own, if for no other reason than to make that window bigger. My importer would be significantly different than Excel’s. Microsoft has to make their method flexible and universal, but I don’t. I know certain things about my text files and can build in some assumptions to make things better. For one, my files have repeating page headers. While Excel allows you to start your import at something other than first row to skip all that crap, it doesn’t do me any good because there’s just more crap to come.

Step 2 of the wizard allows you to add, delete, and move the dividing lines that delineate your columns. I find that Excel does a pretty poor job at placing these lines. But I grant that I haven’t used this on a wide variety of files so it’s entirely possible that their algorithm is the best – it’s just not the best for the types of files I’m using.

Did you think I was going to blow by this step without commenting on the lack of keyboard support? Not a chance. To move the lines, click them and drag them. To add a line, single click. To delete a line, double click. Actually, to delete a line, double click next to the line you want to delete, then double click that new line to delete it, then carefully double click on the original line to delete it. You know what I’m talking about. Click, click, click. Where’s the keyboard love?

The lack of large enough preview window really hurts on this step. I’ll discuss determining where the column lines should go later in this post. Normally after step 2, I just hit finish. But let’s take a look at the last step anyway.

I don’t use this step much because the defaults are really good. As much as I don’t like wizards, I have to give MS props for putting a Finish button on step 2 so I can skip step 3. Now that you’ve defined your columns, this step allows you to specify a data type. The General type works well for most situations, but if you have some text that happens to look like a number, it’s best to set the column to the Text format.

Under the Advanced button, you can switch what decimals and commas mean and, most importantly to me, tell it how to handle trailing minus signs.

So back to how nothing’s ever good enough and I can do everything better. I’m only dealing with fixed width files and I’m always starting on the first row, so step 1 of the wizard is gone. Step 2 and 3 could be combined, I think. It should draw a combobox over each column that let’s you choose the format. When you add or delete columns, it redraws the boxes. The best use of resources is getting the columns right in the first place. If you don’t have to move columns because it guesses so well, then the whole thing becomes a breeze.

Let’s look at this sample file in terms of columns. In order to get all the numbers to show, this chart is kind of big.

I wrote a little macro to analyze the file and report how many characters are in each column.

Then I charted them. It seems pretty clear that there’s a break around 7, 12, 23, and 35. The points at 41 and 45 are a little less clear, but starting at 47, you see a clear downward trend. This is the tell-tale sign of left-justified text. Similarly, 87-104 is a right-justified number. The headers muddy up the waters a bit because they contain data that’s no good to me, but still adds to the character count. As I mentioned before, I’m not building a general-purpose solution and it just so happens I can remove the headers. So I did.

The ASCII code for the page break character is 12. All of my headers end in a string of 132 dashes. That’s damn convenient. Look what happens when I remove the headers.

Based on that, I think I could predict the column breaks perfectly. But there’s one aspect of this file that allows me to say that. I don’t have any truly variable length, left-justified text followed by any truly variable length, right-justified number. For instance, if the description column was immediately followed by the debit column, it might be difficult to determine exactly where to break it.

How did I solve that problem? I didn’t. This is as far as I got writing my own text import wizard. Out of nowhere, I read something about reading text files with ADO and all my problems were solved. I had used external data tables to read text files, but never ADO. I’ll post about how I’m importing text files with ADO in my next post.

Sample Fixed Width Text File

I have a post or two in my head that deals with fixed width text files, something I’ve been dealing with quite a bit. Like me, you probably don’t run into fixed width text files in your personal life. It’s usually some crappy computer program at your job that forces you to deal with them. The problem is that I can’t simply throw around private company information, so I had to replicate a real fixed width file with obfuscated information. Before I post about all the lovely things I’ve been doing with these files, I thought I’d post about how I made the sample.

fixed width text file sample

That procedure is a bit longer than I like, but it’s not something I wanted to spend a lot of time on. It reads in the legitimate file, then goes line-by-line through it. When it encounters a line that needs changing, it creates fake data to put in place of the real data. I want the totals to match so I have something to compare it to after I parse the text file (in a later post), so I have to keep track of the fake numbers I make up and total them appropriately.

The GL Account is a 10 digit number that start with 1-5 and can have any other digits after that.

The description obfuscator respects spaces, but replaces anything else with an upper case letter.

The Yes/No generator gives me a Yes about 80% of the time. You have to stay positive, you know.

For the debits and credits, I wanted to stay somewhat realistic, so I kept the same scale as whatever number is there. To determine the scale, I remove all spaces, commas, and periods from the screen, then take the length x 2. A number like $9,453.65 will have a scale of 4 and will produce a number between 1,000 and 9,999. Rnd, as you know, generates a number between 0 and 1. I multiply Rnd by 10^Scale to get the right number of digits, then round it to two decimal places.

Finally, I have a Pad function to fill out spaces around numbers. First, Pad truncates the string to the right width in case it’s already too long. If it’s not too long, spaces are inserted in front of it to fill out the right width. Then there’s the sAfter argument that you’re probably wondering about. I don’t have trailing negatives in this report, but I have had them in some others. I need to be able to stick a negative or a space after the string and used the sAfter argument to do it. I believe they’re all null strings in this example.

Making realistic sample data sucks.

You can download FixedWidthExample2.txt

Opening a PDF from VBA

Several years ago I needed to open the newest CSV file from a particular directory. Now I’m faced with a similar problem. My accounting system produces PDFs in some kind of proprietary PDF reader. It doesn’t have near the feature set of Foxit, my preferred reader. Instead of fighting it, now I immediately save the pdf and open it in Foxit.

Drawing from that previous post, I made a function to find the name of the most recently created PDF.

Once I have the name, a simple FollowHyperlink method will get me where I want to go. Oh, except that hyperlinks are bad and Excel needs to show me a warning. That’s not going to work. Instead, I take the long way around. I create a batch file to open the PDF and run that.

The file opens and there’s no warning to click through. Coincidentally, JW has been working around that same security measure, only for MP3 files. There’s an interesting approach.

It works and no message. Good one John.

Finally, I wanted a third method. fzz commented that I should use a console command because that’s what consoles are good at. I made a batch file following his example:

for /F %%a in ('dir /b/o-d "K:*.pdf"') do (start %%a & exit)

No warnings, obviously, and even though the VBA above is lightning quick, I think we can all appreciate that this is the quickest and most direct way. I’m having a problem running from VBA though.

It got the file name right, but says it can’t find it. As you can see, I tried ShellExecute too. Same result. It doesn’t matter. I have a batch file, so I don’t need Excel. I put a shortcut to the batch file on my desktop and set the shortcut key. Now I can open it regardless of which applications are open or have the focus.

Monitor Worksheet Changes via RSS

I love RSS. If you have a website and don’t have a feed, I don’t follow it. I monitor stuff via Google Reader and it’s a great way to keep up on a lot of sites. So it should only follow that monitoring spreadsheet changes via RSS would be great too. Right? Well, not really, but that never stopped me before.

First some caveats. This is just an experiment and not meant for general use. There may actually be some bugs in it, if you can believe that. If you want to modify the code for your own use, be warned that if you monitor too many cells it might be slow. Or you might get so much information that it’s worthless. OK, now that that’s over.

An RSS feed is an XML file that sits on a web server. This blog has such an XML file. When I post this blog entry, WordPress will update the XML file with an entry for this post. Occasionally Google Reader will check the XML file and see if there’s anything new. If there is, it will display the new stuff for anyone who has subscribed to the feed.

It’s fairly trivial to create an XML file, even from Excel. The potentially difficult part is putting that file on a web server. Oh, except for one little thing. I have a webserver right on my computer and you might too. Mine is called “C:\Users\dick\Dropbox\Public\”. Did I trick you? Dropbox has a Public folder and you can get a “public link” from any file in that folder, including the XML file we’re about to create. What a simple way to publish something to the intertubes. Here’s how you get that public link.

Alright, enough screwing around. Let’s get to the code.

I start with a class module called CChange (and its parent CChanges). CChange has the following read/write properties:

Address - the cell address we're watching for changes and a way to uniquely identify the instance.
OldValue - the value in the cell before it changed.
NewValue - the value in the cell after it changed.
Modified - a time stamp when the changed occurred.

I’ve named a range on the sheet called RSSWatch. When the workbook opens, a CChange object is created for every cell in that range.

Sub Auto_Open()

Set gclsChanges = New CChanges

gclsChanges.Initialize

End Sub

Public Sub Initialize()

Dim clsChange As CChange
Dim rCell As Range

For Each rCell In Sheet1.Range(gsNAMEDRNG).Cells
Set clsChange = New CChange
With clsChange
.Address = rCell.Parent.Name & "!" & rCell.Address
.OldValue = rCell.Value
.NewValue = .OldValue
End With
Me.Add clsChange
Next rCell

End Sub

These CChange objects are just sitting out there waiting to record any changes. They all have a Modified date of 12:00:00 AM (because I didn’t set anything) and in this state they won’t be written to the XML file. In the ThisWorkbook module, I use the Workbook_SheetChange event to monitor my range for changes. You might notice that I switch pretty liberally between sheet-specific references and general references. For instance, in the above Initialize method, I limit my range to Sheet1. When I went to code the event, I thought that someday I would want this to monitor different ranges on different sheets, so I used a Workbook level event. It’s totally inconsistent, but it will be helpful if you just ignore it.

Private Sub Workbook_SheetChange(ByVal Sh As Object, ByVal Target As Range)

Dim clsChange As CChange
Dim rCell As Range
Dim rRng As Range

On Error Resume Next
Set rRng = Sh.Range(gsNAMEDRNG)
On Error GoTo 0

If Not rRng Is Nothing Then
If Not Intersect(Target, rRng) Is Nothing Then
For Each rCell In Target.Cells
Set clsChange = gclsChanges.Change(Sh.Name & "!" & rCell.Address)
If Not clsChange Is Nothing Then
clsChange.NewValue = rCell.Value
clsChange.Modified = Now
End If
Next rCell
End If
End If

End Sub

For every cell that has changed, I find it’s CChange brother and change the NewValue and Modified properties. I haven’t written the XML file yet. I still just have a bunch of CChange objects, except that at least one of them has a Modified property that will make it eligible to be included in the file. Changes get made and recorded and only the last value and time are saved. Then, when the workbook is saved, any eligible CChange objects are written to the file.

Private Sub Workbook_BeforeSave(ByVal SaveAsUI As Boolean, Cancel As Boolean)

If Not gbDEBUG Then
gclsChanges.WriteRSS
End If

End Sub

And that’s pretty much all there is. Just kidding, there’s lots more. Before we look at more code, let’s take a look at what the file might look like after a few changes.

I’m sure you’re all experts on XML, so I’ll spare you the details, but the general hierarchy goes like this

rss
channel
title
link
description
language
lastBuildDate
ttl
item - one or more of these guys
title
link
description
pubDate
/item
/channel
/rss

Now that you know what the file looks like, let’s write one. I called the WriteRSS method from the Workbook_BeforeSave event.

Public Sub WriteRSS()

Dim xmlDoc As MSXML2.DOMDocument
Dim xmlChannel As MSXML2.IXMLDOMElement
Dim xmlLastBuild As MSXML2.IXMLDOMElement
Dim xmlItem As MSXML2.IXMLDOMElement
Dim clsChange As CChange
Dim dtMax As Date

10 If Me.HasChanges Then
20 If Not Me.FileExists Then Me.CreateFile

30 Set xmlDoc = New MSXML2.DOMDocument
40 xmlDoc.Load gsPATH & Me.Filename
50 Set xmlChannel = xmlDoc.SelectSingleNode(gsXRSS).SelectSingleNode(gsXCHANNEL)
60 Set xmlLastBuild = xmlChannel.SelectSingleNode(gsXBUILD)
70 Me.LastBuildDate = ConvertDate(xmlLastBuild.Text)
80 dtMax = Me.LastBuildDate

90 For Each clsChange In Me
100 If clsChange.ShouldWrite Then
110 Set xmlItem = clsChange.xmlItem(xmlDoc)
120 xmlChannel.appendChild xmlItem

130 If clsChange.Modified > dtMax Then dtMax = clsChange.Modified
140 End If
150 Next clsChange

160 Me.LastBuildDate = dtMax
170 xmlLastBuild.Text = Format(Me.LastBuildDate, gsFMTDATE)
180 FormatXMLDoc xmlDoc
190 xmlDoc.Save gsPATH & Me.Filename
200 End If

End Sub

Generally, I’m creating an XML file if it doesn’t exist, reading that file in, appending Items to it for any changes, and writing that file back out. The first thing I do is make sure there’s something to write. In line 10, I call the HasChanges property, which loops through all the CChange instances to see what’s eligible. If nothing has changed, there’s no need to create the XML file.

Line 20: If there isn’t already a file, I need to make one. I check to see if it’s out there.

Public Property Get FileExists() As Boolean

FileExists = Len(Dir(gsPATH & Me.Filename)) > 0

End Property

Public Sub CreateFile()

Dim xmlDoc As MSXML2.DOMDocument
Dim xmlRss As MSXML2.IXMLDOMElement
Dim xmlVer As MSXML2.IXMLDOMAttribute
Dim xmlChannel As MSXML2.IXMLDOMElement
Dim xmlNode As MSXML2.IXMLDOMElement

Set xmlDoc = New MSXML2.DOMDocument

Set xmlRss = xmlDoc.createElement(gsXRSS)
Set xmlVer = xmlDoc.createAttribute(gsXVER)
xmlVer.Value = gsRSSVERSION
xmlRss.Attributes.setNamedItem xmlVer

Set xmlChannel = xmlDoc.createElement(gsXCHANNEL)

Set xmlNode = xmlDoc.createElement(gsXTITLE)
xmlNode.Text = Me.Filename
xmlChannel.appendChild xmlNode

Set xmlNode = xmlDoc.createElement(gsXLINK)
xmlNode.Text = gsLINK
xmlChannel.appendChild xmlNode

Set xmlNode = xmlDoc.createElement(gsXDESC)
xmlNode.Text = "Changes made to " & ThisWorkbook.Name
xmlChannel.appendChild xmlNode

Set xmlNode = xmlDoc.createElement(gsXLANG)
xmlNode.Text = gsLANG
xmlChannel.appendChild xmlNode

Set xmlNode = xmlDoc.createElement(gsXBUILD)
xmlNode.Text = Format(Now - 1, gsFMTDATE)
xmlChannel.appendChild xmlNode

Set xmlNode = xmlDoc.createElement(gsXTTL)
xmlNode.Text = glTTL
xmlChannel.appendChild xmlNode

xmlRss.appendChild xmlChannel

xmlDoc.appendChild xmlRss

xmlDoc.Save gsPATH & Me.Filename

End Sub

You’ll need to set a reference to Microsoft XML, v6.0 or similar. CreateFile sets up everything in the file that’s not an Item, like the title, link, description, language, etc. The basics of XML generation are 1) create a new node and 2) append it to its parent node.

Line 40: I read in the existing file or the one I just created. The Load method populates the XMLDOMDocument with all the hierarchies and data. Line 50 and 60 go find specific nodes in the file that I’m interested in. I want the Channel node because I’ll be appending Items to it. I want the lastBuildDate node so I can compare that to CChange Modified properties and only write new changes.

Line 70: The RSS example I got off the web showed dates formatted like Tue, 06 Mar 2012 21:28:01 CST. Neither Excel’s CDate or Datevalue functions can convert that to a date, so I have to strip off the day and the timezone, which I do in a separate function.

Line 80: Ultimately I want to make my lastBuildDate in the XML file equal to the latest date of all the CChange objects. I’m initializing the maximum variable here. If I don’t, every cell’s Modified would be greater than this date.

Line 100: For every CChange, I check to see if it needs to be written to the file.

Public Property Get ShouldWrite() As Boolean

ShouldWrite = Me.Modified >= gclsChanges.LastBuildDate

End Property

If it’s been modified since the last time I wrote the file, it gets written this time. I really should also check to see if OldValue is different than NewValue. Right now if you change a cell and then change it back, it will still show up as a change. Sounds like a V2 enhancement.

The rest of the loop creates an Item, appends it to the Channel, and updates the maximum Modified date. The xmlItem property returns an object that can be appended.

Public Property Get xmlItem(xmlDoc As MSXML2.DOMDocument) As MSXML2.IXMLDOMElement

Dim xmlReturn As MSXML2.IXMLDOMElement
Dim xmlSubItem As MSXML2.IXMLDOMElement

Set xmlReturn = xmlDoc.createElement(gsXITEM)

Set xmlSubItem = xmlDoc.createElement(gsXTITLE)
xmlSubItem.Text = Me.Address
xmlReturn.appendChild xmlSubItem

Set xmlSubItem = xmlDoc.createElement(gsXLINK)
xmlSubItem.Text = gsLINK
xmlReturn.appendChild xmlSubItem

Set xmlSubItem = xmlDoc.createElement(gsXDESC)
xmlSubItem.Text = Me.Description
xmlReturn.appendChild xmlSubItem

Set xmlSubItem = xmlDoc.createElement(gsXPUBDATE)
xmlSubItem.Text = Format(Me.Modified, gsFMTDATE)
xmlReturn.appendChild xmlSubItem

Set xmlItem = xmlReturn

End Property

Not much to this – make a node and append it. The Description property is a read-only property that makes a nice English sentence describing what happened.

Line 160: I change the LastBuildDate to equal the max, then change that node in the XML file.

Line 180: This code I stole from VB Helper. It adds the line breaks and indentation that, while not necessary, is really helpful when debugging. No, I didn’t write this code perfectly the first time.

Finally I save the modified XML document. Dropbox publishes to the web and Google Reader reads it. Here’s what the last change looks like in the reader.

If I ever look at this again, here’s what I’d do in version 2

  • Put all the changes in one Item rather than one per cell
  • Check to see if a cell was changed and changed back and exclude it
  • Get the timezone from the Windows API rather than hardcoding it in a constant
  • Modify to use mulitple ranges on multiple sheets
  • Put the code in an add-in and look for workbooks to monitor
  • Change the Description to be easier to read

I like this Public Dropbox folder. I’m thinking of using it as a version control system to keep my add-ins up-to-date from multiple computers. It’s a shame that some companies block it. Oh well, your thoughts on the RSS code are welcome.

You can download RSSChanges.zip