Warning: Special Cells slows to a crawl across multiple columns

Further to my previous post on Goto Special/Blanks, it looks like there is a pretty serious problem with the SpecialCells method if you use it on more than one column.

That’s because it’s blindingly fast at finding things in one column, but painfully slow in finding things in any other columns. Which is a problem, because anyone who’s anyone knows that using SpecialCells is supposed to be much more efficient than looping through a big range. Well, Mr and Mrs Anyone may want to reconsider using the loop on multiple columns. Or rather, they might want to consider looping through the columns, and use SpecialCells on each one individually. Best of both worlds.

So how fast is it on one column? And how slow on multiple columns? Let’s find out.

Here’s your test code.

Sub TestingTestingIsThisThingOn()
Dim TimeTaken As Date
Dim rng As Range
Dim Jeff As AboveAverage

TimeTaken = Now()
Set rng = Range("A:B")
rng.SpecialCells(xlCellTypeConstants, xlTextValues).Select

TimeTaken = Now() - TimeTaken

Debug.Print "UsedRange:" & vbTab & ActiveSheet.UsedRange.Address
Debug.Print "Range Searched:" & vbTab & rng.Address
Debug.Print "Time Taken:" & vbTab & Format(TimeTaken, "HH:MM:SS") & " seconds."
Debug.Print "Cells: " & Selection.Cells.Count & vbNewLine & vbNewLine
End Sub

Fill A:A with some text, such as “Blindingly Fast”, and run that sub. Here’s what I get:

UsedRange: $A:$A
Range Searched: $A:$B
Time Taken: 00:00:00 seconds.
Cells: 1048576

Now fill B:B with “Painfully Slow”, and run it again. Here’s what I get:
UsedRange: $A:$B
Range Searched: $A:$B
Time Taken: 00:02:02 seconds.
Cells: 2097152

What the…? So it took a blindingly fast 1 second to process one column, and a painfully slow three minutes to process two? That’s about as linear as my life story before I met and married Excel 2000. (My first wife. I still see her round from time to time in the forums. She hasn’t aged gracefully.)

So either there’s something SpecialCells doesn’t like about the number of columns, or there’s something it doesn’t like about the number 1048576. Let’s find out.

If you delete rows 524288 down, then you’re left with 2 columns of text totaling 1048576 cells. Here’s what I get if I run the code on that:

UsedRange: $A$1:$B$524287
Range Searched: $A:$B
Time Taken: 00:00:57 seconds.
Cells: 1048574

It took one minute to process this, as opposed to the one second it took to process the same amount of cells in one column. So it seems that it’s the multiple columns aspect that is causing the issue, and that the extra time incurred is dependent on the number of cells in that 2nd column. In fact, if I fill those blank cells in column A with our text (while leaving half of Column B with blanks) and rerun the code, then it doesn’t take Excel any longer to process those extra half million cells:

UsedRange: $A:$B
Range Searched: $A:$B
Time Taken: 00:00:52 seconds.
Cells: 1572863

So it seems the issue is in how Excel is handling the union of all those cells it identified in the first column with any cells it identifies in the second. And given how slow it is, I wouldn’t be surprised if it’s unioning those cells one at a time.

The same thing goes for blanks, by the way. This turned up in the comments of Doug Jenkin’s latest post.

Moral of the story: Do NOT use SpecialCells on more than one column at a time.

Question: How are you supposed to bring this stuff to MS’s attention, apart from posting it somewhere conspicuous like here?


Over in the comments of his blog, Doug Jenkins hypothesizes:

A possibility is that a single column is treated as being a single array with 1 million values, but two columns are treated as being 1 million arrays, each with 2 values. When the operation is done on the spreadsheet with F5-Special, the name box shows that it works row by row, rather than all of Col A then all of Col B.

I think he’s right. And also after further testing, I realize that those times are down to how long Excel took to search all these arrays, and not to do with how long it took Excel to union the results. Because there weren’t that many results to union in my simple example, so it took practically no time at all.

But if you set up the data like so:
Every 2nd blank
…then you see that the actual unioning takes a heck of a lot longer than the searching. For instance, as per above it took nearly two minutes for SpecialCells to search 1 million rows across two columns and identify a handful of blank areas. Whereas from the below table, you can see that with the greatly increased number of blanks that it’s got to union, it took over two minutes to union just 45,000 cells in one column.

What good is the Data Model without PowerPivot?

So in the book I’m forever writing, I have an example of how you can use the Data Model to effectively do VLOOKUPS. Take the example where we’ve got a transactional list of sales that has a product ID but no ‘friendly’ name or price, and another table that maps Product ID to price and friendly name.
Two tables
We can use the Data Model to set up a relationship between these two tables…
Edit Relationship
…so that we can display friendly product names from the Price table against aggregated sales quantities from the Sales Data:
Sweet. No VLOOKUPs necessary. So…given that we’ve got quantity, and prices, how much money did we make? Let’s find out. First, lets’ drag that Price field into the Values area of the PivotTable and see what it gives us:
Well, that obviously isn’t what we’re after…it gives us the price of every item, not total revenue. Maybe we need to create a calculated field to multiply Qty against Price:
Calculated Dialog
Oh wait…we can’t. Unless the average user has a premium version of Excel that includes the glorious thing that is PowerPivot (not to mention the wherewithal to not be daunted by a completely different, programmer-centric User Interface) then they can’t do this rather basic datamodel-ish thing.

So here we are…15.64% of the way through this century, and yet your average user still doesn’t have the ability to mash together relational tables in the world’s most widely used business productivity tool. (Or mash together flat files from different sheets or workbooks, for that matter.) Unless of course they go back to last century and do it like this:
ms query
Query Pivot
I just don’t get this. What’s the point of exposing non-premium users to something like the Data Model if it doesn’t let them do diddly squat? Without calculated fields, surely it’s a big red herring that unnecessarily confuses users? (And okay, so if the Data Model on its own isn’t the hammer for this particular nut, what is?)

I thought I was missing something here, so asked someone smarter than me about these matters (Sam). Sam tells me:

From 2013 onward the Data Model and the DAX Engine are separate
The Data Model is only to create Relationships (Joins) – available to all versions of Excel. The DAX engine if for the Calculations – the Pro Plus and Stand Alone versions.
The only extra function on which you can aggregate data in the Data Model based Pivot Table is Distinct Count at the expense of Numerical Count and Product of a normal pivot. A Data model based pivot table disables both Calculated fields and Items as well as the ability to Group. So the short answer is No – you need DAX alternately use SQL to create the Joins

Don’t get me wrong…Powerthis and Powerthat is all very exciting. But Microsoft: How about showing the tiniest bit of love for the common man (and woman) who live in the common version. Those folk shouldn’t have to upgrade to some fully-featured business intelligence flagship product merely so they can mash data from within Excel together, surely? (And yes, they might be able to use PowerQuery for free to join their relational data together, but leaving aside the fact that they’d probably end up with a flat file in Excel after importing the data from PQ, it’s just not good enough that they would have to go learn another programmer-centric tool just to do something that should be childs-play in Excel itself. )

Just to be clear, I’m not one of those that think that PowerPivot should be bundled into all versions of Excel just because we want it to be. Microsoft have spent a lot of money on PowerPivot, and I don’t expect them to have a sudden attack of benevolence merely because it suits users. Theirs is a money-making venture. And their shareholders expect them to do the best job possible at that. But here’s the thing: It’s nearly 2016, and MS still have not given users the ability to do some pretty basic stuff natively in Excel, such as mash together data in separate sheets or workbooks and serve it up piping hot as a PivotTable. Don’t they know that in the real world, people have to mash together data from different sheets all the time? You get a sheet emailed to you from division A, a sheet from division B, and a sheet from division C. You need to mash it up to get an overview. A lot.

I’m not asking for Tableau here. I’m just asking for a core product that also moves with the times. Or at the very least, keeps up with last century.

Searching Text Files in a Directory

I have several years of vendor invoices, in text file format, in some directories on a share. I need to search through these text files to find an order number, manifest number, or some other piece of information. I can’t search everything because it would take too long. And I don’t have control over the server, so if there is some indexing that could be done, I can’t do it. I’m stuck with good old VBA.

The folders are yyyymmdd (ex: 20150725 for July 25th) and corresponds to the invoice dates for any invoices in the file. Each file starts with a three letter abbreviation of the vendors name. Invoice date and vendor name are the only two pieces of information I can use to limit the search. The final piece of information is, of course, the search term. Here’s what the form looks like

I have a table of vendors and codes to populate the Vendor combobox. The QuickDate combobox populates the Date Range textboxes and contains common date ranges, namely, Last Month, This Month, Last Quarter, This Quarter, Last Year, This Year. I can change the dates to whatever I want if there isn’t a Quick Date that suits me. The Search Terms textbox takes a space separated list of terms to search for.

And now the fun part. The code. This converts the Quick Dates into real dates

Private Sub cbxQuick_Change()
    Dim dtStart As Date, dtEnd As Date
    Select Case Me.cbxQuick.Value
        Case "Last Month"
            dtStart = DateSerial(Year(Now), Month(Now) - 1, 1)
            dtEnd = DateSerial(Year(Now), Month(Now), 0)
        Case "This Month"
            dtStart = DateSerial(Year(Now), Month(Now), 1)
            dtEnd = DateSerial(Year(Now), Month(Now) + 1, 0)
        Case "Last Quarter"
            dtStart = DateSerial(Year(Now), Month(Now) - (((Month(Now) - 1) Mod 3) + 3), 1)
            dtEnd = DateSerial(Year(dtStart), Month(dtStart) + 3, 0)
        Case "This Quarter"
            dtStart = DateSerial(Year(Now), Month(Now) - (((Month(Now) - 1) Mod 3)), 1)
            dtEnd = DateSerial(Year(dtStart), Month(dtStart) + 3, 0)
        Case "Last Year"
            dtStart = DateSerial(Year(Now) - 1, 1, 1)
            dtEnd = DateSerial(Year(Now), 1, 0)
        Case "This Year"
            dtStart = DateSerial(Year(Now), 1, 1)
            dtEnd = DateSerial(Year(Now) + 1, 1, 0)
    End Select
    Me.tbxStartDate.Text = Format(dtStart, "mm/dd/yyyy")
    Me.tbxEndDate.Text = Format(dtEnd, "mm/dd/yyyy")
End Sub

This makes sure a real date is entered, but provides for 6 or 8 digit date entry.

Private Sub tbxEndDate_Exit(ByVal Cancel As MSForms.ReturnBoolean)
    If IsDate(Me.tbxEndDate.Value) Then
        tbxEndDate.Text = FormatDateTime(tbxEndDate.Value, vbShortDate)
    ElseIf Len(tbxEndDate.Text) = 6 Then
        tbxEndDate.Text = DateSerial(Right(tbxEndDate.Text, 2), Left(tbxEndDate.Text, 2), Mid(tbxEndDate.Text, 3, 2))
    ElseIf Len(tbxEndDate.Text) = 8 Then
        tbxEndDate.Text = DateSerial(Right(tbxEndDate.Text, 4), Left(tbxEndDate.Text, 2), Mid(tbxEndDate.Text, 3, 2))
        MsgBox "You must enter a valid date."
        Cancel = True
    End If

End Sub

And the big one, the actual search. This is pretty long and needs to be refactored, but it works for now.

Private Sub cmdSearch_Click()
    Dim vaTerms As Variant
    Dim i As Long, j As Long
    Dim aFolders() As String
    Dim sFolder As String, sFile As String, lFile As Long
    Dim lCnt As Long
    Dim dtFolder As Date
    Dim sText As String
    Const sPATH As String = "\\yourserver\rawdata\"
    ReDim aFolders(1 To 1000)
    sFolder = Dir(sPATH & "*", vbDirectory)
    'get a list of folders in the date range
    Do While Len(sFolder) > 0
        If Len(sFolder) = 8 Then
            dtFolder = DateSerial(Left$(sFolder, 4), Mid$(sFolder, 5, 2), Right$(sFolder, 2))
            If dtFolder >= CDate(Me.tbxStartDate.Text) And dtFolder <= CDate(Me.tbxEndDate.Text) Then
                lCnt = lCnt + 1
                aFolders(lCnt) = sFolder
                sFolder = Dir
            End If
        End If
        sFolder = Dir
    ReDim Preserve aFolders(1 To lCnt)
    lCnt = 0
    vaTerms = Split(Me.tbxSearch.Text, Space(1))
    'Make a dummy result
    Me.lbxResults.AddItem vbNullString
    For i = LBound(aFolders) To UBound(aFolders)
        sFolder = sPATH & aFolders(i) & "\"
        sFile = Dir(sFolder & Me.cbxVendor.Value & "*.IN?")
        Do While Len(sFile) > 0
            'Show the current folder as a result
            Me.lbxResults.Column(0, 0) = sFolder & sFile
            'Open the file and read in all the text
            lFile = FreeFile
            Open sPATH & aFolders(i) & "\" & sFile For Binary As lFile
                sText = Space$(LOF(lFile))
                Get #1, , sText
            Close lFile
            'Loop through the space separated search terms and see if
            'they're in the file
            For j = LBound(vaTerms) To UBound(vaTerms)
                If InStr(1, sText, vaTerms(j), vbTextCompare) > 0 Then
                    'This is the animation part
                    Me.lbxResults.AddItem vbNullString, 0
                    Me.lbxResults.TopIndex = 0
                    lCnt = lCnt + 1
                    Exit For
                End If
            Next j
            sFile = Dir
    Next i
    'Get rid of the dummy
    Me.lbxResults.RemoveItem 0
End Sub

It takes about 60 seconds per month to search the files. That’s a long time so it’s necessary to entertain the user while he waits. The top entry in the results listbox is whatever the current file is. It rapidly changes the display as it loops through the folder. When there’s a hit, that file becomes the second entry and any prior hits move down. This little animation lets the user know that it’s still working and gives him a list of what hits have been found already.

You can download SearchTextFiles.zip

Setting the Tab Order of Userform Controls Programmatically

I hate designing userforms. I mean the part where I’m lining up controls, renaming controls, and all the other super-fiddly parts of making a form look and act right – like setting the tab order. I always wait until the very end so I don’t have to do it twice. And that Tab Order dialog? Forget about it.

When I’m finally ready to set the tab order, I find the first control and click Move Up until it’s at the top. Next, I find the second control and click Move Up until it’s just below the first control. Then I go to Whammyburger and force them to serve me breakfast. Then I go back and do the rest of the controls. It’s maddening. I decided finally to do something about it.

First I set a reference to the VBA Extensibility Library.

Then I put this code in a standard module

Public Sub FixTabOrder()
    Dim ctl As Control
    Dim i As Long, j As Long
    Dim lCnt As Long
    Dim vbc As VBIDE.VBComponent
    'Change this to the name of your userform
    Set vbc = ThisWorkbook.VBProject.VBComponents("UInvoice")
    For i = 1 To vbc.Properties("Width")
        For j = 1 To vbc.Properties("Height")
            For Each ctl In vbc.Designer.Controls
                If ctl.Top = i And ctl.Left = j Then
                    ctl.TabIndex = lCnt
                    lCnt = lCnt + 1
                End If
            Next ctl
        Next j
    Next i
    For Each ctl In vbc.Designer.Controls
        Debug.Print ctl.Name, ctl.TabIndex
    Next ctl
End Sub

This is pretty brute force, but it only take a few seconds to run, so who cares. The i and j variables move through the userform point-by-point, left to right, top to bottom (that’s the typographical point, not the agenda items point). For every point on the userform, it loops through every control to see if that control starts at that point. If it finds one, it sets the controls TabIndex to the next number. At the end it just prints out the control names and tabs.

Even if that’s not exactly how I want my tab order, it gets it close and I can make other modifications manually.

If you have a super-wide or super-long form, it will take longer. The right way to do this is to put all the controls in an array and sort them by their Top and Left properties. If this method is too slow, you should write that up. Then send me a copy.

DDoE Servers

Don’t call it a streak, but it’s been almost five days since MySQL crashed. I made a change late last week that masks the problem appropriately. I still don’t know what causes the problem, and probably never will, but as long as the server stays up, I’m in a better place. I’ve been fighting this for over a year.

I run DDoE on a 1GB virtual server at Digital Ocean. The heart of the problem is that a memory usage spike causes the Ubuntu kernel to shut down the MySQL service and it never gets started again. Memory management is one of the things a kernel does, so it’s not strange that services get kicked out of RAM when there’s a spike. MySQL will try to restart itself, but it’s not able to because it doesn’t have enough memory. I don’t know what’s causing the memory spike. I do know that the only services of consequence that are running are MySQL and a bunch of Apache2 services. Apache2 is the web server that directs the traffic.

I don’t recall where I read it, but someone on the internet said that it’s likely Apache2 is spawning too many processes. I don’t know how to verify that. I followed the advice on ServerFault on how to tune Apache on Ubuntu. When I shut down Apache, I could see that I was using about 660MB of memory, leaving 340MB free. I was surprised how much memory it was using without Apache, which I assumed would be the biggest component. Using the top command, I could see that an Apache2 process was using 30MB on average. That means if I limit my Apache2 processes to 10, I should not run out of memory. I changed the MaxRequestWorkers parameter from 150 to 10, and so far it’s worked.

There must be a cost to that change. I imagine that when the memory spikes, for whatever reason, and Apache isn’t allowed to spawn more processes, that the user gets an error and is unable to reach the site. As bad as that is, it’s better than what was happening – MySQL would shut down and nobody could reach the site until I got up the next morning and rebooted.

Now that we’re seemingly stable, maybe I’ll make some posts. I’ve been working in SQL Server a lot lately and have a lot to say about it.

Document Not Saved Error on ExportAsFixedFormat

Recently, I was accusing ExportAsFixedFormat of causing a problem with some code. It turned out to not be the culprit. But I was suspicious because I very often get a Document Not Saved runtime error when I export to PDF in a loop. When I get that error, I can click Debug and F5, and it happily continues until it errors again. Of course, watching something loop eighty-six times to make sure it doesn’t error kind of defeats the purpose of the loop. Today, I took a stand.

I have this code that changes the page fields on a pivot table and exports a range. The result is eighty-six PDFs in a folder that I sew together into one big report. There are two page fields, so there are two loops. Here’s the inner loop.

For Each rCell In .Range(.Cells(vaRows(i), 2), .Cells(vaRows(i), 2).End(xlDown)).Cells
    pt.PivotFields("Final").CurrentPage = rCell.Value
    wshDashFuelCust.Range("AK7").Resize(62, 4).ExportAsFixedFormat xlTypePDF, sPath & rCell.Value & "_" & aProducts(i)
Next rCell

You see my pathetic DoEvents attempt at avoiding the error. The error stops on the ExportAsFixedFormat line, so the DoEvents doesn’t actually help. When the error happens, a .tmp file is left in the directory. And when I do the Debug, F5 thing, the .tmp file stays there forever. Clearly this is a temporary file that would someday become a PDF if not for this error.

The .tmp file is my evidence that the processed finished. I can ignore the error and as long as there is no temp file, I’ll know the error never occurred. I rewrote the loop thusly:

For Each rCell In .Range(.Cells(vaRows(i), 2), .Cells(vaRows(i), 2).End(xlDown)).Cells
    pt.PivotFields("Final").CurrentPage = rCell.Value
        On Error Resume Next
            Kill sPath & Dir(sPath & "*.tmp")
            wshDashFuelCust.Range("AK7").Resize(62, 4).ExportAsFixedFormat xlTypePDF, sPath & rCell.Value & "_" & aProducts(i)
        On Error GoTo 0
    Loop While Len(Dir(sPath & "*.tmp")) > 0
Next rCell

As long as there’s a file with a tmp extension, I keep trying to export. The Kill statement needs to be inside the On Error because there won’t be anything to Kill the first time.

After a rigorous test of one time, it worked. I should have put a little loop counter in there to see how many times it errored. And maybe to exit out in case it gets into an infinite loop situation.

Converting SUMs to SUBTOTALs

Everyone knows that SUBTOTAL ignores filtered rows. Readers of DDoE know that SUBTOTAL also ignores other SUBTOTAL formulas. I tell everyone who will listen about the benefits of SUBTOTAL. It’s one of the best received tips in the ‘Tips and Tricks’ portion of the training I do. But I still get spreadsheets that use SUM and individual adding of cells. When I do, I convert them to SUBTOTAL to make sure there are no errors. Today, I decided to automate that process.

I’ve filled column B over to the right into column C so I can preserve the original data.

With Excel’s color coding and this simple worksheet, you may have spotted the error in the grand total formula. Below is the code I wrote to correct this situation without having to put in all the SUBTOTALs manually.

Public Sub ConvertSumToSubtotal()
    Dim rCell As Range
    Dim rStart As Range
    Const sSUM As String = "=SUM("
    'Only work on ranges
    If TypeName(Selection) = "Range" Then
        'Only work on single columns
        If Selection.Columns.Count = 1 Then
            'rStart will adjust to be where ever the SUBTOTAL range will start
            Set rStart = Selection.Cells(1)
            'loop through the cells and replace SUM with SUBTOTAL
            'change rStart to point to cell just below the SUBTOTAL
            For Each rCell In Selection.Cells
                If rCell.HasFormula And Left(rCell.Formula, 5) = sSUM Then
                    rCell.Formula = "=SUBTOTAL(9," & rStart.Address(0, 0) & ":" & rCell.Offset(-1, 0).Address(0, 0) & ")"
                    Set rStart = rCell.Offset(1, 0)
                End If
            Next rCell
        End If
    End If
    'Make the last cell a SUBTOTAL of the whole range
    Selection.Cells(Selection.Rows.Count).Formula = "=SUBTOTAL(9," & Selection.Resize(Selection.Rows.Count - 1, 1).Address(0, 0) & ")"
End Sub

This won’t work in every situation, but this layout is the one I see the most. This layout being SUMs for the subtotals and a big =A1+A2+A3 style formula for the grand total.

Once again SUBTOTAL saves the day and fixes the error. The most common error I see with this layout is in the grand total, but not always. Sometimes the subtotals don’t cover the correct range. It would seem easier when replacing the SUMs to use the same range the SUM uses, but I wanted to make sure I fixed any of those errors too. To do that, I SUBTOTAL from the cell below the previous SUBTOTAL to the cell above the current one.

Pro tip: Use Ctrl+` to toggle between viewing formulas and values (that’s an accent grave, left of the 1 key on US keyboards).

PowerPivot Stats

Here’s what a recent PowerPivot process looked like

  • 14,512,074 Rows
  • 7 Columns
  • 14:09 to load data from SQL Server at 30M
  • 5:36 to save file
  • 129MB file size
  • 0:14 recalc time
  • 1,559,700KB used by the EXCEL.EXE process

On this client machine:

  • Windows 8.1 Pro 64bit
  • Excel 2010 64bit
  • 36GB RAM
  • 2.4Ghz Xeon processor

Leave your story in the comments.