Retrieving Lost Comments

I’ve restored a few posts in the last few months that were lost. I didn’t restore any of the comments. Honestly, I should have but I didn’t even think about it. But when I went to restore the In Cell Charting post, I noticed there were 85 comments. That seemed worth my while.

First I set a reference to Microsoft XML, v6.0 and Microsoft HTML Object Library. Here’s the main procedure.

I looked at the source for the web page to figure out how it was laid out and how to get at the data I needed. The CComment and CComments classes store the data as I loop through the list index items in the comment list. The first CComment method is AddNameFromCite. I didn’t even know there was a Cite tag in HTML (but you could fill a warehouse with what I don’t know about HTML).

I made this a method because I generally reserve properties to getting/setting values. If I change more than one property or do any extensive manipulation, I go with a method instead of a property. I’m not uber-consistent about it though. The comment author’s name is the innertext of the PhraseElement (that’s what a Cite is, at least according to the TypeName function). To get the AuthorLink, I need to find the anchor and get the href attribute. Because the wayback machine put its own URL in from of other URLs, I had to find the second instance of “http://” to get the real link. Next the AddDate method.

This really should have been a property instead of a method, but oh well. The innertext of the DivElement is something like “January 1, 2010 at 6:16 am”. I split that string on the “at” and used DateValue and TimeValue to build a date. Finally the content of the comment.

I passed in a collection of elements that are ParaElements (tag=p=paragraph). Then I looped through them and concatenated a string for the content. By looping through just the p elements, I skip all the comment meta crap that is auto-generated by WordPress and just get to the text.

At this point I have 85 CComment objects and I’m ready to build the SQL string.

Just a bunch string building and putting in a file that I can import into PHPMyAdmin. In the CComment class, the values are put together like this

I really like this method of building a string – putting it into an array and using Join – so I think I’ll start using it. The EscSq function turns any single quotes into two single quotes. The ContentScrubbed property converts any vbNewLines into \r\n. I exported some existing comments from MySQL to see how all this stuff went together. In the end, I ended up with a file that looks like this.

phpMyAdmin kept erroring out that the file was using too much memory. It’s 51kb, so I knew that wasn’t true. But the helpful people at HostGator imported it for me and set me up with console access so I can do it myself next time. I just need to learn the commandline stuff for importing.

I took a quick look through through the comments and they look alright. It’s hard to tell what I screwed up formatting-wise because some people use code tags and most don’t. But the info appears to be there and that’s the most important thing. I guess since I have this set up, I should go back and make sure any other lost posts get their comments too.

As always, if you see something that’s not right on the site, shoot me an email. I have a few hundred posts that still look like crap, but are readable and I’m fixing them as I see them.

MSHTML Version and getElementsByClassName

Doug alerted me that Rob van Gelder’s In-Cell Charting post was yet another casualty of the great data loss of 2011. I got the post back up, but there were 85 comments on that post that really deserve to be put back. And I’m working on that.

To that end, I wrote some code that includes the getElementsByClassName method of the HTMLLIElement object.

I’m parsing through the HTML from the wayback machine to reconstruct the database entries. When I ran this code on a different machine, I got an error. It did not recognize getElementsByClassName as a valid method. I checked the Object Browser and, sure enough, it wasn’t listed. Googling was fruitless. The only thing I saw was that IE doesn’t support getElementsByClassName prior to version 9. Well, that doesn’t apply to me. I’m not automating IE, I’m using the MSHTML Object Library.

So I check the version of mshtml.tlb and find that it’s 4.0.

I check on the other machine and…

The exact same version with different members. Nice. Finally, it dawned on me that I probably have never updated IE on the second system. I checked and noted that I still had version 8 installed. I updated to IE9 and the code ran perfectly. I checked the ObjectBrowser and the method was listed. I checked the TLB version and it’s still 4.0.

Clearly I don’t understand how object libraries work. I had assumed that MSHTML was independent of IE. It makes sense that they’re related, but I would have thought that IE9 used the object definitions in MSHTML, not the other way around. In reality, they’re probably both using object definitions from a third file. Anyway, I’m happy to have the problem solved, if not the mystery.

Seven Segment Display

I was reading about seven segment displays over at Sparkfun and thought it would be a fun exercise in Excel. I’m sure it’s been done a million times, but not by me. The first one was VBA based. Type a number into a cell and this code fills cells to display the number as a seven segment display.

OK, it’s really a 13 segment display – the seven segments and six connecting cells. Next, I did the same thing with conditional formatting. I tried to make the conditional formatting formula consistent across the cells, but I just couldn’t. The TRUEs and FALSEs change for each cell depending on if that cell is lit for that number.

Here’s the CF formula for cell H3.

=CHOOSE(MID(TEXT($C$9,"0000"),(COLUMN()+MOD(MOD(MOD(16-COLUMN(),12),8),4))/4,1)+1,TRUE,TRUE,TRUE,TRUE,TRUE,FALSE,FALSE,TRUE,TRUE,TRUE)

H3 is lit for every number except 5 and 6. There’s data validation on the input cell to keep it under five digits. The CF formula is a CHOOSE function with nine TRUEs/FALSEs. To determine which character to represent, I use a MID function after padding the text to four digits. The starting position (second argument of MID) is determine by this:

(COLUMN()+MOD(MOD(MOD(16-COLUMN(),12),8),4))/4,1)

Column 16-Column Mod 12 Mod 8 Mod 4 Column+ /4
2 14 2 2 2 4 1
3 13 1 1 1 4 1
4 12 0 0 0 4 1
6 10 10 2 2 8 2
7 9 9 1 1 8 2
8 8 8 0 0 8 2
10 6 6 6 2 12 3
11 5 5 5 1 12 3
12 4 4 4 0 12 3
14 2 2 2 2 16 4
15 1 1 1 1 16 4
16 0 0 0 0 16 4

You can download SevenSegment.zip

Placing Chart Data Labels

I made this chart in Excel 2003:

I used this data:

Vehicle 1/31/2012 2/29/2012 3/31/2012 4/30/2012 5/31/2012 6/30/2012 7/31/2012 8/31/2012 9/30/2012
Cars 603 615 627 634 646 659 672 679 693
Trucks 405 413 433 442 451 440 430 445 475
Vans 545 556 567 578 590 602 590 585 580
SUVs 465 512 527 570 604 634 672 726 750
Hybrids 510 519 571 628 672 725 775 805 855

Normally I would put the series name at the end of the line (because I like how it looks), but I wanted to include the starting and ending values too. So I decided to put the series name on the second to last data point. If all the labels are at the top or bottom, it looks terrible because they run into other lines. I needed to put the label above or below the line based on how much space was available. That was harder than I thought it would be.

I ended up looping through the collection a lot: once to store the values at position 8, another couple times to sort, and yet another time to apply the labels. I started with some of Peltier’s code and modified for my needs.

The lines diverge toward the end, so using the 8th data point turned out good. If they converged, it wouldn’t really matter whether there was more space on the top or bottom, they would still run into each other. In that case I would change lPNTOFFSET to a more appropriate place.

Cloning Class Instances

I’m working on a project that requires me to make multiple copies of a custom object. That generally means I make a Duplicate method and assign each property of the source instance to the copy instance. It’s not terrible, but I just have to be careful that when I change the properties of the class that I update the Duplicate method. I decided that I would see what other options are available and I found this StackOverflow answer that I think is interesting. So I tried it.

Instead of declaring a bunch of private variables in the class and all the getters and setters for those variables, all of the property values will be in a user-defined type. The UDT variable will be private to the class, but the actual UDT has to be in a standard module because VBA won’t let you put types in a class module. Now when I want to clone that class instance, I can pass the whole UDT variable. Unlike a class that passes the reference, the UDT makes a copy of all of the values in memory so you end up with two separate structures in memory – just what I want for a clone.

I made a CContact class that will have a FirstName, LastName, and LastContact property. The poorly named LastContact property is the date last contacted. I wanted to put at least two different data types (String and Date) in the class for demonstration purposes. The declaration of CContact looks like this:

I only have two private variables; ContactID that never want to copy and a ContactMomento variable that contains all the data that I do want to copy. I keep the ContactID getter and setter the same as I normally would, but the mtContactMomento property statements are a little different. First, let’s look at the ContactMomento type. This is in my MGlobals standard module.

I named the type ObjectName + Memento. Memento is from the Memento pattern, which is typically used to store a previous state for Undo purposes, although we’re not really using it like that here. The UDT holds all of the properties that I want to copy to a new class instance. Back in the class, the typical getters and setters look like this:

That’s pretty much the same as typical property statements except I’m pulling the data from my private ContactMemento variable instead of individual variables for each project. With that all set up, this works just like a class that I would create normally – not using a UDT. The last bit of setup that I need is property statements for the ContactMemento as a whole.

And then I need a way to clone the class, so I create a Clone property.

By assigning the ContactMemento from the source instance to the class instance, all the properties I want to copy are done so in one shot rather than one at a time. If I add any properties to the class, I don’t have to adjust the Clone property, which means that I won’t forget and miss a property. Let’s test it out. If create one instance, then clone it and change the LastContact property just for some variety.

That outputs:

I like a system that doesn’t rely on me being careful to update a Clone or Duplicate method. But there are some things I don’t like too. With this method, my class is not longer self-contained. The UDT declaration has to live in a standard module. If I copy my class module to another project, it won’t compile until I declare the UDT in the new project. I would also have to change all of my code generation stuff to use a UDT rather than individual private variables. In all the code I’ve written, I’ve probably only had to clone class instances about a half dozen times, so it’s probably not worth it.

You can download CloneClass.zip

Testing First

I’m not a TDD guy, but I am intrigued by it. With TDD, you write a test that fails and then write code to make it pass. Then you write another test that fails and write code to make it pass. Every so often you stop to refactor the code. I’m not an expert, so this may be overly simplistic or just plain wrong. If I used this methodology, I would end up with a crap load more tests than I do now. Maybe that’s a good thing.

What I like, in principal, about TDD is that nothing slips through the cracks. What I don’t like about it is that implementing it in VBA is probably a big headache. Mike Woodhouse has done some good work in this area, but I wanted to trying something simple from scratch to see what there is to see.

In my scenario, I’m computing commission for salesguys. A salesguy has a standard rate that he earns up to $100k of sales. From $100k to $500k, he gets 1.5 times that rate. And from $500k up, he gets 2x that rate. My first test is to check the commission on the first dollar.

This doesn’t even compile. Literally all I did was add a module called MTest and write this test. I get no intellisense because I haven’t even created the classes yet. The first compile error is on the csalesguy line, so I’m going to create that class.

Now I need a CSales class. I create a CSale class with SaleDate and Amount and use my VBHelpers add-in to create the CSales parent class. The next compile error is that I don’t have a Commission property of the CSalesGuy class.

And in the CSales class:

Everything compiles, I run my test, and it works. So far so good. Next I want to test just below the first threshold, so I add another test and a procedure to run all tests.

And they pass. I guess I didn’t write a test that failed, but I will this time. Next I want to test just over the first tier of sales.

That fails. So I fix my commission property.

Note that I changed the data type to Currency to eliminate the precision errors. This is obviously not optimal code, but I think the idea is to write the minimum code to make the test pass, then refactor. Next up, I test just under the next threshold.

That passes as expected. My last test case is over the second threshold.

Now to fix up Commission to pass the test

And they all pass. Now I have only to refactor. That’s not insignificant as this is some really crappy code. The first thing I want to do is create a CRates class to hold all of the rates. The business rule of 1.5x and 2x is fine, but that could change so I need some flexibility. I create a CRate class with Rate and Threshold properties and make the parent CRates class a child of the CSalesGuy class. In CSalesGuy:

Then I create a test setup function to instantiate the SalesGuy

Finally, I need to change the Commission property to use the Rates rather than the (now eliminated) StandardRate property.

That requires an IncrementalRate property of the CRate class.

All of my tests pass, so I should be good right? It’s nice to be able to refactor the code and see that all the results are still correct. I made the commission rates model more flexible to allow for an unlimited number of sales tiers, but I’m still only testing my original business rules. Maybe I should add more tests, but then it’s not really the tests that drive the development. Do you write tests first? If so, how would you do it differently?

You can download TDD_VBA.zip

String Diffing

I’ve wanted to have some wiki-like diffing in my userform textboxes for a while now. Since I’ve been using wikis almost daily, I want the revisioning feature in everything I do. I’m not there yet, but I decided to see what kind of algorithm I would need to do it. I read the Wikipedia article on longest common subsequence and played around with it a little.

This code is called LCSLength in the article. It returns a matrix (2d array) with counts of matching elements at each position. For instance, if you’re diffing “Dick” and “Rick”, they have three letters in common and this table will compute that. It looks like this

R i c k
0 0 0 0 0
D 0 0 0 0 0
i 0 0 1 1 1
c 0 0 1 2 2
k 0 0 1 2 3

The rest of the functions use this table to figure out what’s what.

This function (called backtrack in the article) traces back through the table and outputs the longest common subsequence. It’s a recursive function (it calls itself) and continually appends letters (or other elements) on to the return string.

When both the i and j counters are zero, it stops calling itself. Otherwise, if the two letters match, it appends the current letter to the end and calls itself using the element up and to the left. If there’s no match, it goes to the larger of the element above (i-1) and the one to the left (j-1). By following the path of the larger numbers through the matrix, it can find the common letters. It’s originally called with the largest i and j – in the above table, it’s called looking at the 3 (the bottom right cell). Here’s how it tracks through the matrix (I’ll use cell references, but it’s not really cells).

  1. F6: k=k, so add k to the end of the string.
  2. E5: c=c, so add c to the end of the string.
  3. D4: i=i, so add i to the end of the string.
  4. C3: D <> R so find the larger of C2 or B3
  5. C2: i=0 so that’s it.
  6. Return “ick”

Thrilling, isn’t it?

This is another recursive function working backward through the matrix. When it finds a match, there’s no prefix. If it’s a new element (in Revised, but not Original) the prefix is a “+”. If it’s a deleted element, you get a “-“. This prints the results to the immediate window. Let’s look at some examples.

This shows the diff on a letter-by-letter basis.

The first line is a printout of the longest common subsequence. The rest is a letter-by-letter diff that shows which elements were added, deleted, and unchanged. We can also diff on words.

Instead of filling the array with letters, I split the string on spaces to get words. Note that I put a leading space in front of each string before the split. The array needs to be 1-based and the Split function is zero based. The array doesn’t actually need to be 1-based, but the first row and column is ignored, so I made sure that it was something I didn’t care about. Once the arrays are filled, everything is the same.

Traditionally, diffing text is done line-by-line. So let’s do that. I found an example essay and made two files; OriginalDiff.txt and RevisedDiff.txt. I changed one thing in Revised and used this code to diff them.

And that’s as far as I got. Next, I need to put the diffs into a database so I can display diffs and revert to prior versions. Or, quite possibly, I’ll lose interest because I don’t have a burning need for this. It’s just something I’ve wanted to do.

You can download Diffing.zip