Since Excel 2007 (and later) can handle a million rows of data, I figured it made sense to explore PowerPivot (PP) with a data set larger than that. In addition, I have had a long time curiosity about the US census data. So, this seemed an appropriate time to combine the two. After some exploration of www.census.gov, I learned of http://usa.ipums.org/usa/index.shtml This is a project of the Minnesota Population Center of the University of Minnesota.
Do note that this post is not meant to be a step-by-step tutorial on how to use the IPUMS website. While it took some time, it took even more effort to understand the census data but an explanation of the data is also not my intent with this post. The focus is on loading a large data set in PowerPivot.
The final data set I downloaded expanded to nearly 18 million rows of data. But, it took a while before I got to the point of getting that data set prepped for PP.
A relatively superficial exploration of the IPUMS site led to two conclusions. One, the result of ‘analyze data online’ would yield results online but the result sets were intentionally limited in size. Two, trying to get detailed data through the ‘Create an Extract’ or the ‘Download or Revise Extracts’ led to large data sets but of the kind that were intended for use with SPSS, SAS, or STATA. The data set itself consisted of fixed length columns. In addition, there was a delimiter at the end of each row record (discovered through trial and error).
In addition to the data set, the website provided the command sets to load the data into SPSS, SAS, and STATA.
The SPSS instructions were fairly helpful in understanding the layout and the content of the data (other than the secret delimiter).
So, for one data set the columns were
In addition, the SPSS instructions also contained information about the text values associated with the numeric codes in the data set. For example, ownershpd values were:
10 “Owned or being bought”
11 “Check mark (owns?)”
12 “Owned free and clear”
13 “Owned with mortgage or loan”
21 “No cash rent”
22 “With cash rent”
The above information is important since we can create Excel tables of these number-to-text maps and link them to the data table in PP.
What made the process of getting the data in PP challenging was that PP will not read fixed column width data from a text file though it will read comma delimited text files. I would have thought this would be elementary but I couldn’t find any way to get PP to do so. Maybe, I missed something that would have made the rest of the exercise below unnecessary.
So, as far as I could tell, I had to convert the fixed-column-width text file into a file where the columns were delimited by some character. I picked the comma. But, I couldn’t use Excel (too many records) or Word (too large a file) to do the conversion. I also tried Notepad and WordPad before falling back on true and tried VB(A).
I did not start with the 18 million record data set. I used the smaller data set for a single census from the 19th century to explore the structure of the data. Through trial and error I discovered the secret delimiter at the end of each record. I also discovered that the software did not summarize the data in any way. The code that I felt confident I could use for data from all the censuses combined is below.
It’s not elegant. It’s not flexible. It’s not meant to be either. At some point I might be tempted to paste the information about the column names and widths into a Excel worksheet and have the code automagically work off that. But, for now, the goal was to prep the data for PP.
One interesting thing I did learn was that using FreeFile() twice in succession yields the same integer value. That, of course, leads to an error when opening the 2nd file. So, the correct approach is to get a free file number, open that file, and then get a second free file number for another file.
Also, if one were to add up the column widths of the various fields one would get 101 but the record being read has 102 characters. The last byte is the delimiter that I want to drop.
Sub writeOneRec(OutF As Long, Rslt As String)
Dim Yr As Integer, DataNum As Integer, Serial As Long, _
NumPrec As Integer, HHwt As Integer, _
StateFIP As Integer, County As Integer, Urban As Integer, _
City As Integer, CityPop As Long, _
UrbPop As Integer, _
OwnerShip As Integer, MorgTotal As Integer, _
MorgAmt1 As Integer, RentGross As Integer, _
HHIncome As Long, _
PerNum As Long, PerWt As Integer, Age As Integer, _
Sex As Integer, MarSt As Integer, Race As Integer, _
GradeAtt As Integer, Occ1950 As Integer, Ind1950 As Integer, _
IncTot As Long, FamTotInc As Long
On Error Resume Next
Yr = Left(Rslt, 4)
DataNum = Mid(Rslt, 5, 2)
Serial = Mid(Rslt, 7, 8)
NumPrec = Mid(Rslt, 15, 2)
HHwt = Mid(Rslt, 17, 4)
StateFIP = Mid(Rslt, 21, 2)
County = Mid(Rslt, 23, 4)
Urban = Mid(Rslt, 27, 1)
City = Mid(Rslt, 28, 4)
CityPop = Mid(Rslt, 32, 5)
UrbPop = Mid(Rslt, 37, 5)
OwnerShip = Mid(Rslt, 42, 2)
MorgTotal = Mid(Rslt, 44, 4)
MorgAmt1 = Mid(Rslt, 48, 5)
RentGross = Mid(Rslt, 53, 4)
HHIncome = Mid(Rslt, 57, 7)
PerNum = Mid(Rslt, 64, 4)
PerWt = Mid(Rslt, 68, 4)
Age = Mid(Rslt, 72, 3)
Sex = Mid(Rslt, 75, 1)
MarSt = Mid(Rslt, 76, 1)
Race = Mid(Rslt, 77, 3)
GradeAtt = Mid(Rslt, 80, 2)
Occ1950 = Mid(Rslt, 82, 3)
Ind1950 = Mid(Rslt, 85, 3)
IncTot = Mid(Rslt, 88, 7)
FamTotInc = Mid(Rslt, 95, 7)
On Error GoTo 0
Write #OutF, Yr, DataNum, Serial, NumPrec, HHwt, StateFIP, _
County, Urban, City, CityPop, UrbPop, OwnerShip, _
MorgTotal, MorgAmt1, RentGross, HHIncome, _
PerNum, PerWt, Age, Sex, MarSt, Race, _
GradeAtt, Occ1950, Ind1950, IncTot, FamTotInc
Dim X, Rslt As String, OutF As Long
X = FreeFile()
Open “c: usharworkusa_00009.datusa_00009.dat” _
For Binary As #X Len = 102
OutF = FreeFile()
Open “c: usharworkusa_00009.txt” _
For Output As #OutF
Write #OutF, “Yr”, “DataNum”, “Serial”, _
“NumPrec”, “HHwt”, _
“StateFIP”, “County”, “Urban”, _
“City”, “CityPop”, _
“OwnerShip”, “MorgTotal”, _
“MorgAmt1”, “RentGross”, _
“PerNum”, “PerWt”, “Age”, _
“Sex”, “MarSt”, “Race”, _
“GradeAtt”, “Occ1950”, “Ind1950”, _
Rslt = String(102, ” “)
Get #X, , Rslt
Do While Not EOF(X)
writeOneRec OutF, Rslt
Get #X, , Rslt
Close #X: Close #OutF
And, so now, I have a CSV data set ready to load into PowerPivot.