A Hardcore Declaration of Independence (cont'd)

Appendix A. The Dictionary Class

What exactly is a Dictionary? Well, according to what passes for documentation on the subject, "A Dictionary object (sic) is the equivalent of a PERL (sic) associative array." So why aren't we all using Perl instead of reading crap from someone who can't spell the name of the language and doesn't know the difference between a class and an object?

I tried to make sense out the documentation, but alas there's no sense to be made. You should write the docs off as a total loss. They leave gapping holes in the description of the class members, and give no context to explain the purpose. The few examples are in VBScript, and are thus perfect examples of how not to write VB code. The only way you're going to figure out what the Dictionary class is all about is to learn how Perl associative arrays work. And once you learn that, you'll see that the Dictionary class is hardly an equivalent to real associative arrays--although its features are similar.

The problem is that Perl associative arrays work in the full context of the Perl language‑-which is optimized for text and array processing. Visual Basic has added some new Perl-like language features and tools, but they fall far short of offering Perl's power to process files and other text with a few lines of code. But once you get the idea, it's relatively easy to add new Perl-like tools that do what the built-in ones fail to do.

Note: Logically, Appendix A fits in Chapter 4 after the end of "The Collection Class" and before "What Makes a Collection Walk?" (page 196). It is located here because it is too long for the page-by-page corrections section.

Not Just a Better Collection
The first thing you should understand about the Dictionary class is that it's not just an attempt to fix the obvious problems of the Collection class (see Chapter 3, "The Collection Class," page 191). The specification for VB5 originally had a Dictionary class. Whenever I complained about the failings of the Collection class, I was told that those problems would be fixed in Dictionary, so there was no need to fix Collection. Then they ran out of time to complete Dictionary, and Collection didn't get fixed either.

So when Dictionary made its appearance in VB6, I was expecting a better Collection. Not so. There is little connection or overlap between the two classes. The people who wrote this class for VBScript apparently had something very different in mind. If you try to use a Dictionary like a Collection, you'll get nothing but frustration. But a little frustration never hurt anyone. Let's try using a Dictionary as if it were a Collection.

But first, before you even think about using Dictionary, you have to check Microsoft Scripting Runtime in the References dialog. The Dictionary class is not part of the VB6 run-time library. It's a part of the Microsoft Scripting Host that happens to be shipped with VB6. You can use it just as easily with VB5 or even with VB4. If you have Windows 98 or the Windows NT Option Pack, you have the Dictionary class (as well as the FileSystemObject class). If not, you can download the Scripting Host from http://msdn.microsoft.com/scripting/. This means that if you use Dictionary, you'll have to ship the Microsoft Scripting Runtime (SCRRUN.DLL) with your product. You can't count on all your users having it.

Let's start by putting some items in a Dictionary:

Dim animals As New Dictionary
With animals
    .Add "Lion", "fierce"
    .Add "Tiger", "cunning"
    .Add "Bear", "savage"
    .Add "Shrew", "bloodthirsty"
    .Add "Weasel", "ruthless"
End With

The syntax for Add looks kind of like Collection Add, but it's not. The first argument is the key, and the second is the item, which is the reverse of the Collection order. You'll see another difference when you iterate through a Dictionary:

Dim animal As Variant
For Each animal In animals
    Debug.Print animal
Next

This prints: Lion, Tiger, Bear, Shrew, Weasel. At first glance this might seem normal, but when you look closer, you'll see that you're printing the keys, not the items. A Collection would print: fierce, cunning, savage, bloodthirsty, ruthless. This seems a little weird. Why won't it print the items? Well, it will if you ask it the right way:

Dim character As Variant
For Each character In animals.Items
    Debug.Print character
Next

How does this work? Well, it turns out that a Dictionary has two array properties. There's a Keys method containing all the keys, and there's an Items method containing all the items. A Collection only has an Items method. You can't get at the keys of a Collection--one of the flaws that I flamed about in Chapter 4. Dictionary fixes that by providing a Keys method. For Each blocks iterate Keys rather than items because this is the method used by the hidden NewEnum property (see Chapter 4, "What Makes a Collection Walk," page 198). In other words this:

For Each animal In animals

Actually means:

For Each animal In animals.Keys

This seems like a weird change if you expect Dictionary to be like Collection, but the difference will eventually make sense.

You might wonder whether indexing in the Dictionary class is one-based as in the Collection class or zero-based as in the Forms or Controls collections. Dictionary has a Count property like Collection's, so maybe it's one-based:

Dim i As Long
For i = 1 To animals.Count
    Debug.Print animals(i)
Next

This code compiles and runs, but it doesn't print anything. Apparently the Dictionary class isn't one-based. In fact, it isn't indexed at all. If you look at the locals window while you execute this code, you'll see that for each iteration a new element is added to the Dictionary with the keys 1, 2, 3, 4, and 5. The value for each is empty. Apparently any access of an Item (the default member) through a key adds that key to the Dictionary. You may not guess the intention of this behavior (and the documentation doesn't say), but it certainly isn't like a Collection. By the way, Dictionary has an Exist member so that you can check for a key to avoid adding those empty Dictionary items.

If you look at the Keys example in the documentation, it appears to iterate by index, but if you look more carefully you can see that the array returned by the Keys method is assigned to another array, and that array is in turn iterated by index. Why would you want to do that? Patience. We'll get to that.

Another difference from a Collection is that you can change on item with the assignment operator:

animals.Item("Lion") = "ferocious"  ' Full syntax
animals("Bear") = "wild"            ' Shortcut using default member

The reason this works with Dictionary is the same reason it ought to work with Collection. Dictionary implements Let and Set properties for its default Item property. Collection doesn't. This syntax can change an existing item, or create a new one. In other words, the Add method is redundant. I usually add new members by assignment:

animals("Mouse") = "timid"
' Same as animals.Add "Mouse", "timid"

To mention a few other differences, Dictionary has a CompareMode property with text and binary settings so that you decide whether "Tiger" and "tiger" are the same key. It also has a RemoveAll method so that you can clear a Dictionary quickly--an operation that is surprisingly tricky with Collection. Finally, it has a Key property which allows you to change the key of an existing item. Key deserves a special note as the first write-only property in history--or at least the first I've encountered. Of course an indexed Property Get for Key would be meaningless since you would have to supply the key to get the key, but I still think they should have supplied a Get to be orthogonal.

Given all these features, you can see that the Dictionary class is quite different than the Collection class as it now stands--although I doubt it would take much to add the same features to Collection. But knowing the Dictionary features doesn't give you much clue about what it's for. I happen to know what it's for because I spent several hours studying Perl associative arrays. It's a shame that you have to learn another language to figure out what the documentation should tell you, but that's how it is.

Associative Arrays and Dictionaries

A VBScript Dictionary object (or a Perl associative array) is simply a table of names (called keys) and values. If you take the name literally, a Dictionary data sample might looks this:

Keys

Items

a

the first letter of the English alphabet

aardvark

a large burrowing nocturnal African animal

Aaron

a brother of Moses and high priest of the Hebrews

aba

a fabric woven from the hair of camels or goats

abaca

a fiber obtained from the leafstalk of a banana

Surely, you think, there must be more to a Dictionary than that. But no. That's it. You tell it the key; it tells you the value.

In Perl, the items are always strings, but in Visual Basic items are Variants and can contain any Variant subtype including objects and, starting in version 6, UDTs. This means that a Dictionary can serve as a simple database for those occasions when a real database is overkill.

In Perl, associative arrays work closely with regular arrays, which work closely with strings. There's a whole set of operators and functions to process strings and arrays. The most useful with Dictionaries are filter functions that take some input--a string, a file name, an array--and returns an array of strings (technically an array of Variants containing Strings). VB6 adds two of these, Filter and Split, but serious Dictionary work requires all kinds of filters. I had to write several others in order to illustrate a practical way of creating a useful mini-database in a Dictionary.

Here's the code to create a database of all the words in a text file with the number of times each word occurs. This code is from the TDictionary.vbp project, and it counts the words in the TDictionary.frm file:

Dim dicWordCount As New Dictionary
Dim avsWords() As Variant, i As Long, sKey As String, sSep As String
' Separate with all the symbols used in VB programming
sSep = sWhitePunct & sQuote2 & sQuote1 & "()#<>+-*/=&\:;"
' Read the words from a file into an array
avsWords = GetFileTokens("TDictionary.frm", sSep)
' Put all the words into a table, counting each
For i = 0 To UBound(avsWords)
    sKey = avsWords(i)
    ' Increment count when a word is encountered
    dicWordCount(sKey) = dicWordCount(sKey) + 1
Next

The sWhitePunct constant used here comes from the Windows API Type Library and includes white space characters and punctuation characters. The GetFileTokens function is from the Parse module in VBCore. We'll look at the code for it later.

After executing this code, you have a Dictionary full of data, but using that data for anything useful is a separate problem. For one thing the data will be unsorted. Perl documentation clearly states that associative arrays have an undefined order. VB Dictionary documentation doesn't say anything about order, but I have found by experience that the order when iterated with For Each is the order in which items were added. I wouldn't count on this order even if it were meaningful. In any case, you'll probably want to have sorted data for most reports. You might also like to apply other operations--such as filtering out all the numeric tokens.

In order to see different views of your data, you don't need to change the order or content of the Dictionary. You just need to get a different view of the keys, and then use the keys to look at the corresponding items. They tell me that this is what database programmers do with queries, but as the only Visual Basic programmer in the world who has never used a RecordSet, I wouldn't know. Here's the code to apply two filters to the keys of the word count Dictionary:

Dim vas As Variant
' Sort the keys
vas = SortKeys(dicWordCount.Keys, fDescend, cmp)
' Filter all numbers out of the list using a Like pattern
vas = FilterLike(vas, "#*", False)

The first function sorts the keys and returns them in an array. The fDescend and cmp arguments determine whether the sort is ascending or descending, and whether the sort comparisons are binary or text (vbCompareBinary or vbCompareText). These variables are linked to Descending and Text checkboxes on the test form. SortKeys is in the Sort module in VBCore.

The second filter is based on my FilterLike function from the Utility module of VBCore. FilterLike is a more powerful version of VB's new Filter function. Filter can be used to filter out or in all the strings from an array strings that contain or don't contain a given substring. For example, I could have applied Filter like this to get all the strings containing the substring "as" in either upper- or lowercase:

vas = Filter(vas, "as", True, vbTextCompare)

FilterLike does the same thing except that it uses the patterns recognized by Visual Basic's Like operator.

I showed the two filters separately for clarity. It's actually more efficient to chain them together like this:

vas = FilterLike(SortKeys(dicWordCount.Keys, fDescend, cmp), _
                 "#*", False)

Once you have the keys you want in the order you want, it's easy to use the corresponding items:

 
' Output the table contents
Dim sOut As String
For i = 0 To UBound(vas)
    sKey = vas(i)
    sOut = sOut & sKey & " : " & dicWordCount(sKey) & sCrLf
Next

The items here are integers, but it's not hard to see that you could process Dictionary data full of objects or UDT variables in a similar way. The TDictionary project has another example in which file data (length, date, attributes, and so on) are stored in Variant UDTs in a Dictionary.

Incidentally, notice that we always use the Keys property to access the Items. We never access the items directly. That's probably why the internal enumerator that works with For Each iterates over the keys rather than the items. But frankly, there's not much reason to ever iterate through keys or items directly because their order is usually random.

Arrays, Strings, and Variants

All the filters shown so far--GetFileTokens, SortKeys, Filter, and FilterLike--return a Variant rather than an arrays of strings or even an array of Variants. This is a bad design choice, but one forced on us by the origin of the Dictionary class and the Filter and Split functions. They came from VBScript, which recognizes only one type--Variant. This is fine for VBScript, but an efficient VB6 Dictionary class would return a String array for Keys, a String for Key, a Variant array for Items, and so on.

But the more you check the actual types used by the Dictionary class, the more surprises and anomalies you'll find. Most methods and properties use Variants, even when that type seems inappropriate. My first thought was that this was a requirement for VBScript, but in fact there are a few specific types used in Dictionary. The Exist method returns Boolean, the Count property returns Long, and the CompareMode property returns a CompareMethod enum.

Even more surprisingly to me was what I found in some of the Variants. For example, the Keys method returns a Variant containing an array of Variants. If you're used to associative arrays in Perl or Dictionaries in the C++ Standard Template Library, you would probably assume that the Variants in the Keys array would contain Strings. Not so. You can actually add keys of any type--Date, Currency, whatever--and they will be stored internally in the entered type. This is a curious fact, but I don't think a very relevant one. Perhaps some reader can think of a good reason to use a type other than String for keys, but I can't. All of my filter functions assume String keys.

As an example of how to write a filter function, let's take a look at my Splits function from the Parse module:

Function Splits(sExpression As String, _
                Optional sDelimiters As String = sWhiteSpace, _
                Optional cMax As Long = -1) As Variant
    Dim sToken As String, asRet() As String, c As Long
    ' Error trap to resize on overflow
    On Error GoTo SplitResize
    ' Break into tokens and put in an array
    sToken = GetToken(sExpression, sDelimiters)
    Do While sToken <> sEmpty
        If cMax <> -1 Then If c >= cMax Then Exit Do
        asRet(c) = sToken
        c = c + 1
        sToken = GetToken(sEmpty, sDelimiters)
    Loop
    ' Size is an estimate, so resize to counted number of tokens
    ReDim Preserve asRet(0 To c - 1)
    Splits = asRet
    Exit Function
    
SplitResize:
    ' Resize on overflow
    Const cChunk As Long = 20
    If Err.Number = eeOutOfBounds Then
        ReDim Preserve asRet(0 To c + cChunk) As String
        Resume              ' Try again
    End If
    ErrRaise Err.Number     ' Other VB error for client
End Function

First let's look at the signature. This code returns a Variant, even though the data being returned is strings. VB6 adds the ability to ability to return arrays of any type from functions or properties, but we're ignoring that new feature. VB6 also adds the ability to assign one array to another, but we're not using that one either. You could return arrays in Variants and assign arrays to Variants in VB5, and since that's what the Dictionary does, it's what I do. This has the extra advantage of making my filter functions portable to VB5.

You might ask why I'm writing a Splits function instead of using the new Split function provided in VB6. At first glance, the signatures of Split and Splits look very similar. Unfortunately, Split can only "parse" using a single character separator rather than a list of separators. The "Split Sucks" section in Appendix B has a few more choice words about the uselessness of this function and its author.

The Splits code itself isn't very interesting. We've already seen the overflow resizing technique in Chapter 4, "Vectors as Resizeable Arrays," page 179, and the GetToken function in Chapter 5, "Code Review," page 253. What is interesting is how easy it is to write the GetFileTokens function once we have the Splits function:

Function GetFileTokens(sFile As String, _
                       Optional sDelimiters As String = sWhiteSpace _
                       ) As Variant
    ' File to string to token array
    GetFileTokens = Splits(MUtility.GetFileText(sFile), sDelimiters)
End Function

Here is a list of filter functions from VB and VBCore:

Function

Module

Description

GetFileTokens

Parse

Splits the contents of a text file into a token array

GetFileLines

Utility

Splits the contents of a text file into a line array

FileArray

Utility

Returns array of file names matching a file spec

Splits

Parse

Splits a string into a token array

QSplits

Parse

Splits a string into a quoted token array

Split

VBA

Fails to split a string into a token array

Filter

VBA

Filters elements in or out of array using substring compares

FilterLike

Utility

Filters elements in or out of array using the Like operator

SortKeys

Sort

Returns a sorted version of array

ShuffleKeys

Sort

Returns a shuffled version of array

ReverseKeys

Utility

Returns a reversed version of array

VB5 doesn't have the Filter and Split functions--or at least it didn't until now. They're defined in a conditional block (#If iVBVer <= 5) in the new VB6Funcs.bas module. In fact, VB5 users get a Split function that actually works. It uses code similar to the Splits function shown earlier. The new Join function is also there, as is an inefficient implementation of InStrRev.

The new array features in VB6 raise some interesting questions about the relationship between arrays and variants. Consider the following declarations:

Dim avsT() As Variant, asT() As String
Dim vasT As Variant, vavT As Variant
ReDim avsT(0 To 2) As Variant   ' Array of variants containing strings
ReDim asT(0 To 2) As String     ' Array of strings
ReDim vasT(0 To 2) As String    ' Variant containing array of strings
ReDim vavT(0 To 2) As Variant   ' Variant containing array of variants

They might all contain exactly the same data, but does that mean that any one can be assigned to any other? No, a test shows that some combinations are illegal:

'avsT = asT  ' Can't assign string array to variant array
'avsT = vasT ' Can't variant with string array to variant array
'asT = avsT  ' Can't assign variant array to string array
'asT = vavT  ' Can't assign variant with variant array to string array

When you think about this for a minute, the results make sense. You can't assign Variant arrays to String arrays or vice versa. But you can assign any kind of array to a Variant, because the Variant automatically resizes to accommodate it regardless of the prior contents. You can also assign one String or Variant array to another of the same kind, even if the array being assigned is stored in a Variant. This gets confusing, but only when assigning to arrays. If you assign to Variants you're always OK.

VB5 has even more limitations. You can't assign any array to another array, but you can assign arrays to Variants. The Sort module has some inefficient VB5 hacks in conditional blocks to work around these limitations.

What does all this unnecessary Variant conversion do to performance? Nothing good. I added a test to the TimeIt application to compare different kinds of array returns. Here are the results:

Return array of strings from function: 0.075 sec
Return array of variants from function: 0.0793 sec
Return variant with array of strings from function: 0.2534 sec
Return variant with array of variants from function: 0.2235 sec

The first shows how fast a typical filter operation would be if the Keys method returned an array of strings, as I think it should. The last shows the same filter operation when returning an array of variants in a variant, as the Keys method does now. The first operation is about three times faster. We're paying a heavy price for using a Dictionary component that was designed for VBScript rather than for VB.

Appendix B. VB Strings and Pointer Arithmetic

One of my favorite songs from the days when I watched Sesame Street with my daughter went like this:

Mama don't allow no guitar playing round here. 
Mama don't allow no guitar playing round here. 
We don't care what mama don't allow, 
we're gonna play guitar anyhow.
Mama don't allow no guitar playing round here. 

And that was the cue for the guitar solo--played dramatically by a big blue muppet. The next verse continued in the same vein:

Mama don't allow no bass playing round here...

Well, it turned out that Mama didn't much care for drums or harmonica or piano, but she had to put up with them anyway. The final verse went like this:

Mama don't allow no pointer arithmetic in Visual Basic...

Hmmm. I think it's about time for our solo. But before we start breaking rules, let me set the stage by talking about Visual Basic's String type.

Basic has always been known for its easy and powerful string handling. Unfortunately, the Visual Basic String type is showing its age. While processing strings in Visual Basic is still easier than in C, it doesn't have any advantages over competing languages like Delphi and C++. Although Visual Basic is becoming more object-oriented, its string capabilities remain completely functional--unlike the string classes in Java and C++. Finally, performance of Visual Basic strings is not particularly good, even with native code produced by VB5 and continued in VB6.

This appendix takes an inside look at Visual Basic strings. By learning how the String type works, we can see why some operations are slow, and how we can speed them up by breaking the rules.

How String Data is Stored
If you do a hex dump of an EXE file created by Visual Basic, you're likely to come to a section of data that looks something like this:

00003950  00 00 00 00 4D 00 6F 00  76 00 65 00 00 00 00 00  ....M.o.v.e.....
00003960  52 00 65 00 66 00 72 00  65 00 73 00 68 00 00 00  R.e.f.r.e.s.h...
00003970  53 00 65 00 74 00 46 00  6F 00 63 00 75 00 73 00  S.e.t.F.o.c.u.s.
00003980  00 00 00 00 5A 00 4F 00  72 00 64 00 65 00 72 00  ....Z.O.r.d.e.r.

Looks like a whole lot of zeros going nowhere. If you wrote the code, you'll recognize it as your string data. All those extra zeros are there because your data is stored in Unicode format. When talking about Unicode, I like to quote the late David Kruglinski--author of Inside Visual C++, Microsoft Press. He said, "Microsoft took away segments and gave us Unicode." So even though 32-bit programming gets rid of all those stupid data size limitations, it gives us a new problem that often seems just as arbitrary and senseless.

I believe that someday Unicode will be worth the pain. We'll have one sensible character format that works for all the languages of the world. But that day is not now. In fact with Visual Basic 5 we have a very ironic situation. Even though strings are stored in Unicode format, you can't actually use Unicode as your character format for internationalization.

For now, let's just think about the different ways that Visual Basic and Windows store strings. The LPSTR, LPWSTR, and BSTR formats are shown in Chapter 2, "Strings Inside Out," page 73. We're going to keep coming back to these three formats throughout the rest of this discussion.

Toggling Case
I first got involved in using APIs to optimize string operations when I needed to search backward within a string. Although the InsStrRev added for VB6 makes this problem irrelevant, we're going to look at it anyway later on. But not yet. Any string searching algorithm other than brute force would be complicated and difficult aside from it's use of strings. So let's start out with a simpler problem--one that will illustrate most of the optimization issues and solutions without confusing us.

We'll start by toggling the case of every character in a string. This isn't a particularly realistic problem. For one thing, we're much more likely to lowercase or uppercase a string, but Visual Basic already provides the LCase and UCase functions. For another, if we did want to toggle case, we probably wouldn't care about performance. Nevertheless we're going to test the performance of the TCase function. I actually use a TCase function written in another language for a professional text editor program because it's easier to remember one toggle command than to remember separate uppercase and lowercase commands. The toggle does what I want 95 percent of the time, and I can adjust the other 5 percent of the time. So we'll pretend that the Visual Basic IDE has keyboard macros like every other programmer's editor in the world and that we're writing TCase as a VB editor command.

Our first version is simple and obvious:

Function TCase1(ByVal s As String) As String
    Dim i As Long, ch As Integer
    For i = 1 To Len(s)
        ch = Asc(Mid$(s, i, 1))
        Select Case ch
        Case 65 To 90       ' A to Z
            ch = ch + 32
        Case 97 To 122      ' a to z
            ch = ch - 32
        ' Ignore non-characters
        End Select
        Mid$(s, i, 1) = Chr$(ch)
    Next
    TCase1 = s
End Function

We loop through each character of the string, extract it, change its case, and reinsert it at the same position. When we've processed every character, we return the modified string. Notice that we pass the string by value so that we're processing a temporary copy, not the original.

You can see this function and all the related TCase functions code in the Test Strings project (TString.vbp). The program provides a means to test toggling of long and short strings, alphanumeric strings that are alpha-heavy and ones that are numeric-heavy, and strings in different languages.

At this point we could start asking optimization questions and start making minor changes. For example, we could ask whether a Select Case statement is a faster way to identify modifiable characters than an If Then block. We could ask whether receiving a string by value is faster than assigning to a temporary variable inside the function. But before we get into micro optimization, let's take a closer look at what's going on behind the scenes.

Programming With B--
In Chapter 10, "What You Don't Know," page 561, I invented a language called B-- to help illustrate all the wonderful things Visual Basic takes care of for us so that we can worry about programming problems. B-- has the syntax of Basic, but it lacks the high-level language features we've come to expect. For example, it doesn't know about automatic memory management or the Common Object Model (COM). Earlier I used B-- to show how tough programming would be if we had to create and manage objects by calling QueryInterface, AddRef, and Release ourselves. Here I'm going to use B-- to show what string allocation would look like if Visual Basic didn't manage it for us.

Here's how the TCase1 function shown above would be written in B--. The original Visual Basic lines being replaced are shown in comments for comparison. Since uncivilized languages like C++ and B-- don't know about the String type, they must use the underlying BSTR type and the BSTR system functions. There are nine of these functions, but for our purposes we're only going to use three--SysStringLen (returns string length), SysAllocStringLen (allocates a string of given length), and SysFreeString (frees an allocated string). The code looks something like this:

Function TCase1(ByVal s As BSTR) As BSTR
    Dim i As Long, ch As Integer, sT As BSTR
    ' For i = 1 To Len(s)
    For i = 1 To SysStringLen(s)
 
        ' ch = Asc(Mid$(s, i, 1))
        sT = SysAllocStringLen("", 1)
        sT = Mid$(s, i, 1)
        ch = Asc(sT)
        SysFreeString sT
 
        Select Case ch
        Case 65 To 90       ' A to Z
            ch = ch + 32
        Case 97 To 122      ' a to z
            ch = ch - 32
        ' Ignore non-characters
        End Select
 
        ' Mid$(s, i, 1) = Chr$(ch)
        sT = SysAllocStringLen("", 1)
        sT = Chr$(ch)
        Mid$(s, i, 1) = sT
        SysFreeString sT
 
    Next
    TCase1 = s
End Function

This is speculative code. If you compile TCase1 to binary code with symbolic information and no optimization, you can load it into a C debugger (such as Microsoft's Visual Studio) and step through the code. You'll see that this isn't exactly what happens. Instead Visual Basic calls internal functions such as __vbaLenBstr and __vbaFreeStr. But trust me. I've written similar code in C++, and this is essentially what happens behind the scenes. Besides, as the inventor and only user of B-- (with no compiler to check up on me), I'm entitled to a little poetic license.

You don't have to know much about memory allocation to figure out that this is really bad code. All we want is to test and modify some characters. We don't even need to extract them. We ought to be able to make the changes in place. If you look back at the diagram of how the String data is stored, you can see how tantalizing close those characters are. We want to just reach out and touch them, but Visual Basic doesn't provide any syntax for that. Instead we have to create temporary strings just to hold one character. We wouldn't need those temporary strings in any other language. C and Delphi, for example, allow you to index into a string and touch a character just as if it were in an array rather than a string. In fact, that's what a string is in those languages: a character array.

This discovery has wide implications. Any algorithm that requires processing characters one at a time is going to hit the same problem. For example, the best-known string search algorithm is called the Boyer-Moore algorithm. But this algorithm depends on a character table and intelligent analysis of potentially matching characters--not necessarily compared in the order in which they appear. It's not difficult to figure out that any efficiencies from the algorithm would be eaten up by the cost of allocating and freeing temporary strings just to check one character. A dumb "brute force" search algorithm would probably come out faster than any smart algorithm.

So what are we going to do about Visual Basic's character processing problems? Before we get to that I want to pose a completely different question about our case toggling algorithm so far. What will TCase1 do if you pass the following string: "Sobre Visual Basic hay muchos libros, pero sólo conozco uno que pueda servirte para ir un paso más allá de lo aprendido aquí." You end up with "sOBRE vISUAL bASIC HAY MUCHOS LIBROS, PERO SóLO CONOZCO UNO QUE PUEDA SERVIRTE PARA IR UN PASO MáS ALLá DE LO APRENDIDO AQUí." Notice that the accented characters don't get uppercased.

By the way, this sentence comes from Manual Imprescindible De Visual Basic 5 (The Indispensable VB5 Book), Anaya Multimedia, 1997, by hardcore programmer Juan Diego Guiterrez Gallardo. In the back of his book, Juan Diego gives a very complimentary plug for my book. According to my Spanish translator (the same daughter I used to watch Sesame Street with), the sentence means "There are many books about Visual Basic, but only one that can serve you to step beyond what can be learned here." He goes on to identify my book with some phrases too flowery to repeat without looking even more like a shameless promoter than I do already.

Make It Right Before You Make It Fast

I've quoted Joe Hacker's favorite saying in other contexts, but let me repeat it here: "It doesn't matter how fast your code is if it doesn't work."

That's what we've got so far. The TCase1 function works for me as an American. It might work for you if you're English, Irish, Scottish, or Australian. But it's not much use if you speak Spanish, Danish, German, or Russian. And I hate to even think of what it might do in Arabic, Japanese, Chinese, or Sumerian. As far as I know, there may not even be such a thing as case in those languages.

So before we make it fast, we're going to have to make it right. That means correcting two bugs. The first bug is that we don't take national language issues into consideration when we test the case of a character. The second is that we don't consider national language when changing case. The second problem is easy. Visual Basic provides the UCase$ and LCase$ functions for changing case. These are smart functions that follow the rules and work correctly regardless of the language.

Unfortunately, Visual Basic doesn't have a solution to the first bug. Many languages have functions for testing character type. C, for example, has a variety of character analysis functions including isalpha, isdigit, isspace, ispunct, islower, and isupper. The closest Visual Basic has to any of these is IsNumeric. Fortunately, the designers of Windows anticipated that some languages might have incomplete libraries, and provided API functions to do whatever we want: IsCharUpper and IsCharLower.

Here's how I handle the code in TCase2:

Function TCase2(ByVal s As String) As String
    Dim i As Long, sCh As String
    For i = 1 To Len(s)
        sCh = Mid$(s, i, 1)
        If IsCharUpper(Asc(sCh)) Then
            sCh = LCase$(sCh)
        Else
            sCh = UCase$(sCh)
        End If
        Mid$(s, i, 1) = sCh
    Next
    TCase2 = s
End Function

The complication here is that UCase$ and LCase$ work on strings, but IsCharUpper and IsCharLower work on characters. You have to use the Asc function to convert the character string into a character (of Integer size) before you can test it.

Keep in mind that there are actually two versions of IsCharUpper--IsCharUpperA (the ANSI version) and IsCharUpperU (the Unicode version). In this case we're using the ANSI version with a type library entry equivalent to following Declare statement:

Declare Function IsCharUpper Lib "user32" Alias "IsCharUpperA" ( _
    ByVal cChar As Byte) As Long

I'd like to say that this code is definitely right, but unfortunately I have to qualify that claim. It wouldn't work for double-byte character languages such as Japanese. In those languages we would have to check to see if the character is a lead byte, and if it was we'd have to skip to the next character before testing the case--again assuming that Japanese has uppercase ideograms or whatever they're called. I don't know enough about internationalization issues to pursue this line inquiry further. Just keep in mind that the issues hang over our heads, even as we ignore them and pretend for the duration of this discussion that all languages work like European languages.

So assuming this code is right, let's see how fast it is. The String Test applications produces the following timings for TCase1 and TCase2:

 
TCase1 time: 321
TCase2 time: 410

Well isn't that special? Do the right thing and end up 50 percent slower. Back to the drawing board.

Do It Faster with the API
If we're making one API call to test the case, why not make another to change the case? It turns out that there are API functions for changing case: CharLower, CharUpper, CharLowerBuff, and CharUpperBuff. The plain versions work on null-terminated strings while the Buff versions work on length specified substrings within strings. Since we're going to be changing one character at a time, we'll need CharLowerBuff and CharUpperBuff.

Like most string-oriented API functions, these come in ANSI and Unicode versions. As we saw earlier, Visual Basic strings are stored in Unicode format, so why bother with ANSI versions that will convert back and forth between the two formats? If we use Unicode all the way, we can skip the translation. (If you have doubts about this strategy, hold your peace.)

One problem with Unicode APIs is that Visual Basic doesn't provide any way to write a Unicode Declare statement. I was hoping this stupid limitation would disappear in VB6, but no such luck. You can use a type library to work around the problem, but that wouldn't do us any good in this case because we're not going to be accessing whole strings. We're going to be accessing individual characters within strings. And the only way to do that is with pointers. But everybody knows Visual Basic doesn't allow pointer arithmetic.

Well, as I said at the beginning, we don't care what Mama don't allow. Here's the code:

Function TCase3(ByVal s As String) As String
    Dim i As Long, ch As Integer
    For i = 0 To Len(s) - 1
        ' Extract a Unicode character from current position
        CopyMemory ch, ByVal StrPtr(s) + (i * 2), 2
        ' Test case of character 
        If IsCharUpperW(ch) Then
            ' Change case of character at current position
            CharLowerBuffW StrPtr(s) + (i * 2), 1
        Else
            CharUpperBuffW StrPtr(s) + (i * 2), 1
        End If
    Next
    TCase3 = s
End Function

Dereferencing Pointers
We'll start by reviewing the CopyMemory statement that extracts a character from the string. Here's the Declare:

Declare Sub CopyMemory Lib "kernel32" Alias "RtlMoveMemory" ( _
    lpvDest As Any, lpvSource As Any, ByVal cbCopy As Long)

See the sidebar "CopyMemory: A Strange and Terrible Saga," Chapter 2, page 90, for the whole story. For now all we care is that you tell it where you want to move to, where you want to move from, and how many bytes to move. In this case we want to move a two-byte Unicode character from a variable location in a Unicode string into an Integer variable called ch.

Notice that CopyMemory is declared to take its first two arguments by reference. This gives us maximum flexibility because we can always override a ByRef default in the Declare by specifying ByVal in the call (but we couldn't override ByVal with ByRef). By reference means that we will use the address of the variable given rather than the value in the variable. For those who know C, a ByRef argument is equivalent to using the C addressof (&) operator.

Consider this statement:

CopyMemory ch, ByVal StrPtr(s) + (i * 2), 2

The first argument is the destination--the address that will be moved to. If putting the destination first and the source second seems strange, you're probably not an assembly language programmer. In any case, the first argument says that the source value should be moved to the address of the ch variable. In other words, it's an assignment to ch.

In the second argument the default ByRef is overridden with the ByVal keyword. This means that we're going to pass a value, not the address of a variable. CopyMemory expects its second argument to be an address, so we had better make sure that the value we pass is actually an address. An address passed as a value is called a pointer--that is it's a value that points to the address of a variable. The undocumented StrPtr function returns a pointer to the Unicode string in its argument.

To that we add a calculation of the offset of the current character. The first time through the loop the second argument is a pointer to the first character of the string--StrPtr(s)--plus an offset of 0 (0 * 2 is 0). The second time through the argument is a pointer to the first character plus 2 (1 * 2 is 2)--which is the second character. And so on to the end.

The third argument is passed by value and is always two--the width in bytes of a Unicode character.

What we have done is called dereferencing a pointer. With the public introduction of AddressOf and the secret introduction of VarPtr, StrPtr, and ObjPtr, VB5 officially became a pointer language. And not just any pointer language. It's the worst pointer language ever created--the only one that doesn't support dereferencing. For example, in C we would dereference this pointer with the * operator like this:

ch = *s;

This is equivalent to the following VB statement:

CopyMemory ch, ByVal StrPtr(s), 2

In my opinion VB should add features that allow you to do the things you do with pointer in C without pointers. But it's too late for that. If they're going to add undocumented pointer functions like StrPtr, they should also add undocumented pointer dereferencing functions such as PtrToChar, PtrToStr, and PtrToLong?

Once you've got dereferencing down, the rest of the code is easy. First you test the case of the extracted character with IsCharUpperW. Then you change the current character with CharLowerBuffW or CharUpperBuffW. The Declares look like this:

Declare Function CharLowerBuffW Lib "user32" ( _
    ByVal lpsz As Long, ByVal cchLength As Long) As Long

Normally the first argument would be declared as a ByVal String, but since this is a Unicode function and since we're going to be accessing the middle of the string, we have to pass a pointer by value and make sure we pass the value as an address from StrPtr.

The statement

CharLowerBuffW StrPtr(s) + (i * 2), 1

works the same as the CopyMemory statement we saw earlier except that CharLowerBuffW deals with characters rather than bytes. When we tell it to process one character, it knows we really want two bytes.

We've jumped through the hoops as requested. Let's check the payoff. Here are the timings from the test program:

 
TCase1 time: 321
TCase2 time: 410
TCase3 time: 140

That's better. These timing were done on Windows NT and they show a 293 percent increase from TCase1. And it's even faster on Windows 98. My timings show it to be 433 percent faster.

Wait a minute! How can it be 293 percent faster on Windows NT and 433 percent faster on Windows 98? Remember what Joe Hacker says? "It doesn't matter how fast your code is if it doesn't work." And this doesn't work. The test program shows a faster speed, but it also shows that the case of the string doesn't get toggled. The reason is simple. Under Windows 98 the code for IsCharUpperW and CharUpperBuffW is stubbed out. If Windows 98 were written in Visual Basic, the implementation of CharUpperBuffW would look like this:

Function CharUpperBuffW(s As String, c As Long) As Long
End Function

A Short Flame About Windows, 95 and 98
I
try to be reasonable and positive about Microsoft. I own some of their stock. I want them to do the right thing and be successful. And that's why I have to flame at the C bigots who made this idiotic decision when designing Windows 95. They made a strategic decision not to do Unicode in Windows 95, and I can respect that and acknowledge the tradeoffs that went into the choice.

Supporting Unicode for every function in the general purpose API that happened to have a string argument would have taken a huge commitment of resources and a lot of extra processing down in the bowels of the operating system. I know why they chose not to support Unicode for miscellaneous API functions like GetCurrentDirectory and SetWindowText. But that's a very different decision than deciding not to support Unicode in utility functions like IsCharUpper, CharLowerBuff, and even lstrcpy. It's not as if they didn't know how. They did give ANSI and Unicode implementations of lstrlen in Windows 95, and they could have done it for other utility functions.

The designers of Windows 95 knew that they had to support COM. They knew that all COM-compliant strings are Unicode. They had access (from Windows NT) to source for general Unicode support functions. They knew that these are independent functions that don't require major changes throughout the operating system. In other words, they knew better. What they didn't seem to know is that some people program Windows in languages other than C and C++. In C-based languages if the operating systems doesn't support some operation, the C library probably does. But Visual Basic programmers depend on the operating system to provide essential functionality. In this case, the operating system let us down. Probably this decision was made by the same person who created the CopyMemory/RtlMoveMemory mess.

As if the original crime weren't bad enough, they've repeated it for Windows 98. The decision not to provide general Unicode support might have been justifiable then, but it's extremely questionable now. The decision not to support Unicode in utility and national language functions is even more debatable--especially since Microsoft has proved in another context how easy it would have been. We'll get to that other context shortly.

What this means is that we must accept that maximum performance in portable API string operations is just out of our reach. We'll have to either assume all our customers have Windows NT (unlikely) or we'll have to deal with the ANSI version of API functions. That choice has a cost, not only in speed, but in correctness for all languages. Since this chapter isn't intended as a discussion of internationalization, we're not going to talk about string processing for languages that have double byte characters except to note that it shouldn't be a problem. When TCase3 works, it works for any language you throw at it. Too bad we can't just stop here.

A First Try With ANSI APIs
In theory we ought to be able to do exactly the same thing with ANSI functions that we just did with Unicode functions. In reality, it's not quite the same. Let's look at the code before we discuss the problems.

Function TCase4(ByVal s As String) As String
    Dim i As Long, ch As Byte, c As Long
    For i = 0 To Len(s) - 1
        CopyMemory ch, ByVal StrPtr(s) + (i * 2), 1
        If IsCharUpper(ch) Then
            CharLowerBuffPtr StrPtr(s) + (i * 2), 1
        Else
            CharUpperBuffPtr StrPtr(s) + (i * 2), 1
        End If
    Next
    TCase4 = s
End Function

This looks a lot like the last version, but notice that the ch variable is a Byte rather than an Integer. The CopyMemory statement copies one character rather than two. But we know that StrPtr is returning the address of the actual Unicode characters being copied. How can we chop off half of a Unicode character and expect to get the right result? Well, we can't. Not always. This code certainly won't work for Unicode character 823 or 1458. It will work for most of the first 256 Unicode characters because they are the same as the first 256 characters for most European code pages. The other half of the character is zero, so there's no need to copy it over.

There are exceptions to the rule. Try the following code:

For i = 0 To 255
    Debug.Print i, Hex$(AscW(Chr$(i))) 
Next

Results may vary depending on your regional settings, but you'll find that some of the codes in the upper half (including 145 to 153 on all the machines I've tried) don't have zeros in the upper half. Fortunately, the upper half is zero for the codes that represent the accented characters frequently used in text in European countries. Still, any technique based on the assumption that Unicode characters are always half zero has a chance of failing for some languages.

Now let's look at the second part of the processing:

CharUpperBuffPtr StrPtr(s) + (i * 2), 1

This statement is using a specialized API statement that exists in the ANSI Windows API Type Library and is equivalent to this Declare statement:

Declare Function CharUpperBuffPtr Lib "user32" _
    Alias "CharUpperBuffA" (ByVal lpsz As Long, _
    ByVal cchLength As Long) As Long

The key to writing Declare statements is to understand that Windows has an unvarying and non-negotiable expectation about the arguments it will receive. In this case the CharUpperBuffA function expects it's first argument to be the address of a string of ANSI characters. Fortunately, there are numerous ways that Visual Basic can satisfy Windows' request, depending on how you write the Declare. This alias satisfies the request by requiring that the address by passed by value as a Long. Since the ByVal is in the Declare, no need to put it in the call.

The pointer arithmetic works for this statement as long as the Unicode character being pointed to is only 8 bits. As far as the ANSI version of the function knows, each character is only 8 bits wide. Since we're only looking at one character at a time, it works. But what if we looked at more than one character. Well, it would probably still work because "AbC" would be processed as A, null, b, null, C, null. Changing a null to uppercase has no effect--except wasted processing time. You would also have to double the buffer length passed in the third parameter.

Even if this technique works, it looks awkward and fragile. Let's check the performance:

TCase1 time: 321
TCase2 time: 410
TCase3 time: 140
TCase4 time: 220

Well, it's not the worst possible algorithm, but it is slower than the correct (but Windows-95-hostile) code in TCase3. This is discouraging. We're getting slower and less correct. Maybe it's time to try something completely different.

Working On Byte Arrays Rather Than Strings
An ANSI string is in reality just an array of bytes. So why not look at it as an array of bytes. Here's a new version of TCase that does just that:

Function TCase5(s As String) As String
    Dim i As Long, ab() As Byte, ch As Byte
    ab = StrConv(s, vbFromUnicode)
    For i = 0 To Len(s) - 1
        If IsCharUpper(ab(i)) Then
            CharLowerBuffB ab(i), 1
        Else
            CharUpperBuffB ab(i), 1
        End If
    Next
    TCase5 = StrConv(ab, vbUnicode)
End Function

The first thing to notice about this function is that it passes it's string by reference. Passing a string by reference is faster--but it's wrong if you modify the original string unintentionally. In this case we'll be copying to a temporary byte array and working on that rather than on the original string. Unlike other arrays, byte arrays have a special relationship with strings. You can use the assignment operator to copy between the two types. For example:

ab = s   ' Copy string to byte array
s = ab   ' Copy byte array to string

These are legal statements, but of course the result is a byte array with a zero in every other byte. In this case we wanted a byte array consisting of the original Unicode converted to ANSI. That's what StrConv does in the code above.

With a byte array, it's a simple matter to pass each character to IsCharUpper. Passing to CharUpperBuff is a little more tricky. We need the following Declare statement (or a type library equivalent):

Declare Function CharUpperBuffB Lib "user32" _
    Alias "CharUpperBuffA" (lpsz As Byte, _
    ByVal cchLength As Long) As Long

Windows expects a pointer to the character data, and this Declare provides one as a ByRef Byte parameter. If we pass the current byte of the array by reference, Windows will see the address of that byte.

Once we've processed all the bytes, we have to convert back to a string. Processing should be fast here, but what's it going to cost to convert the strings back and forth between ANSI and Unicode?

TCase1 time: 321
TCase2 time: 410
TCase3 time: 140
TCase4 time: 220
TCase5 time: 251

At first glance this appears to be one more step backward, but think again. On a small string converting to ANSI and back to Unicode is going to be a significant part of the processing time. On a large string the conversion will take a much smaller percentage of the processing time. You can see this by running the test program on a much larger string.

TCase1 time: 2364
TCase2 time: 3064
TCase3 time: 982
TCase4 time: 1613
TCase5 time: 1502

This is much better. It's still not as fast as the Unicode version in TCase3, but it's better than any of the others. Of course speed isn't everything. This version takes a lot more memory. We don't really want to waste time or data space creating a temporary target string rather than working on the real thing. Imagine the wasted memory if the string were a one megabyte text file read from a disk.

Forget the String, Just Use the Byte Array
One way to avoid all the unnecessary conversion is to use Byte arrays all along. If you're reading a string from a file, read it directly into a byte array rather than a string. Of course this isn't always feasible. If the data comes from a TextBox or other control, it's going to be a String whether you like it or not. But you may be able to pass it around as a byte array for quite a while before you have to assign it to something that expects a string.

If you plan on doing this, it's not realistic to include the string conversion in the cost of the function.

What you really want is a TCase function that works directly on Byte arrays. You might want to do this in either of two ways. First, here's a Sub that processes a byte array in place:

Sub TCase6(ab() As Byte)
    Dim i As Long
    For i = LBound(ab) To UBound(ab)
        If IsCharUpper(ab(i)) Then
            CharLowerBuffB ab(i), 1
        Else
            CharUpperBuffB ab(i), 1
        End If
    Next
End Sub

VB6 allows you to return a byte array. Here's a Function version that returns a modified copy of an array in a function:

 
Function TCase7(ab() As Byte) As Byte()
    Dim i As Long, abRet() As Byte
    ' Copy the input array to a temporary value and process
    abRet = ab
    For i = LBound(abRet) To UBound(abRet)
        If IsCharUpper(abRet(i)) Then
            CharLowerBuffB abRet(i), 1
        Else
            CharUpperBuffB abRet(i), 1
        End If
    Next
    ' Return temporary
    TCase7 = abRet
End Function

VB won't let you pass an array by value, so you have to create a separate temporary array to hold the return value. That way you won't modify the input array.

These two perform respectably, although still not as well at the NT-only TCase3.

TCase1 time: 321
TCase2 time: 410
TCase3 time: 140
TCase4 time: 220
TCase5 time: 251
TCase6 time: 200
TCase7 time: 221

The byte array versions won't work for every situation, but sometimes one of them is a perfect fit.

What About the Algorithm?

So far we've been experimenting with different ways of testing and changing the characters. But what about the algorithm itself? You may have noticed that we always change the case--even if the character is numeric or punctuation. Of course CharUpperBuff isn't going to change the case of a comma or a zero. That's impossible by definition. But we could save a function call if we already knew that a particular character had no case. On the other hand, we'd have to do two tests instead of one. It's a tradeoff, and the results depend on the internals of IsCharUpper/Lower and CharUpper/LowerBuff. There's only one way to find out. Here's a slightly modified version of TCase6:

Sub TCase8(ab() As Byte)
    Dim i As Long
    For i = LBound(ab) To UBound(ab)
        If IsCharUpper(ab(i)) Then
            CharLowerBuffB ab(i), 1
        ElseIf IsCharLower(ab(i)) Then
            CharUpperBuffB ab(i), 1
        End If
    Next
End Sub

Here are the comparative timings:

TCase6 time: 200
TCase8 time: 210

We're going backward again, but if you look at the test program it's easy to see why. The initial test string--"Make things simpler than possible"--is all alphabetic. We're adding a test, but we're still changing the case for every character. To get a benefit from this change we'd have to test with a string containing many non-alphabetic characters. For example with the string "See 3,987,654,456,429 bugs" we get these results:

TCase6 time: 160
TCase8 time: 121

The choice depends on the circumstances. In most cases where we want to toggle case, the text to be modified would be mostly alphabetic. The cost of a few unnecessary case changes would be worth it if we could avoid an extra case test on every character. So we'd probably stick with the first algorithm.

The point is, you have to consider the data you're most likely to be processing to determine the best way of processing it. Perhaps this lesson doesn't matter much when toggling case, but then toggling case isn't a very typical problem, which is what I said right up front. Let's get back to the original problem--searching strings backwards.

Using the SHLWAPI Library

Before we waste a lot of time figuring out search algorithms, let's try stealing some that are already available--or at least sort of available. Internet Explorer comes with a DLL