String Functions Performance Considerations

I’ve been doing some tests and I see Instr$ in LotusScript is still a lot slower when you start searching in the middle of the string. I wrote an SPR about this sometime back.

The same is true of Mid$ – I wrote a timing test that uses Mid$ to get the 1st character of a string as opposed to the 27000th. The latter takes much longer, and I don’t understand why. According to the help docs it’s two bytes per character, so it should be trivial to determine the location of a character from its number position.

Timing test results

0.0008ms – Mid$(x, 1, 1)
0.0566ms – Mid$(x, 27000, 1)
0.0010ms – instr$(1, t, “x”) (where “x” occurs at position 1)
0.2924ms – instr$(27000, t, “x”) (where “x” occurs at position 27000)

What’s going on here? How can it take so much longer to access a string starting at a later position? It must be looping to get to the right starting position instead of just calculating the offset — character position X 2.

Do we care?

Generally, performance of in-memory operations like this is swamped by the much slower things scripts normally spend a lot of time doing, like accessing documents and views. But sometimes it really matters — like when finding string differences in the compareDBs application, for instance. And this seems like an easy fix, worth doing.

As they say in the political ads, “Contact your representative and ask them to do something about this.”

Notes: Performance testing technique

To do this performance test. I created a class PerfTimer. This is more complex than it needs to be for a “quick and dirty” test, but it’s adaptable to other situations, particularly where you don’t have a good idea how long something might take within an order of magnitude. “Let’s do this a million times and see how long it takes” isn’t a great idea when you’re not sure you won’t be waiting four hours for it to finish. By setting the testing time rather than the number of repetitions, we get a predictable experience — it won’t take much longer than your target time, but there will be enough repetitions to be meaningful without having to hunt around for a how many repetitions will be enough for the signal to exceed the noise. Here’s the code:

%REM
	Class PerfTimer
	Constructor: New PerfTimer(seconds)

	Description: A timer for performance testing. The timer will run for a specified number of seconds and count how many times something repeated during that time.
		You can run up to 10 different scenarios for comparison purposes, getting the results at the end. The sequence of operations is:
		 - Create a PerfTimer that will run for a stated number of seconds.
		 - "Start" the timer, supplying the name of the test scenario you're timing.
		 - Use a loop of this form:
		 
		 	Dim pt as New PerfTimer(15) ' we'll count how many times we can do something in 15 seconds.
		 	pt.start "Scenario 1"
			Do
				' Insert operation you're timing here.
			Loop Until pt.isdone
			pt.start "Scenario 2"
			Do
				' Insert operation you're comparing to the first.
			Loop Until pt.isdone
			Msgbox pt.Result ' show timing results in seconds.
		
		Note: if the operation you're timing is pretty quick, it's a good idea to add an inner loop to repeat it (let's say) 1000 times,
			so the timing differences between scenarios aren't swamped by the performance testing overhead.
%END REM
Class PerfTimer
	z_timePer(10) As Double
	z_id(10) As String
	z_index As Integer
	z_curCount As Long
	z_targetTime As Single
	z_runtimeTarget As Single ' seconds
	z_startTime As Single

	Sub New(seconds As Integer)
		z_index = -1
		z_runtimeTarget = seconds
	End Sub
	
	%REM
		Sub start
		Description: Start a timer for a test run, specifying which code variant we're testing.
	%END REM
	Sub start(testName$)
		z_index = z_index + 1
		z_id(z_index) = testName
		z_curCount = 0
		Dim startTime As Single
		startTime = Timer
		Do
			z_startTime = timer
		Loop While startTime = z_startTime ' wait for timer to tick
		z_targetTime = z_startTime + z_runtimeTarget
	End Sub
	
	%REM
		Property Get isDone
		Description: Return True if timer target is reached, meanwhile counting how many times this routine is called.
	%END REM
	Property Get isDone As Boolean
		z_curCount = z_curCount + 1
		Dim tim As Single
		tim = Timer
		If tim >= z_targetTime Then ' use actual time elapsed to calculate rate.
			z_timePer(z_index) = (tim - z_startTime) / z_curCount
			isDone = True
		End If
	End Property
	
	%REM
		Property Get results
		Description: Return a list of test names and time per iteration.
	%END REM
	Public Property Get results As String
		Dim i%
		ReDim ans(0 To z_index) As String
		
		For i = 0 To z_index
			ans(i) = Format(z_timePer(i), "0.########") & " - " & z_id(i) 
		Next
		results = Join(ans, {
})
	End Property
End Class

Tags:performance testing

8 thoughts on “String Functions Performance Considerations”

Ben Langhinrichs November 25, 2022 at 6:27 pm

Reply

I assume the content is in LMBCS, which isn’t two bytes per character. It is variable length depending on the characters and code pages. I assume that means it does really need to cycle through the content. See https://en.wikipedia.org/wiki/Lotus_Multi-Byte_Character_Set
1. Andre November 25, 2022 at 7:16 pm
  
  Reply
  
  That was my initial thought as well, but the LotusScript documentation of the String data type says it’s 2 bytes per character.
2. 1. Ben Langhinrichs November 27, 2022 at 3:37 pm
    
    Reply
    
    My bad. It gets converted to UTF-8 before it is saved into a String. I should know that as I have to do it in my Midas LSX, but my memory is going away at about the same pace as the documentation and support.
Lars Berntrop-Bos November 25, 2022 at 8:42 pm

Reply

The documentation is sadly not in a good state. Large parts of it have been neglected and are a diminished reflection of reality.
Lars Berntrop-Bos November 25, 2022 at 9:03 pm

Reply

An example, pertaining to the subject of String: the documentaion for Uni and UChr incorrectly state the range of unicode values as 0-65535.
That has become untrue as of the inclusion of UTF-16 in LMBCS (per https://en.wikipedia.org/wiki/Lotus_Multi-Byte_Character_Set ).
Since LMBCS includes UTF-16, I don’t think two bytes will fit the entire set, and I think it must be a variable number of bytes construction. The neglect of the docs also shows up as bugs in parts of the Notes API. For example, the NotesStream class has problems with long strings of UTF-8 with lots accents, like Czech (around 17000 charas on a line). The bug exhibits as a corruption when reading back added text from a NotesStream. Japanese also triggers it, but needs longer lines, approx 25000. I have supplied HCL with a db reproducing the bug, it’s added to SPR # KKOOBZ9B2E . The bug reproduces from V9 through 12.0.1FP1 (it was reported before 12.0.2)
Nashed November 25, 2022 at 10:58 pm

Reply

LMBCS is 1 to 3 bytes. Those 3 byte chars are quite rare. But still if the operation is in chars and not in bytes, the operation needs to parse the string as if it would contain a variable number of bytes per char.

From what I just looked up UTF-8 can have up to 4 bytes.

The exposed C-API calls for LBMCS are quite limited. And you would have to loop thru the string manually char by char.

It would be interesting to see how the C-API would perform. C will be faster (because it has less overhead in general), but if we see similar relation ship in speed in the examples with 1 vs 27000 byte, this would be an underlying LMBCS performance limitation.
Else the Lotus Script implementation would have room for improvement.

@Lars, LMBCS does not directly contain UTF-16. The encoding for LMBCS is older than unicode. And it has it’s own format. There is a very old help database about LBMCS. And Notes has it’s own routines to work with LMBCS and there are conversion routines to convert from and to different encodings.

I would not expect incorrect data returned from a NotesStream. This would be clearly a bug. If it would be just performance, this would be understandable.
When did you report the SPR? What info did you get back about getting it fixed?
Andre November 26, 2022 at 3:51 am

Reply
Thanks for your comments, all. I’ve been looking into this further to determine whether there are characters that LotusScript represents internally with more than two bytes. The String datatype absolutely uses Unicode character codes in memory, not LMBCS, as I will show.

The functions Len and LenB return the length in characters and bytes respectively, and so far I can’t find any cases where LenB is not 2*Len. I’ve been trying weird characters like “ʙ̥” (unvoiced labial trill). This one is represented as two Unicode characters internally, 0x299 (“small uppercase B”) and 0x325 (“Combining Ring Below”). LotusScript treats them as separate characters but they are drawn in a single character space. Here’s the code I use to test potential oddball character candidates.
```
Sub Click(Source As Button)
    Dim fo$, i%, ans$
    fo = "ʙ̥"
    For i = 1 To Len(fo)
        ans = ans & ":" & Hex(Uni(Mid$(fo, i, 1)))
    Next
    Print Len(fo) & ", " & Lenb(fo) & " -- " & ans
End Sub
```
The output is “2, 4 — :299:325”.
So in brief, I still can’t find any excuse for these functions to be so slow. Certainly in the case of Instr, even if it has to scan the string to find position 27000, there’s no reason for that to take 5x longer than Mid$ does to do the same thing.
Mikle January 29, 2023 at 2:44 am

Reply

Check the search times with different offsets – 1, 10000, 20000, 30000, 40000, 50000 and see what curve they fit into – if instr is really dumbly going through the whole string, there will be about a linear relationship.

String Functions Performance Considerations

Timing test results

Do we care?

Notes: Performance testing technique

8 thoughts on “String Functions Performance Considerations”

Leave a Reply Cancel reply