I’ve been doing some tests and I see Instr$ in LotusScript is still a lot slower when you start searching in the middle of the string. I wrote an SPR about this sometime back.
The same is true of Mid$ – I wrote a timing test that uses Mid$ to get the 1st character of a string as opposed to the 27000th. The latter takes much longer, and I don’t understand why. According to the help docs it’s two bytes per character, so it should be trivial to determine the location of a character from its number position.
Timing test results
- 0.0008ms – Mid$(x, 1, 1)
- 0.0566ms – Mid$(x, 27000, 1)
- 0.0010ms – instr$(1, t, “x”) (where “x” occurs at position 1)
- 0.2924ms – instr$(27000, t, “x”) (where “x” occurs at position 27000)
What’s going on here? How can it take so much longer to access a string starting at a later position? It must be looping to get to the right starting position instead of just calculating the offset — character position X 2.
Do we care?
Generally, performance of in-memory operations like this is swamped by the much slower things scripts normally spend a lot of time doing, like accessing documents and views. But sometimes it really matters — like when finding string differences in the compareDBs application, for instance. And this seems like an easy fix, worth doing.
As they say in the political ads, “Contact your representative and ask them to do something about this.”
Notes: Performance testing technique
To do this performance test. I created a class PerfTimer. This is more complex than it needs to be for a “quick and dirty” test, but it’s adaptable to other situations, particularly where you don’t have a good idea how long something might take within an order of magnitude. “Let’s do this a million times and see how long it takes” isn’t a great idea when you’re not sure you won’t be waiting four hours for it to finish. By setting the testing time rather than the number of repetitions, we get a predictable experience — it won’t take much longer than your target time, but there will be enough repetitions to be meaningful without having to hunt around for a how many repetitions will be enough for the signal to exceed the noise. Here’s the code:
%REM Class PerfTimer Constructor: New PerfTimer(seconds) Description: A timer for performance testing. The timer will run for a specified number of seconds and count how many times something repeated during that time. You can run up to 10 different scenarios for comparison purposes, getting the results at the end. The sequence of operations is: - Create a PerfTimer that will run for a stated number of seconds. - "Start" the timer, supplying the name of the test scenario you're timing. - Use a loop of this form: Dim pt as New PerfTimer(15) ' we'll count how many times we can do something in 15 seconds. pt.start "Scenario 1" Do ' Insert operation you're timing here. Loop Until pt.isdone pt.start "Scenario 2" Do ' Insert operation you're comparing to the first. Loop Until pt.isdone Msgbox pt.Result ' show timing results in seconds. Note: if the operation you're timing is pretty quick, it's a good idea to add an inner loop to repeat it (let's say) 1000 times, so the timing differences between scenarios aren't swamped by the performance testing overhead. %END REM
Class PerfTimer z_timePer(10) As Double z_id(10) As String z_index As Integer z_curCount As Long z_targetTime As Single z_runtimeTarget As Single' seconds
z_startTime As Single Sub New(seconds As Integer) z_index = -1 z_runtimeTarget = seconds End Sub%REM Sub start Description: Start a timer for a test run, specifying which code variant we're testing. %END REM
Sub start(testName$) z_index = z_index + 1 z_id(z_index) = testName z_curCount = 0 Dim startTime As Single startTime = Timer Do z_startTime = timer Loop While startTime = z_startTime' wait for timer to tick
z_targetTime = z_startTime + z_runtimeTarget End Sub%REM Property Get isDone Description: Return True if timer target is reached, meanwhile counting how many times this routine is called. %END REM
Property Get isDone As Boolean z_curCount = z_curCount + 1 Dim tim As Single tim = Timer If tim >= z_targetTime Then' use actual time elapsed to calculate rate.
z_timePer(z_index) = (tim - z_startTime) / z_curCount isDone = True End If End Property%REM Property Get results Description: Return a list of test names and time per iteration. %END REM
Public Property Get results As String Dim i% ReDim ans(0 To z_index) As String For i = 0 To z_index ans(i) = Format(z_timePer(i), "0.########") & " - " & z_id(i) Next results = Join(ans, { }) End Property End Class
I assume the content is in LMBCS, which isn’t two bytes per character. It is variable length depending on the characters and code pages. I assume that means it does really need to cycle through the content. See https://en.wikipedia.org/wiki/Lotus_Multi-Byte_Character_Set
That was my initial thought as well, but the LotusScript documentation of the String data type says it’s 2 bytes per character.
My bad. It gets converted to UTF-8 before it is saved into a String. I should know that as I have to do it in my Midas LSX, but my memory is going away at about the same pace as the documentation and support.
The documentation is sadly not in a good state. Large parts of it have been neglected and are a diminished reflection of reality.
An example, pertaining to the subject of String: the documentaion for Uni and UChr incorrectly state the range of unicode values as 0-65535.
That has become untrue as of the inclusion of UTF-16 in LMBCS (per https://en.wikipedia.org/wiki/Lotus_Multi-Byte_Character_Set ).
Since LMBCS includes UTF-16, I don’t think two bytes will fit the entire set, and I think it must be a variable number of bytes construction. The neglect of the docs also shows up as bugs in parts of the Notes API. For example, the NotesStream class has problems with long strings of UTF-8 with lots accents, like Czech (around 17000 charas on a line). The bug exhibits as a corruption when reading back added text from a NotesStream. Japanese also triggers it, but needs longer lines, approx 25000. I have supplied HCL with a db reproducing the bug, it’s added to SPR # KKOOBZ9B2E . The bug reproduces from V9 through 12.0.1FP1 (it was reported before 12.0.2)
LMBCS is 1 to 3 bytes. Those 3 byte chars are quite rare. But still if the operation is in chars and not in bytes, the operation needs to parse the string as if it would contain a variable number of bytes per char.
From what I just looked up UTF-8 can have up to 4 bytes.
The exposed C-API calls for LBMCS are quite limited. And you would have to loop thru the string manually char by char.
It would be interesting to see how the C-API would perform. C will be faster (because it has less overhead in general), but if we see similar relation ship in speed in the examples with 1 vs 27000 byte, this would be an underlying LMBCS performance limitation.
Else the Lotus Script implementation would have room for improvement.
@Lars, LMBCS does not directly contain UTF-16. The encoding for LMBCS is older than unicode. And it has it’s own format. There is a very old help database about LBMCS. And Notes has it’s own routines to work with LMBCS and there are conversion routines to convert from and to different encodings.
I would not expect incorrect data returned from a NotesStream. This would be clearly a bug. If it would be just performance, this would be understandable.
When did you report the SPR? What info did you get back about getting it fixed?
Thanks for your comments, all. I’ve been looking into this further to determine whether there are characters that LotusScript represents internally with more than two bytes. The String datatype absolutely uses Unicode character codes in memory, not LMBCS, as I will show.
The functions Len and LenB return the length in characters and bytes respectively, and so far I can’t find any cases where LenB is not 2*Len. I’ve been trying weird characters like “ʙ̥” (unvoiced labial trill). This one is represented as two Unicode characters internally, 0x299 (“small uppercase B”) and 0x325 (“Combining Ring Below”). LotusScript treats them as separate characters but they are drawn in a single character space. Here’s the code I use to test potential oddball character candidates.
The output is “2, 4 — :299:325”.
So in brief, I still can’t find any excuse for these functions to be so slow. Certainly in the case of Instr, even if it has to scan the string to find position 27000, there’s no reason for that to take 5x longer than Mid$ does to do the same thing.
Check the search times with different offsets – 1, 10000, 20000, 30000, 40000, 50000 and see what curve they fit into – if instr is really dumbly going through the whole string, there will be about a linear relationship.