467,926 Members | 1,410 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 467,926 developers. It's quick & easy.

How to speed-up access to strings

32bit
Dear Bytes comunity...
I'm trying to speed-up the way I'm interacting with big strings.
I have an string with more than 2M caracteres and would not like to use the function mid$ to read, and change a specific character.
I'd like to use VARPTR or STRPTR, but I could not understand how to access the memory to get the content of a character directly.
Could somebody give me some help?
Best regards
Ricardo
4 Weeks Ago #1

✓ answered by Rabbit

In essence, you'll do something like this:
Expand|Select|Wrap|Line Numbers
  1. Dim arrBytes() As Byte
  2. Dim intFileNum As Long
  3. Dim i As Long
  4.  
  5. intFileNum = FreeFile
  6. Open "C:\somefile.ext" For Binary Access Read As intFileNum
  7. ReDim arrBytes(LOF(intFileNum) - 1)
  8. Get intFileNum, , arrBytes
  9.  
  10. For i = 0 To UBound(arrBytes)
  11.     arrBytes(i) = (arrBytes(i) + 1) Mod 256
  12. Next i
  13.  
  14. Close intFileNum
The portion that manipulates the bytes is a many times faster than the string version but the magnitude of the improvement isn't huge. Both the string and byte array versions are extremely quick, the byte array is just slightly quicker. Does it matter if one takes a second and the other takes a quarter of a second? Probably not, but the byte array version is quicker and cleaner.

  • viewed: 3325
Share:
63 Replies
cactusdata
Expert 128KB
What would "speed up" mean?
This runs in 0.15Ķs where s is a string of 2 mio. characters:

Expand|Select|Wrap|Line Numbers
  1. Mid(s, 10 ^ 6, 1) = "x" 
3 Weeks Ago #2
NeoPa
Expert Mod 16PB
Hi Ricardo.

Your question is pretty unclear. What do you mean by using the Mid$() function? This is the only function that I can think of that is used both on the left AND on the right of the =. On the right it may make sense. On the left? Not so much.

Also, you introduce two terms - VARPTR and STRPTR - but give nothing to indicate where you got them from. I was able to determine they're both hidden members of the VBA library, but only in more recent versions.

My suspicion here is that they're there to allow you to see where the data's stored, but the absence of any way to use that to manipulate the data is very deliberate.

I suspect you just need to explore using the Mid() function on the left of the = in your code. It should do all you need.
3 Weeks Ago #3
ADezii
Expert 8TB
Just as a side note, to the best of my knowledge, Mid$() is more efficient than its counterpart Mid() when dealing with Strings.
3 Weeks Ago #4
isladogs
Expert 128KB
Just as a side note, to the best of my knowledge, Mid$() is more efficient than its counterpart Mid() when dealing with Strings.
Interestingly, I was going to reply that Mid$ is now largely obsolete, along with Left$, Right$, LTrim$, RTrim$, LCase$, UCase$, but all can still be used for backwards compatibility. Am I wrong in that belief?
3 Weeks Ago #5
Rabbit
Expert Mod 8TB
The purpose of the PTR types are to be used as pointers to be passed to API calls. VBA itself doesn't deal directly with pointers and has no way of dereferencing the pointers to get access to the memory. If you want to manipulate memory with pointers, you'll have to do it through an API call. But that isn't necessarily faster than using the built in string functions.
3 Weeks Ago #6
NeoPa
Expert Mod 16PB
Rabbit:
The purpose of the PTR types are to be used as pointers to be passed to API calls.
Thank you. That's very helpful info, and it makes total sense where I was struggling to see any.

ADezii:
Just as a side note, to the best of my knowledge, Mid$() is more efficient than its counterpart Mid() when dealing with Strings.
You may be right there but I've not heard that.

I do know that, while the $ versions of those functions are still supported, because they have general usability limitations when compared to the non-$ versions, they are no longer recommended. Essentially they're only there to support the existing code base. I find it very hard to imagine that any reduction in performance would even be detectable over hundreds of iterations so while it may be possible, I certainly wouldn't worry about it. The recommendation from Microsoft, going back many years now, is to use the versions that return Variant values so the caller can handle results where no string value can be returned.

Essentially, that would indicate that IslaDogs' understanding is correct.
3 Weeks Ago #7
ADezii
Expert 8TB
I`ve always had a preference for Functions that return Strings rather than Variants (NULL returns excluded), thus avoiding the Data Type Coercion factor. Does that make sense, or am I too, obsolete? (LOL).
3 Weeks Ago #8
isladogs
Expert 128KB
Of course ADezii$ is an exception to the rule regarding obsolescence :~>).
3 Weeks Ago #9
ADezii
Expert 8TB
Nice! Out of curiosity, when I get a chance, I'll run Benchmark Tests using both Versions of the Function, although I am fairly sure as to what the end result will be.
3 Weeks Ago #10
128KB
I've posted code that compares speeds using Mid, Mid$ InStr, InStrRev Replace functions, and Regular expressions.
In my PC environment, the results show that Replace is much faster than Mid.
but the administrator decides it's not relevant, I removed it.
3 Weeks Ago #11
NeoPa
Expert Mod 16PB
Hi SioSio.

Have you posted that in the right thread? I can't see where it answers the question or otherwise adds to this discussion. Am I missing something?
3 Weeks Ago #12
NeoPa
Expert Mod 16PB
ADezii:
I`ve always had a preference for Functions that return Strings rather than Variants (NULL returns excluded), thus avoiding the Data Type Coercion factor.
I'm not sure that there would be any Data Type Coercion involved - even with the Variant returning version. If you assign a String value Variant to a String variable then it just assigns. No coercion required. In fact, I'm pretty confident that any coercion for one version would be matched by the other. I can't think of any exceptions.

IslaDogs:
Of course ADezii$ is an exception to the rule regarding obsolescence :~>).
Quite right :-D

He may have an extraordinary fondness for archaic versions of some functions but we love him anyway. Few of us have been around on Bytes.com longer than he and I suspect few are as well loved. His eagerness to volunteer his time for all & sundry is legendary.
3 Weeks Ago #13
128KB
The sample code I showed was a function that could specify the start address of a character variable in memory and the length to retrieve. In other words, this function behaves the same as the Mid function in memory. However since it was pointed out that the problem was not solved, I deleted the post.
3 Weeks Ago #14
cactusdata
Expert 128KB
But you asked for faster alternatives to Mid.
Will this run a replacement in your large string faster than 0.15Ķs?

PS: Mid$ runs at exactly the same speed as Mid.
3 Weeks Ago #15
NeoPa
Expert Mod 16PB
CactusData:
But you asked for faster alternatives to Mid.
The OP (Ricardo de Mila) made that request, rather than SioSio, but they also asked about VARPTR & STRPTR.

It seems that SioSio, if I understand their post correctly, is offering some code as an illustration of how these values can be accessed using VBA, but in reality is confirming the point made earlier by Rabbit that native VBA can do nothing with these and they can only be used to access the (VARPTR & STRPTR) data directly using Windows API calls. It certainly isn't code you could expect to outperform the Mid() function.

CactusData:
PS: Mid$ runs at exactly the same speed as Mid.
Cheers. As I suspected but now confirmed.
3 Weeks Ago #16
ADezii
Expert 8TB
Ok, I ran some Benchmark Tests and it appears as though NeoPa, cactusdata, and isladogs were right all along in that the String Versions of these Functions are both obsolete and possess no advantage over their Variant counterparts. I'll give you the boring details as well as an Image of the results. You will see that the versions are, for the most part, identical, with Mid() having a slight advantage of .25+ seconds overall.
  1. I created a String consisting of 500,000, randomly generated, lower case letters (a-z).
  2. I created a Looping Structure of 5,000 Iterations. With each Iteration, I created a Random Number between 1 and Len(<Test String>) - 1,000.
  3. Then I extracted 1,000 characters from the Test String (using Mid() and Mid$()) in Randomly Generate locations, so the first 1,000 characters, the last 1,000 characters, or any block of 1,000 characters in between these could be extracted.
  4. These 1,000 character blocks were then assigned to a String Variable, to test my coercion Theory with Mid(), which failed miserably.
  5. I am done with Theories and starting fresh with the Variant Versions of these Functions.
  6. Sorry if I mislead anyone.
P.S. - I saw no need for further testing with variable Test String Lengths, given the prior results.

Attached Images
File Type: jpg BIG String.JPG (127.0 KB, 586 views)
3 Weeks Ago #17
isladogs
Expert 128KB
Thanks for confirming that info @ADezii$ - as you can see, you now are stuck with a $ attached to your user name :~>)
3 Weeks Ago #18
ADezii
Expert 8TB
That`s fine with me, since I have been called a lot worse in my lifetime. Does the $ suffix mean that I am now a Paid Consultant on BYTES?
3 Weeks Ago #19
isladogs
Expert 128KB
Of course....if it was up to me....
Perhaps niheel will double your existing remuneration rate!
3 Weeks Ago #20
NeoPa
Expert Mod 16PB
IslaDogs:
Perhaps niheel will double your existing remuneration rate!
Consider it done! Well worth every cent :-D
3 Weeks Ago #21
isladogs
Expert 128KB
Well done @NeoPa.
That was a deliberate 'test' to see if you would spot the transposed m & n.
A test you passed with flying colours! Award yourself a pay rise at the same time!
3 Weeks Ago #22
ADezii
Expert 8TB
All the time, I just thought that you were a bad spellar!
3 Weeks Ago #23
isladogs
Expert 128KB
Hi @ADezii$
No, it really was deliberate this time.
It followed a PM I sent to @NeoPa based on my unintentional misspelling of obsolescence in post #9 which he corrected!

BTW I think we've scared off the OP again between us all
3 Weeks Ago #24
NeoPa
Expert Mod 16PB
IslaDogs:
BTW I think we've scared off the OP again between us all
Scared or not he's had every aspect of his question answered in depth so he should have no complaints.
IslaDogs:
A test you passed with flying colours! Award yourself a pay rise at the same time!
Yay. I think I just caught up with ADezii's rate!!
3 Weeks Ago #25
NeoPa
Expert Mod 16PB
ADezii$:
All the time, I just thought that you were a bad spellar!
No. I'm not falling for that one :-D
3 Weeks Ago #26
Rabbit
Expert Mod 8TB
The issue here probably has less to do with the string manipulation and more to do with whatever else they're trying to do in the code or how they're calling it.
3 Weeks Ago #27
NeoPa
Expert Mod 16PB
Hi Rabbit.

I suspect the original issue was actually that they'd either forgotten, or never known, that VBA povides the ability to manipulate substrings with the use of the Mid() function on the left hand side of the equals (=).
3 Weeks Ago #28
32bit
Hello everybody... Thank you very much for replying. Didn't expect to have some many answers here.
Let me try to clarify a little bit better what I'm doing and let me see if you can help me.
In my specific case, My interactions are taking more than 3 days to be completed.
The intention here is to have a simple criptografy process applyed into a file that is around 230MB long.
This big file is splited in blocks of 2MB. Each block is initialy stored into a string to be processed and then, using mid$, I replace the raw caracters of the string by the criptografed caracters. Just to clarify a little bit better, see the example below:
Expand|Select|Wrap|Line Numbers
  1. Dim Sx1, Sx2 as string
  2. Dim Ix1 as long
  3.  
  4. Sx1="RICARDO":Sx2=""
  5. For Ix1 = 1 to Len(Sx1)
  6. Sx2 = Sx2+Chr$((ASC(Mid$(Sx1,Ix1,1))+1) Mod 256)
  7. Next Ix1
In this interactions, I'm speding more than 3 days.
The critografy I'm using is not much more than what I have showed here.

So... I was think that maybe could be possible to gain some time using STRPTR to replace the caracter inside the variable Sx1 directly.
Some years ago, when I still was a little boy and dinossaurs where still alive, I made a test using the old BASIC for similar things and the difference was really significative.

I'd like to learn how to do that using an API. Could somebody help me to understand that?
Why do we have STRPTR if it is not possible to access that memory position?
What for STRPTR is used for? If we can just use STRPTR trough an API, it seems that somebody don't want us to see some part of the memory.

Any way...
I'd like to conduct some experiments using the APIs.
Can somebody help me?

Best regards
Ricardo
3 Weeks Ago #29
128KB
Often I'm sorry.
The post has been deleted.
3 Weeks Ago #30
Rabbit
Expert Mod 8TB
The larger issue here is that you are treating binary data like strings. Just read, process, and write the data as a byte array. Skip all this extra overhead of string handling.
3 Weeks Ago #31
128KB
Is there a problem with my code?
If it is harmful answer, I delete it.
3 Weeks Ago #32
cactusdata
Expert 128KB
VBA is too slow for such tasks.
I would use C#.
3 Weeks Ago #33
Rabbit
Expert Mod 8TB
Siosio, your code to profile two different approaches is fine. My reply was directed at Ricardo at their overall approach of using strings in the first place
3 Weeks Ago #34
128KB
Rabbit, thank you for following me.
Next week, I will finally post the code that solves the Sx2 problem.
Ricardo de Mira, please wait for a while.
3 Weeks Ago #35
isladogs
Expert 128KB
SioSio
I'm confused.
IIRC you have posted and removed code twice now before I had a chance to study it properly
I may still have that code in my notification emails.
Are you suggesting I should just ignore earlier code as a different version will follow in a few days?
3 Weeks Ago #36
128KB
Hi isladogs.
The code I posted isn't perfect in terms of speeding up solutions, but it makes sense to want to learn how to use the OP's StrPtr, so I sent a message only to him.
The code I posted and deleted still has Sx2 issues. This needs to be resolved.
3 Weeks Ago #37
isladogs
Expert 128KB
OK thanks for letting me know
3 Weeks Ago #38
Rabbit
Expert Mod 8TB
CactusData, VBA is perfectly fine for this from a speed perspective. You just have to use the right approach.

Treat the data as a byte array and you can process 4 million bytes in roughly 10 milliseconds. The slowest portion of working with file data is the bottleneck at the disk IO.

The slowness of Ricardo's code has to do with constntly rebuilding the string variable. As suggested by others, it can be sped up by applying the change directly by also placing the Mid call on the left side of the equation. But this is roughly 10 times slower than just dealing directly with byte arrays, 100 ms vs 10 ms.
3 Weeks Ago #39
NeoPa
Expert Mod 16PB
@SioSio.
As Administrator, or any Moderator, I see no problems with any of your code. It doesn't break any rules. I just struggle to see how it helps. It seems to me that you have maybe not properly understood the requirement, but I say that with the understanding that you're working in a foreign language, so without criticism. Certainly don't worry that you're doing anything I need to be involved with.

I would suggest, only suggest mind you, that you read what Rabbit has said very carefully before you post any sort of solution. It would be a shame to waste time on a solution that either won't work or that would be more complicated than it needs to be.

@Rabbit.
Very interesting. Perhaps you could flesh out the concept with techniques for loading and saving the Byte Arrays you speak of. I doubt most people would know where to start on that and it's your idea so only fair (to you) that you get the first opportunity to fill out the answer so it can be used.

Please consider the idea of Best Answer so that your single post includes all parts of the solution including an explanation of why you suggest what you do.
3 Weeks Ago #40
NeoPa
Expert Mod 16PB
@Ricardo.
May I suggest that you read the existing replies carefully before deciding that VarPtr() & StrPtr() are necessarily what you need. It seems to me that you haven't understood what's been written here already. Again, there may be difficulties with the language, but we can only really help in English so it makes sense that you try to understand what is already there. Otherwise you'll miss the main benefits of having so many very clever experts who are trying to help you.
3 Weeks Ago #41
Rabbit
Expert Mod 8TB
Perhaps I made it sound more obscure than I intended. When you read a file as binary, it returns an array of bytes. When you write to a file as binary, you pass an array of bytes.
3 Weeks Ago #42
NeoPa
Expert Mod 16PB
Interesting. I've done a bit of work using Open#, Close#, Input# & Write# but I was unable to to find anything in help that tells how to do that. Neither old 2003 (proper) help files nor the new ones in docs.microsoft.com mention anything about Byte Arrays. I should add here that though the old Help system was brilliant and the newer ones have been a real let-down, much work has been done in the last number of years to get this back to the level it was at before.

Anyway, having found nothing helpful in the Help systems, and not having much of a lead from your post, I dug up some old work of my own and at least re-discovered the Get# & Put# statements. These are not linked to from the Open# but clearly are fundamental to follow where you're going. However, even these I found to refer to string variables for input & output.

Eventually, after a great deal of going back & forth in the help systems, I found a short paragraph under opening for Random (Within the Get Statement.), that explains how a Byte Array might work. Later on it goes on to explain which of the paragraphs relating to Random also pertain to Binary. The relevant paragraph was :
2003 Help:
If the variable being read into is a fixed-size array, Get reads only the data. No descriptor is read.
Though I've used Get# & Put# before (Obviously - from my earlier comments.), I've never used Arrays with it. I'd always previously used strings (Pre-set to the desired length). I've found the performance (When buffered properly etc.) to be extremely impressive. Mostly my code works on specific parts of large files but I've hardly noticed any delays.

This is extraordinarily useful & powerful. It's also extremely difficult to get helpful Help on. I've reported the page as being less helpful than it could be.

For anyone thinking of using these commands I recommend great care. Test your code on copies of files until you know it's working solidly. It's very powerful - and can do great damage if not used carefully.
3 Weeks Ago #43
Rabbit
Expert Mod 8TB
In essence, you'll do something like this:
Expand|Select|Wrap|Line Numbers
  1. Dim arrBytes() As Byte
  2. Dim intFileNum As Long
  3. Dim i As Long
  4.  
  5. intFileNum = FreeFile
  6. Open "C:\somefile.ext" For Binary Access Read As intFileNum
  7. ReDim arrBytes(LOF(intFileNum) - 1)
  8. Get intFileNum, , arrBytes
  9.  
  10. For i = 0 To UBound(arrBytes)
  11.     arrBytes(i) = (arrBytes(i) + 1) Mod 256
  12. Next i
  13.  
  14. Close intFileNum
The portion that manipulates the bytes is a many times faster than the string version but the magnitude of the improvement isn't huge. Both the string and byte array versions are extremely quick, the byte array is just slightly quicker. Does it matter if one takes a second and the other takes a quarter of a second? Probably not, but the byte array version is quicker and cleaner.
3 Weeks Ago #44
ADezii
Expert 8TB
Expand|Select|Wrap|Line Numbers
  1. Dim intFileNum%
Data Type Identifier at the end of a Variable Declaration? I thought that I was old school, wait till NeoPa and isladogs see this, then they will have someone else to pick on! (LOL)
3 Weeks Ago #45
NeoPa
Expert Mod 16PB
Rabbit:
Both the string and byte array versions are extremely quick, the byte array is just slightly quicker. Does it matter if one takes a second and the other takes a quarter of a second? Probably not, but the byte array version is quicker and cleaner.
Generally not, but if Ricardo is dealing with multi-megabytes then possibly. From my experience of the string version (The only one I even knew of until you chipped in with this very interesting concept.) is already very quick so it probably won't make much difference across just a couple of megabytes.

I did notice, however, that your code doesn't include any hash/octothorp (#) characters, yet I was under the impression they are required in such situations. Am I wrong? Is this a different version from what I've been using?
ADezii:
I thought that I was old school, wait till NeoPa and isladogs see this, then they will have someone else to pick on! (LOL)
(ROFL) - I really did guffaw. Nice one :-D

NB. You should also notice that Rabbit is also a moderator himself so he might also pick himself up on his code should he choose to ;-)

I suspect Rabbit is older than he seems and this part of the code is left over from when he first created the basic code many, many years ago.

@Rabbit.
I'd like to set this as Best Answer. In fact, I will anyway. However, it would be improved if you'd make updates to deal with the percent (%) (It's great to know about but many users will simply be confused.) and the hashes if necessary.
3 Weeks Ago #46
NeoPa
Expert Mod 16PB
It may be interesting to know that whenever I do such file work using Open# etc (Actually, the # always goes as a prelude to the File Number specification rather than at the end of Open or any of the other hashed keywords, but that makes it hard to talk about so the Help system refers to them with the hash suffix instead.), I always have to go back to some previous code to find out how to do it. Once, long ago, I went through the Help system and worked out how to get it to work without the addition of all the metadata that it likes to add by default.

I managed to work it out way back when, but it certainly wasn't for the faint-hearted. Since then I've avoided having to repeat that unpleasant experience by reviewing my existing code every time I've needed to do any work in that area. The Help system was always reasonably decent once you knew what you were looking for but :
  1. Get# & Put# are not easily discoverable. These are the Statements to use for direct file access as opposed to BASIC file access which likes to add the metadata for you - thus making it useless for anything other than files created using those same Statements (Write# & Print#.) Get# & Put# are listed in the Help system under Get & Put yet both describe the hashed versions.
  2. The parameters required for direct access when using Open# are also pretty obscure. It takes a good deal of reading through the comments, as well as a knowledge of Get# & Put#, in order to work out how it should be done.
  3. As far as I'm aware, it's also unbuffered I/O. This means very little nowadays I suppose, with the hardware buffering for you anyway, but it doesn't help to have O/S calls firing off so frequently in running code.
I have a buffered I/O Class Module somewhere that deals with that for you. If I can dig it up I'll add it here - or maybe even a link to the article if I manage to write one.
3 Weeks Ago #47
NeoPa
Expert Mod 16PB
Hopefully Direct File I/O in VBA will prove helpful to people who are looking for some help with such work.
3 Weeks Ago #48
ADezii
Expert 8TB
@Rabbit:
  1. Out of curiosity, I have a File named SmallFile.txt in the C:\Test\Folder which consists of a single Line of 26 characters (a-z).
  2. When I Open this File in BINARY READ Access using your Code, I get the Results depicted below (Before.jpg).
  3. I can obtain the correct results (After.jpg) by Redimensioning arrBytes to LOF(intFileNum) and not processing the last three Bytes of arrBytes(), but I am not really sure what is going on here, any ideas?


Attached Images
File Type: jpg Before.JPG (35.2 KB, 528 views)
File Type: jpg After.JPG (35.0 KB, 529 views)
3 Weeks Ago #49
Rabbit
Expert Mod 8TB
I suspect Rabbit is older than he seems and this part of the code is left over from when he first created the basic code many, many years ago.
Or perhaps not. I googled it. I only knew that it would be odd if I wasn't able to read a file in binary mode. I was admittedly confused by the purpose of the % symbol. The code seems to run fine without any of the special symbols.
3 Weeks Ago #50
63 Replies

Post your reply

Sign in to post your reply or Sign up for a free account.

Similar topics

8 posts views Thread by andrewpalumbo | last post: by
17 posts views Thread by Noen | last post: by
23 posts views Thread by Mark Dickinson | last post: by
reply views Thread by thri | last post: by
1 post views Thread by Tim Mulholland | last post: by
8 posts views Thread by TM | last post: by
1 post views Thread by Dave | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.