467,165 Members | 938 Online
Bytes | Developer Community
Ask Question

Home New Posts Topics Members FAQ

Post your question to a community of 467,165 developers. It's quick & easy.

Faster or more Efficient parsing of strings?

SwissProgrammer
128KB
I am new at C++ and the I have a question that relates to the implied logic of some code in someone else's post:
Expand|Select|Wrap|Line Numbers
  1.  { char m,z; int n,s,i;
  2.    for (s=0,m='A'; m <='Z'; m++) {
I think that the char m is treated as a number that can be incremented and not as a string even though it is coded in as a string. I am OK with that. But, it brings up a parsing question.

I thought that
"Char is a C++ data type designed for the storage of letters."

And

"It is an integral data type, meaning the value is stored as an integer."
It looks like the value ("A") of the variable (m) is not stored anywhere (other than in an ASCII reference table) but an integer number is what is stored in memory. I think that this is stored in ASCII. This is telling me that since ASCII for A can be found via int(A), then when I input "hello world" that the string is not being saved as an English string but it is being saved (in memory) as the ASCII values of each letter of "hello world".

It looks to me like it would be more efficient to parse via the numbers in memory rather than the string letters that I am using in my programs.

Someone a lot more advanced than me, am I correct in this and is there a process to do this that is faster than parsing via string letters? I do not want to struggle through months of writing one to find that it is more efficient to use processes (in the operating system) that are already included in the operating system to do it.

Why I am asking this:
A basic premise to a program that I am programming has been for it to be useful in all Unicode referenced languages (within reason). I think that I have found that different operating systems work with Unicode differently, but they seem to (maybe) see the code points in the same binary/byte representation. I have found reports of conflicts for various UTF versions with the most stable and most universal being UTF-8. When I tried to parse Unicode strings I think that I found that parsing via the bits of the bytes is the most cross-platform process. I am thinking, "Is there a better, or more efficient way of doing this?" Thus, this question as it relates to simple ASCII Character/strings.

Staying to characters and ASCII via C++, please supply comments and/or help.

Thank you.
Dec 14 '20 #1
  • viewed: 2177
Share:
9 Replies
There are indeed ways to speed up parsing, but they don't really apply to the particular example you gave. As far as the compiler is concerned, there is absolutely no difference between 'A', 65, 0x41, etc. They all refer to the same integer.
Dec 15 '20 #2
SwissProgrammer
128KB
misto,

Thank you. I look forward to reading more wisdom from you.

Directly answered. Thank you. But that brings up an issue about user-input text while staying with the question of "Faster or more Efficient parsing of strings".

If a user inputs text via typing, or copy and paste, or drag and drop, and they are using ASCII (abc 123 etc.); when my C++ program parses that input text does the parsing work the same as on text that I write hard coded into my program? I think not, but I have to ask. And particularly how is it different. And the main question again is, is there a maximum speed parsing process for that (user-input) text?

Thank you.
Dec 15 '20 #3
Banfa
Expert Mod 8TB
You need to be careful about code like the snippet you have posted because it makes assumptions that depending on your platform just aren't true all related to the execution character set.

So when writing C/C++ code (and possibly languages) you have to worry about 2 character sets:
  • The source character: the character set used to write the source code in, often related to what is available on the platform used for development
  • The execution character set: The character set used by the platform running the compiled executable.

These can be different, particularly if you are cross compiling for an embedded platform.

Additionally char does not store letters, it stores numbers that are meant to represent characters (most of them printable) but the actual character represented is defined by the execution character set.

Which comes onto what the code is doing; consider this
Expand|Select|Wrap|Line Numbers
  1. char a = 'A';
  2. char b = a + 1;
  3.  
You might expect b to have a value representing 'B' but the C/C++ standards do not guarantee that and it depends on the execution character set. This only works were 'A' and 'B' have contiguous values in the character set, which is true for ASCII and UTF8 which many platforms use. However there are character sets (notable EBCDIC which used to be popular on mobile phones) where the alpha bets are not contiguous such that 'J' - 'I' != 1.

The C/C++ standards do require that for all decimal digits in the character set maths using those characters does produce the expected result '.e. '9' - '2' = 7 for all character sets.

The point is that this sort of code only works because the execution character set is defined so that it works. So if you want to create portable code you should either avoid this sort of operation on your characters or confine it to a few functions that can be changed if necessary for different platforms rather than let it pervade your code.

Sometimes it is easier to process textual data as the numbers them represent: i.e. a ^= 0x20; toggles the case of your letter (for ASCII), but this is by definition character set dependent and not fully portable.


Oh and finally remember that computers have no concept of letters at all; they only do numbers (certainly at the sort of low level place we are talking about); they may have code that tells them that a given number should be displayed as a certain pattern of pixels but it takes a human looking at the pattern before any concept of letter appears in the process. It can be useful for the programmer to think of the computer as handling letters but it isn't really doing that.
Dec 15 '20 #4
SwissProgrammer
128KB
Banfa,

To be clear, I am not talking about .net (JIT, "managed code"). I have collided with .net so many times, and had so much difficulty removing that stuff that I mentioned it. Just to make certain.

You refer to "The execution character set", "used by the platform running the compiled executable." I shall now consider that. Thank you.

Looking at ASCII and also at Unicode:

I think that if I parse via UTF-8 byte representations of each character, ASCII or Unicode, then this might be cross-platform in that these byte representations are the same in most platforms. I do not know, but I think so.

If UTF-8 is not so universal in its byte representations of ASCII and/or Unicode, then maybe have my program determine the operating system that it is running on and use the appropriate UTF (8 or 16 or 32) for that system. Then parse via those bytes. Does that make sense? Or, are the bytes different on different platforms for "123 办 abc"? If they are different then I plan to account for that in parsing the user-input strings. If they are the same then that seems more simple to plan for.

This post has become narrowed to faster parsing of the user-input strings.

Comments on this are appreciated.

Thank you.
Dec 16 '20 #5
Parsing text is an entire field of study. So you really have to narrow down what you mean by that. Are you trying to parse CSV files? Regular expressions? HTML code? The approach used strictly depends on the input format, what you need to extract from the data contained therein, etc.

Parsing source code is typically much more intensive than your everyday "garden variety" parsing tasks, although (again) it just depends on what's being parsed.

As far as particular approaches there are many. Iterative, sequential parsing is common but table-based lookup strategies are popular too. Another funny technique is to pack the bytes into large integers for faster lookups.

UTF-8 is great but unfortunately makes parsing a HUGE headache. Thankfully there are libraries such as utf8.h. Otherwise, if at all possible stick to ASCII.
Dec 16 '20 #6
Banfa
Expert Mod 8TB
Another point to remember is, just like the sorting algorithms, unless you have a lot of text to parse, or you are performing an academic exercise, searching for the best parsing algorithm, as opposed to settling for one that is good enough is a waste of time.

This falls into the category of errors known as early optimisation.
Dec 16 '20 #7
SwissProgrammer
128KB
misto, and Banfa, Thank you.

misto: "The approach used strictly depends on the input format".
Graphical User Input. Text boxes. User input boxes. Not so much CLI at this time. Some place that the user could type into (via a keyboard) or drag-n-drop into or copy-n-paste into.

I would like to have my program be usable in many languages including English, French, German, Chinese, Japanese, Korean, and my program to be able to work with the input of the user.

Example: An English language user types into a text box (CreateWindowExW / Edit) and some of their words are not acceptable, or not nice, or whatever our reason for deciding to change them, as we the owners of the software decide. I would like for the software to search the input (input from the user) as it is input and delete or replace those words.

Example: A user inputs in English, "You are mean" and we want to change that to "You are not nice".

Example: A user inputs in Malayalam, "നിങ്ങൾ മോശക്കാരനാണ്" (You are mean) and we want to change that to "നിങ്ങൾ നല്ലവനല്ല" (You are not nice).

Just staying with the parsing, I think that our program now has the ability to work with the full range of Unicode (all 17 Planes) via bytes (UTF-8 and UTF-16 and UTF-32) as needed. But, I would like to have experienced comments about what direction to study parsing.

I asked about ASCII first because I wanted at least some foundation in parsing to start with before going into the more complicated parsing of Unicode types of input.

I am now focused beyond ASCII. If I receive comments on parsing ASCII, thank you. If I receive comments on parsing beyond ASCII, or in the Unicode ranges, thank you. Your small or large comments are all being taken into consideration. Thank you.



Banfa: "unless you have a lot of text to parse":
I am guessing that I can limit the input to less than 500 words each time.

A typed page in English has averaged about 100 words per page, so 5 pages of a communication between users might be a lot for our program to expect.

I expect to get mostly 1 to about 10 words.
"Hello", "Did you want to go fishing today or later this week?". Similar in Japanese: "こんにちは", "今日または今週後半に釣りに行きたいですか?"

You gave me some basics that I am now considering. Thank you. I am now looking for comments on parsing words or sentences as I have further described for faster parsing of the user-input strings.

In my limited experience and understanding, I am considering converting all non-ASCII to Hex, or Binary, or Decimal and parse via that. I might be wrong. I am looking for comments and ideas. If there is a better or faster approach, I am interested.

Banfa: "errors known as early optimisation". It reminds me of "know thyself and in 100 battles you shall lose none." I like that. Thank you.

Comments on parsing as it relates to the further information?

Thank you.
Dec 16 '20 #8
128KB
The first thing to do is identify what the language of the characters you typed is.
Example, Google search uses UTF-8, so it can support multiple languages.
The Google Translate API also provides the ability to identify the language of a string.
https://googleblog.blogspot.com/2008...tools-for.html

By the way, Chinese characters are used in Japanese, Korean, Vietnamese, and Chinese, but they are slightly different.
Since the order of the kanji codes in UTF-8 is random, it is almost impossible to identify the language by the character code.


Information on other language identification.
http://www.let.rug.nl/~vannoord/TextCat/
https://code.google.com/archive/p/language-detection/
Dec 18 '20 #9
SwissProgrammer
128KB
SioSio,

Are you saying that it might be the most universal across platforms to convert what that platform represents as a character (in its own UTF (7,8,16,32,etc.)) into UTF-8 Octets and then use that in my parsing?

I could go from that to a database of intended edits.

If so then I think that I have found the answer.

Thank you.
Dec 19 '20 #10

Post your reply

Sign in to post your reply or Sign up for a free account.

Similar topics

18 posts views Thread by Eirik WS | last post: by
16 posts views Thread by Dustan | last post: by
20 posts views Thread by laredotornado@zipmail.com | last post: by
12 posts views Thread by pedagani@gmail.com | last post: by
74 posts views Thread by copx | last post: by
3 posts views Thread by cokofreedom@gmail.com | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.