473,472 Members | 2,184 Online
Bytes | Software Development & Data Engineering Community
Create Post

Home Posts Topics Members FAQ

Faster or more Efficient parsing of strings?

SwissProgrammer
220 New Member
I am new at C++ and the I have a question that relates to the implied logic of some code in someone else's post:
Expand|Select|Wrap|Line Numbers
  1.  { char m,z; int n,s,i;
  2.    for (s=0,m='A'; m <='Z'; m++) {
I think that the char m is treated as a number that can be incremented and not as a string even though it is coded in as a string. I am OK with that. But, it brings up a parsing question.

I thought that
"Char is a C++ data type designed for the storage of letters."

And

"It is an integral data type, meaning the value is stored as an integer."
It looks like the value ("A") of the variable (m) is not stored anywhere (other than in an ASCII reference table) but an integer number is what is stored in memory. I think that this is stored in ASCII. This is telling me that since ASCII for A can be found via int(A), then when I input "hello world" that the string is not being saved as an English string but it is being saved (in memory) as the ASCII values of each letter of "hello world".

It looks to me like it would be more efficient to parse via the numbers in memory rather than the string letters that I am using in my programs.

Someone a lot more advanced than me, am I correct in this and is there a process to do this that is faster than parsing via string letters? I do not want to struggle through months of writing one to find that it is more efficient to use processes (in the operating system) that are already included in the operating system to do it.

Why I am asking this:
A basic premise to a program that I am programming has been for it to be useful in all Unicode referenced languages (within reason). I think that I have found that different operating systems work with Unicode differently, but they seem to (maybe) see the code points in the same binary/byte representation. I have found reports of conflicts for various UTF versions with the most stable and most universal being UTF-8. When I tried to parse Unicode strings I think that I found that parsing via the bits of the bytes is the most cross-platform process. I am thinking, "Is there a better, or more efficient way of doing this?" Thus, this question as it relates to simple ASCII Character/strings.

Staying to characters and ASCII via C++, please supply comments and/or help.

Thank you.
Dec 14 '20 #1
9 3034
misto
2 New Member
There are indeed ways to speed up parsing, but they don't really apply to the particular example you gave. As far as the compiler is concerned, there is absolutely no difference between 'A', 65, 0x41, etc. They all refer to the same integer.
Dec 15 '20 #2
SwissProgrammer
220 New Member
misto,

Thank you. I look forward to reading more wisdom from you.

Directly answered. Thank you. But that brings up an issue about user-input text while staying with the question of "Faster or more Efficient parsing of strings".

If a user inputs text via typing, or copy and paste, or drag and drop, and they are using ASCII (abc 123 etc.); when my C++ program parses that input text does the parsing work the same as on text that I write hard coded into my program? I think not, but I have to ask. And particularly how is it different. And the main question again is, is there a maximum speed parsing process for that (user-input) text?

Thank you.
Dec 15 '20 #3
Banfa
9,065 Recognized Expert Moderator Expert
You need to be careful about code like the snippet you have posted because it makes assumptions that depending on your platform just aren't true all related to the execution character set.

So when writing C/C++ code (and possibly languages) you have to worry about 2 character sets:
  • The source character: the character set used to write the source code in, often related to what is available on the platform used for development
  • The execution character set: The character set used by the platform running the compiled executable.

These can be different, particularly if you are cross compiling for an embedded platform.

Additionally char does not store letters, it stores numbers that are meant to represent characters (most of them printable) but the actual character represented is defined by the execution character set.

Which comes onto what the code is doing; consider this
Expand|Select|Wrap|Line Numbers
  1. char a = 'A';
  2. char b = a + 1;
  3.  
You might expect b to have a value representing 'B' but the C/C++ standards do not guarantee that and it depends on the execution character set. This only works were 'A' and 'B' have contiguous values in the character set, which is true for ASCII and UTF8 which many platforms use. However there are character sets (notable EBCDIC which used to be popular on mobile phones) where the alpha bets are not contiguous such that 'J' - 'I' != 1.

The C/C++ standards do require that for all decimal digits in the character set maths using those characters does produce the expected result '.e. '9' - '2' = 7 for all character sets.

The point is that this sort of code only works because the execution character set is defined so that it works. So if you want to create portable code you should either avoid this sort of operation on your characters or confine it to a few functions that can be changed if necessary for different platforms rather than let it pervade your code.

Sometimes it is easier to process textual data as the numbers them represent: i.e. a ^= 0x20; toggles the case of your letter (for ASCII), but this is by definition character set dependent and not fully portable.


Oh and finally remember that computers have no concept of letters at all; they only do numbers (certainly at the sort of low level place we are talking about); they may have code that tells them that a given number should be displayed as a certain pattern of pixels but it takes a human looking at the pattern before any concept of letter appears in the process. It can be useful for the programmer to think of the computer as handling letters but it isn't really doing that.
Dec 15 '20 #4
SwissProgrammer
220 New Member
Banfa,

To be clear, I am not talking about .net (JIT, "managed code"). I have collided with .net so many times, and had so much difficulty removing that stuff that I mentioned it. Just to make certain.

You refer to "The execution character set", "used by the platform running the compiled executable." I shall now consider that. Thank you.

Looking at ASCII and also at Unicode:

I think that if I parse via UTF-8 byte representations of each character, ASCII or Unicode, then this might be cross-platform in that these byte representations are the same in most platforms. I do not know, but I think so.

If UTF-8 is not so universal in its byte representations of ASCII and/or Unicode, then maybe have my program determine the operating system that it is running on and use the appropriate UTF (8 or 16 or 32) for that system. Then parse via those bytes. Does that make sense? Or, are the bytes different on different platforms for "123 办 abc"? If they are different then I plan to account for that in parsing the user-input strings. If they are the same then that seems more simple to plan for.

This post has become narrowed to faster parsing of the user-input strings.

Comments on this are appreciated.

Thank you.
Dec 16 '20 #5
misto
2 New Member
Parsing text is an entire field of study. So you really have to narrow down what you mean by that. Are you trying to parse CSV files? Regular expressions? HTML code? The approach used strictly depends on the input format, what you need to extract from the data contained therein, etc.

Parsing source code is typically much more intensive than your everyday "garden variety" parsing tasks, although (again) it just depends on what's being parsed.

As far as particular approaches there are many. Iterative, sequential parsing is common but table-based lookup strategies are popular too. Another funny technique is to pack the bytes into large integers for faster lookups.

UTF-8 is great but unfortunately makes parsing a HUGE headache. Thankfully there are libraries such as utf8.h. Otherwise, if at all possible stick to ASCII.
Dec 16 '20 #6
Banfa
9,065 Recognized Expert Moderator Expert
Another point to remember is, just like the sorting algorithms, unless you have a lot of text to parse, or you are performing an academic exercise, searching for the best parsing algorithm, as opposed to settling for one that is good enough is a waste of time.

This falls into the category of errors known as early optimisation.
Dec 16 '20 #7
SwissProgrammer
220 New Member
misto, and Banfa, Thank you.

misto: "The approach used strictly depends on the input format".
Graphical User Input. Text boxes. User input boxes. Not so much CLI at this time. Some place that the user could type into (via a keyboard) or drag-n-drop into or copy-n-paste into.

I would like to have my program be usable in many languages including English, French, German, Chinese, Japanese, Korean, and my program to be able to work with the input of the user.

Example: An English language user types into a text box (CreateWindowExW / Edit) and some of their words are not acceptable, or not nice, or whatever our reason for deciding to change them, as we the owners of the software decide. I would like for the software to search the input (input from the user) as it is input and delete or replace those words.

Example: A user inputs in English, "You are mean" and we want to change that to "You are not nice".

Example: A user inputs in Malayalam, "നിങ്ങൾ മോശക്കാരനാണ്" (You are mean) and we want to change that to "നിങ്ങൾ നല്ലവനല്ല" (You are not nice).

Just staying with the parsing, I think that our program now has the ability to work with the full range of Unicode (all 17 Planes) via bytes (UTF-8 and UTF-16 and UTF-32) as needed. But, I would like to have experienced comments about what direction to study parsing.

I asked about ASCII first because I wanted at least some foundation in parsing to start with before going into the more complicated parsing of Unicode types of input.

I am now focused beyond ASCII. If I receive comments on parsing ASCII, thank you. If I receive comments on parsing beyond ASCII, or in the Unicode ranges, thank you. Your small or large comments are all being taken into consideration. Thank you.



Banfa: "unless you have a lot of text to parse":
I am guessing that I can limit the input to less than 500 words each time.

A typed page in English has averaged about 100 words per page, so 5 pages of a communication between users might be a lot for our program to expect.

I expect to get mostly 1 to about 10 words.
"Hello", "Did you want to go fishing today or later this week?". Similar in Japanese: "こんにちは", "今日または今週後半に釣りに行きたいですか?"

You gave me some basics that I am now considering. Thank you. I am now looking for comments on parsing words or sentences as I have further described for faster parsing of the user-input strings.

In my limited experience and understanding, I am considering converting all non-ASCII to Hex, or Binary, or Decimal and parse via that. I might be wrong. I am looking for comments and ideas. If there is a better or faster approach, I am interested.

Banfa: "errors known as early optimisation". It reminds me of "know thyself and in 100 battles you shall lose none." I like that. Thank you.

Comments on parsing as it relates to the further information?

Thank you.
Dec 16 '20 #8
SioSio
272 Contributor
The first thing to do is identify what the language of the characters you typed is.
Example, Google search uses UTF-8, so it can support multiple languages.
The Google Translate API also provides the ability to identify the language of a string.
https://googleblog.blogspot.com/2008...tools-for.html

By the way, Chinese characters are used in Japanese, Korean, Vietnamese, and Chinese, but they are slightly different.
Since the order of the kanji codes in UTF-8 is random, it is almost impossible to identify the language by the character code.


Information on other language identification.
http://www.let.rug.nl/~vannoord/TextCat/
https://code.google.com/archive/p/language-detection/
Dec 18 '20 #9
SwissProgrammer
220 New Member
SioSio,

Are you saying that it might be the most universal across platforms to convert what that platform represents as a character (in its own UTF (7,8,16,32,etc.)) into UTF-8 Octets and then use that in my parsing?

I could go from that to a database of intended edits.

If so then I think that I have found the answer.

Thank you.
Dec 19 '20 #10

Sign in to post your reply or Sign up for a free account.

Similar topics

18
by: Eirik WS | last post by:
Is there a more efficient way of comparing a string to different words? I'm doing it this way: if(strcmp(farge, "kvit") == 0) peikar_til_glas_struktur->farge = KVIT; if(strcmp(farge, "raud") ==...
1
by: | last post by:
I was viewing some strange behaviour in c# so I tried a comparison in vb.net A simple 1 line console application written in both c# and vb.net (One line meaning one line in the main function). ...
4
by: Rich | last post by:
Hello, Just checking which is more efficient/better/or correct Do While something str1 = dr(i).ToString If Not str1.Equals("xyz") Then .... or
16
by: Dustan | last post by:
I have a program that uses up a lot of CPU and want to make it is efficient as possible with what I have to work with it. So which of the following would be more efficient, knowing that l is a list...
20
by: laredotornado | last post by:
Hi, I'm using PHP 4.3. I have 15 pages in which I need to take the content of the BODY and put it in a little table ... <table> <tr><td colspan="3"><img src="header.gif"></td></tr> <tr>...
4
by: | last post by:
Using VS.NET I am wondering what methods developers use to deploy ASP.NET website content to a remote server, either using FTP or network file copy. Ideally there would be a one-button or...
12
by: pedagani | last post by:
Dear comp.lang.c++, Could you make this snippet more efficient? As you see I have too many variables introduced in the code. //Read set of integers from a file on line by line basis in a STL...
74
by: copx | last post by:
In "Learning Standard C++ as a New Language" Bjarne Stroustrup claims that properly written C++ outperforms C code. I will just copy his first example here, which is supposed to demonstrate how C++...
3
by: cokofreedom | last post by:
I've written up a little piece of code that isn't that foolproof to scan through a file (java presently) to find functions and then look for them throughout the document and output the name of the...
3
by: Ken Fine | last post by:
This is a question that someone familiar with ASP.NET and ADO.NET DataSets and DataTables should be able to answer fairly easily. The basic question is how I can efficiently match data from one...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
0
muto222
php
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.