misto, and Banfa, Thank you.
misto: "The approach used strictly depends on the input format".
Graphical User Input. Text boxes. User input boxes. Not so much CLI at this time. Some place that the user could type into (via a keyboard) or drag-n-drop into or copy-n-paste into.
I would like to have my program be usable in many languages including English, French, German, Chinese, Japanese, Korean, and my program to be able to work with the input of the user.
Example: An English language user types into a text box (CreateWindowExW / Edit) and some of their words are not acceptable, or not nice, or whatever our reason for deciding to change them, as we the owners of the software decide. I would like for the software to search the input (input from the user) as it is input and delete or replace those words.
Example: A user inputs in English, "You are mean" and we want to change that to "You are not nice".
Example: A user inputs in Malayalam, "നിങ്ങൾ മോശക്കാരനാണ്" (You are mean) and we want to change that to "നിങ്ങൾ നല്ലവനല്ല" (You are not nice).
Just staying with the parsing, I think that our program now has the ability to work with the full range of Unicode (all 17 Planes) via bytes (UTF-8 and UTF-16 and UTF-32) as needed. But, I would like to have experienced comments about what direction to study parsing.
I asked about ASCII first because I wanted at least some foundation in parsing to start with before going into the more complicated parsing of Unicode types of input.
I am now focused beyond ASCII. If I receive comments on parsing ASCII, thank you. If I receive comments on parsing beyond ASCII, or in the Unicode ranges, thank you. Your small or large comments are all being taken into consideration. Thank you.
Banfa: "unless you have a lot of text to parse":
I am guessing that I can limit the input to less than 500 words each time.
A typed page in English has averaged about 100 words per page, so 5 pages of a communication between users might be a lot for our program to expect.
I expect to get mostly 1 to about 10 words.
"Hello", "Did you want to go fishing today or later this week?". Similar in Japanese: "こんにちは", "今日または今週後半に釣りに行きたいですか?"
You gave me some basics that I am now considering. Thank you. I am now looking for comments on parsing words or sentences as I have further described for faster parsing of the user-input strings.
In my limited experience and understanding, I am considering converting all non-ASCII to Hex, or Binary, or Decimal and parse via that. I might be wrong. I am looking for comments and ideas. If there is a better or faster approach, I am interested.
Banfa: "errors known as early optimisation". It reminds me of "know thyself and in 100 battles you shall lose none." I like that. Thank you.
Comments on parsing as it relates to the further information?
Thank you.