Read text file to Temp file and apply formating and color changes

=?Utf-8?B?QnJpYW4gQ29vaw==?=

I want to open a text file and format it into a specific line and then apply
color to a specific location of the text and then display it in a RichTextBox
after all of this is done.

I can do all of the above after the file is loaded into the RichTextBox, and
I am trying to speed the process up by doing it in a temp file.

Aug 21 '07 #1

Subscribe Post Reply

2956

Nicholas Paldino [.NET/C# MVP]

Brian,

You might as well do it in the RichTextBox. Depending on how much
formatting you need, you would basically have to write the code the RTF
format. If you can find another component that will write RTF format files
for you, you might be better off doing that.

Or, if you don't need to edit your files, use HTML (you can just create
the markup on the fly).
--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"Brian Cook" <br********@bnsf.comwrote in message
news:30**********************************@microsof t.com...

>I want to open a text file and format it into a specific line and then
apply
color to a specific location of the text and then display it in a
RichTextBox
after all of this is done.

I can do all of the above after the file is loaded into the RichTextBox,
and
I am trying to speed the process up by doing it in a temp file.

Aug 21 '07 #2

Peter Duniho

Brian Cook wrote:

[...]
I have tried it with the control invisible, and thier is no difference in
the time it takes apply the color to the with it set that way or if it is
visible.

So what's the problem? Did you try using the Rtf property to initialize
the control with pre-formatted text? Just how much time are we talking
about here? My experience has been that on a fast computer, control
initializations/interactions are practically instant. What kind of
delay are you running into and why is it such a big deal?

It would really help if you could be a little more specific and
objective about the exact behavior you're seeing and what you'd like
instead.

Pete

Aug 21 '07 #3

=?Utf-8?B?QnJpYW4gQ29vaw==?=

It takes 3Min30Sec for a 7MB file.

I am runing a P4 3.4Mhz PC with 2GB of RAM.

The big deal, is the next step of this project is to be able to search,
filter and highhlight all entries that match a specific match, or filter it
as it is loaded for one or multiple specific strings.

For example

filter on:
[144007]
[144008]
[144008]

Thanks, and I hope I explained myself better.

Brian

"Peter Duniho" wrote:

Brian Cook wrote:
[...]
I have tried it with the control invisible, and thier is no difference in
the time it takes apply the color to the with it set that way or if it is
visible.

So what's the problem? Did you try using the Rtf property to initialize
the control with pre-formatted text? Just how much time are we talking
about here? My experience has been that on a fast computer, control
initializations/interactions are practically instant. What kind of
delay are you running into and why is it such a big deal?

It would really help if you could be a little more specific and
objective about the exact behavior you're seeing and what you'd like
instead.

Pete

Aug 21 '07 #4

Peter Duniho

Brian Cook wrote:

It takes 3Min30Sec for a 7MB file.

I am runing a P4 3.4Mhz PC with 2GB of RAM.

I assume that's "3.4Ghz", you mean. :)

Maybe you could post a concise-but-complete example of code that
reliably demonstrates your scenario. I am not seeing nearly so awful a
performance result.

I created a short application that simply puts 7MB worth of dummy text
into a RichTextBox. Actually, it's technically 14MB...I took the
conservative view that your "7MB file" is regular ASCII, and so I
actually generate 7 * 1024 * 1024 characters. Since .NET uses Unicode,
this is actually 14MB, not 7MB.

I generate the dummy text randomly one character at a time, one fake
sentence at a time, appending each new sentence to a StringBuilder that
has been preinitialized to the 7 million (or so :) ) characters. Then I
use the ToString() method and assign the result to the Text property
of the RichTextBox.

I ran this on my Core 2 Duo 2.33Ghz computer with 2GB of RAM.

The initialization of the data takes roughly 1 second, and roughly
another 3 seconds to copy it all to the RichTextBox.

So yes, there's a fair amount of overhead in the control, relatively
speaking (copying the text takes three times as long as generating it in
the first place). But three and a half minutes is WAY too slow. I
don't know what you're code is doing, but it seems likely to me that you
have some basic algorithmic issues going on, and your problem will be
better solved addressing those rather than looking for ways to make the
control itself faster. My experiment suggests that the control itself
has acceptable performance, when used efficiently.

IMHO, the best sample code would include something that autogenerates
the data equivalent to your 7MB file, and which also includes whatever
initialization code you are using, along with a RichTextBox contained by
the form for the program. In other words, the program itself should be
as small as possible (no downloading 7MB files :) ), but should still
reproduce the same performance results you're seeing, by doing the exact
same initialization you're doing.

Pete

Aug 21 '07 #5

=?Utf-8?B?QnJpYW4gQ29vaw==?=

Sure thing Peter, Here is the code. Do you need me to email you a sample file?

Yes I meant Ghz.

I call this routine from the OpenFileDialog;

This reads and formats the text into the single line format;
----- Begin Code -----
using System;
using System.Drawing;
using System.IO;
using System.Collections.Generic;
using System.Text;
using System.Windows.Forms;

namespace TCEditor
{
class Streamer
{

#region File Stream and Format Routine
public static string LayoutInput(string input)
{
StreamReader sr = File.OpenText(input);
StringBuilder sb = new StringBuilder(input.Length);
bool firstLine = true;
string line;
while ((line = sr.ReadLine()) != null)
{
if (line.Trim() == "")
continue;
if (line.Length < 29) { throw new InvalidOperationException("invalid
input"); }
if (line[29] != ' ')
{
int txPos;
int rxPos = -1;
int len = 0;
if (firstLine)
firstLine = false;
else
sb.Append("\r\n");
if (((txPos = line.IndexOf("TX")) -1) || ((rxPos = line.IndexOf("RX")) 0))
{
int charactersTillPoint;
if (txPos -1)
{
charactersTillPoint = txPos;
len = line.Substring(txPos).Length;
}
else
{
charactersTillPoint = rxPos;
len = line.Substring(rxPos).Length;
}
string part0 = line.Substring(0, charactersTillPoint);
string part1 = line.Substring(charactersTillPoint);
sb.Append(part0.PadRight(86));
sb.Append(part1);
}
else
sb.Append(line);
sb.Append(' ');
if (len == 12)
sb.Append(' ');
}
else
{
sb.Append(line.Substring(31));
}
}
return sb.ToString();
}
#endregion
}
}
-----End Code-----
This is the routine that adds the color to the specified locations;
I call this as the next routine in the OpenFileDialog
-----Begin Code-----
private void findSequenceNumbers()
{
toolStripStatusLabel.Visible = toolStripProgressBar.Visible = true;
toolStripProgressBar.Value = 0;
int lineNum = 0;
bool startingNewLine = true;
FontStyle style = FontStyle.Bold;
string[] lines = rtbDoc.Lines;
string text = rtbDoc.Text;
toolStripProgressBar.Maximum = text.Length;
for (int i = 0; i < text.Length; i++)
{
if (startingNewLine)
{
if ((lines[lineNum].Contains("ARES_EINDICATION")) ||
(lines[lineNum].Contains("ARES_INDICATION")))
{
i += 169;

rtbDoc.Select(i, 2);
rtbDoc.SelectionFont = new Font(rtbDoc.SelectionFont,
rtbDoc.SelectionFont.Style ^ style);
rtbDoc.SelectionColor = Color.DarkBlue;
}
else if (lines[lineNum].Contains("]CODELINE_INDICATION_MSG"))
{
i += 160;
rtbDoc.Select(i, 2);
rtbDoc.SelectionFont = new Font(rtbDoc.SelectionFont,
rtbDoc.SelectionFont.Style ^ style);
rtbDoc.SelectionColor = Color.DarkBlue;
}
else
{
i += lines[lineNum].Length - 1;
}
startingNewLine = false;
Application.DoEvents();
}
if (text[i] == '\n')
{
startingNewLine = true;
lineNum++;
}
toolStripProgressBar.Value = i;
}
toolStripStatusLabel.Visible = toolStripProgressBar.Visible = false;
rtbDoc.Select(0, 0);
rtbDoc.ScrollToCaret();
}
-----End Code-----

"Peter Duniho" wrote:

Brian Cook wrote:
It takes 3Min30Sec for a 7MB file.

I am runing a P4 3.4Mhz PC with 2GB of RAM.

I assume that's "3.4Ghz", you mean. :)

Maybe you could post a concise-but-complete example of code that
reliably demonstrates your scenario. I am not seeing nearly so awful a
performance result.

I created a short application that simply puts 7MB worth of dummy text
into a RichTextBox. Actually, it's technically 14MB...I took the
conservative view that your "7MB file" is regular ASCII, and so I
actually generate 7 * 1024 * 1024 characters. Since .NET uses Unicode,
this is actually 14MB, not 7MB.

I generate the dummy text randomly one character at a time, one fake
sentence at a time, appending each new sentence to a StringBuilder that
has been preinitialized to the 7 million (or so :) ) characters. Then I
use the ToString() method and assign the result to the Text property
of the RichTextBox.

I ran this on my Core 2 Duo 2.33Ghz computer with 2GB of RAM.

The initialization of the data takes roughly 1 second, and roughly
another 3 seconds to copy it all to the RichTextBox.

So yes, there's a fair amount of overhead in the control, relatively
speaking (copying the text takes three times as long as generating it in
the first place). But three and a half minutes is WAY too slow. I
don't know what you're code is doing, but it seems likely to me that you
have some basic algorithmic issues going on, and your problem will be
better solved addressing those rather than looking for ways to make the
control itself faster. My experiment suggests that the control itself
has acceptable performance, when used efficiently.

IMHO, the best sample code would include something that autogenerates
the data equivalent to your 7MB file, and which also includes whatever
initialization code you are using, along with a RichTextBox contained by
the form for the program. In other words, the program itself should be
as small as possible (no downloading 7MB files :) ), but should still
reproduce the same performance results you're seeing, by doing the exact
same initialization you're doing.

Pete

Aug 21 '07 #6

Peter Duniho

Brian Cook wrote:

Sure thing Peter, Here is the code. Do you need me to email you a sample file?

No. As I said in my post, your sample code should auto-generate some
relevant data. That way, no one has to have an exact copy of the sample
file you are using.

Also, I have asked twice already whether you have tried pre-formatting
the text and then assigning the RTF string to the RichTextBox.Rtf
property. You haven't answered that question yet. IMHO, that technique
is likely to give you the best performance. If nothing else, it will
make clear where the performance bottleneck is (assuming you bother to
time the individual sections of your initialization, of course).

Please at least answer that question, even if you have no intention of
trying the suggestion.

Pete

Aug 21 '07 #7

=?Utf-8?B?QnJpYW4gQ29vaw==?=

"Peter Duniho" wrote:

Brian Cook wrote:
Sure thing Peter, Here is the code. Do you need me to email you a sample file?

No. As I said in my post, your sample code should auto-generate some
relevant data. That way, no one has to have an exact copy of the sample
file you are using.

Sorry, I thought I had answered that. I have not tried to auto-generate some
relevant data, as I am unsure how to do that being new to C#.

>
Also, I have asked twice already whether you have tried pre-formatting
the text and then assigning the RTF string to the RichTextBox.Rtf
property. You haven't answered that question yet. IMHO, that technique
is likely to give you the best performance. If nothing else, it will
make clear where the performance bottleneck is (assuming you bother to
time the individual sections of your initialization, of course).

As above, no I have not tried doing that as I was not aware of that
technique, and do not know how to do it.

I am doing what I can to learn this as I go along, and have made good
progress on some of what I consider the harder things. This is still very
new.

Please at least answer that question, even if you have no intention of
trying the suggestion.

Pete I will gladly try your suggestion. I just don't know how to do it..
That is why I posted here so that I can get suggestions (e.g. examples) of
how I can do it, or assistance in figuring it out.

We all have to start someplace, and I am at the beginning, without any
classes, years of C++ experiance or other practical knowledge.

Brian

Pete

Aug 21 '07 #8

Peter Duniho

Okay...had a chance to look through your code. I don't know if I can
cut three minutes or more from the cost while sticking with the same
basic algorithm, but there is some low-hanging fruit. You haven't
provided any information regarding which parts of your initialization is
slow, so it may or may not be that some or all of these changes are
significant. But they are still potentially issues that can be fixed.

See what happens if you clean some of these things up (and by the way,
please when posting code make sure that your indentation is
preserved...it's a lot harder to read the code without it):

Brian Cook wrote:

This reads and formats the text into the single line format;
----- Begin Code -----
using System;
using System.Drawing;
using System.IO;
using System.Collections.Generic;
using System.Text;
using System.Windows.Forms;

namespace TCEditor
{
class Streamer
{

#region File Stream and Format Routine
public static string LayoutInput(string input)
{
StreamReader sr = File.OpenText(input);
StringBuilder sb = new StringBuilder(input.Length);
bool firstLine = true;
string line;
while ((line = sr.ReadLine()) != null)
{
if (line.Trim() == "")
continue;
if (line.Length < 29) { throw new InvalidOperationException("invalid
input"); }
if (line[29] != ' ')
{
int txPos;
int rxPos = -1;
int len = 0;
if (firstLine)
firstLine = false;
else
sb.Append("\r\n");
if (((txPos = line.IndexOf("TX")) -1) || ((rxPos = line.IndexOf("RX")) 0))

Minor issue above: IndexOf() has to scan through the string each time
you call it. Worst-case is, of course, when the text you're looking for
doesn't exist.

You might do better using a regular expression (see Regex class) to
search for both "TX" and "RX" at the same time. Then you can check the
actual match (if any) to determined which matched. Though, that said, I
don't see anything that actually depends on which one matched; you seem
to treat both the same. So even more reason to just use Regex.

{
int charactersTillPoint;
if (txPos -1)
{
charactersTillPoint = txPos;
len = line.Substring(txPos).Length;

Major issue above: this is the worst way to calculate "len". Getting
the length of the original string is fast. Doing a subtraction is fast.
Creating a whole new string just so you can see how long it is? Not fast.

Change this line to:

len = line.Length - txPos;

}
else
{
charactersTillPoint = rxPos;
len = line.Substring(rxPos).Length;

Likewise.

}
string part0 = line.Substring(0, charactersTillPoint);
string part1 = line.Substring(charactersTillPoint);
sb.Append(part0.PadRight(86));
sb.Append(part1);

Medium issue: creating new strings just to append to the StringBuilder
incurs the overhead of the instance creation. But the StringBuilder has
overloads for the Append() method to avoid that.

I did a quick-and-dirty test, and it appears to me that creating the
string almost doubles the total time it takes to append text to a
StringBuilder.

At the very least, I would change the "part1" appending so that it looks
more like this:

sb.Append(line, charactersTillPoint, len);

I suspect you would also gain a win by not using PadRight, and instead
doing that work yourself:

sb.Append(line, 0, charactersTillPoint);
sb.Append(new string(' ', 86 - charactersTillPoint));

That way, you avoid the creation of both substrings (but not the new
padding string, of course). This cuts your string instantiation in this
area of the code from three strings down to just one.

}
else
sb.Append(line);
sb.Append(' ');
if (len == 12)
sb.Append(' ');
}
else
{
sb.Append(line.Substring(31));

Likewise here, use the Append() overload that extracts the substring for
you:

sb.Append(line, 31, line.Length - 31);

One last note on the substring thing: I also tested the overload that
takes a char[] instead of an array, along with the substring index and
length. It's actually even a little faster than passing a string, but
not by a lot. The big win here will be to just stop instantiating new
strings to append.

}
}
return sb.ToString();
}
#endregion
}
}
-----End Code-----
This is the routine that adds the color to the specified locations;
I call this as the next routine in the OpenFileDialog

I'm not going to provide specific comments for this method. Probably
the most costly part of it is all of the selecting and formatting that's
going on, and the best way to fix that would be to simply move the
formatting logic into the same method where you are reading the file,
and insert the necessary RTF format codes, rather than interacting with
the control directly.

That said, on a style perspective, I'd suggest that one significant
thing wrong with this method is that you wind up with two copies of the
text from the control. IMHO, since you want to process the text on a
line-by-line basis, you should just get the string[] from the Lines
property and operate on that. The RichTextBox control has methods such
as GetFirstCharIndexFromLine to allow you to determine actual character
indices for the purpose of formatting, and I would be surprised if using
that method significantly reduces the overall performance of this
method. As a result, your memory footprint will be halved, and the code
will be much closer to your intended algorithm.

A couple of other changes I'd make are to not call DoEvents(), and to
not update your progress control so often.

If you can't get the performance of this stuff down to something
acceptable for being in-line with the UI code, the correct solution is
to move the processing to a background thread. The BackgroundWorker
class is designed especially for this sort of thing and would work
nicely for you.

As far as the updating of the progress control goes, the main issue
there is that it potentially generates UI updates. I haven't checked
its exact implementation, but at the very least you are calling the
control many more times than one would actually be able to perceive.
IMHO, it'd be better to set the max for the control to 100 and update it
any time you progress 1%. A possible middle-ground would be to base the
maximum on the number of lines, and only update it when you hit the code
that checks for the start of a new line. Of course, if you fix the code
to be line-based in the first place, this becomes even easier.

All of this is moot if you change the design to generate RTF text
instead. That's actually the solution that I think would provide the
best results.

Pete

Aug 22 '07 #9

=?Utf-8?B?QnJpYW4gQ29vaw==?=

Peter thank you for the suggestions. Those helped me understand how to
accomplish what I need. Samples/snipits are easier for me to understand,
learn from and impliment.

For the File Open method, I moved to a StringCollection which has reduced
the time it takes to open and format the file significantly.

Still working around the color portion. I have placed it into the file open
method, yet I am getting an index out of bounds error that I am tracking down.

Thanks again.

Brian

"Peter Duniho" wrote:

Okay...had a chance to look through your code. I don't know if I can
cut three minutes or more from the cost while sticking with the same
basic algorithm, but there is some low-hanging fruit. You haven't
provided any information regarding which parts of your initialization is
slow, so it may or may not be that some or all of these changes are
significant. But they are still potentially issues that can be fixed.

See what happens if you clean some of these things up (and by the way,
please when posting code make sure that your indentation is
preserved...it's a lot harder to read the code without it):

Brian Cook wrote:
This reads and formats the text into the single line format;
----- Begin Code -----
using System;
using System.Drawing;
using System.IO;
using System.Collections.Generic;
using System.Text;
using System.Windows.Forms;

namespace TCEditor
{
class Streamer
{

#region File Stream and Format Routine
public static string LayoutInput(string input)
{
StreamReader sr = File.OpenText(input);
StringBuilder sb = new StringBuilder(input.Length);
bool firstLine = true;
string line;
while ((line = sr.ReadLine()) != null)
{
if (line.Trim() == "")
continue;
if (line.Length < 29) { throw new InvalidOperationException("invalid
input"); }
if (line[29] != ' ')
{
int txPos;
int rxPos = -1;
int len = 0;
if (firstLine)
firstLine = false;
else
sb.Append("\r\n");
if (((txPos = line.IndexOf("TX")) -1) || ((rxPos = line.IndexOf("RX")) 0))

Minor issue above: IndexOf() has to scan through the string each time
you call it. Worst-case is, of course, when the text you're looking for
doesn't exist.

You might do better using a regular expression (see Regex class) to
search for both "TX" and "RX" at the same time. Then you can check the
actual match (if any) to determined which matched. Though, that said, I
don't see anything that actually depends on which one matched; you seem
to treat both the same. So even more reason to just use Regex.

{
int charactersTillPoint;
if (txPos -1)
{
charactersTillPoint = txPos;
len = line.Substring(txPos).Length;

Major issue above: this is the worst way to calculate "len". Getting
the length of the original string is fast. Doing a subtraction is fast.
Creating a whole new string just so you can see how long it is? Not fast.

Change this line to:

len = line.Length - txPos;

}
else
{
charactersTillPoint = rxPos;
len = line.Substring(rxPos).Length;

Likewise.

}
string part0 = line.Substring(0, charactersTillPoint);
string part1 = line.Substring(charactersTillPoint);
sb.Append(part0.PadRight(86));
sb.Append(part1);

Medium issue: creating new strings just to append to the StringBuilder
incurs the overhead of the instance creation. But the StringBuilder has
overloads for the Append() method to avoid that.

I did a quick-and-dirty test, and it appears to me that creating the
string almost doubles the total time it takes to append text to a
StringBuilder.

At the very least, I would change the "part1" appending so that it looks
more like this:

sb.Append(line, charactersTillPoint, len);

I suspect you would also gain a win by not using PadRight, and instead
doing that work yourself:

sb.Append(line, 0, charactersTillPoint);
sb.Append(new string(' ', 86 - charactersTillPoint));

That way, you avoid the creation of both substrings (but not the new
padding string, of course). This cuts your string instantiation in this
area of the code from three strings down to just one.

}
else
sb.Append(line);
sb.Append(' ');
if (len == 12)
sb.Append(' ');
}
else
{
sb.Append(line.Substring(31));

Likewise here, use the Append() overload that extracts the substring for
you:

sb.Append(line, 31, line.Length - 31);

One last note on the substring thing: I also tested the overload that
takes a char[] instead of an array, along with the substring index and
length. It's actually even a little faster than passing a string, but
not by a lot. The big win here will be to just stop instantiating new
strings to append.

}
}
return sb.ToString();
}
#endregion
}
}
-----End Code-----
This is the routine that adds the color to the specified locations;
I call this as the next routine in the OpenFileDialog

I'm not going to provide specific comments for this method. Probably
the most costly part of it is all of the selecting and formatting that's
going on, and the best way to fix that would be to simply move the
formatting logic into the same method where you are reading the file,
and insert the necessary RTF format codes, rather than interacting with
the control directly.

That said, on a style perspective, I'd suggest that one significant
thing wrong with this method is that you wind up with two copies of the
text from the control. IMHO, since you want to process the text on a
line-by-line basis, you should just get the string[] from the Lines
property and operate on that. The RichTextBox control has methods such
as GetFirstCharIndexFromLine to allow you to determine actual character
indices for the purpose of formatting, and I would be surprised if using
that method significantly reduces the overall performance of this
method. As a result, your memory footprint will be halved, and the code
will be much closer to your intended algorithm.

A couple of other changes I'd make are to not call DoEvents(), and to
not update your progress control so often.

If you can't get the performance of this stuff down to something
acceptable for being in-line with the UI code, the correct solution is
to move the processing to a background thread. The BackgroundWorker
class is designed especially for this sort of thing and would work
nicely for you.

As far as the updating of the progress control goes, the main issue
there is that it potentially generates UI updates. I haven't checked
its exact implementation, but at the very least you are calling the
control many more times than one would actually be able to perceive.
IMHO, it'd be better to set the max for the control to 100 and update it
any time you progress 1%. A possible middle-ground would be to base the
maximum on the number of lines, and only update it when you hit the code
that checks for the start of a new line. Of course, if you fix the code
to be line-based in the first place, this becomes even easier.

All of this is moot if you change the design to generate RTF text
instead. That's actually the solution that I think would provide the
best results.

Pete

Aug 25 '07 #10

Read text file to Temp file and apply formating and color changes

Similar topics