By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
446,341 Members | 1,364 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 446,341 IT Pros & Developers. It's quick & easy.

Replacing a string inside of a PDF

P: n/a
I am having a lot more trouble with this than I thought I would. Here
is what I want to do in pseudocode.

Open c:\some.pdf
Replace "Replace this" with "Replaced!"
Save c:\some_edited.pdf

I can do this in notepad and it works fine, but when I start getting in
to reading the files I think it has some encoding problem. I tried
saving the file with every encoding option. When I open a PDF in the
text editor I normally use it says it is ANSI with Mac style carriage
returns. Winmerge will not let me compare the files because it says
they are binary.

Anyone know what I have to do?

Jul 20 '06 #1
Share this Question
Share on Google+
14 Replies


P: n/a
Please explain, are you trying to read the file using a binary string and
then using a binary string you try to write another file

"Josh Baltzell" <jo**********@gmail.comwrote in message
news:11**********************@h48g2000cwc.googlegr oups.com...
>I am having a lot more trouble with this than I thought I would. Here
is what I want to do in pseudocode.

Open c:\some.pdf
Replace "Replace this" with "Replaced!"
Save c:\some_edited.pdf

I can do this in notepad and it works fine, but when I start getting in
to reading the files I think it has some encoding problem. I tried
saving the file with every encoding option. When I open a PDF in the
text editor I normally use it says it is ANSI with Mac style carriage
returns. Winmerge will not let me compare the files because it says
they are binary.

Anyone know what I have to do?

Jul 20 '06 #2

P: n/a
Samuel,

I have tried it several ways. The end goal is just to end up with an
edited PDF. If I have to overwrite the original file that is fine.

Samuel Shulman wrote:
Please explain, are you trying to read the file using a binary string and
then using a binary string you try to write another file

"Josh Baltzell" <jo**********@gmail.comwrote in message
news:11**********************@h48g2000cwc.googlegr oups.com...
I am having a lot more trouble with this than I thought I would. Here
is what I want to do in pseudocode.

Open c:\some.pdf
Replace "Replace this" with "Replaced!"
Save c:\some_edited.pdf

I can do this in notepad and it works fine, but when I start getting in
to reading the files I think it has some encoding problem. I tried
saving the file with every encoding option. When I open a PDF in the
text editor I normally use it says it is ANSI with Mac style carriage
returns. Winmerge will not let me compare the files because it says
they are binary.

Anyone know what I have to do?
Jul 21 '06 #3

P: n/a
I'm assuming that I should somehow be using a binaryreader and a
binarywriter, I just don't know how to work with the data inside as
strings and then put it back in to the writer.

Josh Baltzell wrote:
Samuel,

I have tried it several ways. The end goal is just to end up with an
edited PDF. If I have to overwrite the original file that is fine.

Samuel Shulman wrote:
Please explain, are you trying to read the file using a binary string and
then using a binary string you try to write another file

"Josh Baltzell" <jo**********@gmail.comwrote in message
news:11**********************@h48g2000cwc.googlegr oups.com...
>I am having a lot more trouble with this than I thought I would. Here
is what I want to do in pseudocode.
>
Open c:\some.pdf
Replace "Replace this" with "Replaced!"
Save c:\some_edited.pdf
>
I can do this in notepad and it works fine, but when I start getting in
to reading the files I think it has some encoding problem. I tried
saving the file with every encoding option. When I open a PDF in the
text editor I normally use it says it is ANSI with Mac style carriage
returns. Winmerge will not let me compare the files because it says
they are binary.
>
Anyone know what I have to do?
>
Jul 21 '06 #4

P: n/a
I think that the key to your question is how to actually read the file (I
should have realized before that this is the main issue),

Did you manage to read parts of the file only if you can do that you can
replace the text

"Josh Baltzell" <jo**********@gmail.comwrote in message
news:11*********************@m79g2000cwm.googlegro ups.com...
I'm assuming that I should somehow be using a binaryreader and a
binarywriter, I just don't know how to work with the data inside as
strings and then put it back in to the writer.

Josh Baltzell wrote:
>Samuel,

I have tried it several ways. The end goal is just to end up with an
edited PDF. If I have to overwrite the original file that is fine.

Samuel Shulman wrote:
Please explain, are you trying to read the file using a binary string
and
then using a binary string you try to write another file

"Josh Baltzell" <jo**********@gmail.comwrote in message
news:11**********************@h48g2000cwc.googlegr oups.com...
I am having a lot more trouble with this than I thought I would. Here
is what I want to do in pseudocode.

Open c:\some.pdf
Replace "Replace this" with "Replaced!"
Save c:\some_edited.pdf

I can do this in notepad and it works fine, but when I start getting
in
to reading the files I think it has some encoding problem. I tried
saving the file with every encoding option. When I open a PDF in the
text editor I normally use it says it is ANSI with Mac style carriage
returns. Winmerge will not let me compare the files because it says
they are binary.

Anyone know what I have to do?

Jul 21 '06 #5

P: n/a
may this link will be useful
http://groups.google.com/group/micro...ace184fa716b5a

"Josh Baltzell" <jo**********@gmail.comwrote in message
news:11*********************@m79g2000cwm.googlegro ups.com...
I'm assuming that I should somehow be using a binaryreader and a
binarywriter, I just don't know how to work with the data inside as
strings and then put it back in to the writer.

Josh Baltzell wrote:
>Samuel,

I have tried it several ways. The end goal is just to end up with an
edited PDF. If I have to overwrite the original file that is fine.

Samuel Shulman wrote:
Please explain, are you trying to read the file using a binary string
and
then using a binary string you try to write another file

"Josh Baltzell" <jo**********@gmail.comwrote in message
news:11**********************@h48g2000cwc.googlegr oups.com...
I am having a lot more trouble with this than I thought I would. Here
is what I want to do in pseudocode.

Open c:\some.pdf
Replace "Replace this" with "Replaced!"
Save c:\some_edited.pdf

I can do this in notepad and it works fine, but when I start getting
in
to reading the files I think it has some encoding problem. I tried
saving the file with every encoding option. When I open a PDF in the
text editor I normally use it says it is ANSI with Mac style carriage
returns. Winmerge will not let me compare the files because it says
they are binary.

Anyone know what I have to do?

Jul 21 '06 #6

P: n/a
I have written the code to at least read the internals of the file as a
string or a stream and then I can find the chunk I want to replace easy
enough, but I think it loses some special characters, or maybe screws
up the line endings (PDF files have mac style CR only instead of CR LF
like a lot of windows based files have I believe.)

So I guess my problem is actually reading and writing. I can write
code that looks like I am reading it with a streamreader, but I think I
am really losing data. I can write code that reads it as binary, but
then I have trouble working with the contents. After all that is
worked out I have to figure out how to write the edited file back to
disk (I believe the binary writer will do that, but I have not tested
much.)

I'm not sure what else I can tell you, This is just a matter of me not
fully understanding how I am supposed to read and edit a file like this
as opposed to the other formats that I have worked with that were all
plain text.

Thanks a lot for the feedback. I looked at the other post you linked
to and read the linked page. I think that would be useful to me if the
PDFs were compressed, but I can open these in Notepad and find my
string right now (and that works when I do the edit that way.)

Samuel Shulman wrote:
I think that the key to your question is how to actually read the file (I
should have realized before that this is the main issue),

Did you manage to read parts of the file only if you can do that you can
replace the text

"Josh Baltzell" <jo**********@gmail.comwrote in message
news:11*********************@m79g2000cwm.googlegro ups.com...
I'm assuming that I should somehow be using a binaryreader and a
binarywriter, I just don't know how to work with the data inside as
strings and then put it back in to the writer.

Josh Baltzell wrote:
Samuel,

I have tried it several ways. The end goal is just to end up with an
edited PDF. If I have to overwrite the original file that is fine.

Samuel Shulman wrote:
Please explain, are you trying to read the file using a binary string
and
then using a binary string you try to write another file

"Josh Baltzell" <jo**********@gmail.comwrote in message
news:11**********************@h48g2000cwc.googlegr oups.com...
>I am having a lot more trouble with this than I thought I would. Here
is what I want to do in pseudocode.
>
Open c:\some.pdf
Replace "Replace this" with "Replaced!"
Save c:\some_edited.pdf
>
I can do this in notepad and it works fine, but when I start getting
in
to reading the files I think it has some encoding problem. I tried
saving the file with every encoding option. When I open a PDF in the
text editor I normally use it says it is ANSI with Mac style carriage
returns. Winmerge will not let me compare the files because it says
they are binary.
>
Anyone know what I have to do?
>
Jul 21 '06 #7

P: n/a
You may be able to create identical string to the one that you want to
replace then send it to a binary stream (it doesn't have to be a file) then
look for such a binary sequence within the main binary stream (binary
buffer) that holds the pdf file and replace it with another binary stream
created from the string you wanted to use for the replacement
You still have the problem of the funny characters which you can imitate by
adding CR instead of the CRLF (or what is the normal)

And finally, once the code will work please send it over it seems
interesting to me (if it is OK with you/your company)

Regards,
Samuel

"Josh Baltzell" <jo**********@gmail.comwrote in message
news:11*********************@i3g2000cwc.googlegrou ps.com...
>I have written the code to at least read the internals of the file as a
string or a stream and then I can find the chunk I want to replace easy
enough, but I think it loses some special characters, or maybe screws
up the line endings (PDF files have mac style CR only instead of CR LF
like a lot of windows based files have I believe.)

So I guess my problem is actually reading and writing. I can write
code that looks like I am reading it with a streamreader, but I think I
am really losing data. I can write code that reads it as binary, but
then I have trouble working with the contents. After all that is
worked out I have to figure out how to write the edited file back to
disk (I believe the binary writer will do that, but I have not tested
much.)

I'm not sure what else I can tell you, This is just a matter of me not
fully understanding how I am supposed to read and edit a file like this
as opposed to the other formats that I have worked with that were all
plain text.

Thanks a lot for the feedback. I looked at the other post you linked
to and read the linked page. I think that would be useful to me if the
PDFs were compressed, but I can open these in Notepad and find my
string right now (and that works when I do the edit that way.)

Samuel Shulman wrote:
>I think that the key to your question is how to actually read the file (I
should have realized before that this is the main issue),

Did you manage to read parts of the file only if you can do that you can
replace the text

"Josh Baltzell" <jo**********@gmail.comwrote in message
news:11*********************@m79g2000cwm.googlegr oups.com...
I'm assuming that I should somehow be using a binaryreader and a
binarywriter, I just don't know how to work with the data inside as
strings and then put it back in to the writer.

Josh Baltzell wrote:
Samuel,

I have tried it several ways. The end goal is just to end up with an
edited PDF. If I have to overwrite the original file that is fine.

Samuel Shulman wrote:
Please explain, are you trying to read the file using a binary
string
and
then using a binary string you try to write another file

"Josh Baltzell" <jo**********@gmail.comwrote in message
news:11**********************@h48g2000cwc.googlegr oups.com...
I am having a lot more trouble with this than I thought I would.
Here
is what I want to do in pseudocode.

Open c:\some.pdf
Replace "Replace this" with "Replaced!"
Save c:\some_edited.pdf

I can do this in notepad and it works fine, but when I start
getting
in
to reading the files I think it has some encoding problem. I
tried
saving the file with every encoding option. When I open a PDF in
the
text editor I normally use it says it is ANSI with Mac style
carriage
returns. Winmerge will not let me compare the files because it
says
they are binary.

Anyone know what I have to do?


Jul 21 '06 #8

P: n/a
I'm not sure I know how to do what you are saying, but here is a test I
made to write the file using a string converted in to a bytearray.
This is not working.

:::::::::::::::::::::::::::::::::::::::::::::::::: :
Public Function ByteTest()
Dim PDFFile As String
Dim PDFFolder As IO.Directory

Response.Write("Start Byte:" & DateTime.Now.ToLongTimeString &
":" & Now.Millisecond & "<br>")

For Each PDFFile In PDFFolder.GetFiles(Server.MapPath("PDF"))
'Open the file
Dim FileStream As IO.StreamReader
FileStream = IO.File.OpenText(PDFFile)

'Load the file in to a string
Dim Contents As String = FileStream.ReadToEnd

'Replace text in string
Contents = Contents.Replace("ABC1234567890",
"ABC1111111111")

'Close stream
FileStream.Close()

'Create byte based output file
Dim OutputFileName As String = Server.MapPath("PDFOutput\"
& DateTime.Now.ToFileTimeUtc.ToString & "BYTE.pdf")
Dim fs As FileStream = File.Create(OutputFileName)
fs.Close()

'Convert the string to bytes
Dim info As Byte() = New
System.Text.UTF8Encoding(True).GetBytes(Contents)

'Write string as bytes to output file
fs = File.OpenWrite(OutputFileName)
fs.Write(info, 0, info.Length)
fs.Close()

Next

Response.Write("Stop Byte:" & DateTime.Now.ToLongTimeString &
":" & Now.Millisecond & "<br>")

End Function
:::::::::::::::::::::::::::::::::::::::::::::::::: :

Jul 21 '06 #9

P: n/a
Here is another test I wrote that sucessfully generates a bunch of
useless files encoded in different ways.

:::::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::
Public Function StringTest()
Dim PDFFile As String
Dim PDFFolder As IO.Directory

Response.Write("Start String:" & DateTime.Now.ToLongTimeString
& ":" & Now.Millisecond & "<br>")

For Each PDFFile In PDFFolder.GetFiles(Server.MapPath("PDF"))
'Open the file
Dim FileStream As IO.StreamReader
FileStream = IO.File.OpenText(PDFFile)

'Load the file in to a string
Dim Contents As String = FileStream.ReadToEnd

'Replace text in string
Contents = Contents.Replace("ABC1234567890",
"ABC1111111111")

'Close stream
FileStream.Close()

'Create ASCII output file
Dim OutputFileName As String = Server.MapPath("PDFOutput\"
& DateTime.Now.ToFileTimeUtc.ToString & "STRING-ASCII.pdf")
Dim fs As FileStream = File.Create(OutputFileName)
Dim PDFStream As StreamWriter = New StreamWriter(fs,
System.Text.Encoding.ASCII)
PDFStream.Write(Contents)
PDFStream.Close()
fs.Close()

'Create BigEndianUnicode output file
OutputFileName = Server.MapPath("PDFOutput\" &
DateTime.Now.ToFileTimeUtc.ToString & "STRING-BigEndianUnicode.pdf")
fs = File.Create(OutputFileName)
PDFStream = New StreamWriter(fs,
System.Text.Encoding.BigEndianUnicode)
PDFStream.Write(Contents)
PDFStream.Close()
fs.Close()

'Create default formatted output file
OutputFileName = Server.MapPath("PDFOutput\" &
DateTime.Now.ToFileTimeUtc.ToString & "STRING-Default.pdf")
fs = File.Create(OutputFileName)
PDFStream = New StreamWriter(fs,
System.Text.Encoding.Default)
PDFStream.Write(Contents)
PDFStream.Close()
fs.Close()

'Create Unicode output file
OutputFileName = Server.MapPath("PDFOutput\" &
DateTime.Now.ToFileTimeUtc.ToString & "STRING-Unicode.pdf")
fs = File.Create(OutputFileName)
PDFStream = New StreamWriter(fs,
System.Text.Encoding.Unicode)
PDFStream.Write(Contents)
PDFStream.Close()
fs.Close()

'Create UTF7 output file
OutputFileName = Server.MapPath("PDFOutput\" &
DateTime.Now.ToFileTimeUtc.ToString & "STRING-UTF7.pdf")
fs = File.Create(OutputFileName)
PDFStream = New StreamWriter(fs, System.Text.Encoding.UTF7)
PDFStream.Write(Contents)
PDFStream.Close()
fs.Close()

'Create UTF8 output file
OutputFileName = Server.MapPath("PDFOutput\" &
DateTime.Now.ToFileTimeUtc.ToString & "STRING-UTF8.pdf")
fs = File.Create(OutputFileName)
PDFStream = New StreamWriter(fs, System.Text.Encoding.UTF8)
PDFStream.Write(Contents)
PDFStream.Close()
fs.Close()

Next

Response.Write("Stop String:" & DateTime.Now.ToLongTimeString &
":" & Now.Millisecond & "<br>")

End Function
:::::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::

Jul 21 '06 #10

P: n/a
Finally,
Can you achieve what you actually need?

Samuel
"Josh Baltzell" <jo**********@gmail.comwrote in message
news:11**********************@i42g2000cwa.googlegr oups.com...
Here is another test I wrote that sucessfully generates a bunch of
useless files encoded in different ways.

:::::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::
Public Function StringTest()
Dim PDFFile As String
Dim PDFFolder As IO.Directory

Response.Write("Start String:" & DateTime.Now.ToLongTimeString
& ":" & Now.Millisecond & "<br>")

For Each PDFFile In PDFFolder.GetFiles(Server.MapPath("PDF"))
'Open the file
Dim FileStream As IO.StreamReader
FileStream = IO.File.OpenText(PDFFile)

'Load the file in to a string
Dim Contents As String = FileStream.ReadToEnd

'Replace text in string
Contents = Contents.Replace("ABC1234567890",
"ABC1111111111")

'Close stream
FileStream.Close()

'Create ASCII output file
Dim OutputFileName As String = Server.MapPath("PDFOutput\"
& DateTime.Now.ToFileTimeUtc.ToString & "STRING-ASCII.pdf")
Dim fs As FileStream = File.Create(OutputFileName)
Dim PDFStream As StreamWriter = New StreamWriter(fs,
System.Text.Encoding.ASCII)
PDFStream.Write(Contents)
PDFStream.Close()
fs.Close()

'Create BigEndianUnicode output file
OutputFileName = Server.MapPath("PDFOutput\" &
DateTime.Now.ToFileTimeUtc.ToString & "STRING-BigEndianUnicode.pdf")
fs = File.Create(OutputFileName)
PDFStream = New StreamWriter(fs,
System.Text.Encoding.BigEndianUnicode)
PDFStream.Write(Contents)
PDFStream.Close()
fs.Close()

'Create default formatted output file
OutputFileName = Server.MapPath("PDFOutput\" &
DateTime.Now.ToFileTimeUtc.ToString & "STRING-Default.pdf")
fs = File.Create(OutputFileName)
PDFStream = New StreamWriter(fs,
System.Text.Encoding.Default)
PDFStream.Write(Contents)
PDFStream.Close()
fs.Close()

'Create Unicode output file
OutputFileName = Server.MapPath("PDFOutput\" &
DateTime.Now.ToFileTimeUtc.ToString & "STRING-Unicode.pdf")
fs = File.Create(OutputFileName)
PDFStream = New StreamWriter(fs,
System.Text.Encoding.Unicode)
PDFStream.Write(Contents)
PDFStream.Close()
fs.Close()

'Create UTF7 output file
OutputFileName = Server.MapPath("PDFOutput\" &
DateTime.Now.ToFileTimeUtc.ToString & "STRING-UTF7.pdf")
fs = File.Create(OutputFileName)
PDFStream = New StreamWriter(fs, System.Text.Encoding.UTF7)
PDFStream.Write(Contents)
PDFStream.Close()
fs.Close()

'Create UTF8 output file
OutputFileName = Server.MapPath("PDFOutput\" &
DateTime.Now.ToFileTimeUtc.ToString & "STRING-UTF8.pdf")
fs = File.Create(OutputFileName)
PDFStream = New StreamWriter(fs, System.Text.Encoding.UTF8)
PDFStream.Write(Contents)
PDFStream.Close()
fs.Close()

Next

Response.Write("Stop String:" & DateTime.Now.ToLongTimeString &
":" & Now.Millisecond & "<br>")

End Function
:::::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::

Jul 23 '06 #11

P: n/a
Josh Baltzell wrote:
I am having a lot more trouble with this than I thought I would. Here
is what I want to do in pseudocode.

Open c:\some.pdf
Replace "Replace this" with "Replaced!"
Save c:\some_edited.pdf

I can do this in notepad and it works fine, but when I start getting in
to reading the files I think it has some encoding problem. I tried
saving the file with every encoding option. When I open a PDF in the
text editor I normally use it says it is ANSI with Mac style carriage
returns. Winmerge will not let me compare the files because it says
they are binary.
<snip>

Winmerge is right, a PDF file is actually a binary image, not a plain
text in a given encoding. You should load it as a stream of bytes.

On the other hand, since you want to perform text replacements in the
file, you may load it with an encoding that doesn't apply
transformations on the bytes in the file, such as the Ansi encoding:

Sub PDFReplaceText(ByVal Path As String, ByVal OldText As String, _
ByVal OutPath As String, ByVal NewText As String)

Const ANSI As Integer = 1252

Dim Encoding As Text.Encoding = Text.Encoding.GetEncoding(ANSI)
Dim sr As New IO.StreamReader(Path, Encoding)
Dim Data As String = sr.ReadToEnd
sr.Close()

Data = Data.Replace(OldText, NewText)

Dim sw As New IO.StreamWriter(OutPath, False, Encoding)
sw.Write(Data)
sw.Close()

End Sub

HTH.

Regards,

Branco.

Jul 23 '06 #12

P: n/a
Branco,

This worked perfect. My knowlege about the encoding options in general
is very weak, so thanks for spelling it out for me with some code.

Samuel,

Thank you to you too. You have both been a big help.

Thank you,
Josh Baltzell

Jul 24 '06 #13

P: n/a
I am glad to hear,

Is Branco's code works as is?
"Josh" <jo**********@gmail.comwrote in message
news:11**********************@75g2000cwc.googlegro ups.com...
Branco,

This worked perfect. My knowlege about the encoding options in general
is very weak, so thanks for spelling it out for me with some code.

Samuel,

Thank you to you too. You have both been a big help.

Thank you,
Josh Baltzell

Jul 24 '06 #14

P: n/a
I put the encoding options in to my own code, so I am not positive.
This is the final sub I ended up with.

Public Sub ReplaceText(ByVal FilePath As String, ByVal OriginalText
As String, ByVal NewText As String)
Dim PDFFolder As IO.Directory
Dim Encoding As System.Text.Encoding =
Encoding.GetEncoding(1252)

'Open the file
Dim FileStream As New IO.StreamReader(FilePath, Encoding)

'Load the file in to a string
Dim Contents As String = FileStream.ReadToEnd

'Replace text in string
Contents = Contents.Replace(OriginalText, NewText)

'Close stream
FileStream.Close()

'Write string as bytes to output file
Dim OutputFileName As String = FilePath
Dim sw As New IO.StreamWriter(OutputFileName, False, Encoding)
sw.Write(Contents)
sw.Close()

End Sub

Samuel Shulman wrote:
I am glad to hear,

Is Branco's code works as is?
"Josh" <jo**********@gmail.comwrote in message
news:11**********************@75g2000cwc.googlegro ups.com...
Branco,

This worked perfect. My knowlege about the encoding options in general
is very weak, so thanks for spelling it out for me with some code.

Samuel,

Thank you to you too. You have both been a big help.

Thank you,
Josh Baltzell
Jul 24 '06 #15

This discussion thread is closed

Replies have been disabled for this discussion.