473,761 Members | 2,293 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

extracting the text out of a binary file

RSH
Hi,

I have quite a few .DAT data files that i need to extract the data out of.
When i open the files in a text editor I see all of the text that I need to
get at BUT there are a lot of junk (binary?) characters and white space in
non logical formatting positions.

Here is a small sample of what the data looks like:

0~ 050110101051250 5011132451235 >   ô ô
  X 3> Bô
 

MICHAEL B SMITH, DVM PC 123 MAIN ST., STE 600 DALLAS, TX 75252
MICHAEL SMITH Õ
s ' 'MICHAEL B NORTON, DVM PC 123 MAIN ST., STE 600 DALLAS, TX
75252
ÌZkM Ô  Bô
< Ô ' Ò '
 ' 
'  ' '  ' â '
Bô
 -D Õ  â '
Bô
Is it possible to filter out all the junk characters and just get at the
text? If so could you provide a sample of how?

I tried doing a string replace but the number of characters in the array
were getting way out of control. I was wondering if the binaryreader class
might be what I'm looking for but I have no idea on what that really does.

Thanks!
RSH
Nov 21 '05 #1
6 1851
You can filter the string to just accept range of the numbers and letters.
ASCII table: http://www.lookuptables.com/
But why do you read binary files using text file methods?
--
Saber S.
"RSH" <wa************ *@yahoo.com> wrote in message
news:Oy******** ******@TK2MSFTN GP12.phx.gbl...
Hi,

I have quite a few .DAT data files that i need to extract the data out of.
When i open the files in a text editor I see all of the text that I need
to get at BUT there are a lot of junk (binary?) characters and white space
in non logical formatting positions.

Here is a small sample of what the data looks like:

0~ 050110101051250 5011132451235 >   ô ô  
X 3> Bô
 

MICHAEL B SMITH, DVM PC 123 MAIN ST., STE 600 DALLAS, TX 75252
MICHAEL SMITH Õ
s ' 'MICHAEL B NORTON, DVM PC 123 MAIN ST., STE 600 DALLAS, TX
75252 ÌZkM Ô 
Bô
< Ô ' Ò '
 ' 
'  ' '  ' â ' Bô
 -D Õ  â ' Bô
Is it possible to filter out all the junk characters and just get at the
text? If so could you provide a sample of how?

I tried doing a string replace but the number of characters in the array
were getting way out of control. I was wondering if the binaryreader
class might be what I'm looking for but I have no idea on what that really
does.

Thanks!
RSH

Nov 21 '05 #2
RSH
Primariy i just need to get at the data...which is all there but
unfortunately there are thousands of useless characters with the text. i
tried Regex to strip the useless characters but everything disappears when i
try that which leads me to believe it is an encoding mismatch. If you look
at my example of the text file below you will see what is happening. The
files are from an old Microfocus ISAM database which no simple database
viewer exists. many products are out there but they are several thousand
dollars, which is way out of range for us.

"Saber" <saber[.AT.]oxin.ir> wrote in message
news:ex******** ******@TK2MSFTN GP10.phx.gbl...
You can filter the string to just accept range of the numbers and letters.
ASCII table: http://www.lookuptables.com/
But why do you read binary files using text file methods?
--
Saber S.
"RSH" <wa************ *@yahoo.com> wrote in message
news:Oy******** ******@TK2MSFTN GP12.phx.gbl...
Hi,

I have quite a few .DAT data files that i need to extract the data out
of. When i open the files in a text editor I see all of the text that I
need to get at BUT there are a lot of junk (binary?) characters and white
space in non logical formatting positions.

Here is a small sample of what the data looks like:

0~ 050110101051250 5011132451235 >   ô ô   X 3>
Bô
 

MICHAEL B SMITH, DVM PC 123 MAIN ST., STE 600 DALLAS, TX 75252
MICHAEL SMITH Õ
s ' 'MICHAEL B NORTON, DVM PC 123 MAIN ST., STE 600 DALLAS, TX
75252 ÌZkM Ô  Bô
< Ô ' Ò '
 ' 
'  ' '  ' â ' Bô
 -D Õ  â ' Bô
Is it possible to filter out all the junk characters and just get at the
text? If so could you provide a sample of how?

I tried doing a string replace but the number of characters in the array
were getting way out of control. I was wondering if the binaryreader
class might be what I'm looking for but I have no idea on what that
really does.

Thanks!
RSH


Nov 21 '05 #3
I mean using *range*, for example imagine you've stored whole data in a
string file named strOld and you want to exclude ascii charachter instead of
numbers
and letters:

Dim newStr As String
Dim i As Integer
For i = 0 To oldStr.Length - 1
If oldStr.Chars(i) >= Chr(48) AndAlso oldStr.Chars(i) <= Chr(172) Then
newStr += oldStr.Chars(i)
End If
Next

You can play with if-statement to get your desired result.
--
Saber S.

"RSH" <wa************ *@yahoo.com> wrote in message
news:eM******** ******@tk2msftn gp13.phx.gbl...
Primariy i just need to get at the data...which is all there but
unfortunately there are thousands of useless characters with the text. i
tried Regex to strip the useless characters but everything disappears when
i try that which leads me to believe it is an encoding mismatch. If you
look at my example of the text file below you will see what is happening.
The files are from an old Microfocus ISAM database which no simple
database viewer exists. many products are out there but they are several
thousand dollars, which is way out of range for us.

"Saber" <saber[.AT.]oxin.ir> wrote in message
news:ex******** ******@TK2MSFTN GP10.phx.gbl...
You can filter the string to just accept range of the numbers and
letters.
ASCII table: http://www.lookuptables.com/
But why do you read binary files using text file methods?
--
Saber S.
"RSH" <wa************ *@yahoo.com> wrote in message
news:Oy******** ******@TK2MSFTN GP12.phx.gbl...
Hi,

I have quite a few .DAT data files that i need to extract the data out
of. When i open the files in a text editor I see all of the text that I
need to get at BUT there are a lot of junk (binary?) characters and
white space in non logical formatting positions.

Here is a small sample of what the data looks like:

0~ 050110101051250 5011132451235 >   ô ô   X 3>
Bô
 

MICHAEL B SMITH, DVM PC 123 MAIN ST., STE 600 DALLAS, TX 75252
MICHAEL SMITH Õ
s ' 'MICHAEL B NORTON, DVM PC 123 MAIN ST., STE 600 DALLAS,
TX 75252 ÌZkM Ô  Bô
< Ô ' Ò '
 ' 
'  ' '  ' â ' Bô
 -D Õ  â ' Bô
Is it possible to filter out all the junk characters and just get at the
text? If so could you provide a sample of how?

I tried doing a string replace but the number of characters in the array
were getting way out of control. I was wondering if the binaryreader
class might be what I'm looking for but I have no idea on what that
really does.

Thanks!
RSH



Nov 21 '05 #4
RSH
Oh thats good...that worked BUT...is there anyway to keep single instances
of spaces...in other words if there are two or more spaces next to each
other they can be eliminated...an d i would like to keep the linefeeds too.

Thanks alot!
"Saber" <saber[.AT.]oxin.ir> wrote in message
news:Oc******** ******@TK2MSFTN GP14.phx.gbl...
I mean using *range*, for example imagine you've stored whole data in a
string file named strOld and you want to exclude ascii charachter instead
of numbers
and letters:

Dim newStr As String
Dim i As Integer
For i = 0 To oldStr.Length - 1
If oldStr.Chars(i) >= Chr(48) AndAlso oldStr.Chars(i) <= Chr(172) Then
newStr += oldStr.Chars(i)
End If
Next

You can play with if-statement to get your desired result.
--
Saber S.

"RSH" <wa************ *@yahoo.com> wrote in message
news:eM******** ******@tk2msftn gp13.phx.gbl...
Primariy i just need to get at the data...which is all there but
unfortunately there are thousands of useless characters with the text. i
tried Regex to strip the useless characters but everything disappears
when i try that which leads me to believe it is an encoding mismatch. If
you look at my example of the text file below you will see what is
happening. The files are from an old Microfocus ISAM database which no
simple database viewer exists. many products are out there but they are
several thousand dollars, which is way out of range for us.

"Saber" <saber[.AT.]oxin.ir> wrote in message
news:ex******** ******@TK2MSFTN GP10.phx.gbl...
You can filter the string to just accept range of the numbers and
letters.
ASCII table: http://www.lookuptables.com/
But why do you read binary files using text file methods?
--
Saber S.
"RSH" <wa************ *@yahoo.com> wrote in message
news:Oy******** ******@TK2MSFTN GP12.phx.gbl...
Hi,

I have quite a few .DAT data files that i need to extract the data out
of. When i open the files in a text editor I see all of the text that I
need to get at BUT there are a lot of junk (binary?) characters and
white space in non logical formatting positions.

Here is a small sample of what the data looks like:

0~ 050110101051250 5011132451235 >   ô ô   X 3>
Bô
 

MICHAEL B SMITH, DVM PC 123 MAIN ST., STE 600 DALLAS, TX 75252
MICHAEL SMITH Õ
s ' 'MICHAEL B NORTON, DVM PC 123 MAIN ST., STE 600 DALLAS,
TX 75252 ÌZkM Ô  Bô
< Ô ' Ò '
 ' 
'  ' '  ' â ' Bô
 -D Õ  â ' Bô
Is it possible to filter out all the junk characters and just get at
the text? If so could you provide a sample of how?

I tried doing a string replace but the number of characters in the
array were getting way out of control. I was wondering if the
binaryreader class might be what I'm looking for but I have no idea on
what that really does.

Thanks!
RSH



Nov 21 '05 #5
I found a tricky way about your "spacing" problem,
but maybe there are also better ways:

For i = 0 To oldStr.Length - 1
If oldStr.Chars(i) >= Chr(32) AndAlso oldStr.Chars(i) <= Chr(172) Then
If Not (oldStr.Substri ng(i).StartsWit h(Chr(32)) AndAlso _
oldStr.Chars(i + 1) = Chr(32)) Then newStr += oldStr.Chars(i)
End If
Next
About LFs, I'm not sure what to do, please send me a sample
of your files (if they are less than 1 MB!) and the code you use to read
those files.
--
Saber S.
"RSH" <wa************ *@yahoo.com> wrote in message
news:OR******** ******@TK2MSFTN GP15.phx.gbl...
Oh thats good...that worked BUT...is there anyway to keep single instances
of spaces...in other words if there are two or more spaces next to each
other they can be eliminated...an d i would like to keep the linefeeds too.

Thanks alot!
"Saber" <saber[.AT.]oxin.ir> wrote in message
news:Oc******** ******@TK2MSFTN GP14.phx.gbl...
I mean using *range*, for example imagine you've stored whole data in a
string file named strOld and you want to exclude ascii charachter instead
of numbers
and letters:

Dim newStr As String
Dim i As Integer
For i = 0 To oldStr.Length - 1
If oldStr.Chars(i) >= Chr(48) AndAlso oldStr.Chars(i) <= Chr(172) Then
newStr += oldStr.Chars(i)
End If
Next

You can play with if-statement to get your desired result.
--
Saber S.

"RSH" <wa************ *@yahoo.com> wrote in message
news:eM******** ******@tk2msftn gp13.phx.gbl...
Primariy i just need to get at the data...which is all there but
unfortunately there are thousands of useless characters with the text.
i tried Regex to strip the useless characters but everything disappears
when i try that which leads me to believe it is an encoding mismatch.
If you look at my example of the text file below you will see what is
happening. The files are from an old Microfocus ISAM database which no
simple database viewer exists. many products are out there but they are
several thousand dollars, which is way out of range for us.

"Saber" <saber[.AT.]oxin.ir> wrote in message
news:ex******** ******@TK2MSFTN GP10.phx.gbl...
You can filter the string to just accept range of the numbers and
letters.
ASCII table: http://www.lookuptables.com/
But why do you read binary files using text file methods?
--
Saber S.
"RSH" <wa************ *@yahoo.com> wrote in message
news:Oy******** ******@TK2MSFTN GP12.phx.gbl...
> Hi,
>
> I have quite a few .DAT data files that i need to extract the data out
> of. When i open the files in a text editor I see all of the text that
> I need to get at BUT there are a lot of junk (binary?) characters and
> white space in non logical formatting positions.
>
> Here is a small sample of what the data looks like:
>
> 0~ 050110101051250 5011132451235 >   ô ô   X
> 3> Bô
>  
>
> MICHAEL B SMITH, DVM PC 123 MAIN ST., STE 600 DALLAS, TX
> 75252 MICHAEL SMITH Õ
> s ' 'MICHAEL B NORTON, DVM PC 123 MAIN ST., STE 600 DALLAS,
> TX 75252 ÌZkM Ô  Bô
> < Ô ' Ò '
>  ' 
> '  ' '  ' â ' Bô
>  -D Õ  â ' Bô
>
>
> Is it possible to filter out all the junk characters and just get at
> the text? If so could you provide a sample of how?
>
> I tried doing a string replace but the number of characters in the
> array were getting way out of control. I was wondering if the
> binaryreader class might be what I'm looking for but I have no idea on
> what that really does.
>
> Thanks!
> RSH
>



Nov 21 '05 #6
a better loop is:
For i = 0 To oldStr.Length - 2
If oldStr.Chars(i) >= Chr(32) AndAlso oldStr.Chars(i) <= Chr(172) Then
If Not (oldStr.Substri ng(i, 2) = " ") Then newStr += oldStr.Chars(i) '*
End If
Next

* There are 2 space characters in double qutations ( ...(i,2)=" ")

--
Saber S.
"Saber" <saber[.AT.]oxin.ir> wrote in message
news:%2******** ********@TK2MSF TNGP09.phx.gbl. ..
I found a tricky way about your "spacing" problem,
but maybe there are also better ways:

For i = 0 To oldStr.Length - 1
If oldStr.Chars(i) >= Chr(32) AndAlso oldStr.Chars(i) <= Chr(172) Then
If Not (oldStr.Substri ng(i).StartsWit h(Chr(32)) AndAlso _
oldStr.Chars(i + 1) = Chr(32)) Then newStr += oldStr.Chars(i)
End If
Next
About LFs, I'm not sure what to do, please send me a sample
of your files (if they are less than 1 MB!) and the code you use to read
those files.
--
Saber S.
"RSH" <wa************ *@yahoo.com> wrote in message
news:OR******** ******@TK2MSFTN GP15.phx.gbl...
Oh thats good...that worked BUT...is there anyway to keep single
instances of spaces...in other words if there are two or more spaces next
to each other they can be eliminated...an d i would like to keep the
linefeeds too.

Thanks alot!
"Saber" <saber[.AT.]oxin.ir> wrote in message
news:Oc******** ******@TK2MSFTN GP14.phx.gbl...
I mean using *range*, for example imagine you've stored whole data in a
string file named strOld and you want to exclude ascii charachter
instead of numbers
and letters:

Dim newStr As String
Dim i As Integer
For i = 0 To oldStr.Length - 1
If oldStr.Chars(i) >= Chr(48) AndAlso oldStr.Chars(i) <= Chr(172) Then
newStr += oldStr.Chars(i)
End If
Next

You can play with if-statement to get your desired result.
--
Saber S.

"RSH" <wa************ *@yahoo.com> wrote in message
news:eM******** ******@tk2msftn gp13.phx.gbl...
Primariy i just need to get at the data...which is all there but
unfortunately there are thousands of useless characters with the text.
i tried Regex to strip the useless characters but everything disappears
when i try that which leads me to believe it is an encoding mismatch.
If you look at my example of the text file below you will see what is
happening. The files are from an old Microfocus ISAM database which no
simple database viewer exists. many products are out there but they
are several thousand dollars, which is way out of range for us.

"Saber" <saber[.AT.]oxin.ir> wrote in message
news:ex******** ******@TK2MSFTN GP10.phx.gbl...
> You can filter the string to just accept range of the numbers and
> letters.
> ASCII table: http://www.lookuptables.com/
> But why do you read binary files using text file methods?
>
>
> --
> Saber S.
> "RSH" <wa************ *@yahoo.com> wrote in message
> news:Oy******** ******@TK2MSFTN GP12.phx.gbl...
>> Hi,
>>
>> I have quite a few .DAT data files that i need to extract the data
>> out of. When i open the files in a text editor I see all of the text
>> that I need to get at BUT there are a lot of junk (binary?)
>> characters and white space in non logical formatting positions.
>>
>> Here is a small sample of what the data looks like:
>>
>> 0~ 050110101051250 5011132451235 >   ô ô   X
>> 3> Bô
>>  
>>
>> MICHAEL B SMITH, DVM PC 123 MAIN ST., STE 600 DALLAS, TX
>> 75252 MICHAEL SMITH Õ
>> s ' 'MICHAEL B NORTON, DVM PC 123 MAIN ST., STE 600
>> DALLAS, TX 75252 ÌZkM Ô  Bô
>> < Ô ' Ò '
>>  ' 
>> '  ' '  ' â ' Bô
>>  -D Õ  â ' Bô
>>
>>
>> Is it possible to filter out all the junk characters and just get at
>> the text? If so could you provide a sample of how?
>>
>> I tried doing a string replace but the number of characters in the
>> array were getting way out of control. I was wondering if the
>> binaryreader class might be what I'm looking for but I have no idea
>> on what that really does.
>>
>> Thanks!
>> RSH
>>
>
>



Nov 21 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
4088
by: JustinCase | last post by:
Hi, I have to index the text of a pdf document. Does any of you know of a PHP script/extension or a binary that is able to extract the text ? The pdf extension mentioned in the php.net docs seem to indicate that it's for _creation_ of documents only, is that so? Same with all the PHP classes i have found.
2
3749
by: Trader | last post by:
Hi, I'm trying to use Mark Hammond's win32clipboard module to extract more complex data than just plain ASCII text from the Windows clipboard. For instance, when you select all the content on web page, you can paste it into an app like Frontpage, or something Rich Text-aware, and it will preserve all the formatting, HTML, etc. I'd like to include that behavior in the application I'm writing. In the interactive session below, before I...
27
5039
by: Eric | last post by:
Assume that disk space is not an issue (the files will be small < 5k in general for the purpose of storing preferences) Assume that transportation to another OS may never occur. Are there any solid reasons to prefer text files over binary files files?
27
15052
by: gRizwan | last post by:
Hello all, We have a problem on a webpage. That page is sent some email data in base64 format. what we need to do is, decode the base64 data back to original shape and extract attached image from it. Any help will be highly appriciated. Thanks
3
2401
by: code_wrong | last post by:
hi, I decided to extract the text from some powerpoint files. The results have thrown up some questions. When I use the 'char *valid' character array (in the program below) to choose the characters to write in the new file... the result is totally different to when I use the line with isalpha() and isdigit(). Yes .. There are more valid characters in the valid array but this is not the problem .. Using it, I see extra spaces in the...
12
5926
by: Adam J. Schaff | last post by:
I am writing a quick program to edit a binary file that contains file paths (amongst other things). If I look at the files in notepad, they look like: <gibberish>file//g:\pathtofile1<gibberish>file//g:\pathtofile2<gibberish> etc. I want to remove the "g:\" from the file paths. I wrote a console app that successfully reads the file and writes a duplicate of it, but fails for some reason to do the "replacing" of the "g:\". The code...
1
8220
by: Dave | last post by:
Hello, I am wondering about including binary files in my MS Access database application. I want to keep my application as just a single MDE or MDB file, but the users of the app may need some additional files. One file is an ODBC driver for connecting to an external database, this is a 120 kb DLL. The other is a 20 kb PDF help document for using the database application. I am wondering if it is possible to store these files within...
10
3664
by: joelagnel | last post by:
hi friends, i've been having this confusion for about a year, i want to know the exact difference between text and binary files. using the fwrite function in c, i wrote 2 bytes of integers in binary mode. according to me, notepad opens files and each byte of the file read, it converts that byte from ascii to its correct character and displays
5
9142
by: =?Utf-8?B?U2NvdHQ=?= | last post by:
I am trying to extract a zip file in a database image field to disk. For some reason, the zip file is getting corrupted / truncated. I have code in ASP which extracts the zip file no problem, so i know it is not corrupted in the table. Any help would be appreciated. FileStream fs = new FileStream(sFilePath + "TestFile.zip", FileMode.CreateNew); BinaryWriter bw = new BinaryWriter(fs); byte buffer =...
0
9554
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10136
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9989
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
9811
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
7358
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6640
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5266
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
3913
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
3509
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.