473,749 Members | 2,451 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Extract data from a file and write it to another file

2 New Member
I want to open a word file,
check again my list of words or phases to extract
(such as Monday_Tuesday, Happy_birthday and etc)
write the word or phases to another file
Also states which word or phases in my list were not found
Jul 27 '12 #1
6 2656
charley Situ
2 New Member
I know nothing about python. my knowledge at this time is only I downloaded it and wrote hello world..
Jul 27 '12 #2
Jory R Ferrell
62 New Member
First, you should realize that your question is unlikely to receive very many responses besides mine. Your question is asking for a lot of higher level(higher than noob level :P) concepts to be addressed, but you failed to do a very important thing: You should have made an attempt to write your own code. You should show what level of knowledge you have about lists, dictionaries, indexing, function calls, etc. This shows you made an effort and are not trying to have someone else whip up a highly efficient piece of code so you yourself avoid having to put in effort.

You asked for many things in your code but provided none of you own progress to build off of. But I do understand why you did. I myself have inadvertently done it in the past. Just try to keep this in mind for the future.
Jul 28 '12 #3
Jory R Ferrell
62 New Member
Expand|Select|Wrap|Line Numbers
  1. #------------------------------------------------------------------------------#
  2. #                            Preparation of Data                               #
  3. #------------------------------------------------------------------------------#
  4. # "user_input" Holds all words and/or phrases that you would like to search for.
  5.  
  6. user_input = raw_input('Please input words and phrases to search for, separating each standalone term with a comma.')
  7.  
  8.  
  9. user_input = user_input.split(',') # split will separate each search var by the commas you are asked to use at input.
  10.  
  11. print user_input
  12. # #Now I am not entirely sure how to go about efficiently searching for phrases, but single words are fairly simple.
  13. # #The following variable, "phrases" is a list which will contain all...well...phrases, separate from the single words. :P
  14.  
  15. phrases = []
  16. for var in user_input:
  17.     if len(var.split()) > 1: # Leaving the params empty in the .split() func call will separate everything by whitespaces.
  18.         phrases.append(var)
  19.         user_input.remove(var) # Remove all instances of var from previous list (user_input) to avoid redundant search iterations.
  20.  
  21.  
  22. #------------------------------------------------------------------------------#
  23. #                            Search Data For Matches                           #
  24. #------------------------------------------------------------------------------#
  25.  
  26. path_to_file = ''
  27.  
  28. search_text = open('C:/Users/JRFerrell/Desktop/sample_parse.txt', 'r') # You can find several good tutorials on youtube for dealing with file IO.
  29.  
  30.  
  31. matches = []
  32.  
  33.  
  34. # we'll need to combine two lines at a time in order to search for phrases. Phrases could be split over  multiple lines, so you'll need conventions for dealing with that.
  35. # "prev_lines" will store the previous line and be combined with the current line to form a completely new line for iteration.
  36. prev_line = ''
  37.  
  38. for line in search_text: # Each line is counted as a separate, complete object in itself.
  39.     new_line = prev_line + line # Create a new line from the current and previous line.
  40.     prev_line = line # Re-assign the previous line variable with the current line in preparation for the next search.
  41.     new_line = new_line.split() # Split each line object into separate words.
  42.  
  43.     # SEARCH FOR SINGLE WORDS #
  44.     for var in user_input:
  45.         if var.strip('!') in new_line|var.strip('?') in new_line|var.strip('.') in new_line: # I am unpracticed with ways to do this without ".split()". This leaves a problem. :)
  46.         # The strint splitting function, when separating whitespaces, will leave puncuation attached. So, annoyingly, splitting "Hello there!" leaves you with "there!", not "there".
  47.         # So we can add conditionals that check to see if anything, once stripped of potential puncutation, matches the var.
  48.             matches.append(var) # If there is a match, we can append the match to a list and/or write it to another file...for example: file.write(var + ' ')
  49.  
  50.     # SEARCH FOR PHRASES #
  51.     if phrases: # If "False", there are no phrases to search for, so you can skip this long and laborious search. Otherwise, for "True", begin searching.
  52.         for var in phrases:
  53.             var_split = var.split()
  54.             length = len(var_split) # Length of the line will be used as range of index vars.
  55.  
  56.             # We now know the length of each phrase. For each word in the phrase, we'll iterate through the line,
  57.             #   and for each word in the line, add the word plus each word after it, for every number in the range of the length variable.
  58.             #   So if the length of a phrase is 3 words, grab the index of the current word (curr_index) and:
  59.             #
  60.             #       if phrase == new_line[curr_index] + new_line[curr_index+1] + new_line[curr_index+2]:
  61.             #           do_Something()
  62.             #
  63.             #   This means you slowly go through each word in the line in this example, and check to see if that word,
  64.             #   combined with the two after it, equals the phrase you need.
  65.  
  66.             for word in new_line:
  67.                 index = new_line.index(word)
  68.  
  69.                 search_term = word
  70.  
  71.  
  72.                 for x in range(length-1):
  73.                     search_term = search_term + ' ' + new_line[x+1]
  74.                     if var == search_term:
  75.                         matches.append(var) # Or matches.append(search_term)
  76.  
  77.  
  78. if matches:
  79.     for match in matches:
  80.         print 'Match:', match
  81.  
  82.  
  83.  
  84.  
  85.  
  86.  
  87.  
  88.  
  89.  
  90.  
  91.  
  92.  
Jul 29 '12 #4
Jory R Ferrell
62 New Member
I am going to try to read up on regular expressions and see if I can write something a little more streamlined.
Jul 29 '12 #5
Jory R Ferrell
62 New Member
So...I stopped being lazy last night(....sorta .... :P), and
I figured out some issues that were holding me back.
The code below works as far as I test (a few lines of a short test txt doc). It may not be very efficient for extremely large searches, but it'll do for short quick work, as I said for the last code. Anyways, it turns out that Python has a built-in module called "re". This stands for Regular Expressions. This module is purpose built for searching strings for a match of a user-defined pattern.
This is more efficient than a custom, self-built, franken-parser (unless you know what is efficient, memory wise...I do not. :P), because it's been optimized by serious programmers. :) All you have to do with the module is set-up the text to be searched, create a way to iterate through the text, and condition the user input to be so it can be used as a search param.

When you go to use re.search, keep in mind that it deals with some strings in a way that you won't commonly run into as a beginner (like me). re.search requires the pattern to search for be a "raw string literal", for example: 'Hello' becomes r'Hello', with a 'r' in front. When you try to match the exact string (the word or phrase) as is, you have to use the '\b' indicator (so r'\bHello\b') which is part of the 're' module. But, I couldn't convert strings into raw strings. Luckily, I found a work around: Python uses the backslash ('\') as an escape character: Anything after the backslash is ignored. It's not processed in the way you might want it to be, so you have to use an escape character on the backslash of the '\b' flag: '\b' becomes '\\b'. This is the exact thing a raw string is meant to replace, so you can simply avoid trying to add the 'r' flag directly, and concatenate an extra backslash where ever it's needed: r'\bHello\b' becomes '\\bHello\\b'. Not doing this will lead to confusion. You have been warned. ;P
Expand|Select|Wrap|Line Numbers
  1.  
  2.  
  3. import re
  4.  
  5. #------------------------------------------------------------------------------#
  6. #                            Preparation of Data                               #
  7. #------------------------------------------------------------------------------#
  8. # "user_input" Holds all words and/or phrases that you would like to search for.
  9.  
  10. user_input = raw_input('Please input words and phrases to search for, separating each standalone term with a comma.')
  11.  
  12.  
  13. user_input = user_input.split(',') # split will separate each search var by the commas you are asked to use at input.
  14.  
  15. print user_input
  16.  
  17.  
  18.  
  19.  
  20. #------------------------------------------------------------------------------#
  21. #                            Search Data For Matches                           #
  22. #------------------------------------------------------------------------------#
  23.  
  24. path_to_file = ''
  25.  
  26. search_text = open('C:/Users/JRFerrell/Desktop/sample_parse.txt', 'r') 
  27. # You can replace my example path with path_to_file after the user assigns their own custom path to it.)
  28. # You can find several good tutorials on youtube for dealing with file IO.
  29.  
  30. matches = []
  31.  
  32. prev_line = ''
  33.  
  34. for line in search_text: # Each line is counted as a separate, complete object in itself.
  35.     new_line = prev_line + line # Create a new line from the current and previous line.
  36.     prev_line = line # Re-assign the previous line variable with the current line in preparation for the next search.
  37.  
  38.  
  39.     # SEARCH FOR SINGLE WORDS #
  40.     for var in user_input:
  41.         if var not in matches:
  42.             match = re.search('\\b'+var+'\\b', new_line)
  43.             matches.append(match.group(0))
  44.  
  45.         else:
  46.             user_input.remove(var)
  47.  
  48.  
  49. if matches:
  50.     for match in matches:
  51.         print 'Match:', matches
  52.  
If you have any questions, or you find a mistake in my code, please let me know. Have fun.
Aug 4 '12 #6
numberwhun
3,509 Recognized Expert Moderator Specialist
It is recommended that you find one of the plethora of Python tutorials on the internet and go through it. Python is fun an easier than one would think, especially for beginners.

Regards,

Jeff
Aug 7 '12 #7

Sign in to post your reply or Sign up for a free account.

Similar topics

11
3172
by: Ren | last post by:
Suppose I have a file containing several lines similar to this: :10000000E7280530AC00A530AD00AD0B0528AC0BE2 The data I want to extract are 8 hexadecimal strings, the first of which is E728, like this: :10000000 E728 0530 AC00 A530 AD00 AD0B 0528 AC0B E2 Also, the bytes in the string are reversed. The E728 needs to be 28E7,
0
1722
by: Peter A. Schott | last post by:
If I want to verify that a file has finished writing before deleting the remote file, what would be the best method? Current code on Python 2.4: #filename - remote FTP server File Name #NewFile - local file copy of the remote file #objFTP - standard ftplib.FTP object NewFile = open(os.path.join(InputPath, RemoteFileName), "wb")
4
3992
by: Tony Clarke | last post by:
Hi All, I have been trying to extract data from a text file using the fscanf() functions and sscanf() functions. The file is of various characters and integers separated by semicolons, the problem I'm having is that each line is of varying length and the fields separated by semicolons are of varying length also. Is there a way that I could check the first field and depending on this extract data from certain fields contained in this...
8
2840
by: Fabian Braennstroem | last post by:
Hi, I would like to remove certain lines from a log files. I had some sed/awk scripts for this, but now, I want to use python with its re module for this task. Actually, I have two different log files. The first file looks like: ...
0
4626
by: shamszia | last post by:
I exported data using Export Utility. Now import utility is not working with this .dmp file. i've tried using nxtextract, but it limits rows to 5000. is there any data extracting tool for .dmp file.
1
5064
by: manishabh77 | last post by:
I will be obliged if anybody can help me with this problem: I am trying to extract data from an excel sheet that matches IDs given in column 4 of the excel sheet.I have stored those query IDs in an array (@names). After I look for the match in this section of the code: if ($value=~/^$names$/), I want to write out only those rows that satisfy the above natch condition. But currently the code I have here writes out everything. How do I get it to...
1
1461
by: =?Utf-8?B?THVpZ2k=?= | last post by:
Hi all, is it possible to extract data from Pdf file, in several formats, like .txt or Excel. And from an aspx page (ASP.NET 2.0 - C#). Thanks in advance. -- Luigi
1
3070
by: veer | last post by:
Hi i am making a program in which i want to extract data from html file . Actually there are two dates on html file i want to extract these dates but the main probleum is that these dates are different on each file. A word "AKTIVA" is always comes before these dates. i made this by seaching the activa word but after this i am not getting any idea how these dates can be accessed. i use one another method by transfering the whole data of...
1
1987
by: honeymoon | last post by:
Hello to everyone!! I'm a newbie to Python and I have this problem: I have an xml document like this <root> <one> <two> <third> some text
1
2777
by: masterinex | last post by:
Hi guys , Im a little unfamiliar with Python . Hope you can take a look at this: Im trying to extract the number 7.2 from the html string below using python: '''<a href="/ratings_explained">weighted average</a> vote of <a href="/List?ratings=7">7.2</a> / 10</p><p>''' I thought this would be code to do this .But how come this doesnt work ? averageget = re.compile('<a href="/List?ratings=7">(.*?)</a>') average =...
0
8997
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8833
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9568
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9389
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
8257
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6801
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6079
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4881
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
2794
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.