473,406 Members | 2,312 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,406 software developers and data experts.

Extract data from a file and write it to another file

I want to open a word file,
check again my list of words or phases to extract
(such as Monday_Tuesday, Happy_birthday and etc)
write the word or phases to another file
Also states which word or phases in my list were not found
Jul 27 '12 #1
6 2640
I know nothing about python. my knowledge at this time is only I downloaded it and wrote hello world..
Jul 27 '12 #2
First, you should realize that your question is unlikely to receive very many responses besides mine. Your question is asking for a lot of higher level(higher than noob level :P) concepts to be addressed, but you failed to do a very important thing: You should have made an attempt to write your own code. You should show what level of knowledge you have about lists, dictionaries, indexing, function calls, etc. This shows you made an effort and are not trying to have someone else whip up a highly efficient piece of code so you yourself avoid having to put in effort.

You asked for many things in your code but provided none of you own progress to build off of. But I do understand why you did. I myself have inadvertently done it in the past. Just try to keep this in mind for the future.
Jul 28 '12 #3
Expand|Select|Wrap|Line Numbers
  1. #------------------------------------------------------------------------------#
  2. #                            Preparation of Data                               #
  3. #------------------------------------------------------------------------------#
  4. # "user_input" Holds all words and/or phrases that you would like to search for.
  5.  
  6. user_input = raw_input('Please input words and phrases to search for, separating each standalone term with a comma.')
  7.  
  8.  
  9. user_input = user_input.split(',') # split will separate each search var by the commas you are asked to use at input.
  10.  
  11. print user_input
  12. # #Now I am not entirely sure how to go about efficiently searching for phrases, but single words are fairly simple.
  13. # #The following variable, "phrases" is a list which will contain all...well...phrases, separate from the single words. :P
  14.  
  15. phrases = []
  16. for var in user_input:
  17.     if len(var.split()) > 1: # Leaving the params empty in the .split() func call will separate everything by whitespaces.
  18.         phrases.append(var)
  19.         user_input.remove(var) # Remove all instances of var from previous list (user_input) to avoid redundant search iterations.
  20.  
  21.  
  22. #------------------------------------------------------------------------------#
  23. #                            Search Data For Matches                           #
  24. #------------------------------------------------------------------------------#
  25.  
  26. path_to_file = ''
  27.  
  28. search_text = open('C:/Users/JRFerrell/Desktop/sample_parse.txt', 'r') # You can find several good tutorials on youtube for dealing with file IO.
  29.  
  30.  
  31. matches = []
  32.  
  33.  
  34. # we'll need to combine two lines at a time in order to search for phrases. Phrases could be split over  multiple lines, so you'll need conventions for dealing with that.
  35. # "prev_lines" will store the previous line and be combined with the current line to form a completely new line for iteration.
  36. prev_line = ''
  37.  
  38. for line in search_text: # Each line is counted as a separate, complete object in itself.
  39.     new_line = prev_line + line # Create a new line from the current and previous line.
  40.     prev_line = line # Re-assign the previous line variable with the current line in preparation for the next search.
  41.     new_line = new_line.split() # Split each line object into separate words.
  42.  
  43.     # SEARCH FOR SINGLE WORDS #
  44.     for var in user_input:
  45.         if var.strip('!') in new_line|var.strip('?') in new_line|var.strip('.') in new_line: # I am unpracticed with ways to do this without ".split()". This leaves a problem. :)
  46.         # The strint splitting function, when separating whitespaces, will leave puncuation attached. So, annoyingly, splitting "Hello there!" leaves you with "there!", not "there".
  47.         # So we can add conditionals that check to see if anything, once stripped of potential puncutation, matches the var.
  48.             matches.append(var) # If there is a match, we can append the match to a list and/or write it to another file...for example: file.write(var + ' ')
  49.  
  50.     # SEARCH FOR PHRASES #
  51.     if phrases: # If "False", there are no phrases to search for, so you can skip this long and laborious search. Otherwise, for "True", begin searching.
  52.         for var in phrases:
  53.             var_split = var.split()
  54.             length = len(var_split) # Length of the line will be used as range of index vars.
  55.  
  56.             # We now know the length of each phrase. For each word in the phrase, we'll iterate through the line,
  57.             #   and for each word in the line, add the word plus each word after it, for every number in the range of the length variable.
  58.             #   So if the length of a phrase is 3 words, grab the index of the current word (curr_index) and:
  59.             #
  60.             #       if phrase == new_line[curr_index] + new_line[curr_index+1] + new_line[curr_index+2]:
  61.             #           do_Something()
  62.             #
  63.             #   This means you slowly go through each word in the line in this example, and check to see if that word,
  64.             #   combined with the two after it, equals the phrase you need.
  65.  
  66.             for word in new_line:
  67.                 index = new_line.index(word)
  68.  
  69.                 search_term = word
  70.  
  71.  
  72.                 for x in range(length-1):
  73.                     search_term = search_term + ' ' + new_line[x+1]
  74.                     if var == search_term:
  75.                         matches.append(var) # Or matches.append(search_term)
  76.  
  77.  
  78. if matches:
  79.     for match in matches:
  80.         print 'Match:', match
  81.  
  82.  
  83.  
  84.  
  85.  
  86.  
  87.  
  88.  
  89.  
  90.  
  91.  
  92.  
Jul 29 '12 #4
I am going to try to read up on regular expressions and see if I can write something a little more streamlined.
Jul 29 '12 #5
So...I stopped being lazy last night(....sorta.... :P), and
I figured out some issues that were holding me back.
The code below works as far as I test (a few lines of a short test txt doc). It may not be very efficient for extremely large searches, but it'll do for short quick work, as I said for the last code. Anyways, it turns out that Python has a built-in module called "re". This stands for Regular Expressions. This module is purpose built for searching strings for a match of a user-defined pattern.
This is more efficient than a custom, self-built, franken-parser (unless you know what is efficient, memory wise...I do not. :P), because it's been optimized by serious programmers. :) All you have to do with the module is set-up the text to be searched, create a way to iterate through the text, and condition the user input to be so it can be used as a search param.

When you go to use re.search, keep in mind that it deals with some strings in a way that you won't commonly run into as a beginner (like me). re.search requires the pattern to search for be a "raw string literal", for example: 'Hello' becomes r'Hello', with a 'r' in front. When you try to match the exact string (the word or phrase) as is, you have to use the '\b' indicator (so r'\bHello\b') which is part of the 're' module. But, I couldn't convert strings into raw strings. Luckily, I found a work around: Python uses the backslash ('\') as an escape character: Anything after the backslash is ignored. It's not processed in the way you might want it to be, so you have to use an escape character on the backslash of the '\b' flag: '\b' becomes '\\b'. This is the exact thing a raw string is meant to replace, so you can simply avoid trying to add the 'r' flag directly, and concatenate an extra backslash where ever it's needed: r'\bHello\b' becomes '\\bHello\\b'. Not doing this will lead to confusion. You have been warned. ;P
Expand|Select|Wrap|Line Numbers
  1.  
  2.  
  3. import re
  4.  
  5. #------------------------------------------------------------------------------#
  6. #                            Preparation of Data                               #
  7. #------------------------------------------------------------------------------#
  8. # "user_input" Holds all words and/or phrases that you would like to search for.
  9.  
  10. user_input = raw_input('Please input words and phrases to search for, separating each standalone term with a comma.')
  11.  
  12.  
  13. user_input = user_input.split(',') # split will separate each search var by the commas you are asked to use at input.
  14.  
  15. print user_input
  16.  
  17.  
  18.  
  19.  
  20. #------------------------------------------------------------------------------#
  21. #                            Search Data For Matches                           #
  22. #------------------------------------------------------------------------------#
  23.  
  24. path_to_file = ''
  25.  
  26. search_text = open('C:/Users/JRFerrell/Desktop/sample_parse.txt', 'r') 
  27. # You can replace my example path with path_to_file after the user assigns their own custom path to it.)
  28. # You can find several good tutorials on youtube for dealing with file IO.
  29.  
  30. matches = []
  31.  
  32. prev_line = ''
  33.  
  34. for line in search_text: # Each line is counted as a separate, complete object in itself.
  35.     new_line = prev_line + line # Create a new line from the current and previous line.
  36.     prev_line = line # Re-assign the previous line variable with the current line in preparation for the next search.
  37.  
  38.  
  39.     # SEARCH FOR SINGLE WORDS #
  40.     for var in user_input:
  41.         if var not in matches:
  42.             match = re.search('\\b'+var+'\\b', new_line)
  43.             matches.append(match.group(0))
  44.  
  45.         else:
  46.             user_input.remove(var)
  47.  
  48.  
  49. if matches:
  50.     for match in matches:
  51.         print 'Match:', matches
  52.  
If you have any questions, or you find a mistake in my code, please let me know. Have fun.
Aug 4 '12 #6
numberwhun
3,509 Expert Mod 2GB
It is recommended that you find one of the plethora of Python tutorials on the internet and go through it. Python is fun an easier than one would think, especially for beginners.

Regards,

Jeff
Aug 7 '12 #7

Sign in to post your reply or Sign up for a free account.

Similar topics

11
by: Ren | last post by:
Suppose I have a file containing several lines similar to this: :10000000E7280530AC00A530AD00AD0B0528AC0BE2 The data I want to extract are 8 hexadecimal strings, the first of which is E728,...
0
by: Peter A. Schott | last post by:
If I want to verify that a file has finished writing before deleting the remote file, what would be the best method? Current code on Python 2.4: #filename - remote FTP server File Name...
4
by: Tony Clarke | last post by:
Hi All, I have been trying to extract data from a text file using the fscanf() functions and sscanf() functions. The file is of various characters and integers separated by semicolons, the...
8
by: Fabian Braennstroem | last post by:
Hi, I would like to remove certain lines from a log files. I had some sed/awk scripts for this, but now, I want to use python with its re module for this task. Actually, I have two different...
0
by: shamszia | last post by:
I exported data using Export Utility. Now import utility is not working with this .dmp file. i've tried using nxtextract, but it limits rows to 5000. is there any data extracting tool for .dmp...
1
by: manishabh77 | last post by:
I will be obliged if anybody can help me with this problem: I am trying to extract data from an excel sheet that matches IDs given in column 4 of the excel sheet.I have stored those query IDs in an...
1
by: =?Utf-8?B?THVpZ2k=?= | last post by:
Hi all, is it possible to extract data from Pdf file, in several formats, like .txt or Excel. And from an aspx page (ASP.NET 2.0 - C#). Thanks in advance. -- Luigi
1
by: veer | last post by:
Hi i am making a program in which i want to extract data from html file . Actually there are two dates on html file i want to extract these dates but the main probleum is that these dates are...
1
by: honeymoon | last post by:
Hello to everyone!! I'm a newbie to Python and I have this problem: I have an xml document like this <root> <one> <two> <third> some text
1
by: masterinex | last post by:
Hi guys , Im a little unfamiliar with Python . Hope you can take a look at this: Im trying to extract the number 7.2 from the html string below using python: '''<a...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.