473,385 Members | 1,780 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

Parse HTML file to EXCEL

Hello Everbody,
Hopefully you all are fine.

Question:
I have few html files (invoices) at a directory.
What i required is read html contents (customer name,bill_no,dues etc) and store them in a .csv file...
[i have attached file invoice.txt, plz open it in IE]

I have an already existing parser (attached parser.sh) which is working fine.
Please help me in converting my existing shell parser to python parser.

Shell Parser Code is below:
Expand|Select|Wrap|Line Numbers
  1. #!/bin/bash
  2.  
  3. echo "Script started \n"; 
  4. echo "\"BILL_NUMBER\",\"ACCOUNT_NUMBER\",\"USERNAME\",\"CUSTOMER_NAME\",\"CONTACT_NO\",\"EMAIL\",
  5. \"PACKAGE_PLAN\",\"TOTAL_AMOUNT_DUE_TOP\",\"PAYMENT_DUE_DATE\",\"TOTAL_AMOUNT_AFTER_DUE_DATE\",\"BILLING_PERIOD\",
  6. \"PERVIOUS_BALANCE\",\"PAYMENTS\",\"NET_PREVIOUS_BALANCE\",\"SUBORDINATE_AC_CHARGES\",\"INITIAL_CHARGES\",
  7. \"MONTHLY_LINE_RENT\",\"ANTIVIRUS_LINE_RENT\",\"PARENTAL_LINE_RENT\",\"EXTRA_USAGE\",\"Extra Usage-2GB_COUNT\",
  8. \"Extra Usage-2GB_AMOUNT\",\"Extra Usage-5GB_COUNT\",\"Extra Usage-5GB_AMOUNT\",\"SPEED_BOOST_COUNT\",
  9. \"SPEED_BOOST_AMOUNT\",\"HAPPY_DAYS_1_COUNT\",\"HAPPY_DAYS_1_AMOUNT\",\"HAPPY_DAYS_3_COUNT\",
  10. \"HAPPY_DAYS_3_AMOUNT\",\"StaticIP_COUNT\",\"StaticIP_AMOUNT\",\"PayAsYouGo_05Day_COUNT\",\"PayAsYouGo_05Day_AMOUNT\",
  11. \"PayAsYouGo_10Day_COUNT\",\"PayAsYouGo_10Day_AMOUNT\",\"PayAsYouGo_30Day_COUNT\",\"PayAsYouGo_30Day_AMOUNT\",
  12. \"PayAsYouGo_03Day_COUNT\",\"PayAsYouGo_03Day_AMOUNT\",\"PayAsYouGo_07Day_COUNT\",\"PayAsYouGo_07Day_AMOUNT\",
  13. \"PayAsYouGo_15Day_COUNT\",\"PayAsYouGo_15Day_AMOUNT\",\"Power Hours-30Days_COUNT\",\"Power Hours-30Day_AMOUNT\",
  14. \"DISCOUNTS\",\"ADJUSTMENTS\",\"DEVICE_CHANGE_CHARGES\",\"PLAN_CHANGE_CHARGES\",\"DEVICE_DAMAGE_CHARGES\",
  15. \"DEVICE_LOST_CHARGES\",\"ACCOUNT_FREEZE_CHARGES\",\"LATE_PAYMENT_CHARGES\",\"SUBTOTAL\",\"FEDERAL_EXCISE_DUTY\",
  16. \"ADVANCE_WITHHOLDING_TAX\",\"CURRENT_CHARGES\",\"ToTAL_DUE\"" > parsed.csv
  17.  
  18. for file in *.html; do
  19.  cat $file | grep '<!--B' |  awk 'BEGIN{ RS = ""; FS = "|"}  { for( i=1;i<=NF;i+=1)    { split($i,a,"##"); printf "\"" a[2] "\","}  printf "\n"  }'  
  20.   >> parsed.csv
  21. done
  22.  echo "Script finished";


Please if anyone can share it in python format.thnx
Attached Files
File Type: txt invoice.txt (333.5 KB, 3933 views)
Dec 15 '11 #1
7 14840
Glenton
391 Expert 256MB
Hi

There are three modules that you can use to help you.
glob - to help you go through your files
re - for regular expressions to extract your data
csv - for writing csv files

Below is a basic structure to get you started

Expand|Select|Wrap|Line Numbers
  1. import glob
  2. import re
  3. import csv
  4.  
  5. headings=[*list of your headings*]
  6. output=csv.writer(open("parser.csv","w"))
  7. output.writerow(headings)
  8.  
  9. for file in glob.glob("*.html"):
  10.     inputFile=open(file[-1])
  11.     data=[]
  12.     for heading in headings:
  13.         *code to extract data for heading
  14.         data.append(extracted data)
  15.     output.writerow(data)
  16.     inputFile.close()
  17.  
The code to extract your data is obviously specific to your file. I haven't gone through to see what is the best way, but generally using a regular expression will sort you out.

Good luck!
Dec 15 '11 #2
bvdet
2,851 Expert Mod 2GB
To add to Glenton's information, Python module BeautifulSoup is ideal for parsing HTML files. Having never used it, I thought I would give it a shot.

Here's what takes place:
  1. Read the file in it's entirety and create a BeautifulSoup object
  2. Find the text you want to parse
  3. Replace "=\n" with "" in the text
  4. Create a list of strings by splitting the text on "|"
  5. Create a file object for writing the csv data
  6. Create a csv.writer object
  7. Iterate on the list of strings, split each string on "##", and write each row
  8. Close the file object

Now for the code:
Expand|Select|Wrap|Line Numbers
  1. import re
  2. from BeautifulSoup import BeautifulSoup
  3.  
  4. fnIn = "invoice.htm"
  5. fnOut = "invoice.csv"
  6.  
  7. soup = BeautifulSoup(open(fnIn).read())
  8. comments = soup.find(text=re.compile("BILL_NUMBER")).replace("=\n", "").split("|")
  9.  
  10. f = open(fnOut, 'w')
  11. writer = csv.writer(f)
  12. for s in comments:
  13.     writer.writerow(s.split("##"))
  14. f.close()
Looks simple, doesn't it?

The csv module automatically accounts for embedded commas in the text.
Dec 15 '11 #3
Glenton
391 Expert 256MB
Wow! That's handy. I have an old script that I use to download share data - wish I'd known about BeautifulSoup then!
Dec 16 '11 #4
Thanks Glenton & bvdet,

@bvdet: i am getting error "ImportError: No module named BeautifulSoup" on executing the provided code.
Is it beacause of any missing plugin/utility.?
Please guide.
Dec 22 '11 #5
bvdet
2,851 Expert Mod 2GB
BeautifulSoup is not built into Python. You have to download and install it.
Dec 22 '11 #6
Dear Both,
Thanks for your help, now it giving error on writer, as undefine..

Anyways I have simplified my requirement as below, kindly provide me python code ......

I have an HTML File (say myfile.html) having only 3 lines as shown below

Hello EverBody.
<!--FName##Adam|SName##John|PhoneNo##0987654321|Gender ##Male|EMAIL##abbassalam@yahoo.com-->
Good Bye

I want to write this file(myfile.html) contents to a csv file (say mycsv.csv) in such a way that....
1)Program only read line 2 starting from "<!--" to "-->"
2)Extract all strings between '##" and "|" and store it into a csv file as below

Adam John 0987654321 Male abbassalam@yahoo.com
Jan 4 '12 #7
bvdet
2,851 Expert Mod 2GB
Amad Khan,

We are not here to write your code for you. You should be able to write your own from the examples we have provided. You can post the code you have attempted along with the error you received, including traceback, and we will be glad to assist in correcting your problem.

bvdet
Moderator
Jan 4 '12 #8

Sign in to post your reply or Sign up for a free account.

Similar topics

3
by: Prakash | last post by:
Hi, We face problems uploading excel (with macros) documents using HTML File Upload. The file contents are corrupted while viewing the same. However, we are able to upload excel (w/o. macros)...
1
by: Onur | last post by:
How can import a HTML table to Excel?
0
by: harini | last post by:
Hi, I am working on excel VBA and i have a control on my excel sheet say a command button. Now when i click on the command button i shud be able to generate a state web page written using html...
1
by: sandy2481 | last post by:
Hi Friends, I'm a beginner in perl. Can anyone please provide me with a perl code to read a HTML file and write into excel file. Regards, Sandy
0
by: dodoG | last post by:
Hello, How to parse html file tags <h1>-<h4> to excel file with c#. 1. How i create excel 2007 object . 2. What should i do in visual studio 2005 (which references in com object). 3. How i...
6
scubak1w1
by: scubak1w1 | last post by:
Hello, I have been Googling for a bit now as to how to let the user save the current page via a 'clickable icon' to an HTML file (i.e., on the client side.) That is, I am presenting a lot of...
7
by: veer | last post by:
Hi! i am making a program in which i want to search a particular word in html file and if word found save it into excel file with html file name e.g i want to search a string "visual" in a.html...
3
by: Mark | last post by:
Hello This might be an odd question. I have an Excel VBA program that searches through a folder for any .htm or .html file, searches the file for any 'hrefs' and prints them to the spreadsheet....
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.