473,322 Members | 1,911 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,322 software developers and data experts.

parsing a file

58
Hey all.
(hopefully) a quick question here. I am processing data using Hadoop Streaming Map/Reduce.. the map.py is straight forward.. basically takes the input data (in the form of sys.stdin), loads it into a list, sorts that list, then... well not exactly sure what hadoop does with that, but pretty sure it creates a temporarly file much like a csv in memory

Expand|Select|Wrap|Line Numbers
  1. for line in sys.stdin:
  2.     <append into a list then sort>
  3. for m in TmpArr:
  4.     print m

I then have a reduce.py that takes each line from the mysterious hadoop temp file and loads it as sys.stdin... like this...

Expand|Select|Wrap|Line Numbers
  1. for line in sys.stdin:
  2.     <load into temporary list and do some stuff>
my issue is as soon as line = nothing, the process ends... even if there is still data to process. Is there an error checking way to fix this with stdin?
an example would be this... my table looks like this
ID---VAL
01--20
01--22
01--25
02--10
02--15
02--17
03--5
03--7

my output SHOULD look like this
ID---AVG---COUNT
01--22.3--3
02--14.0--3
03--6.0--2

but its coming out like this
ID---AVG---COUNT
01--22.3--3
02--14.0--3

Sorry this is so long winded and thanks for any input. Also, i could post my whole code if needed but its a bit long winded too!
Cheers,
Eric
Apr 28 '10 #1

✓ answered by Glenton

Er...I don't really understand what the question is now. What's the pseudocode for what you're trying to do?

The trickiness of what you're doing is not the calculation, but handling the large dataset, right? If you had a way of converting your keys uniquely into 0,...,n-1 you might find it easier to create an array/list which you just update on the fly.

4 1582
erbrose
58
i am messing around with just running the reducer.py with a txt file and am able to process the whole file by adding this
Expand|Select|Wrap|Line Numbers
  1. while True:
  2.     line = reader.readline()
  3.     if len(line) != 0:
  4.         <my code here>
  5.     else:
  6.         <repeat my code here>
  7.  
seems slightly wrong to have to repeat all my code in the if and the else but it works... am not able to get it to work using sys.stdin...
Thanks again
Apr 28 '10 #2
Glenton
391 Expert 256MB
Okay, that's interesting. In your second example, I assume reader is an object of a text file (ie reader=open("whatever.txt") or something?).
If so, it should work to go:

Expand|Select|Wrap|Line Numbers
  1. for line in reader:
  2.     <your code here>
I'm assuming that when the length of the line is zero, you just want to carry on to the next line? So you could do this like this:

Expand|Select|Wrap|Line Numbers
  1. for line in reader:
  2.     if len(line)==0: continue
  3.     <your code here>
I definitely can't see any reason to repeat your code!

Regarding the sys.stdin, I suppose it goes until it hits a blank line and then the iterator stops running. I've not used this kind of thing, but can imagine that would be problematic. I would suppose that the easier fix is in your reduce.py. Perhaps you can ensure that it goes through the whole hadoop file there before the iterator ends.

Perhaps you could post reduce.py? I'm guessing that it could be written neatly as a class with an iterator. And that it's not written like that now ;P
Apr 29 '10 #3
erbrose
58
Thanks!
Alright.. well the only reason I check for line == 0 (or in this case line = "") is the actual end of file.. there will be no NULL lines from the map input. I am still having to pretty much duplicate the code as you see. The code is all over the place too as im still in debug mode... but it is working properly with a csv file as the input.. I am calculating the average, standard deviation, median, min and max values too. Will eventually look into Numpy or Scipy, but for now calculating values the old fashion way

Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/python
  2. import sys
  3. import math
  4. TmpArr = []
  5. Unique = []
  6. SortArr = []
  7. OutArr = []
  8. tmp_avgspd = float(0)
  9. sqr_sum = float(0)
  10. i = int(0)
  11. l = int(0)
  12. b = int(0)
  13. c = int(0)
  14. n = int(0)
  15. j = int(0)
  16. k = float(0)
  17. a = int(0)
  18. devsum = float(0)
  19. deviation = []
  20. reader = open("d:/temp/tmp/sample.csv",'r')
  21.  
  22. while True:
  23.     line = reader.readline()
  24.  
  25.     line = line.strip()
  26.     StrTemp = line
  27.     TmpArr.append(line.split(','))
  28.     SortArr.append(line.split(','))    
  29.     #if last line.. finish processing stuff in my Unique List
  30.     if line == "":
  31.         l = len(Unique)
  32.         if l == 0:
  33.             median_val=Unique[l][1]
  34.             avg_val=Unique[l][1]
  35.             min_val = Unique[l][1]
  36.             std_dev=0.0
  37.         elif l == 1:
  38.             median_val=Unique[l-1][1]
  39.             avg_val=Unique[l-1][1]
  40.             std_dev=0.0
  41.             min_val = Unique[l-1][1]
  42.             max_val = Unique[l-1][1]
  43.         elif l == 2:
  44.             for a in Unique:
  45.                 c = c + int(Unique[b][1])
  46.                 b = b + 1
  47.             avg_val = c/l
  48.             median_val = c/l
  49.             tmp_avgspd = float(c)/float(l)
  50.             b = 0
  51.             c = 0
  52.             for a in Unique:
  53.                 deviation.append(float((float(Unique[b][1])-tmp_avgspd)*(float(Unique[b][1])-tmp_avgspd)))
  54.                 b = b + 1
  55.             b = 0    
  56.             for a in deviation:
  57.                 devsum = devsum + float(deviation[b])
  58.                 b = b + 1
  59.             devsum = devsum/1.0
  60.             sqr_sum = math.sqrt(devsum)
  61.             std_dev = round(sqr_sum,3)
  62.             min_val = Unique[0][1]
  63.             max_val = Unique[l-1][1]    
  64.         elif l%2==0:
  65.             median_val=(int(Unique[l/2][1])+int(Unique[(l/2)+1][1]))/2
  66.             for a in Unique:
  67.                 c = c + int(Unique[b][1])
  68.                 b = b + 1
  69.             avg_val = c/l
  70.             tmp_avgspd = float(c)/float(l)
  71.             b = 0
  72.             c = 0
  73.             for a in Unique:
  74.                 deviation.append(float((float(Unique[b][1])-tmp_avgspd)*(float(Unique[b][1])-tmp_avgspd)))
  75.                 b = b + 1
  76.             b = 0    
  77.             devsum = 0.0
  78.             for a in deviation:
  79.                 devsum = devsum + float(deviation[b])
  80.                 b = b + 1
  81.             devsum2 = devsum   
  82.             k = l - 1 
  83.             devsum = devsum/k
  84.             sqr_sum = math.sqrt(devsum)
  85.             std_dev = round(sqr_sum,3)                
  86.             min_val = Unique[0][1]
  87.             max_val = Unique[l-1][1] 
  88.         else:
  89.             median_val=Unique[l/2][1]
  90.             for a in Unique:
  91.                 d = Unique[b][1]
  92.                 d = int(d)
  93.                 c = c + d
  94.                 b = b + 1
  95.             avg_val = c/l
  96.             tmp_avgspd = float(c)/float(l)
  97.             b = 0
  98.             c = 0
  99.             for a in Unique:
  100.                 deviation.append(float((float(Unique[b][1])-tmp_avgspd)*(float(Unique[b][1])-tmp_avgspd)))
  101.                 b = b + 1
  102.             b = 0    
  103.             devsum = 0.0
  104.             for a in deviation:
  105.                 devsum = devsum + float(deviation[b])
  106.                 b = b + 1
  107.             devsum2 = devsum    
  108.             k = l - 1 
  109.             devsum = devsum/k
  110.             sqr_sum = math.sqrt(devsum)
  111.             std_dev = round(sqr_sum,3)                
  112.             min_val = Unique[0][1]
  113.             max_val = Unique[l-1][1] 
  114.  
  115.         id = Unique[0][0]
  116.         TempString = str(id) + ',' + str(avg_val) + ',' + str(median_val) + ',' + str(std_dev) + ',' + str(min_val) + ',' + str(max_val) + ',' + str(l)
  117.         print TempString
  118.         break
  119.     else:
  120.  
  121.     #if first row go ahead and put into unique array
  122.         if i == 0:
  123.             Unique.append(line.split(','))        
  124.         else:
  125.             #print Unique
  126.  
  127.             if SortArr[i][0]==SortArr[i-1][0]:
  128.                 #StrTemp = str(TmpArr[j][0]) + ',' + str(TmpArr[j][1])
  129.                 StrTemp = str(TmpArr[j][0]) + ',' + str(TmpArr[j][1])
  130.                 Unique.append(StrTemp.split(','))
  131.             else:
  132.  
  133.                 l = len(Unique)
  134.                 if l == 0:
  135.                     median_val=Unique[l][1]
  136.                     avg_val=Unique[l][1]
  137.                     min_val = Unique[l][1]
  138.                     max_val = Unique[l][1]
  139.                     std_dev=0.0
  140.                 elif l == 1:
  141.                     median_val=Unique[l-1][1]
  142.                     avg_val=Unique[l-1][1]
  143.                     std_dev=0.0
  144.                     min_val = Unique[l-1][1]
  145.                     max_val = Unique[l-1][1]
  146.                 elif l == 2:
  147.                     for a in Unique:
  148.                         c = c + int(Unique[b][1])
  149.                         b = b + 1
  150.                     avg_val = c/l
  151.                     median_val = c/l
  152.                     tmp_avgspd = float(c)/float(l)
  153.                     b = 0
  154.                     c = 0
  155.                     for a in Unique:
  156.                         deviation.append(float((float(Unique[b][1])-tmp_avgspd)*(float(Unique[b][1])-tmp_avgspd)))
  157.                         b = b + 1
  158.                     b = 0    
  159.                     for a in deviation:
  160.                         devsum = devsum + float(deviation[b])
  161.                         b = b + 1
  162.                     devsum = devsum/1.0
  163.                     sqr_sum = math.sqrt(devsum)
  164.                     std_dev = round(sqr_sum,3)
  165.                     min_val = Unique[0][1]
  166.                     max_val = Unique[l-1][1]    
  167.                 elif l%2==0:
  168.                     median_val=(int(Unique[l/2][1])+int(Unique[(l/2)+1][1]))/2
  169.                     for a in Unique:
  170.                         c = c + int(Unique[b][1])
  171.                         b = b + 1
  172.                     avg_val = c/l
  173.                     tmp_avgspd = float(c)/float(l)
  174.                     b = 0
  175.                     c = 0
  176.                     for a in Unique:
  177.                         deviation.append(float((float(Unique[b][1])-tmp_avgspd)*(float(Unique[b][1])-tmp_avgspd)))
  178.                         b = b + 1
  179.                     b = 0    
  180.                     devsum = 0.0
  181.                     for a in deviation:
  182.                         devsum = devsum + float(deviation[b])
  183.                         b = b + 1
  184.                     devsum2 = devsum   
  185.                     k = l - 1 
  186.                     devsum = devsum/k
  187.                     sqr_sum = math.sqrt(devsum)
  188.                     std_dev = round(sqr_sum,3)                
  189.                     min_val = Unique[0][1]
  190.                     max_val = Unique[l-1][1] 
  191.                 else:
  192.                     median_val=Unique[l/2][1]
  193.                     for a in Unique:
  194.                         d = Unique[b][1]
  195.                         d = int(d)
  196.                         c = c + d
  197.                         b = b + 1
  198.                     avg_val = c/l
  199.                     tmp_avgspd = float(c)/float(l)
  200.                     b = 0
  201.                     c = 0
  202.                     for a in Unique:
  203.                         deviation.append(float((float(Unique[b][1])-tmp_avgspd)*(float(Unique[b][1])-tmp_avgspd)))
  204.                         b = b + 1
  205.                     b = 0    
  206.                     devsum = 0.0
  207.                     for a in deviation:
  208.                         devsum = devsum + float(deviation[b])
  209.                         b = b + 1
  210.                     devsum2 = devsum    
  211.                     k = l - 1 
  212.                     devsum = devsum/k
  213.                     sqr_sum = math.sqrt(devsum)
  214.                     std_dev = round(sqr_sum,3)                
  215.                     min_val = Unique[0][1]
  216.                     max_val = Unique[l-1][1] 
  217.  
  218.                 id = Unique[0][0]
  219.                 TempString = str(id) + ',' + str(avg_val) + ',' + str(median_val) + ',' + str(std_dev) + ',' + str(min_val) + ',' + str(max_val) + ',' + str(l)
  220.                 print TempString
  221.                 deviation = []
  222.                 a = len(TmpArr)
  223.  
  224.                 Unique = []
  225.                 SortArr = []
  226.                 SortArr.append(StrTemp.split(','))
  227.                 Unique.append(StrTemp.split(','))
  228.                 i = 0
  229.                 b = int(0)
  230.                 avg_val = int(0)
  231.                 median_val = int(0)
  232.                 c = int(0)    
  233.  
  234.  
  235.     i = i + 1
  236.     n = n + 1
  237.     j = j + 1
  238.  
Apr 29 '10 #4
Glenton
391 Expert 256MB
Er...I don't really understand what the question is now. What's the pseudocode for what you're trying to do?

The trickiness of what you're doing is not the calculation, but handling the large dataset, right? If you had a way of converting your keys uniquely into 0,...,n-1 you might find it easier to create an array/list which you just update on the fly.
Apr 29 '10 #5

Sign in to post your reply or Sign up for a free account.

Similar topics

3
by: Willem Ligtenberg | last post by:
I decided to use SAX to parse my xml file. But the parser crashes on: File "/usr/lib/python2.3/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError raise exception...
2
by: Cigdem | last post by:
Hello, I am trying to parse the XML files that the user selects(XML files are on anoher OS400 system called "wkdis3"). But i am permenantly getting that error: Directory0: \\wkdis3\ROOT\home...
3
by: Pir8 | last post by:
I have a complex xml file, which contains stories within a magazine. The structure of the xml file is as follows: <?xml version="1.0" encoding="ISO-8859-1" ?> <magazine> <story>...
1
by: Christoph Bisping | last post by:
Hello! Maybe someone is able to give me a little hint on this: I've written a vb.net app which is mainly an interpreter for specialized CAD/CAM files. These files mainly contain simple movement...
4
by: Rick Walsh | last post by:
I have an HTML table in the following format: <table> <tr><td>Header 1</td><td>Header 2</td></tr> <tr><td>1</td><td>2</td></tr> <tr><td>3</td><td>4</td></tr> <tr><td>5</td><td>6</td></tr>...
3
by: toton | last post by:
Hi, I have some ascii files, which are having some formatted text. I want to read some section only from the total file. For that what I am doing is indexing the sections (denoted by .START in...
9
by: Paulers | last post by:
Hello, I have a log file that contains many multi-line messages. What is the best approach to take for extracting data out of each message and populating object properties to be stored in an...
13
by: Chris Carlen | last post by:
Hi: Having completed enough serial driver code for a TMS320F2812 microcontroller to talk to a terminal, I am now trying different approaches to command interpretation. I have a very simple...
13
by: charliefortune | last post by:
I am fetching some product feeds with PHP like this $merch = substr($key,1); $feed = file_get_contents($_POST); $fp = fopen("./feeds/feed".$merch.".txt","w+"); fwrite ($fp,$feed); fclose...
2
by: Felipe De Bene | last post by:
I'm having problems parsing an HTML file with the following syntax : <TABLE cellspacing=0 cellpadding=0 ALIGN=CENTER BORDER=1 width='100%'> <TH BGCOLOR='#c0c0c0' Width='3%'>User ID</TH> <TH...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.