By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,172 Members | 727 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,172 IT Pros & Developers. It's quick & easy.

Regex for replacing missing value period

P: 21
I am parsing a csv data file with comma as the delimiter. The file mostly contains integers but also some decimal numbers/floats.

It has some missing values, which are marked by a period ("."). In order to insert into a database, I want to replace the missing value periods, but NOT the decimal periods.

With python's regex flavor and the re.sub construct applied to each line read as a string, can someone help what the pattern might be for this?

Best,
Aksel
Dec 20 '10 #1

✓ answered by bvdet

This is a CSV file, right? All data is separated by a comma or other known delimiter. This is the easiest way:
Expand|Select|Wrap|Line Numbers
  1. line = '.,ACSSF,2009e5,ca,000,0100,0013344,309,11,0,0,0,0,11,0,10,0,0,0,0,10,0,93,0,0,0,38,55,0,54,0,0,0,12,42,0,43,12,0,8,21,2,0,20,11,0,0,9,0,0,78,33,39,6,0,0,0,593,0,0,0,0,0,0,0,0,17,0,0,0,0,0,11,0,0,0,0,17,10,45,308,185,789200,909500,1000001,.,.,.,.,.,.,.,.,.,.,.,.,.,593,474,159,49,110,0,315,119,541772500,438522500,103250000,.,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,.,593,474,0,0,0,0,0,0,0,0,0,17,20,22,62,69,284,119,0,0,0,0,0,32,0,0,19,30,38,2909,4001,628,2035900,1966500,69300,593,474,49,48,91,81,51,49,8,39,58,0,119,77,42,0,0,0,0,0,0,0,0,19.4,23.0,7.7,593,0,0,0,0,0,0,0,41,0,0,0,10,31,0,358,144,81,44,31,58,0,194,163,0,7,8,16,0,593,0,0,0,32,0,19,30,21,17,0,37,22,415,.\n'
  2. lineList = line.split(',')
  3. output = []
  4. for item in lineList:
  5.     if item.strip() == ".":
  6.         output.append("***")
  7.     else:
  8.         output.append(item)
  9. print ",".join(output)
Output:
Expand|Select|Wrap|Line Numbers
  1. >>> ***,ACSSF,2009e5,ca,000,0100,0013344,309,11,0,0,0,0,11,0,10,0,0,0,0,10,0,93,0,0,0,38,55,0,54,0,0,0,12,42,0,43,12,0,8,21,2,0,20,11,0,0,9,0,0,78,33,39,6,0,0,0,593,0,0,0,0,0,0,0,0,17,0,0,0,0,0,11,0,0,0,0,17,10,45,308,185,789200,909500,1000001,***,***,***,***,***,***,***,***,***,***,***,***,***,593,474,159,49,110,0,315,119,541772500,438522500,103250000,***,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,***,593,474,0,0,0,0,0,0,0,0,0,17,20,22,62,69,284,119,0,0,0,0,0,32,0,0,19,30,38,2909,4001,628,2035900,1966500,69300,593,474,49,48,91,81,51,49,8,39,58,0,119,77,42,0,0,0,0,0,0,0,0,19.4,23.0,7.7,593,0,0,0,0,0,0,0,41,0,0,0,10,31,0,358,144,81,44,31,58,0,194,163,0,7,8,16,0,593,0,0,0,32,0,19,30,21,17,0,37,22,415,***
  2. >>> 

Share this Question
Share on Google+
17 Replies


bvdet
Expert Mod 2.5K+
P: 2,851
There is no need to use regex for this.
Expand|Select|Wrap|Line Numbers
  1. >>> line = "12,15,45.1,.,16"
  2. >>> lineList = line.split(',')
  3. >>> output = []
  4. >>> for item in lineList:
  5. ...     if item == ".":
  6. ...         output.append(0.0)
  7. ...     else:
  8. ...         output.append(float(item))
  9. ...         
  10. >>> output
  11. [12.0, 15.0, 45.100000000000001, 0.0, 16.0]
If you must:
Expand|Select|Wrap|Line Numbers
  1. >>> import re
  2. >>> patt = re.compile(r', ?. ?,')
  3. >>> re.sub(patt,',0.0,',"12,15,45.1,.,16")
  4. '12,15,45.1,0.0,16'
  5. >>> re.sub(patt,',0.0,',"12,15,45.1, . ,16")
  6. '12,15,45.1,0.0,16'
  7. >>> 
Dec 20 '10 #2

Thekid
100+
P: 145
Wouldn't something like this work where you could replace the 'comma period comma' with just a 'comma':

Expand|Select|Wrap|Line Numbers
  1. >>> line='2.5,3,.,100,.,4.10,.,8'
  2. >>> line.replace(',.,',',')
  3. '2.5,3,100,4.10,8'
  4.  
Dec 20 '10 #3

P: 21
That's what I initially hoped, but it fails to replace all periods if several periods are adjacent because the last quote is seemingly used by the first match and skipped as the first quote in the next match. It accordingly skips every second period.
Expand|Select|Wrap|Line Numbers
  1. line='2.5,3,.,100,.,4.10,.,8,.,.,.,8.9,.'
  2. line.replace(',.,',',')
  3. '2.5,3,100,4.10,8,.,8.9,.'
  4.  
Dec 20 '10 #4

P: 21
I like this one, but it doesn't like repeated period values, like so:
Expand|Select|Wrap|Line Numbers
  1. patt = re.compile(r', ?. ?,')
  2. line='2.5,3,.,100,.,4.10,.,8,.,.,.,8.9,.'
  3. re.sub(patt,',0.0,',line)
  4. '2.5,0.0,.,100,0.0,4.10,0.0,8,0.0,.,0.0,8.9,.'
  5.  
Dec 20 '10 #5

bvdet
Expert Mod 2.5K+
P: 2,851
I was a bit hasty with my previous post. See if this works:
Expand|Select|Wrap|Line Numbers
  1. >>> patt = re.compile(r', ?[.] ?')
  2. >>> line='2.5,3,.,100,.,4.10,.,8,.,.,.,8.9,.'
  3. >>> re.sub(patt,',0.0',line)
  4. '2.5,3,0.0,100,0.0,4.10,0.0,8,0.0,0.0,0.0,8.9,0.0'
  5. >>> 
Dec 20 '10 #6

P: 9
It looks like all your floats are "0.#" format. Why not try to replace just the single value ".,". This way the search is looking the actual value that needs replacing. And although my python isn't perfect you get the idea with...

Expand|Select|Wrap|Line Numbers
  1. # line='2.5,3,.,100,.,4.10,.,8,.,.,.,8.9,.'
  2. # line.replace('.,','0.0,')
  3. # '2.5,3,0.0,100,0.0,4.10,0.0,8,0.0,0.0,0.0,8.9,0.0'
This way you don't loose the field and mess the database up completely. If you don't need the "0.0" you can write it in to the code that it doesn't display.
Dec 20 '10 #7

P: 21
The option of searching for the literal is at one level compelling
Expand|Select|Wrap|Line Numbers
  1. # line='2.5,3,.,100,.,4.10,.,8,.,.,.,8.9,.'
  2. # line.replace('.,','0.0,')
  3. # '2.5,3,0.0,100,0.0,4.10,0.0,8,0.0,0.0,0.0,8.9,0.0'
  4.  
But this doesn't work if the last item is a period and not a number as the replacement looks for a trailing comma. This could of course be handled separately, but I was hoping for a both general and clean solution to this issue.

Also, what is the "best" way of dealing with the carriage return in the last item of each line when iterating over lines in a file? Replacing \n with blank space?
Dec 20 '10 #8

Thekid
100+
P: 145
I was going to suggest just replacing the 'period comma' with a blank space which works except if the last value is a period:

Expand|Select|Wrap|Line Numbers
  1. # line='2.5,3,.,100,.,4.10,.,8,.,.,.,8.9,.'
  2. # line.replace('.,','')
  3. # '2.5,3,100,4.10,8,8.9,.'
  4.  
Dec 20 '10 #9

P: 9
You could switch the comma placement from "0.0," to ",0.0".
Expand|Select|Wrap|Line Numbers
  1. # line.replace(',.',',0.0')
And generally the new line at the end of the is to keep the format of the table or database.

If I split a database in to fields. Every field needs to have some value thus the ".".

AA|AB|AC|AD
BA|BB|BC|BD
CA|CB|CC|CD
DA|DB|DC|DD

If I have a 4 field database and I kill the \n it becomes a run-on and then the parsing engine looses everything beyond it's initial 4 fields.

AA|AB|AC|AD|BA|BB|BC|BD|CA|CB|CC|CD|DA|DB|DC|DD

to the engine looks like AA|AB|AC|AD

Now If you just want a longlist of numbers and dont care what the numbers mean you can in most languages just replace the "\n".

Expand|Select|Wrap|Line Numbers
  1. # # line='2.5,3,.,100,.,4.10,.,8,.,.,.,8.9,.\n2.5,3,.,100,.,4.10,.,8,.,.,.,8.9,.'
  2. # # line.replace(',.',',0.0')
  3. # # line.replace('\n',',')
  4. # # '2.5,3,0.0,100,0.0,4.10,0.0,8,0.0,0.0,0.0,8.9,0.0,2.5,3,0.0,100,0.0,4.10,0.0,8,0.0,0.0,0.0,8.9,0.0'
  5.  
Dec 20 '10 #10

Thekid
100+
P: 145
I think another issue will be if a value is a fraction like:
line='.,.45,.25,.,100,7.5,.'
Dec 20 '10 #11

P: 21
Last column values are often 'period'. How about a pattern like
Expand|Select|Wrap|Line Numbers
  1. #line='ACSSF,2009e5,ca,000,0100,0013344,309,11,0,0,0,0,11,0,10,0,0,0,0,10,0,93,0,0,0,38,55,0,54,0,0,0,12,42,0,43,12,0,8,21,2,0,20,11,0,0,9,0,0,78,33,39,6,0,0,0,593,0,0,0,0,0,0,0,0,17,0,0,0,0,0,11,0,0,0,0,17,10,45,308,185,789200,909500,1000001,.,.,.,.,.,.,.,.,.,.,.,.,.,593,474,159,49,110,0,315,119,541772500,438522500,103250000,.,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,.,593,474,0,0,0,0,0,0,0,0,0,17,20,22,62,69,284,119,0,0,0,0,0,32,0,0,19,30,38,2909,4001,628,2035900,1966500,69300,593,474,49,48,91,81,51,49,8,39,58,0,119,77,42,0,0,0,0,0,0,0,0,19.4,23.0,7.7,593,0,0,0,0,0,0,0,41,0,0,0,10,31,0,358,144,81,44,31,58,0,194,163,0,7,8,16,0,593,0,0,0,32,0,19,30,21,17,0,37,22,415\n'
  2. #patt = re.compile(r'\D\.\D')
  3. #'ACSSF,2009e5,ca,000,0100,0013344,309,11,0,0,0,0,11,0,10,0,0,0,0,10,0,93,0,0,0,38,55,0,54,0,0,0,12,42,0,43,12,0,8,21,2,0,20,11,0,0,9,0,0,78,33,39,6,0,0,0,593,0,0,0,0,0,0,0,0,17,0,0,0,0,0,11,0,0,0,0,17,10,45,308,185,789200,909500,1000001,0.0,.,0.0,.,0.0,.,0.0,.,0.0,.,0.0,.,0.0,593,474,159,49,110,0,315,119,541772500,438522500,103250000,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,593,474,0,0,0,0,0,0,0,0,0,17,20,22,62,69,284,119,0,0,0,0,0,32,0,0,19,30,38,2909,4001,628,2035900,1966500,69300,593,474,49,48,91,81,51,49,8,39,58,0,119,77,42,0,0,0,0,0,0,0,0,19.4,23.0,7.7,593,0,0,0,0,0,0,0,41,0,0,0,10,31,0,358,144,81,44,31,58,0,194,163,0,7,8,16,0,593,0,0,0,32,0,19,30,21,17,0,37,22,415\n'
  4.  
  5.  
It doesn't give me the desired result, but conceptually straightforward: return a match whereever a period exists if surrounded by non-digits. Repeat periods cause grief here.
Dec 20 '10 #12

P: 9
From the code that you were showing at the beginning all your fractions were in were in a "0.#" format. But if that is an issue try...

Expand|Select|Wrap|Line Numbers
  1. line.replace(',.',',0.')
  2. line.replace('0.,',',0.0,')
  3. line.replace('\n',',')
( On a side note I think that the regex for the find would be ",\.," or ",[\.],". And to keep out the numbers and letters, ",(*[^0-9a-zA-Z\s][^\.]*[^0-9a-zA-Z\s])," but my regex is not perfect.)
Dec 20 '10 #13

Thekid
100+
P: 145
I don't have python on this particular computer but how about:
Expand|Select|Wrap|Line Numbers
  1. line='ACSSF,2009e5,ca,000,0100,0013344,309,11,0,0,0,0,11,0,10,0,0,0,0,10,0,93,0,0,0,38,55,0,54,0,0,0,12,42,0,43,12,0,8,21,2,0,20,11,0,0,9,0,0,78,33,39,6,0,0,0,593,0,0,0,0,0,0,0,0,17,0,0,0,0,0,11,0,0,0,0,17,10,45,308,185,789200,909500,1000001,.,.,.,.,.,.,.,.,.,.,.,.,.,593,474,159,49,110,0,315,119,541772500,438522500,103250000,.,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,.,593,474,0,0,0,0,0,0,0,0,0,17,20,22,62,69,284,119,0,0,0,0,0,32,0,0,19,30,38,2909,4001,628,2035900,1966500,69300,593,474,49,48,91,81,51,49,8,39,58,0,119,77,42,0,0,0,0,0,0,0,0,19.4,23.0,7.7,593,0,0,0,0,0,0,0,41,0,0,0,10,31,0,358,144,81,44,31,58,0,194,163,0,7,8,16,0,593,0,0,0,32,0,19,30,21,17,0,37,22,415\n' 
  2.  
  3. newline1=line.replace('.,','')
  4. newline2=newline1.replace(',.\n','') #incase last value is a period
  5.  
...or you could make your replacement patterns contain '.,',,.\n','\n'
Dec 20 '10 #14

bvdet
Expert Mod 2.5K+
P: 2,851
This is a CSV file, right? All data is separated by a comma or other known delimiter. This is the easiest way:
Expand|Select|Wrap|Line Numbers
  1. line = '.,ACSSF,2009e5,ca,000,0100,0013344,309,11,0,0,0,0,11,0,10,0,0,0,0,10,0,93,0,0,0,38,55,0,54,0,0,0,12,42,0,43,12,0,8,21,2,0,20,11,0,0,9,0,0,78,33,39,6,0,0,0,593,0,0,0,0,0,0,0,0,17,0,0,0,0,0,11,0,0,0,0,17,10,45,308,185,789200,909500,1000001,.,.,.,.,.,.,.,.,.,.,.,.,.,593,474,159,49,110,0,315,119,541772500,438522500,103250000,.,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,.,593,474,0,0,0,0,0,0,0,0,0,17,20,22,62,69,284,119,0,0,0,0,0,32,0,0,19,30,38,2909,4001,628,2035900,1966500,69300,593,474,49,48,91,81,51,49,8,39,58,0,119,77,42,0,0,0,0,0,0,0,0,19.4,23.0,7.7,593,0,0,0,0,0,0,0,41,0,0,0,10,31,0,358,144,81,44,31,58,0,194,163,0,7,8,16,0,593,0,0,0,32,0,19,30,21,17,0,37,22,415,.\n'
  2. lineList = line.split(',')
  3. output = []
  4. for item in lineList:
  5.     if item.strip() == ".":
  6.         output.append("***")
  7.     else:
  8.         output.append(item)
  9. print ",".join(output)
Output:
Expand|Select|Wrap|Line Numbers
  1. >>> ***,ACSSF,2009e5,ca,000,0100,0013344,309,11,0,0,0,0,11,0,10,0,0,0,0,10,0,93,0,0,0,38,55,0,54,0,0,0,12,42,0,43,12,0,8,21,2,0,20,11,0,0,9,0,0,78,33,39,6,0,0,0,593,0,0,0,0,0,0,0,0,17,0,0,0,0,0,11,0,0,0,0,17,10,45,308,185,789200,909500,1000001,***,***,***,***,***,***,***,***,***,***,***,***,***,593,474,159,49,110,0,315,119,541772500,438522500,103250000,***,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,***,593,474,0,0,0,0,0,0,0,0,0,17,20,22,62,69,284,119,0,0,0,0,0,32,0,0,19,30,38,2909,4001,628,2035900,1966500,69300,593,474,49,48,91,81,51,49,8,39,58,0,119,77,42,0,0,0,0,0,0,0,0,19.4,23.0,7.7,593,0,0,0,0,0,0,0,41,0,0,0,10,31,0,358,144,81,44,31,58,0,194,163,0,7,8,16,0,593,0,0,0,32,0,19,30,21,17,0,37,22,415,***
  2. >>> 
Dec 20 '10 #15

P: 9
Expand|Select|Wrap|Line Numbers
  1. # newline1=line.replace('.,','')
  2. # newline2=newline1.replace(',.\n','') #incase last value is a period
This code works. The only problem is a float that is an int... e.g in C# and C++ you values like 100f == (float)100 == 100.0, which could be in your db as ",100.,".

Your second replace deletes the field and mashes the rows together screwing the db. Try instead....
Expand|Select|Wrap|Line Numbers
  1. # newline1=line.replace('.,','')
  2. # newline2=newline1.replace(',.\n',',0.0\n') #incase last value is a period
(I don't know about python but in c, c++ and c# you can chain code together like... "temp=line.replace('.,','').replace(',.\n',',0.0\n ')". To me its easier to think this way.)
Dec 20 '10 #16

P: 21
Thanks all for your help. I must say that it makes sense to just use the built-in split-to-list feature of strings and check each element. It seems pretty safe. However, it takes a long time to process a large dataset and check each value to see if it is a period, in this case the census numbers for the state of California. But efficiency doesn't matter that much in this case as releases are rare.
Dec 21 '10 #17

Expert 100+
P: 621
it takes a long time to process a large dataset and check each value to see if it is a period
Only do that if it is necessary, i.e. if you are summing one field of the list for example, check for a "." and ignore (since it is zero). When you are already iterating through the list it takes an insignificant amount of time since it is one additional if() statement. You do not have to change every field, just check the ones you are using.
Dec 22 '10 #18

Post your reply

Sign in to post your reply or Sign up for a free account.