473,382 Members | 1,373 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,382 software developers and data experts.

Regex for replacing missing value period

21
I am parsing a csv data file with comma as the delimiter. The file mostly contains integers but also some decimal numbers/floats.

It has some missing values, which are marked by a period ("."). In order to insert into a database, I want to replace the missing value periods, but NOT the decimal periods.

With python's regex flavor and the re.sub construct applied to each line read as a string, can someone help what the pattern might be for this?

Best,
Aksel
Dec 20 '10 #1

✓ answered by bvdet

This is a CSV file, right? All data is separated by a comma or other known delimiter. This is the easiest way:
Expand|Select|Wrap|Line Numbers
  1. line = '.,ACSSF,2009e5,ca,000,0100,0013344,309,11,0,0,0,0,11,0,10,0,0,0,0,10,0,93,0,0,0,38,55,0,54,0,0,0,12,42,0,43,12,0,8,21,2,0,20,11,0,0,9,0,0,78,33,39,6,0,0,0,593,0,0,0,0,0,0,0,0,17,0,0,0,0,0,11,0,0,0,0,17,10,45,308,185,789200,909500,1000001,.,.,.,.,.,.,.,.,.,.,.,.,.,593,474,159,49,110,0,315,119,541772500,438522500,103250000,.,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,.,593,474,0,0,0,0,0,0,0,0,0,17,20,22,62,69,284,119,0,0,0,0,0,32,0,0,19,30,38,2909,4001,628,2035900,1966500,69300,593,474,49,48,91,81,51,49,8,39,58,0,119,77,42,0,0,0,0,0,0,0,0,19.4,23.0,7.7,593,0,0,0,0,0,0,0,41,0,0,0,10,31,0,358,144,81,44,31,58,0,194,163,0,7,8,16,0,593,0,0,0,32,0,19,30,21,17,0,37,22,415,.\n'
  2. lineList = line.split(',')
  3. output = []
  4. for item in lineList:
  5.     if item.strip() == ".":
  6.         output.append("***")
  7.     else:
  8.         output.append(item)
  9. print ",".join(output)
Output:
Expand|Select|Wrap|Line Numbers
  1. >>> ***,ACSSF,2009e5,ca,000,0100,0013344,309,11,0,0,0,0,11,0,10,0,0,0,0,10,0,93,0,0,0,38,55,0,54,0,0,0,12,42,0,43,12,0,8,21,2,0,20,11,0,0,9,0,0,78,33,39,6,0,0,0,593,0,0,0,0,0,0,0,0,17,0,0,0,0,0,11,0,0,0,0,17,10,45,308,185,789200,909500,1000001,***,***,***,***,***,***,***,***,***,***,***,***,***,593,474,159,49,110,0,315,119,541772500,438522500,103250000,***,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,***,593,474,0,0,0,0,0,0,0,0,0,17,20,22,62,69,284,119,0,0,0,0,0,32,0,0,19,30,38,2909,4001,628,2035900,1966500,69300,593,474,49,48,91,81,51,49,8,39,58,0,119,77,42,0,0,0,0,0,0,0,0,19.4,23.0,7.7,593,0,0,0,0,0,0,0,41,0,0,0,10,31,0,358,144,81,44,31,58,0,194,163,0,7,8,16,0,593,0,0,0,32,0,19,30,21,17,0,37,22,415,***
  2. >>> 

17 2349
bvdet
2,851 Expert Mod 2GB
There is no need to use regex for this.
Expand|Select|Wrap|Line Numbers
  1. >>> line = "12,15,45.1,.,16"
  2. >>> lineList = line.split(',')
  3. >>> output = []
  4. >>> for item in lineList:
  5. ...     if item == ".":
  6. ...         output.append(0.0)
  7. ...     else:
  8. ...         output.append(float(item))
  9. ...         
  10. >>> output
  11. [12.0, 15.0, 45.100000000000001, 0.0, 16.0]
If you must:
Expand|Select|Wrap|Line Numbers
  1. >>> import re
  2. >>> patt = re.compile(r', ?. ?,')
  3. >>> re.sub(patt,',0.0,',"12,15,45.1,.,16")
  4. '12,15,45.1,0.0,16'
  5. >>> re.sub(patt,',0.0,',"12,15,45.1, . ,16")
  6. '12,15,45.1,0.0,16'
  7. >>> 
Dec 20 '10 #2
Thekid
145 100+
Wouldn't something like this work where you could replace the 'comma period comma' with just a 'comma':

Expand|Select|Wrap|Line Numbers
  1. >>> line='2.5,3,.,100,.,4.10,.,8'
  2. >>> line.replace(',.,',',')
  3. '2.5,3,100,4.10,8'
  4.  
Dec 20 '10 #3
akselo
21
That's what I initially hoped, but it fails to replace all periods if several periods are adjacent because the last quote is seemingly used by the first match and skipped as the first quote in the next match. It accordingly skips every second period.
Expand|Select|Wrap|Line Numbers
  1. line='2.5,3,.,100,.,4.10,.,8,.,.,.,8.9,.'
  2. line.replace(',.,',',')
  3. '2.5,3,100,4.10,8,.,8.9,.'
  4.  
Dec 20 '10 #4
akselo
21
I like this one, but it doesn't like repeated period values, like so:
Expand|Select|Wrap|Line Numbers
  1. patt = re.compile(r', ?. ?,')
  2. line='2.5,3,.,100,.,4.10,.,8,.,.,.,8.9,.'
  3. re.sub(patt,',0.0,',line)
  4. '2.5,0.0,.,100,0.0,4.10,0.0,8,0.0,.,0.0,8.9,.'
  5.  
Dec 20 '10 #5
bvdet
2,851 Expert Mod 2GB
I was a bit hasty with my previous post. See if this works:
Expand|Select|Wrap|Line Numbers
  1. >>> patt = re.compile(r', ?[.] ?')
  2. >>> line='2.5,3,.,100,.,4.10,.,8,.,.,.,8.9,.'
  3. >>> re.sub(patt,',0.0',line)
  4. '2.5,3,0.0,100,0.0,4.10,0.0,8,0.0,0.0,0.0,8.9,0.0'
  5. >>> 
Dec 20 '10 #6
It looks like all your floats are "0.#" format. Why not try to replace just the single value ".,". This way the search is looking the actual value that needs replacing. And although my python isn't perfect you get the idea with...

Expand|Select|Wrap|Line Numbers
  1. # line='2.5,3,.,100,.,4.10,.,8,.,.,.,8.9,.'
  2. # line.replace('.,','0.0,')
  3. # '2.5,3,0.0,100,0.0,4.10,0.0,8,0.0,0.0,0.0,8.9,0.0'
This way you don't loose the field and mess the database up completely. If you don't need the "0.0" you can write it in to the code that it doesn't display.
Dec 20 '10 #7
akselo
21
The option of searching for the literal is at one level compelling
Expand|Select|Wrap|Line Numbers
  1. # line='2.5,3,.,100,.,4.10,.,8,.,.,.,8.9,.'
  2. # line.replace('.,','0.0,')
  3. # '2.5,3,0.0,100,0.0,4.10,0.0,8,0.0,0.0,0.0,8.9,0.0'
  4.  
But this doesn't work if the last item is a period and not a number as the replacement looks for a trailing comma. This could of course be handled separately, but I was hoping for a both general and clean solution to this issue.

Also, what is the "best" way of dealing with the carriage return in the last item of each line when iterating over lines in a file? Replacing \n with blank space?
Dec 20 '10 #8
Thekid
145 100+
I was going to suggest just replacing the 'period comma' with a blank space which works except if the last value is a period:

Expand|Select|Wrap|Line Numbers
  1. # line='2.5,3,.,100,.,4.10,.,8,.,.,.,8.9,.'
  2. # line.replace('.,','')
  3. # '2.5,3,100,4.10,8,8.9,.'
  4.  
Dec 20 '10 #9
You could switch the comma placement from "0.0," to ",0.0".
Expand|Select|Wrap|Line Numbers
  1. # line.replace(',.',',0.0')
And generally the new line at the end of the is to keep the format of the table or database.

If I split a database in to fields. Every field needs to have some value thus the ".".

AA|AB|AC|AD
BA|BB|BC|BD
CA|CB|CC|CD
DA|DB|DC|DD

If I have a 4 field database and I kill the \n it becomes a run-on and then the parsing engine looses everything beyond it's initial 4 fields.

AA|AB|AC|AD|BA|BB|BC|BD|CA|CB|CC|CD|DA|DB|DC|DD

to the engine looks like AA|AB|AC|AD

Now If you just want a longlist of numbers and dont care what the numbers mean you can in most languages just replace the "\n".

Expand|Select|Wrap|Line Numbers
  1. # # line='2.5,3,.,100,.,4.10,.,8,.,.,.,8.9,.\n2.5,3,.,100,.,4.10,.,8,.,.,.,8.9,.'
  2. # # line.replace(',.',',0.0')
  3. # # line.replace('\n',',')
  4. # # '2.5,3,0.0,100,0.0,4.10,0.0,8,0.0,0.0,0.0,8.9,0.0,2.5,3,0.0,100,0.0,4.10,0.0,8,0.0,0.0,0.0,8.9,0.0'
  5.  
Dec 20 '10 #10
Thekid
145 100+
I think another issue will be if a value is a fraction like:
line='.,.45,.25,.,100,7.5,.'
Dec 20 '10 #11
akselo
21
Last column values are often 'period'. How about a pattern like
Expand|Select|Wrap|Line Numbers
  1. #line='ACSSF,2009e5,ca,000,0100,0013344,309,11,0,0,0,0,11,0,10,0,0,0,0,10,0,93,0,0,0,38,55,0,54,0,0,0,12,42,0,43,12,0,8,21,2,0,20,11,0,0,9,0,0,78,33,39,6,0,0,0,593,0,0,0,0,0,0,0,0,17,0,0,0,0,0,11,0,0,0,0,17,10,45,308,185,789200,909500,1000001,.,.,.,.,.,.,.,.,.,.,.,.,.,593,474,159,49,110,0,315,119,541772500,438522500,103250000,.,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,.,593,474,0,0,0,0,0,0,0,0,0,17,20,22,62,69,284,119,0,0,0,0,0,32,0,0,19,30,38,2909,4001,628,2035900,1966500,69300,593,474,49,48,91,81,51,49,8,39,58,0,119,77,42,0,0,0,0,0,0,0,0,19.4,23.0,7.7,593,0,0,0,0,0,0,0,41,0,0,0,10,31,0,358,144,81,44,31,58,0,194,163,0,7,8,16,0,593,0,0,0,32,0,19,30,21,17,0,37,22,415\n'
  2. #patt = re.compile(r'\D\.\D')
  3. #'ACSSF,2009e5,ca,000,0100,0013344,309,11,0,0,0,0,11,0,10,0,0,0,0,10,0,93,0,0,0,38,55,0,54,0,0,0,12,42,0,43,12,0,8,21,2,0,20,11,0,0,9,0,0,78,33,39,6,0,0,0,593,0,0,0,0,0,0,0,0,17,0,0,0,0,0,11,0,0,0,0,17,10,45,308,185,789200,909500,1000001,0.0,.,0.0,.,0.0,.,0.0,.,0.0,.,0.0,.,0.0,593,474,159,49,110,0,315,119,541772500,438522500,103250000,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,593,474,0,0,0,0,0,0,0,0,0,17,20,22,62,69,284,119,0,0,0,0,0,32,0,0,19,30,38,2909,4001,628,2035900,1966500,69300,593,474,49,48,91,81,51,49,8,39,58,0,119,77,42,0,0,0,0,0,0,0,0,19.4,23.0,7.7,593,0,0,0,0,0,0,0,41,0,0,0,10,31,0,358,144,81,44,31,58,0,194,163,0,7,8,16,0,593,0,0,0,32,0,19,30,21,17,0,37,22,415\n'
  4.  
  5.  
It doesn't give me the desired result, but conceptually straightforward: return a match whereever a period exists if surrounded by non-digits. Repeat periods cause grief here.
Dec 20 '10 #12
From the code that you were showing at the beginning all your fractions were in were in a "0.#" format. But if that is an issue try...

Expand|Select|Wrap|Line Numbers
  1. line.replace(',.',',0.')
  2. line.replace('0.,',',0.0,')
  3. line.replace('\n',',')
( On a side note I think that the regex for the find would be ",\.," or ",[\.],". And to keep out the numbers and letters, ",(*[^0-9a-zA-Z\s][^\.]*[^0-9a-zA-Z\s])," but my regex is not perfect.)
Dec 20 '10 #13
Thekid
145 100+
I don't have python on this particular computer but how about:
Expand|Select|Wrap|Line Numbers
  1. line='ACSSF,2009e5,ca,000,0100,0013344,309,11,0,0,0,0,11,0,10,0,0,0,0,10,0,93,0,0,0,38,55,0,54,0,0,0,12,42,0,43,12,0,8,21,2,0,20,11,0,0,9,0,0,78,33,39,6,0,0,0,593,0,0,0,0,0,0,0,0,17,0,0,0,0,0,11,0,0,0,0,17,10,45,308,185,789200,909500,1000001,.,.,.,.,.,.,.,.,.,.,.,.,.,593,474,159,49,110,0,315,119,541772500,438522500,103250000,.,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,.,593,474,0,0,0,0,0,0,0,0,0,17,20,22,62,69,284,119,0,0,0,0,0,32,0,0,19,30,38,2909,4001,628,2035900,1966500,69300,593,474,49,48,91,81,51,49,8,39,58,0,119,77,42,0,0,0,0,0,0,0,0,19.4,23.0,7.7,593,0,0,0,0,0,0,0,41,0,0,0,10,31,0,358,144,81,44,31,58,0,194,163,0,7,8,16,0,593,0,0,0,32,0,19,30,21,17,0,37,22,415\n' 
  2.  
  3. newline1=line.replace('.,','')
  4. newline2=newline1.replace(',.\n','') #incase last value is a period
  5.  
...or you could make your replacement patterns contain '.,',,.\n','\n'
Dec 20 '10 #14
bvdet
2,851 Expert Mod 2GB
This is a CSV file, right? All data is separated by a comma or other known delimiter. This is the easiest way:
Expand|Select|Wrap|Line Numbers
  1. line = '.,ACSSF,2009e5,ca,000,0100,0013344,309,11,0,0,0,0,11,0,10,0,0,0,0,10,0,93,0,0,0,38,55,0,54,0,0,0,12,42,0,43,12,0,8,21,2,0,20,11,0,0,9,0,0,78,33,39,6,0,0,0,593,0,0,0,0,0,0,0,0,17,0,0,0,0,0,11,0,0,0,0,17,10,45,308,185,789200,909500,1000001,.,.,.,.,.,.,.,.,.,.,.,.,.,593,474,159,49,110,0,315,119,541772500,438522500,103250000,.,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,.,593,474,0,0,0,0,0,0,0,0,0,17,20,22,62,69,284,119,0,0,0,0,0,32,0,0,19,30,38,2909,4001,628,2035900,1966500,69300,593,474,49,48,91,81,51,49,8,39,58,0,119,77,42,0,0,0,0,0,0,0,0,19.4,23.0,7.7,593,0,0,0,0,0,0,0,41,0,0,0,10,31,0,358,144,81,44,31,58,0,194,163,0,7,8,16,0,593,0,0,0,32,0,19,30,21,17,0,37,22,415,.\n'
  2. lineList = line.split(',')
  3. output = []
  4. for item in lineList:
  5.     if item.strip() == ".":
  6.         output.append("***")
  7.     else:
  8.         output.append(item)
  9. print ",".join(output)
Output:
Expand|Select|Wrap|Line Numbers
  1. >>> ***,ACSSF,2009e5,ca,000,0100,0013344,309,11,0,0,0,0,11,0,10,0,0,0,0,10,0,93,0,0,0,38,55,0,54,0,0,0,12,42,0,43,12,0,8,21,2,0,20,11,0,0,9,0,0,78,33,39,6,0,0,0,593,0,0,0,0,0,0,0,0,17,0,0,0,0,0,11,0,0,0,0,17,10,45,308,185,789200,909500,1000001,***,***,***,***,***,***,***,***,***,***,***,***,***,593,474,159,49,110,0,315,119,541772500,438522500,103250000,***,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,***,593,474,0,0,0,0,0,0,0,0,0,17,20,22,62,69,284,119,0,0,0,0,0,32,0,0,19,30,38,2909,4001,628,2035900,1966500,69300,593,474,49,48,91,81,51,49,8,39,58,0,119,77,42,0,0,0,0,0,0,0,0,19.4,23.0,7.7,593,0,0,0,0,0,0,0,41,0,0,0,10,31,0,358,144,81,44,31,58,0,194,163,0,7,8,16,0,593,0,0,0,32,0,19,30,21,17,0,37,22,415,***
  2. >>> 
Dec 20 '10 #15
Expand|Select|Wrap|Line Numbers
  1. # newline1=line.replace('.,','')
  2. # newline2=newline1.replace(',.\n','') #incase last value is a period
This code works. The only problem is a float that is an int... e.g in C# and C++ you values like 100f == (float)100 == 100.0, which could be in your db as ",100.,".

Your second replace deletes the field and mashes the rows together screwing the db. Try instead....
Expand|Select|Wrap|Line Numbers
  1. # newline1=line.replace('.,','')
  2. # newline2=newline1.replace(',.\n',',0.0\n') #incase last value is a period
(I don't know about python but in c, c++ and c# you can chain code together like... "temp=line.replace('.,','').replace(',.\n',',0.0\n ')". To me its easier to think this way.)
Dec 20 '10 #16
akselo
21
Thanks all for your help. I must say that it makes sense to just use the built-in split-to-list feature of strings and check each element. It seems pretty safe. However, it takes a long time to process a large dataset and check each value to see if it is a period, in this case the census numbers for the state of California. But efficiency doesn't matter that much in this case as releases are rare.
Dec 21 '10 #17
dwblas
626 Expert 512MB
it takes a long time to process a large dataset and check each value to see if it is a period
Only do that if it is necessary, i.e. if you are summing one field of the list for example, check for a "." and ignore (since it is zero). When you are already iterating through the list it takes an insignificant amount of time since it is one additional if() statement. You do not have to change every field, just check the ones you are using.
Dec 22 '10 #18

Sign in to post your reply or Sign up for a free account.

Similar topics

2
by: DN | last post by:
Is it possible to use Missing.Value in C#, or is that just a VB concept? I'm using Persits ASPEmail component in an ASP.NET page, and their sample code shows something like this: <%@ Page...
0
by: darrel | last post by:
I'm new to .net's regex capabilities. I've made a regex: dim r as new regex( _ "(?<tagOpening>(<))" & _ "(?<theTag>(" & tagToFind & "))" & _ "(?<anything>(.*))" & _ "(?<theAttribute>(" &...
1
by: amraam35 | last post by:
Here is the problem. I have made a log in script that prompts you to log in with a username and password as the "splash screen" It is an unbounded forum that checks that the username and password...
1
by: dotnetnoob | last post by:
i need to copy several xml files and use it as xml element and insert them into a xml file. i use DOM support in .Net and from the pervious help i was able to copy and insert them into the xml file...
2
by: BF | last post by:
Hi, I have some tables where I import data in, lots of field have gotten a NULL value which the application can not handle. Now can I replace each NULL value with '' in a columns with: update...
4
by: Trev | last post by:
Hi everyone, I'm having some problems using Regex; I have a long string that is delimetered in a random fashion by a combination of spaces and \n's for newlines. I have five possiblities: //...
0
by: Jason | last post by:
The following code removes a particular item from my drop down. Instead, I would like to replace it's value with another string. How can I do that in vb.net? Dim ddl =...
2
by: silversubey | last post by:
Hello, I have two tables: CltDue, Checklist SELECT CltDue.CDClientname, CltDue.CDID FROM CltDue WHERE CLtdue.CDstatus = 'More Client info reqst' shows the following:
4
by: shapper | last post by:
Hello, I have a control with the following property: Private _Sections As New List(Of FormSection) Public Property Sections() As List(Of FormSection) Get If _Sections Is Nothing Then...
0
by: isisqiero | last post by:
I use a list of names in a Word document as the value of SELECT statement's parameter. However, my datagridview shows only the column names without the data I am hoping to see. Public Class Form1 ...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.