Hi everybody,
Does anyone know if the adict.has_key(k) command can be used to match a string against a dictionary key? I'm trying to append a value from my dictionary to a string when it is found.
String example:
browser details Musmusculuslet-7g 21 1 21 21 100.0% 22 + 46884872 46884892 21
browser details Musmusculuslet-7i 21 1 21 21 100.0% 5 + 50605174 50605194 21
Dictionary example:
'Musmusculuslet-7g': 'UGAGGUAGUAGUUUGUACAGU'
'Musmusculuslet-7i': 'UGAGGUAGUAGUUUGUGCUGU'
What I want:
browser details Musmusculuslet-7g UGAGGUAGUAGUUUGUACAGU 21 1 21 21 100.0% 22 + 46884872 46884892 21
etc.
Thanks,
Mark
13 1622
Hi everybody,
Does anyone know if the adict.has_key(k) command can be used to match a string against a dictionary key? I'm trying to append a value from my dictionary to a string when it is found.
String example:
browser details Musmusculuslet-7g 21 1 21 21 100.0% 22 + 46884872 46884892 21
browser details Musmusculuslet-7i 21 1 21 21 100.0% 5 + 50605174 50605194 21
Dictionary example:
'Musmusculuslet-7g': 'UGAGGUAGUAGUUUGUACAGU'
'Musmusculuslet-7i': 'UGAGGUAGUAGUUUGUGCUGU'
What I want:
browser details Musmusculuslet-7g UGAGGUAGUAGUUUGUACAGU 21 1 21 21 100.0% 22 + 46884872 46884892 21
etc.
Thanks,
Mark
I'm not sure if this is exactly what you need: -
import re
-
patt = re.compile("Musmusculuslet-..")
-
-
teststr = "browser details Musmusculuslet-7g 21 1 21 21 100.0% 22 + 46884872 46884892 21"
-
match = patt.findall(teststr)
-
-
if match:
-
if adict.has_key(match[0]):
-
ind = teststr.index(match[0])
-
finalstring = "%s%s%s" % (teststr[:ind+len(match[0])], adict[match[0]], teststr[ind+len(match[0]):])
-
bvdet 2,851
Expert Mod 2GB
Hi everybody,
Does anyone know if the adict.has_key(k) command can be used to match a string against a dictionary key? I'm trying to append a value from my dictionary to a string when it is found.
String example:
browser details Musmusculuslet-7g 21 1 21 21 100.0% 22 + 46884872 46884892 21
browser details Musmusculuslet-7i 21 1 21 21 100.0% 5 + 50605174 50605194 21
Dictionary example:
'Musmusculuslet-7g': 'UGAGGUAGUAGUUUGUACAGU'
'Musmusculuslet-7i': 'UGAGGUAGUAGUUUGUGCUGU'
What I want:
browser details Musmusculuslet-7g UGAGGUAGUAGUUUGUACAGU 21 1 21 21 100.0% 22 + 46884872 46884892 21
etc.
Thanks,
Mark
Following are a couple of ways: - print dd
-
import re
-
-
s1 = "browser details Musmusculuslet-7g 21 1 21 21 100.0% 22 + 46884872 46884892 21\nbrowser details Musmusculuslet-7i 21 1 21 21 100.0% 5 + 50605174 50605194 21"
-
patt = re.compile(r'Musmusculuslet-[0-9a-z]+|MusmusculusmiR-\d+')
-
strList = patt.findall(s1)
-
s2 = s1
-
for item in strList:
-
if dd.has_key(item):
-
s2 = s2.replace(item, '%s %s' % (item, dd[item]))
-
-
print s2
-
-
print
-
-
s3 = s1
-
for key in dd:
-
if key in s3:
-
s3 = s3.replace(key, '%s %s' % (key, dd[key]))
-
-
print s3
Output: >>> {'MusmusculusmiR-1': 'UGGAAUGUAAAGAAGUAUGUA', 'Musmusculuslet-7i': 'UGAGGUAGUAGUUUGUGCUGU', 'Musmusculuslet-7g': 'UGAGGUAGUAGUUUGUACAGU'}
browser details Musmusculuslet-7g UGAGGUAGUAGUUUGUACAGU 21 1 21 21 100.0% 22 + 46884872 46884892 21
browser details Musmusculuslet-7i UGAGGUAGUAGUUUGUGCUGU 21 1 21 21 100.0% 5 + 50605174 50605194 21
browser details Musmusculuslet-7g UGAGGUAGUAGUUUGUACAGU 21 1 21 21 100.0% 22 + 46884872 46884892 21
browser details Musmusculuslet-7i UGAGGUAGUAGUUUGUGCUGU 21 1 21 21 100.0% 5 + 50605174 50605194 21
>>>
Thanks for the help ilikepython and bvdet. I'm running into only one problem. I am getting multiple matches for certain strings, e.g. the key
MusmusculusmiR-1 also matches with MusmusculusmiR-146b, so I get the following output:
browser details MusmusculusmiR-1 UGGAAUGUAAAGAAGUAUGUA46b UGAGAACUGAAUUCCAUAGGCU 22 1 22 22 100.0% 26 - 20924724 20924745 22
when the original string is:
browser details MusmusculusmiR-146b 22 1 22 22 100.0% 26 - 20924724 20924745 22
is there a way to prevent this?
my full list of keys is:
['MusmusculusmiR-106a', 'MusmusculusmiR-433-3p', 'MusmusculusmiR-126-5p', 'MusmusculusmiR-106b', 'MusmusculusmiR-216a', 'MusmusculusmiR-324-5p', 'MusmusculusmiR-762', 'MusmusculusmiR-7121', 'MusmusculusmiR-760', 'MusmusculusmiR-200b', 'MusmusculusmiR-200c', 'MusmusculusmiR-200a', 'MusmusculusmiR-241', 'MusmusculusmiR-30a-5p', 'MusmusculusmiR-802', 'MusmusculusmiR-801', 'MusmusculusmiR-805', 'MusmusculusmiR-804', 'MusmusculusmiR-216b', 'MusmusculusmiR-667', 'MusmusculusmiR-666', 'MusmusculusmiR-665', 'MusmusculusmiR-741', 'MusmusculusmiR-742', 'MusmusculusmiR-668', 'MusmusculusmiR-744', 'MusmusculusmiR-1401', 'MusmusculusmiR-34a', 'MusmusculusmiR-34b', 'MusmusculusmiR-34c', 'MusmusculusmiR-592', 'MusmusculusmiR-455-5p', 'MusmusculusmiR-698', 'MusmusculusmiR-376a1', 'MusmusculusmiR-344', 'MusmusculusmiR-697', 'MusmusculusmiR-694', 'MusmusculusmiR-695', 'MusmusculusmiR-340', 'MusmusculusmiR-341', 'MusmusculusmiR-342', 'MusmusculusmiR-691', 'MusmusculusmiR-542-5p', 'MusmusculusmiR-764-5p', 'MusmusculusmiR-122a', 'MusmusculusmiR-142-5p', 'MusmusculusmiR-449', 'MusmusculusmiR-448', 'MusmusculusmiR-23a', 'MusmusculusmiR-23b', 'MusmusculusmiR-6741', 'MusmusculusmiR-135b', 'MusmusculusmiR-135a', 'MusmusculusmiR-301b', 'MusmusculusmiR-129-5p', 'MusmusculusmiR-30b', 'MusmusculusmiR-30c', 'MusmusculusmiR-30d', 'MusmusculusmiR-30e', 'MusmusculusmiR-292-3p', 'MusmusculusmiR-713', 'MusmusculusmiR-499', 'MusmusculusmiR-711', 'MusmusculusmiR-710', 'MusmusculusmiR-717', 'MusmusculusmiR-715', 'MusmusculusmiR-714', 'MusmusculusmiR-490', 'MusmusculusmiR-491', 'MusmusculusmiR-719', 'MusmusculusmiR-718', 'MusmusculusmiR-494', 'MusmusculusmiR-495', 'MusmusculusmiR-496', 'MusmusculusmiR-497', 'MusmusculusmiR-297b', 'MusmusculusmiR-485-5p', 'MusmusculusmiR-300', 'MusmusculusmiR-301', 'MusmusculusmiR-302', 'MusmusculusmiR-422b', 'MusmusculusmiR-33', 'MusmusculusmiR-32', 'MusmusculusmiR-31', 'MusmusculusmiR-181d', 'MusmusculusmiR-27a', 'MusmusculusmiR-27b', 'MusmusculusmiR-450b1', 'MusmusculusmiR-551b', 'MusmusculusmiR-302b1', 'MusmusculusmiR-155', 'MusmusculusmiR-154', 'MusmusculusmiR-151', 'MusmusculusmiR-150', 'MusmusculusmiR-153', 'MusmusculusmiR-152', 'MusmusculusmiR-409', 'MusmusculusmiR-470', 'MusmusculusmiR-471', 'MusmusculusmiR-15a', 'MusmusculusmiR-15b', 'MusmusculusmiR-675-3p', 'MusmusculusmiR-712', 'MusmusculusmiR-199a', 'MusmusculusmiR-199b', 'MusmusculusmiR-148b', 'MusmusculusmiR-148a', 'MusmusculusmiR-615', 'MusmusculusmiR-759', 'MusmusculusmiR-758', 'MusmusculusmiR-30e1', 'MusmusculusmiR-374-3p', 'MusmusculusmiR-291a-5p', 'MusmusculusmiR-488', 'MusmusculusmiR-689', 'MusmusculusmiR-688', 'MusmusculusmiR-685', 'MusmusculusmiR-684', 'MusmusculusmiR-687', 'MusmusculusmiR-686', 'MusmusculusmiR-681', 'MusmusculusmiR-680', 'MusmusculusmiR-683', 'MusmusculusmiR-682', 'MusmusculusmiR-351', 'MusmusculusmiR-350', 'MusmusculusmiR-720', 'MusmusculusmiR-721', 'MusmusculusmiR-4671', 'MusmusculusmiR-181a1', 'MusmusculusmiR-7b', 'MusmusculusmiR-130a', 'MusmusculusmiR-130b', 'MusmusculusmiR-4881', 'MusmusculusmiR-380-5p', 'MusmusculusmiR-127', 'MusmusculusmiR-467b', 'MusmusculusmiR-467a', 'MusmusculusmiR-431', 'MusmusculusmiR-291b-5p', 'MusmusculusmiR-532', 'MusmusculusmiR-539', 'MusmusculusmiR-128a', 'MusmusculusmiR-128b', 'MusmusculusmiR-543', 'MusmusculusmiR-540', 'MusmusculusmiR-542-3p', 'MusmusculusmiR-546', 'MusmusculusmiR-547', 'MusmusculusmiR-223', 'MusmusculusmiR-222', 'MusmusculusmiR-693-5p', 'MusmusculusmiR-224', 'MusmusculusmiR-91', 'MusmusculusmiR-93', 'MusmusculusmiR-92', 'MusmusculusmiR-96', 'MusmusculusmiR-98', 'MusmusculusmiR-99b', 'MusmusculusmiR-17-5p', 'MusmusculusmiR-434-3p', 'MusmusculusmiR-770-3p', 'MusmusculusmiR-763', 'MusmusculusmiR-489', 'MusmusculusmiR-761', 'MusmusculusmiR-486', 'MusmusculusmiR-484', 'MusmusculusmiR-483', 'MusmusculusmiR-652', 'MusmusculusmiR-21', 'MusmusculusmiR-22', 'MusmusculusmiR-24', 'MusmusculusmiR-25', 'MusmusculusmiR-146b', 'MusmusculusmiR-28', 'MusmusculusmiR-362', 'MusmusculusmiR-363', 'MusmusculusmiR-361', 'MusmusculusmiR-367', 'MusmusculusmiR-365', 'MusmusculusmiR-302c1', 'MusmusculusmiR-692', 'MusmusculusmiR-182', 'MusmusculusmiR-183', 'MusmusculusmiR-186', 'MusmusculusmiR-187', 'MusmusculusmiR-184', 'MusmusculusmiR-185', 'MusmusculusmiR-324-3p', 'MusmusculusmiR-188', 'MusmusculusmiR-124a', 'MusmusculusmiR-463', 'MusmusculusmiR-464', 'MusmusculusmiR-466', 'MusmusculusmiR-469', 'MusmusculusmiR-468', 'MusmusculusmiR-505', 'MusmusculusmiR-503', 'MusmusculusmiR-500', 'MusmusculusmiR-501', 'MusmusculusmiR-212', 'MusmusculusmiR-210', 'MusmusculusmiR-211', 'MusmusculusmiR-26b', 'MusmusculusmiR-26a', 'MusmusculusmiR-215', 'MusmusculusmiR-218', 'MusmusculusmiR-219', 'MusmusculusmiR-465-3p', 'MusmusculusmiR-376a', 'MusmusculusmiR-376b', 'MusmusculusmiR-376c', 'MusmusculusmiR-369-5p', 'MusmusculusmiR-133a', 'MusmusculusmiR-133b', 'MusmusculusmiR-6761', 'MusmusculusmiR-9', 'MusmusculusmiR-129-3p', 'MusmusculusmiR-1', 'MusmusculusmiR-7', 'MusmusculusmiR-675-5p', 'MusmusculusmiR-101a', 'MusmusculusmiR-101b', 'MusmusculusmiR-217', 'MusmusculusmiR-214', 'MusmusculusmiR-699', 'MusmusculusmiR-326', 'MusmusculusmiR-696', 'MusmusculusmiR-325', 'MusmusculusmiR-322', 'MusmusculusmiR-323', 'MusmusculusmiR-320', 'MusmusculusmiR-345', 'MusmusculusmiR-346', 'MusmusculusmiR-328', 'MusmusculusmiR-329', 'MusmusculusmiR-18', 'MusmusculusmiR-764-3p', 'MusmusculusmiR-16', 'MusmusculusmiR-690', 'MusmusculusmiR-429', 'MusmusculusmiR-425', 'MusmusculusmiR-424', 'MusmusculusmiR-423', 'MusmusculusmiR-132', 'MusmusculusmiR-137', 'MusmusculusmiR-136', 'MusmusculusmiR-134', 'MusmusculusmiR-139', 'MusmusculusmiR-138', 'MusmusculusmiR-30a-3p', 'MusmusculusmiR-541', 'MusmusculusmiR-199a1', 'MusmusculusmiR-291b-3p', 'MusmusculusmiR-221', 'MusmusculusmiR-292-5p', 'MusmusculusmiR-450b', 'MusmusculusmiR-455-3p', 'MusmusculusmiR-181b', 'MusmusculusmiR-708', 'MusmusculusmiR-709', 'MusmusculusmiR-704', 'MusmusculusmiR-705', 'MusmusculusmiR-376b1', 'MusmusculusmiR-706', 'MusmusculusmiR-291a-3p', 'MusmusculusmiR-700', 'MusmusculusmiR-701', 'MusmusculusmiR-485-3p', 'MusmusculusmiR-678', 'MusmusculusmiR-679', 'MusmusculusmiR-674', 'MusmusculusmiR-676', 'MusmusculusmiR-677', 'MusmusculusmiR-670', 'MusmusculusmiR-671', 'MusmusculusmiR-672', 'MusmusculusmiR-673', 'MusmusculusmiR-19a', 'MusmusculusmiR-19b', 'MusmusculusmiR-379', 'MusmusculusmiR-378', 'MusmusculusmiR-29b', 'MusmusculusmiR-370', 'MusmusculusmiR-29a', 'MusmusculusmiR-375', 'MusmusculusmiR-377', 'MusmusculusmiR-10b', 'MusmusculusmiR-10a', 'MusmusculusmiR-487b', 'MusmusculusmiR-702', 'MusmusculusmiR-191', 'MusmusculusmiR-190', 'MusmusculusmiR-193', 'MusmusculusmiR-192', 'MusmusculusmiR-195', 'MusmusculusmiR-194', 'MusmusculusmiR-380-3p', 'MusmusculusmiR-450', 'MusmusculusmiR-451', 'MusmusculusmiR-452', 'MusmusculusmiR-126-3p', 'MusmusculusmiR-103', 'MusmusculusmiR-100', 'MusmusculusmiR-107', 'MusmusculusmiR-133a1', 'MusmusculusmiR-298', 'MusmusculusmiR-299', 'MusmusculusmiR-293', 'MusmusculusmiR-290', 'MusmusculusmiR-296', 'MusmusculusmiR-297', 'MusmusculusmiR-294', 'MusmusculusmiR-295', 'MusmusculusmiR-743', 'MusmusculusmiR-201', 'MusmusculusmiR-203', 'MusmusculusmiR-202', 'MusmusculusmiR-205', 'MusmusculusmiR-204', 'MusmusculusmiR-207', 'MusmusculusmiR-206', 'MusmusculusmiR-208', 'MusmusculusmiR-433-5p', 'MusmusculusmiR-693-3p', 'Musmusculuslet-7d1', 'MusmusculusmiR-125b', 'MusmusculusmiR-125a', 'MusmusculusmiR-381', 'MusmusculusmiR-99a', 'MusmusculusmiR-434-5p', 'MusmusculusmiR-17-3p', 'MusmusculusmiR-5011', 'MusmusculusmiR-374-5p', 'MusmusculusmiR-465-5p', 'MusmusculusmiR-142-3p', 'MusmusculusmiR-20a', 'MusmusculusmiR-20b', 'MusmusculusmiR-146', 'MusmusculusmiR-144', 'MusmusculusmiR-335', 'MusmusculusmiR-181a', 'MusmusculusmiR-337', 'MusmusculusmiR-181c', 'MusmusculusmiR-331', 'MusmusculusmiR-330', 'MusmusculusmiR-669c', 'MusmusculusmiR-669b', 'MusmusculusmiR-669a', 'MusmusculusmiR-707', 'MusmusculusmiR-339', 'MusmusculusmiR-338', 'MusmusculusmiR-369-3p', 'MusmusculusmiR-703', 'MusmusculusmiR-302c', 'MusmusculusmiR-302b', 'MusmusculusmiR-141', 'MusmusculusmiR-302d', 'Musmusculuslet-7b', 'Musmusculuslet-7c', 'Musmusculuslet-7a', 'Musmusculuslet-7f', 'Musmusculuslet-7g', 'Musmusculuslet-7d', 'Musmusculuslet-7e', 'Musmusculuslet-7i', 'MusmusculusmiR-449b', 'MusmusculusmiR-382', 'MusmusculusmiR-383', 'MusmusculusmiR-384', 'MusmusculusmiR-410', 'MusmusculusmiR-411', 'MusmusculusmiR-412', 'MusmusculusmiR-145', 'MusmusculusmiR-143', 'MusmusculusmiR-140', 'MusmusculusmiR-29c', 'MusmusculusmiR-196a', 'MusmusculusmiR-196b', 'MusmusculusmiR-149'
thanks,
Mark
Thanks for the help ilikepython and bvdet. I'm running into only one problem. I am getting multiple matches for certain strings, e.g. the key
MusmusculusmiR-1 also matches with MusmusculusmiR-146b, so I get the following output:
browser details MusmusculusmiR-1 UGGAAUGUAAAGAAGUAUGUA46b UGAGAACUGAAUUCCAUAGGCU 22 1 22 22 100.0% 26 - 20924724 20924745 22
when the original string is:
browser details MusmusculusmiR-146b 22 1 22 22 100.0% 26 - 20924724 20924745 22
is there a way to prevent this?
my full list of keys is:
['MusmusculusmiR-106a', 'MusmusculusmiR-433-3p', 'MusmusculusmiR-126-5p', 'MusmusculusmiR-106b', 'MusmusculusmiR-216a', 'MusmusculusmiR-324-5p', 'MusmusculusmiR-762', 'MusmusculusmiR-7121', 'MusmusculusmiR-760', 'MusmusculusmiR-200b', 'MusmusculusmiR-200c', 'MusmusculusmiR-200a', 'MusmusculusmiR-241', 'MusmusculusmiR-30a-5p',
<CLIPPED>
'MusmusculusmiR-143', 'MusmusculusmiR-140', 'MusmusculusmiR-29c', 'MusmusculusmiR-196a', 'MusmusculusmiR-196b', 'MusmusculusmiR-149'
thanks,
Mark
This is similar to Bv's second way: -
teststr = "browser details MusmusculusmiR-146b 22 1 22 22 100.0% 26 - 20924724 20924745 22"
-
words = teststr.split()
-
-
key = words[2] # will the key always be the second word?
-
if key in adict.keys():
-
finalstring = teststr.replace(key, "%s %s" % (key, adict[key])
-
If the key is not always the second word you could check every word if there is only one key per string.
I tried your suggestion but recieved the same result. Is there a statement I could write that checks each line for capital A,T,C, or G? If I could put that into an 'if' statement then maybe it wouldn't re-format a line that has already been formatted. Of course then there would be the problem of did it replace it with Mus..R-1, or with Mus..R-106a, etc. Is there an order, or is it random because I am using a dictionary?
Mark
I tried your suggestion but recieved the same result. Is there a statement I could write that checks each line for capital A,T,C, or G? If I could put that into an 'if' statement then maybe it wouldn't re-format a line that has already been formatted. Of course then there would be the problem of did it replace it with Mus..R-1, or with Mus..R-106a, etc. Is there an order, or is it random because I am using a dictionary?
Mark
I'm not really sure what you mean. Are you checking each string more than once? Everytime you finish formatting a string you can append it to a list and the next time, if it is in the list, don't format it. I don't think you should have a problem with matching the wrong key. Could you post the code you used?
bvdet 2,851
Expert Mod 2GB
I tried your suggestion but recieved the same result. Is there a statement I could write that checks each line for capital A,T,C, or G? If I could put that into an 'if' statement then maybe it wouldn't re-format a line that has already been formatted. Of course then there would be the problem of did it replace it with Mus..R-1, or with Mus..R-106a, etc. Is there an order, or is it random because I am using a dictionary?
Mark
Try this regex solution to see if it works for you. It matches the empty string at the beginning or end of a word. Then the string is split on the space character and should replace only on a full match: - print dd
-
-
import re
-
-
s1 = "browser details Musmusculuslet-7g 21 1 21 21 100.0% 22 + 46884872 46884892 21\nbrowser details Musmusculuslet-7i 21 1 21 21 100.0% 5 + 50605174 50605194 21\nbrowser details MusmusculusmiR-314-5p 21 1 21 21 100.0% 22 + 46884872 46884892 21\nbrowser details MusmusculusmiR-31 21 1 21 21 100.0% 22 + 46884872 46884892 21"
-
patt = re.compile(r'''\bMusmusculuslet-[0-9a-z]+\b # Matches "Musmusculuslet-" followed by alphanumeric
-
# characters at word borderlines
-
|\bMusmusculusmiR-[0-9a-z\-]+\b # Matches "MusmusculusmiR-" followed by alphanumeric
-
# characters or dashes at word borderlines
-
''', re.VERBOSE)
-
-
strList = patt.findall(s1)
-
s2 = s1
-
for item in strList:
-
if dd.has_key(item):
-
s2List = s2.split(' ')
-
idx = s2List.index(item)
-
s2List[idx] = '%s %s' % (item, dd[item])
-
s2 = ' '.join(s2List)
-
-
print s2
Output: >>> {'MusmusculusmiR-1': 'UGGAAUGUAAAGAAGUAUGUA', 'MusmusculusmiR-314-5p': 'UGAGGUAGUAGUUUGUACAGU', 'Musmusculuslet-7i': 'UGAGGUAGUAGUUUGUGCUGU', 'Musmusculuslet-7g': 'UGAGGUAGUAGUUUGUACAGU', 'MusmusculusmiR-31': 'UGGAAUGUAAAGAAGUAUGUA'}
browser details Musmusculuslet-7g UGAGGUAGUAGUUUGUACAGU 21 1 21 21 100.0% 22 + 46884872 46884892 21
browser details Musmusculuslet-7i UGAGGUAGUAGUUUGUGCUGU 21 1 21 21 100.0% 5 + 50605174 50605194 21
browser details MusmusculusmiR-314-5p UGAGGUAGUAGUUUGUACAGU 21 1 21 21 100.0% 22 + 46884872 46884892 21
browser details MusmusculusmiR-31 UGGAAUGUAAAGAAGUAUGUA 21 1 21 21 100.0% 22 + 46884872 46884892 21
>>>
The code I am currently using and still getting the same problem: -
def EditFile ( s1, dd ):
-
print dd
-
import re
-
patt = re.compile(r'''\bMusmusculuslet-[0-9a-z]+\b+|\bMusmusculusmiR-[0-9a-z\-]+\b''', re.VERBOSE)
-
strList = patt.findall(s1)
-
s2 = s1
-
for item in strList:
-
if dd.has_key(item):
-
s2List = s2.split(' ')
-
idx = s2List.index(item)
-
s2List[idx] = '%s %s' % (item, dd[item]))
-
s2 = ' '.join(s2List)
-
print s2
-
## print
-
## s3 = s1
-
## words = s3.split()
-
## key = words[2]
-
## for key in dd:
-
## if key in s3:
-
## s3 = s3.replace(key, '%s %s' % (key, dd[key]))
-
## print s3
-
f = open('editted BLAT Search Results-Mouse.txt', 'w')
-
f.writelines(s2)
-
f.close()
-
return s2
-
-
It seems to choke on the following matches:
MusmusculusmiR-1 is read when it reads MusmusculusmiR-124a, thus it gets written twice with two separate values from two separate keys:
UGGAAUGUAAAGAAGUAUGUA24a
followed by UAAGGCACGCGGUGAAUGCC
The first is the value for key MusmusculusmiR-1(without that 24a that it at the end), the second is the value for key MusmusculusmiR-124a
It is also still choking on the following matches:
MusmusculusmiR-126-5p (weird, since it doesn't mind MusmusculusmiR-126-3p)
MusmusculusmiR-127, MusmusculusmiR-128a, MusmusculusmiR-130, MusmusculusmiR-129-5p
and MusmusculusmiR-324-3p because there is a MusmusculusmiR-32.
I ran the above code and got the same results I did with the previous code, which is strange. Did I miss something in my transcription? what exactly does the \b do in your code?
Mark
Okay, now I am getting a new error:
Traceback (most recent call last):
File "<pyshell#28>", line 1, in <module>
newfile = EditFile ( data, mouse )
File "BatchEditor.py", line 45, in EditFile
patt = re.compile(r'''\bMusmusculuslet-[0-9a-z]+\b+|\bMusmusculusmiR-[0-9a-z\-]+\b''', re.VERBOSE)
File "C:\Python25\lib\re.py", line 180, in compile
return _compile(pattern, flags)
File "C:\Python25\lib\re.py", line 233, in _compile
raise error, v # invalid expression
error: nothing to repeat
I edited the code, so it is now like this: -
def EditFile ( s1, dd ):
-
-
#print dd
-
import re
-
patt = re.compile(r'''\bMusmusculuslet-[0-9a-z]+\b+|\bMusmusculusmiR-[0-9a-z\-]+\b''', re.VERBOSE)
-
strList = patt.findall(s1)
-
s2 = s1
-
print strList
-
for item in strList:
-
if dd.has_key(item):
-
s2List = s2.split(' ')
-
idx = s2List.index(item)
-
s2List[idx] = '%s %s' % (item, dd[item])
-
s2 = ' '.join(s2List)
-
print s2
-
## print
-
## s3 = s1
-
## words = s3.split()
-
## key = words[2]
-
## for key in dd:
-
## if key in s3:
-
## s3 = s3.replace(key, '%s %s' % (key, dd[key]))
-
## print s3
-
f = open('editted BLAT Search Results-Mouse.txt', 'w')
-
f.writelines(s2)
-
f.close()
-
return s2
-
I should ask is that a single quote followed by a double-quote at the beginning and end of the re.compile statement? I had it set as three single-quotes and then realized that is probably wrong.
Mark
bvdet 2,851
Expert Mod 2GB
Okay, now I am getting a new error:
Traceback (most recent call last):
File "<pyshell#28>", line 1, in <module>
newfile = EditFile ( data, mouse )
File "BatchEditor.py", line 45, in EditFile
patt = re.compile(r'''\bMusmusculuslet-[0-9a-z]+\b+|\bMusmusculusmiR-[0-9a-z\-]+\b''', re.VERBOSE)
File "C:\Python25\lib\re.py", line 180, in compile
return _compile(pattern, flags)
File "C:\Python25\lib\re.py", line 233, in _compile
raise error, v # invalid expression
error: nothing to repeat
I edited the code, so it is now like this: -
def EditFile ( s1, dd ):
-
-
#print dd
-
import re
-
patt = re.compile(r'''\bMusmusculuslet-[0-9a-z]+\b+|\bMusmusculusmiR-[0-9a-z\-]+\b''', re.VERBOSE)
-
strList = patt.findall(s1)
-
s2 = s1
-
print strList
-
for item in strList:
-
if dd.has_key(item):
-
s2List = s2.split(' ')
-
idx = s2List.index(item)
-
s2List[idx] = '%s %s' % (item, dd[item])
-
s2 = ' '.join(s2List)
-
print s2
-
## print
-
## s3 = s1
-
## words = s3.split()
-
## key = words[2]
-
## for key in dd:
-
## if key in s3:
-
## s3 = s3.replace(key, '%s %s' % (key, dd[key]))
-
## print s3
-
f = open('editted BLAT Search Results-Mouse.txt', 'w')
-
f.writelines(s2)
-
f.close()
-
return s2
-
I should ask is that a single quote followed by a double-quote at the beginning and end of the re.compile statement? I had it set as three single-quotes and then realized that is probably wrong.
Mark
The error you received is caused by an additional '+' character after '\b'. Since '\b' just matches the whitespace between words, there is nothing to repeat.
Three single quotes or three double quotes would be correct.
I re-copied and re-pasted the code, and it is working much better now. The program is no longer splitting the keys, but it is pasting multiple values back-to-back instead of next to the key for multiple matches:
browser details MusmusculusmiR-450b1 AUUGGGAACAUUUUGCAUGCAU AUUGGGAACAUUUUGCAUGCAU 20 1 22 22 95.5% Un.003.104 - 440337 440358 22
browser details MusmusculusmiR-450b1 20 1 22 22 95.5% Un.003.104 - 440652 440673 22
This is something I can live with, unless there is some easy way to fix it. I am going to import the whole thing into Access for a database when I am through.
Thanks again,
Mark
bvdet 2,851
Expert Mod 2GB
I re-copied and re-pasted the code, and it is working much better now. The program is no longer splitting the keys, but it is pasting multiple values back-to-back instead of next to the key for multiple matches:
browser details MusmusculusmiR-450b1 AUUGGGAACAUUUUGCAUGCAU AUUGGGAACAUUUUGCAUGCAU 20 1 22 22 95.5% Un.003.104 - 440337 440358 22
browser details MusmusculusmiR-450b1 20 1 22 22 95.5% Un.003.104 - 440652 440673 22
This is something I can live with, unless there is some easy way to fix it. I am going to import the whole thing into Access for a database when I am through.
Thanks again,
Mark
Do you have multiple occurrences of the key 'MusmusculusmiR-450b1' in the string? That would explain the double values. Try this: - print dd
-
-
import re
-
-
s1 = "browser details Musmusculuslet-7g 21 1 21 21 100.0% 22 + 46884872 46884892 21\nbrowser details Musmusculuslet-7i 21 1 21 21 100.0% 5 + 50605174 50605194 21\nbrowser details MusmusculusmiR-314-5p 21 1 21 21 100.0% 22 + 46884872 46884892 21\nbrowser details MusmusculusmiR-31 21 1 21 21 100.0% 22 + 46884872 46884892 21\nbrowser details MusmusculusmiR-31 21 1 21 21 100.0% 22 + 46884872 46884892 21"
-
patt = re.compile(r'''\bMusmusculuslet-[0-9a-z]+\b # Matches "Musmusculuslet-" followed by alphanumeric
-
# characters at word borderlines
-
|\bMusmusculusmiR-[0-9a-z\-]+\b # Matches "MusmusculusmiR-" followed by alphanumeric
-
# characters or dashes at word borderlines
-
''', re.VERBOSE)
-
-
sList = s1.split('\n')
-
outList = []
-
for item in sList:
-
tem = patt.search(item)
-
if tem:
-
if dd.has_key(tem.group(0)):
-
item = item.replace(tem.group(0), '%s %s' % (tem.group(0), dd[tem.group(0)]))
-
outList.append(item)
-
-
s2 = '\n'.join(outList)
-
print s2
>>> {'MusmusculusmiR-1': 'UGGAAUGUAAAGAAGUAUGUA', 'MusmusculusmiR-314-5p': 'UGAGGUAGUAGUUUGUACAGU', 'Musmusculuslet-7i': 'UGAGGUAGUAGUUUGUGCUGU', 'Musmusculuslet-7g': 'UGAGGUAGUAGUUUGUACAGU', 'MusmusculusmiR-31': 'UGGAAUGUAAAGAAGUAUGUA'}
browser details Musmusculuslet-7g UGAGGUAGUAGUUUGUACAGU 21 1 21 21 100.0% 22 + 46884872 46884892 21
browser details Musmusculuslet-7i UGAGGUAGUAGUUUGUGCUGU 21 1 21 21 100.0% 5 + 50605174 50605194 21
browser details MusmusculusmiR-314-5p UGAGGUAGUAGUUUGUACAGU 21 1 21 21 100.0% 22 + 46884872 46884892 21
browser details MusmusculusmiR-31 UGGAAUGUAAAGAAGUAUGUA 21 1 21 21 100.0% 22 + 46884872 46884892 21
browser details MusmusculusmiR-31 UGGAAUGUAAAGAAGUAUGUA 21 1 21 21 100.0% 22 + 46884872 46884892 21
>>>
That has done the trick! Thanks for all of the help, I didn't even know about Python having a regex module. Still wondering why the triple quotes, but I will go to python.org and read up on it.
Mark
Post your reply Sign in to post your reply or Sign up for a free account.
Similar topics
12 posts
views
Thread by Matthew Wilson |
last post: by
|
7 posts
views
Thread by Dave Hansen |
last post: by
|
6 posts
views
Thread by Gustaf Liljegren |
last post: by
|
2 posts
views
Thread by Nick |
last post: by
|
13 posts
views
Thread by kj |
last post: by
| | | | | | | | | | |