By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
444,050 Members | 1,019 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 444,050 IT Pros & Developers. It's quick & easy.

Parsing tab separated .txt files with common and distinct attributes

P: n/a
I would like to parse tab separated .txt files separating common attribute and distinct attribute from the file. I would like to parse only the first line attributes not the values. Could you please rectify this script. The file may be located from this url -
ftp://ftp.ebi.ac.uk/pub/databases/mi...MX-10.sdrf.txt

The source code i have written is as below -
Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/python
  2. import glob
  3. outfile = open('output_attribute.txt' , 'w')
  4. files = glob.glob('*.sdrf.txt')
  5. for file in files:
  6.     infile = open(file)
  7.     #ret = False
  8.     for line in infile:
  9.         lineArray = line.split('\t')
  10.  
  11.         if '\n\n' in line:
  12.             ret = false
  13.             outfile.write('')
  14.             break;
  15.         elif len(lineArray) > 2:            
  16.            output = "%s\t%s\n\n"%(lineArray[0],lineArray[1])
  17.            outfile.write(output)
  18.         else:
  19.             output = "%s\t\n"%(lineArray[0])
  20.             outfile.write(output)
  21.     infile.close()
  22. outfile.close()
Oct 11 '10 #1
Share this Question
Share on Google+
13 Replies


bvdet
Expert Mod 2.5K+
P: 2,851
I am unclear about your end goal. It seems you want to read multiple files, read the first line of each file, split the line on the tab character, then write the first two elements to the output file. Would you please clarify what output you want?
Oct 11 '10 #2

P: 16
Dear,

Please find the attached zip file. I would like to extract only the headers from the file parsed. Every files in header starts with Array Design Name but not fix attribute. So i would like to extract headers with space (\n\n)separated gap which is attached in zip file format. I would like to extract only the RED encircled headers. I would be glad for your support and cooperation.

With regards,
Haobijam
Attached Files
File Type: zip headers.zip (73.7 KB, 73 views)
Oct 13 '10 #3

P: 16
I would like to extract only the headers from the file parsed. Every files in header starts with Array Design Name but ends with unfix attribute. So i would like to extract headers with space (\n\n)separated gap which is attached in zip file format. I would like to extract only the RED encircled headers. I have attached the output for this script written. I would be glad for your support and cooperation.
The file may be located from this url -
ftp://ftp.ebi.ac.uk/pub/databases/mi...FFY-10.adf.txt

The source code i have written is as below-

Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/python
  2. import glob
  3.  
  4. outfile = open('output_attri.txt' , 'w')
  5. files = glob.glob('*.adf.txt')
  6.  
  7. for file in files:
  8.     infile = open(file)
  9.  
  10.     for line in infile:
  11.         line = line.replace('^' , '\n\n').replace('!' , '').replace('#' , '').replace('\n','')
  12.         lineArray = line.split('%s\t')
  13.         if line == '\n\n':
  14.             outfile.write('')
  15.             break;
  16.         elif len(lineArray) > 2:            
  17.             output = "%s\t%s\n"%(lineArray[0],lineArray[1])
  18.             outfile.write(output)
  19.         else:
  20.             output = "%s\t\n"%(lineArray[0])
  21.             outfile.write(output)
  22.     infile.close()
  23. outfile.close()

With regards,
Haobijam
Attached Files
File Type: zip output_attribute.zip (2.46 MB, 93 views)
Oct 13 '10 #4

bvdet
Expert Mod 2.5K+
P: 2,851
Is there always a blank line separating the header info you want from the data you do not want? You only want the first two elements of each header line? Untested:
Expand|Select|Wrap|Line Numbers
  1. outFile = open(outFileName, 'w')
  2. for fn in fileNameList:
  3.     f = open(fn)
  4.     output = []
  5.     for line in f:
  6.         line = line.strip().split("\t")
  7.         if line:
  8.             output.append("\t".join(line[:2]))
  9.         else:
  10.             outFile.write("\n".join(output))
  11.             break
  12. outFile.close()
Oct 13 '10 #5

P: 16
Dear,
Yes there is always a blank line separating the header information i want from the text data i do not want to extract in all the files.

Regards,
Haobijam
Oct 13 '10 #6

P: 16
Dear,

What is wrong with this script? I could not print any output.
Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/python
  2. import glob
  3.  
  4. outFile = open('output.txt', 'w')
  5. fileNameList = glob.glob('*.adf.txt')
  6. for file in fileNameList:
  7.     f = open(file)
  8.     output = []
  9.     for line in f:
  10.         line = line.strip().split("\t")
  11.         #lineArray = line.split('\t')
  12.         if line:
  13.             #output = "%s\t%s\n"%(lineArray[0],lineArray[1])
  14.             output.append("\t".join(line[:2]))
  15.         else:
  16.             outFile.write("\n".join(output))
  17.             break
  18.     f.close()
  19. outFile.close()
The code is here -
Oct 13 '10 #7

bvdet
Expert Mod 2.5K+
P: 2,851
PLEASE use code tags when posting code. That way I will not have to edit your post.

There are no print statements. Is there any content in the output file? Add print statements, as in print line, to see what is being read.

BV - Moderator
Oct 13 '10 #8

bvdet
Expert Mod 2.5K+
P: 2,851
This writes the header information to disk:
Expand|Select|Wrap|Line Numbers
  1. outFile = open(outFileName, 'w')
  2. for fn in fileNameList:
  3.     f = open(fn)
  4.     output = []
  5.     for line in f:
  6.         line = line.strip()
  7.         if line:
  8.             output.append(line)
  9.         else:
  10.             outFile.write("\n".join(output))
  11.             f.close()
  12.             break
  13. outFile.close()
  14.  
Oct 13 '10 #9

Expert 100+
P: 624
Note that his will never be found as it is read as two separate records. Test for len(line.strip()) instead to find an empty record.
Expand|Select|Wrap|Line Numbers
  1.         if '\n\n' in line:
Oct 13 '10 #10

P: 16
Dear Sir,
I have written a script to extract the first line starting with Source Name AND ends with Comment [ArrayExpress Data Retrieval URI] and i have done it but i could not parse distinct or unique attributes which is not repeated in every files. I would like to parse only the first line attributes not the table values. Could you please rectify this script. I have attached a zip file for all sdrf.txt files.The file may be located from this url -
ftp://ftp.ebi.ac.uk/pub/databases/mi...FMX-1.sdrf.txt

Expand|Select|Wrap|Line Numbers
  1.  
Regards,
Haobijam
Attached Files
File Type: zip sdrf.txt.zip (95.7 KB, 60 views)
File Type: txt sdrf.txt (536 Bytes, 361 views)
File Type: zip output_att.zip (3.0 KB, 65 views)
Oct 14 '10 #11

P: 16
Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/python
  2. import glob
  3. #import linecache
  4. outfile = open('output_att.txt' , 'w')
  5. files = glob.glob('*.sdrf.txt')
  6. for file in files:
  7.     infile = open(file)
  8.     #count = 0
  9.     for line in infile:
  10.  
  11.         lineArray = line.rstrip()
  12.         if not line.startswith('Source Name') : continue
  13.         #count = count + 1
  14.         lineArray = line.split('%s\t')
  15.         print lineArray[0]
  16.         output = "%s\t\n"%(lineArray[0])
  17.         outfile.write(output)
  18.     infile.close()
  19. outfile.close() 
  20.  
Oct 14 '10 #12

P: 16
Dear Sir,

I would like to extract only unique terms from all sdrf.txt files but this python code outputs unique terms for every file individually. Like Array Data File , Array Design REF ... are repeated in most of sdrf.txt files so i don't wanna print it as unique terms. Could you please tell me to hide case sensitive in python because Characteristics[OrganismPart] is printed as unique term to Characteristics[organism part] similarly for Characteristics[Sex] with Characteristics[sex]. I am eagerly waiting for your support and positive reply.
Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/python
  2. import glob
  3. import string
  4.  
  5. outfile = open('output.txt' , 'w')
  6. files = glob.glob('*.sdrf.txt')
  7. previous = set()
  8. for file in files:
  9.     print('\n'+file)
  10.     infile = open(file)
  11.     #previous = set() # uncomment this if do not need to be unique between the files
  12.     for line in infile:
  13.         lineArray = line.rstrip()
  14.         if not line.startswith('Source Name') : continue
  15.         lineArray = line.split('%s\t')
  16.         output = "%s\t\n"%(lineArray[0])
  17.         outfile.write(output)
  18.         uniqwords = set(word.strip() for word in lineArray[0].split('\t')
  19.                         if word.strip() and word.strip() not in previous) 
  20.         print('The %i unique terms are:\n\t%s' % (len(uniqwords),'\n\t'.join(sorted(uniqwords))))
  21.         previous |=  uniqwords 
  22.     infile.close()
  23. outfile.close()
  24. print('='*80)
  25. print('The %i terms are:\n\t%s' % (len(previous),'\n\t'.join(sorted(previous))))
  26.  
Attached Files
File Type: zip sdrf.zip (95.7 KB, 61 views)
File Type: zip attribute.zip (2.9 KB, 77 views)
Oct 19 '10 #13

P: 16
Dear Sir,

I do have a query regarding parsing attributes and extracting unique terms from adf.txt files from ArrayExpress [ftp://ftp.ebi.ac.uk/pub/databases/mi...y/data/array/] .The python code written here is feasible for running individual file with similar starting term but it is infeasible for running around 2270 adf.txt files at one time. Could you please rectify or suggest me some tips for this python code in line number 12 . Actually i would like to parse the first line for every adf.txt files (2270 in numbers) and later extract unique terms and common terms from it. For your convenience i have attached a zip file for adf.txt format but for more you may get into ftp site mentioned above. I would so glad for your support and cooperation.

With warm regards,
Haobijam

Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/python
  2. import glob
  3. import string
  4. with open('output_Reporter Name.txt' , 'w') as outfile:
  5.     files = glob.glob('*.adf.txt')
  6.     uniqwords = set()
  7.     previous = set()
  8.     for file in files:
  9.         with open(file) as infile:
  10.             #previous = set() # uncomment this if do not need to be unique between the files
  11.             for line in infile:
  12.                 if not line.startswith('Reporter Name') : continue ## change this line to deal with other form
  13.                 output = line
  14.                 uniqwords = set(word.strip() for word in line.rstrip().split('\t')
  15.                                 if word.strip() and word.strip() not in previous)
  16.                 previous |=  uniqwords
  17.                 print (output)
  18.                 outfile.write(output)
  19. print('The %i unique terms are:\n\t%s' % (len(uniqwords),'\n\t'.join(sorted(uniqwords))))                  
  20. print('='*80)
  21. print('The %i terms are:\n\t%s' % (len(previous),'\n\t'.join(sorted(previous))))
  22.  
Attached Files
File Type: zip adf.zip (1.01 MB, 74 views)
Oct 28 '10 #14

Post your reply

Sign in to post your reply or Sign up for a free account.