473,378 Members | 1,346 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,378 software developers and data experts.

Parsing tab separated .txt files with common and distinct attributes

I would like to parse tab separated .txt files separating common attribute and distinct attribute from the file. I would like to parse only the first line attributes not the values. Could you please rectify this script. The file may be located from this url -
ftp://ftp.ebi.ac.uk/pub/databases/mi...MX-10.sdrf.txt

The source code i have written is as below -
Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/python
  2. import glob
  3. outfile = open('output_attribute.txt' , 'w')
  4. files = glob.glob('*.sdrf.txt')
  5. for file in files:
  6.     infile = open(file)
  7.     #ret = False
  8.     for line in infile:
  9.         lineArray = line.split('\t')
  10.  
  11.         if '\n\n' in line:
  12.             ret = false
  13.             outfile.write('')
  14.             break;
  15.         elif len(lineArray) > 2:            
  16.            output = "%s\t%s\n\n"%(lineArray[0],lineArray[1])
  17.            outfile.write(output)
  18.         else:
  19.             output = "%s\t\n"%(lineArray[0])
  20.             outfile.write(output)
  21.     infile.close()
  22. outfile.close()
Oct 11 '10 #1
13 2978
bvdet
2,851 Expert Mod 2GB
I am unclear about your end goal. It seems you want to read multiple files, read the first line of each file, split the line on the tab character, then write the first two elements to the output file. Would you please clarify what output you want?
Oct 11 '10 #2
Dear,

Please find the attached zip file. I would like to extract only the headers from the file parsed. Every files in header starts with Array Design Name but not fix attribute. So i would like to extract headers with space (\n\n)separated gap which is attached in zip file format. I would like to extract only the RED encircled headers. I would be glad for your support and cooperation.

With regards,
Haobijam
Attached Files
File Type: zip headers.zip (73.7 KB, 185 views)
Oct 13 '10 #3
I would like to extract only the headers from the file parsed. Every files in header starts with Array Design Name but ends with unfix attribute. So i would like to extract headers with space (\n\n)separated gap which is attached in zip file format. I would like to extract only the RED encircled headers. I have attached the output for this script written. I would be glad for your support and cooperation.
The file may be located from this url -
ftp://ftp.ebi.ac.uk/pub/databases/mi...FFY-10.adf.txt

The source code i have written is as below-

Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/python
  2. import glob
  3.  
  4. outfile = open('output_attri.txt' , 'w')
  5. files = glob.glob('*.adf.txt')
  6.  
  7. for file in files:
  8.     infile = open(file)
  9.  
  10.     for line in infile:
  11.         line = line.replace('^' , '\n\n').replace('!' , '').replace('#' , '').replace('\n','')
  12.         lineArray = line.split('%s\t')
  13.         if line == '\n\n':
  14.             outfile.write('')
  15.             break;
  16.         elif len(lineArray) > 2:            
  17.             output = "%s\t%s\n"%(lineArray[0],lineArray[1])
  18.             outfile.write(output)
  19.         else:
  20.             output = "%s\t\n"%(lineArray[0])
  21.             outfile.write(output)
  22.     infile.close()
  23. outfile.close()

With regards,
Haobijam
Attached Files
File Type: zip output_attribute.zip (2.46 MB, 126 views)
Oct 13 '10 #4
bvdet
2,851 Expert Mod 2GB
Is there always a blank line separating the header info you want from the data you do not want? You only want the first two elements of each header line? Untested:
Expand|Select|Wrap|Line Numbers
  1. outFile = open(outFileName, 'w')
  2. for fn in fileNameList:
  3.     f = open(fn)
  4.     output = []
  5.     for line in f:
  6.         line = line.strip().split("\t")
  7.         if line:
  8.             output.append("\t".join(line[:2]))
  9.         else:
  10.             outFile.write("\n".join(output))
  11.             break
  12. outFile.close()
Oct 13 '10 #5
Dear,
Yes there is always a blank line separating the header information i want from the text data i do not want to extract in all the files.

Regards,
Haobijam
Oct 13 '10 #6
Dear,

What is wrong with this script? I could not print any output.
Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/python
  2. import glob
  3.  
  4. outFile = open('output.txt', 'w')
  5. fileNameList = glob.glob('*.adf.txt')
  6. for file in fileNameList:
  7.     f = open(file)
  8.     output = []
  9.     for line in f:
  10.         line = line.strip().split("\t")
  11.         #lineArray = line.split('\t')
  12.         if line:
  13.             #output = "%s\t%s\n"%(lineArray[0],lineArray[1])
  14.             output.append("\t".join(line[:2]))
  15.         else:
  16.             outFile.write("\n".join(output))
  17.             break
  18.     f.close()
  19. outFile.close()
The code is here -
Oct 13 '10 #7
bvdet
2,851 Expert Mod 2GB
PLEASE use code tags when posting code. That way I will not have to edit your post.

There are no print statements. Is there any content in the output file? Add print statements, as in print line, to see what is being read.

BV - Moderator
Oct 13 '10 #8
bvdet
2,851 Expert Mod 2GB
This writes the header information to disk:
Expand|Select|Wrap|Line Numbers
  1. outFile = open(outFileName, 'w')
  2. for fn in fileNameList:
  3.     f = open(fn)
  4.     output = []
  5.     for line in f:
  6.         line = line.strip()
  7.         if line:
  8.             output.append(line)
  9.         else:
  10.             outFile.write("\n".join(output))
  11.             f.close()
  12.             break
  13. outFile.close()
  14.  
Oct 13 '10 #9
dwblas
626 Expert 512MB
Note that his will never be found as it is read as two separate records. Test for len(line.strip()) instead to find an empty record.
Expand|Select|Wrap|Line Numbers
  1.         if '\n\n' in line:
Oct 13 '10 #10
Dear Sir,
I have written a script to extract the first line starting with Source Name AND ends with Comment [ArrayExpress Data Retrieval URI] and i have done it but i could not parse distinct or unique attributes which is not repeated in every files. I would like to parse only the first line attributes not the table values. Could you please rectify this script. I have attached a zip file for all sdrf.txt files.The file may be located from this url -
ftp://ftp.ebi.ac.uk/pub/databases/mi...FMX-1.sdrf.txt

Expand|Select|Wrap|Line Numbers
  1.  
Regards,
Haobijam
Attached Files
File Type: zip sdrf.txt.zip (95.7 KB, 79 views)
File Type: txt sdrf.txt (536 Bytes, 440 views)
File Type: zip output_att.zip (3.0 KB, 94 views)
Oct 14 '10 #11
Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/python
  2. import glob
  3. #import linecache
  4. outfile = open('output_att.txt' , 'w')
  5. files = glob.glob('*.sdrf.txt')
  6. for file in files:
  7.     infile = open(file)
  8.     #count = 0
  9.     for line in infile:
  10.  
  11.         lineArray = line.rstrip()
  12.         if not line.startswith('Source Name') : continue
  13.         #count = count + 1
  14.         lineArray = line.split('%s\t')
  15.         print lineArray[0]
  16.         output = "%s\t\n"%(lineArray[0])
  17.         outfile.write(output)
  18.     infile.close()
  19. outfile.close() 
  20.  
Oct 14 '10 #12
Dear Sir,

I would like to extract only unique terms from all sdrf.txt files but this python code outputs unique terms for every file individually. Like Array Data File , Array Design REF ... are repeated in most of sdrf.txt files so i don't wanna print it as unique terms. Could you please tell me to hide case sensitive in python because Characteristics[OrganismPart] is printed as unique term to Characteristics[organism part] similarly for Characteristics[Sex] with Characteristics[sex]. I am eagerly waiting for your support and positive reply.
Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/python
  2. import glob
  3. import string
  4.  
  5. outfile = open('output.txt' , 'w')
  6. files = glob.glob('*.sdrf.txt')
  7. previous = set()
  8. for file in files:
  9.     print('\n'+file)
  10.     infile = open(file)
  11.     #previous = set() # uncomment this if do not need to be unique between the files
  12.     for line in infile:
  13.         lineArray = line.rstrip()
  14.         if not line.startswith('Source Name') : continue
  15.         lineArray = line.split('%s\t')
  16.         output = "%s\t\n"%(lineArray[0])
  17.         outfile.write(output)
  18.         uniqwords = set(word.strip() for word in lineArray[0].split('\t')
  19.                         if word.strip() and word.strip() not in previous) 
  20.         print('The %i unique terms are:\n\t%s' % (len(uniqwords),'\n\t'.join(sorted(uniqwords))))
  21.         previous |=  uniqwords 
  22.     infile.close()
  23. outfile.close()
  24. print('='*80)
  25. print('The %i terms are:\n\t%s' % (len(previous),'\n\t'.join(sorted(previous))))
  26.  
Attached Files
File Type: zip sdrf.zip (95.7 KB, 78 views)
File Type: zip attribute.zip (2.9 KB, 109 views)
Oct 19 '10 #13
Dear Sir,

I do have a query regarding parsing attributes and extracting unique terms from adf.txt files from ArrayExpress [ftp://ftp.ebi.ac.uk/pub/databases/mi...y/data/array/] .The python code written here is feasible for running individual file with similar starting term but it is infeasible for running around 2270 adf.txt files at one time. Could you please rectify or suggest me some tips for this python code in line number 12 . Actually i would like to parse the first line for every adf.txt files (2270 in numbers) and later extract unique terms and common terms from it. For your convenience i have attached a zip file for adf.txt format but for more you may get into ftp site mentioned above. I would so glad for your support and cooperation.

With warm regards,
Haobijam

Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/python
  2. import glob
  3. import string
  4. with open('output_Reporter Name.txt' , 'w') as outfile:
  5.     files = glob.glob('*.adf.txt')
  6.     uniqwords = set()
  7.     previous = set()
  8.     for file in files:
  9.         with open(file) as infile:
  10.             #previous = set() # uncomment this if do not need to be unique between the files
  11.             for line in infile:
  12.                 if not line.startswith('Reporter Name') : continue ## change this line to deal with other form
  13.                 output = line
  14.                 uniqwords = set(word.strip() for word in line.rstrip().split('\t')
  15.                                 if word.strip() and word.strip() not in previous)
  16.                 previous |=  uniqwords
  17.                 print (output)
  18.                 outfile.write(output)
  19. print('The %i unique terms are:\n\t%s' % (len(uniqwords),'\n\t'.join(sorted(uniqwords))))                  
  20. print('='*80)
  21. print('The %i terms are:\n\t%s' % (len(previous),'\n\t'.join(sorted(previous))))
  22.  
Attached Files
File Type: zip adf.zip (1.01 MB, 108 views)
Oct 28 '10 #14

Sign in to post your reply or Sign up for a free account.

Similar topics

2
by: Pawe³ Ga³ecki | last post by:
Are there any good libraries for parsing RTF files?? Can you recommend me some?
7
by: mike henkins | last post by:
hi, I've been looking through the various XML parsers API available and I have decided to use the SAX parser. Probably not the best of choices but I think it can do the job. What is the best way...
3
by: Kevin | last post by:
Does anyone have a suggestion for parsing large files line by line without loading the entire file into memory first? I don't want to use file() because the files I'm working with may be...
4
by: Iain | last post by:
I've an xml document that looks a bit like this <Vendors> <Vendor Stationery="Fred" /> <Vendor Stationery="bert" /> <Vendor Stationery="bert" /> </Vendors> I want to extract a list of the...
4
by: Ron | last post by:
Hi, I need to parse text (ie. created in Notepad) files for numbers (doubles). In Borland C++ Builder the following works: if(!InVect.is_open()) { InVect.open(TxtFileName.c_str()) ; }
0
by: firelli | last post by:
Hi, I would like to be able to read (parse) an html file into my Java program. Once I'm able to do this, I need to be able to analyse the html code. If you could offer any help in meeting for...
3
by: stéphane bard | last post by:
hello i would like to parse java files an detect class name's, attributes name's type's and visibility (and or list of methods). is there any module who can parse easily a java file without...
1
by: janakivenk | last post by:
Hello, I am running Oracle 10g R2 in our office. I created the following procedure. It is suppose to access an xml file ( family.xml). The procedure is compiled and when I try to run it, i get the...
0
by: =?Utf-8?B?R2FyeSBWYXJnYQ==?= | last post by:
I am writing a DSL that adds and/or updates attributes to a project's AssemblyInfo file. What is the recommended way to parse the existing file to ensure that no attributes are lost? Is it...
2
by: rds80 | last post by:
In the xml document below, I would like to retrieve the distinct attributes for the element '<Bal>'. However, I haven't had any success. Here is what I have so far: <TRANS> <TRAN...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.