Parsing tab separated .txt files with common and distinct attributes

haobijam

I would like to parse tab separated .txt files separating common attribute and distinct attribute from the file. I would like to parse only the first line attributes not the values. Could you please rectify this script. The file may be located from this url -
ftp://ftp.ebi.ac.uk/pub/databases/mi...MX-10.sdrf.txt

The source code i have written is as below -

Expand|Select|Wrap|Line Numbers

 
#!/usr/bin/python

import glob

outfile = open('output_attribute.txt' , 'w')

files = glob.glob('*.sdrf.txt')

for file in files:

    infile = open(file)

    #ret = False

    for line in infile:

        lineArray = line.split('\t')
 
        if '\n\n' in line:

            ret = false

            outfile.write('')

            break;

        elif len(lineArray) > 2:            

           output = "%s\t%s\n\n"%(lineArray[0],lineArray[1])

           outfile.write(output)

        else:

            output = "%s\t\n"%(lineArray[0])

            outfile.write(output)

    infile.close()

outfile.close()

Oct 11 '10 #1

Subscribe Post Reply

2978

bvdet

2,851

Expert Mod 2GB

I am unclear about your end goal. It seems you want to read multiple files, read the first line of each file, split the line on the tab character, then write the first two elements to the output file. Would you please clarify what output you want?

Oct 11 '10 #2

haobijam

Dear,

Please find the attached zip file. I would like to extract only the headers from the file parsed. Every files in header starts with Array Design Name but not fix attribute. So i would like to extract headers with space (\n\n)separated gap which is attached in zip file format. I would like to extract only the RED encircled headers. I would be glad for your support and cooperation.

With regards,
Haobijam

Attached Files

headers.zip (73.7 KB, 185 views)

Oct 13 '10 #3

haobijam

I would like to extract only the headers from the file parsed. Every files in header starts with Array Design Name but ends with unfix attribute. So i would like to extract headers with space (\n\n)separated gap which is attached in zip file format. I would like to extract only the RED encircled headers. I have attached the output for this script written. I would be glad for your support and cooperation.
The file may be located from this url -
ftp://ftp.ebi.ac.uk/pub/databases/mi...FFY-10.adf.txt

The source code i have written is as below-

Expand|Select|Wrap|Line Numbers

 
#!/usr/bin/python

import glob
 
outfile = open('output_attri.txt' , 'w')

files = glob.glob('*.adf.txt')
 
for file in files:

    infile = open(file)
 
    for line in infile:

        line = line.replace('^' , '\n\n').replace('!' , '').replace('#' , '').replace('\n','')

        lineArray = line.split('%s\t')

        if line == '\n\n':

            outfile.write('')

            break;

        elif len(lineArray) > 2:            

            output = "%s\t%s\n"%(lineArray[0],lineArray[1])

            outfile.write(output)

        else:

            output = "%s\t\n"%(lineArray[0])

            outfile.write(output)

    infile.close()

outfile.close()

With regards,
Haobijam

Attached Files

output_attribute.zip (2.46 MB, 126 views)

Oct 13 '10 #4

bvdet

2,851

Expert Mod 2GB

Is there always a blank line separating the header info you want from the data you do not want? You only want the first two elements of each header line? Untested:

Expand|Select|Wrap|Line Numbers

 outFile = open(outFileName, 'w')

for fn in fileNameList:

    f = open(fn)

    output = []

    for line in f:

        line = line.strip().split("\t")

        if line:

            output.append("\t".join(line[:2]))

        else:

            outFile.write("\n".join(output))

            break

outFile.close()

Oct 13 '10 #5

haobijam

Dear,
Yes there is always a blank line separating the header information i want from the text data i do not want to extract in all the files.

Regards,
Haobijam

Oct 13 '10 #6

haobijam

Dear,

What is wrong with this script? I could not print any output.

Expand|Select|Wrap|Line Numbers

 
#!/usr/bin/python

import glob
 
outFile = open('output.txt', 'w')

fileNameList = glob.glob('*.adf.txt')

for file in fileNameList:

    f = open(file)

    output = []

    for line in f:

        line = line.strip().split("\t")

        #lineArray = line.split('\t')

        if line:

            #output = "%s\t%s\n"%(lineArray[0],lineArray[1])

            output.append("\t".join(line[:2]))

        else:

            outFile.write("\n".join(output))

            break

    f.close()

outFile.close()

The code is here -

Oct 13 '10 #7

bvdet

2,851

Expert Mod 2GB

PLEASE use code tags when posting code. That way I will not have to edit your post.

There are no print statements. Is there any content in the output file? Add print statements, as in print line, to see what is being read.

BV - Moderator

Oct 13 '10 #8

bvdet

2,851

Expert Mod 2GB

This writes the header information to disk:

Expand|Select|Wrap|Line Numbers

 outFile = open(outFileName, 'w')

for fn in fileNameList:

    f = open(fn)

    output = []

    for line in f:

        line = line.strip()

        if line:

            output.append(line)

        else:

            outFile.write("\n".join(output))

            f.close()

            break

outFile.close()

Oct 13 '10 #9

dwblas

626

Expert 512MB

Note that his will never be found as it is read as two separate records. Test for len(line.strip()) instead to find an empty record.

Expand|Select|Wrap|Line Numbers

if '\n\n' in line:

Oct 13 '10 #10

haobijam

Dear Sir,
I have written a script to extract the first line starting with Source Name AND ends with Comment [ArrayExpress Data Retrieval URI] and i have done it but i could not parse distinct or unique attributes which is not repeated in every files. I would like to parse only the first line attributes not the table values. Could you please rectify this script. I have attached a zip file for all sdrf.txt files.The file may be located from this url -
ftp://ftp.ebi.ac.uk/pub/databases/mi...FMX-1.sdrf.txt

Expand|Select|Wrap|Line Numbers

Regards,
Haobijam

Attached Files

	sdrf.txt.zip (95.7 KB, 79 views)
	sdrf.txt (536 Bytes, 440 views)
	output_att.zip (3.0 KB, 94 views)

Oct 14 '10 #11

haobijam

Expand|Select|Wrap|Line Numbers

 
#!/usr/bin/python

import glob

#import linecache

outfile = open('output_att.txt' , 'w')

files = glob.glob('*.sdrf.txt')

for file in files:

    infile = open(file)

    #count = 0

    for line in infile:
 
        lineArray = line.rstrip()

        if not line.startswith('Source Name') : continue

        #count = count + 1

        lineArray = line.split('%s\t')

        print lineArray[0]

        output = "%s\t\n"%(lineArray[0])

        outfile.write(output)

    infile.close()

outfile.close()

Oct 14 '10 #12

haobijam

Dear Sir,

I would like to extract only unique terms from all sdrf.txt files but this python code outputs unique terms for every file individually. Like Array Data File , Array Design REF ... are repeated in most of sdrf.txt files so i don't wanna print it as unique terms. Could you please tell me to hide case sensitive in python because Characteristics[OrganismPart] is printed as unique term to Characteristics[organism part] similarly for Characteristics[Sex] with Characteristics[sex]. I am eagerly waiting for your support and positive reply.

Expand|Select|Wrap|Line Numbers

 
#!/usr/bin/python

import glob

import string
 
outfile = open('output.txt' , 'w')

files = glob.glob('*.sdrf.txt')

previous = set()

for file in files:

    print('\n'+file)

    infile = open(file)

    #previous = set() # uncomment this if do not need to be unique between the files

    for line in infile:

        lineArray = line.rstrip()

        if not line.startswith('Source Name') : continue

        lineArray = line.split('%s\t')

        output = "%s\t\n"%(lineArray[0])

        outfile.write(output)

        uniqwords = set(word.strip() for word in lineArray[0].split('\t')

                        if word.strip() and word.strip() not in previous) 

        print('The %i unique terms are:\n\t%s' % (len(uniqwords),'\n\t'.join(sorted(uniqwords))))

        previous |=  uniqwords 

    infile.close()

outfile.close()

print('='*80)

print('The %i terms are:\n\t%s' % (len(previous),'\n\t'.join(sorted(previous))))

Attached Files

	sdrf.zip (95.7 KB, 78 views)
	attribute.zip (2.9 KB, 109 views)

Oct 19 '10 #13

haobijam

Dear Sir,

I do have a query regarding parsing attributes and extracting unique terms from adf.txt files from ArrayExpress [ftp://ftp.ebi.ac.uk/pub/databases/mi...y/data/array/] .The python code written here is feasible for running individual file with similar starting term but it is infeasible for running around 2270 adf.txt files at one time. Could you please rectify or suggest me some tips for this python code in line number 12 . Actually i would like to parse the first line for every adf.txt files (2270 in numbers) and later extract unique terms and common terms from it. For your convenience i have attached a zip file for adf.txt format but for more you may get into ftp site mentioned above. I would so glad for your support and cooperation.

With warm regards,
Haobijam

Expand|Select|Wrap|Line Numbers

 
#!/usr/bin/python

import glob

import string

with open('output_Reporter Name.txt' , 'w') as outfile:

    files = glob.glob('*.adf.txt')

    uniqwords = set()

    previous = set()

    for file in files:

        with open(file) as infile:

            #previous = set() # uncomment this if do not need to be unique between the files

            for line in infile:

                if not line.startswith('Reporter Name') : continue ## change this line to deal with other form

                output = line

                uniqwords = set(word.strip() for word in line.rstrip().split('\t')

                                if word.strip() and word.strip() not in previous)

                previous |=  uniqwords

                print (output)

                outfile.write(output)

print('The %i unique terms are:\n\t%s' % (len(uniqwords),'\n\t'.join(sorted(uniqwords))))                  

print('='*80)

print('The %i terms are:\n\t%s' % (len(previous),'\n\t'.join(sorted(previous))))

Attached Files

adf.zip (1.01 MB, 108 views)

Oct 28 '10 #14

Similar topics

Parsing RTF files.

by: Pawe³ Ga³ecki | last post by:

Are there any good libraries for parsing RTF files?? Can you recommend me some?

PHP

parsing XML files with SAX

by: mike henkins | last post by:

hi, I've been looking through the various XML parsers API available and I have decided to use the SAX parser. Probably not the best of choices but I think it can do the job. What is the best way...

.NET Framework

Parsing large files by line

by: Kevin | last post by:

Does anyone have a suggestion for parsing large files line by line without loading the entire file into memory first? I don't want to use file() because the files I'm working with may be...

PHP

extracting distinct attributes from a document

by: Iain | last post by:

I've an xml document that looks a bit like this <Vendors> <Vendor Stationery="Fred" /> <Vendor Stationery="bert" /> <Vendor Stationery="bert" /> </Vendors> I want to extract a list of the...

.NET Framework

Parsing text files

by: Ron | last post by:

Hi, I need to parse text (ie. created in Notepad) files for numbers (doubles). In Borland C++ Builder the following works: if(!InVect.is_open()) { InVect.open(TxtFileName.c_str()) ; }

C# / C Sharp

Parsing HTML files into Java

by: firelli | last post by:

Hi, I would like to be able to read (parse) an html file into my Java program. Once I'm able to do this, I need to be able to analyse the html code. If you could offer any help in meeting for...

Java

parsing java files

by: stéphane bard | last post by:

hello i would like to parse java files an detect class name's, attributes name's type's and visibility (and or list of methods). is there any module who can parse easily a java file without...

Python

Error parsing xml files - Help urgently needed.

by: janakivenk | last post by:

Hello, I am running Oracle 10g R2 in our office. I created the following procedure. It is suppose to access an xml file ( family.xml). The procedure is compiled and when I try to run it, i get the...

Oracle Database

Parsing AssemblyInfo files

by: =?Utf-8?B?R2FyeSBWYXJnYQ==?= | last post by:

I am writing a DSL that adds and/or updates attributes to a project's AssemblyInfo file. What is the recommended way to parse the existing file to ensure that no attributes are lost? Is it...

.NET Framework

retrieving distinct attributes from xml doc

by: rds80 | last post by:

In the xml document below, I would like to retrieve the distinct attributes for the element '<Bal>'. However, I haven't had any success. Here is what I have so far: <TRANS> <TRAN...

C# / C Sharp

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++