By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
426,083 Members | 1,608 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 426,083 IT Pros & Developers. It's quick & easy.

Difficulties POSTing to RDP Hierarchy Browse Page

P: n/a
Hello,
I'm trying to write a tool to scrape through some of the Ribosomal
Database Project II's (http://rdp.cme.msu.edu/) pages, specifically,
through the Hierarchy Browser. (http://rdp.cme.msu.edu/hierarchy/)
The Hierarchy Browser is accessed first through a page with a form.
There are four fields with several options to be chosen from (Strain,
Source, Size, and Taxonomy) and then a submit button labeled "Browse".
The HTML of the form is as follows (note, I am also including the
Javascript code, as it is called by the submit button):

--------excerpted HTML----------------
<script language="Javascript">

function resetHiddenVar(){
var f_form = document.forms['hierarchyForm'];
f_form.action= "HierarchyControllerServlet/start";
return ;
}

</script>

<form name="hierarchyForm" method="POST"
action="HierarchyControllerServlet/start/">
<input type='hidden' name='printParams' value='no' />

<h1>Hierarchy Browser - Start</h1><div class="cart" style="float:
right">[&nbsp;<a href="hb_help.jsp">help</a>&nbsp;]</div>

<p>&nbsp;</p>
<div id="options">

<table summary="options area" cellpadding="0" cellspacing="0"
border="0"><tr><td align="left" valign="middle">
<table border="0" cellspacing="0" cellpadding="0" summary="Options"
align="left" class="borderup">
<tr>
<th align="right" valign="middle" class="bottom greenbg"
nowrap="nowrap">Strain:</th>
<td class="bottom formtext" nowrap="nowrap"><input id="type"
name="strain" type="radio" value="type">
<label for="type">Type</label></td>
<td class="bottom formtext" nowrap="nowrap"><input id="nontype"
name="strain" type="radio" value="nontype">
<label for="nontype">Non Type</label>&nbsp;</td>
<td class="bottom formtext" nowrap="nowrap"><input name="strain"
type="radio" id="strainboth" value="both" checked>
<label for="strainboth">Both</label>&nbsp;</td>

</tr>
<tr>
<th align="right" valign="middle" class="bottom greenbg">Source:</th>
<td class="bottom formtext" nowrap="nowrap"><input id="environmental"
name="source" type="radio" value="environ">
<label for="environmental">Uncultured&nbsp;</label></td>
<td class="bottom formtext" nowrap="nowrap"><input id="isolates"
name="source" type="radio" value="isolates">
<label for="isolates">Isolates</label></td>
<td class="bottom formtext" nowrap="nowrap"><input name="source"
type="radio" id="sourceboth" value="both" checked >
<label for="sourceboth">Both</label></td>
</tr>

<tr>
<th align="right" valign="middle" class="bottom greenbg">Size:</th>
<td class="bottom formtext" nowrap="nowrap"><input
id="greaterthan1200" name="size" type="radio" value="gt1200" checked>
<label for="greaterthan1200"><u>&gt;</u>1200</label></td>
<td class="bottom formtext" nowrap="nowrap"><input id="lessthan1200"
name="size" type="radio" value="lt1200">
<label for="lessthan1200">&lt;1200</label></td>
<td class="bottom formtext" nowrap="nowrap"><input id="sizeboth"
name="size" type="radio" value="both">
<label for="sizeboth">Both</label></td>
</tr>
<tr>

<th align="right" valign="middle" class="bottom
greenbg">Taxonomy:</th>
<td class="bottom formtext" nowrap="nowrap"><input id="bergeys"
name="taxonomy" type="radio" value="rdpHome" checked>
<label for="bergeys">Bergey's</label></td>
<td colspan="2" class="bottom formtext" nowrap="nowrap"><input
id="ncbi" name="taxonomy" type="radio" value="ncbiHome">
<label for="ncbi">NCBI</label></td>
</tr>
</table>
</td>
<td align="left" valign="middle">&nbsp;&nbsp;&nbsp;
<input name="browse" type="submit" id="browse"
onclick="resetHiddenVar(); return true;" value="Browse">

</td></tr></table></p>
</div>
<!-- end options -->
</form>
----------end excerpted HTML--------------
The options I would like to simulate are browsing by strain=type,
source=both, size = gt1200, and taxonomy = bergeys. I see that the
form method is POST, and I read through the urllib documentation, and
saw that the syntax for POSTing is urllib.urlopen(url, data). Since
the submit button calls HierarchyControllerServlet/start (see the
Javascript), I figure that the url I should be contacting is
http://rdp.cme.msu.edu/hierarchy/Hie...rServlet/start
Thus, I came up with the following test code:

--------Python test code---------------
#!/usr/bin/python

import urllib

options = [("strain", "type"), ("source", "both"),
("size", "gt1200"), ("taxonomy", "bergeys"),
("browse", "Browse")]

params = urllib.urlencode(options)

rdpbrowsepage = urllib.urlopen(
"http://rdp.cme.msu.edu/hierarchy/HierarchyControllerServlet/start",
params)

pagehtml = rdpbrowsepage.read()

print pagehtml
---------end Python test code----------
However, the page that is returned is an error page that says the
request could not be completed. The correct page should show various
bacterial taxonomies, which are clickable to reveal greater detail of
that particular taxon.

I'm a bit stumped, and admittedly, I am in over my head on the subject
matter of networking and web-clients. Perhaps I should be using the
httplib module for connecting to the RDP instead, but I am unsure what
methods I need to use to do this. This is complicated by the fact that
these are JSP generated pages and I'm unsure what exactly the server
requires before giving up the desired page. For instance, there's a
jsessionid that's given and I'm unsure if this is required to access
pages, and if it is, how to place it in POST requests.

If anyone has suggestions, I would greatly appreciate them. If any
more information is needed that I haven't provided, please let me know
and I'll be happy to give what I am able. Thanks very, very much in
advance.

Chris
Jul 18 '05 #1
Share this Question
Share on Google+
1 Reply


P: n/a
[Chris Lasher]
I'm trying to write a tool to scrape through some of the Ribosomal
Database Project II's (http://rdp.cme.msu.edu/) pages, specifically,
through the Hierarchy Browser. (http://rdp.cme.msu.edu/hierarchy/)
I'm sure that urllib is the right tool to use. However, there may be one
or two problems with the way you're using it.
--------excerpted HTML----------------
<!-- snip -->
<form name="hierarchyForm" method="POST"
action="HierarchyControllerServlet/start/">
<input type='hidden' name='printParams' value='no' />
This is an omission from the params you are passing to the
HierarchyServlet. Although the "printParams" field is not visible to you
in a browser, the browser still submits a name/value pair in its form
submission. So you should also in your code, as shwon below.
<input id="bergeys" name="taxonomy" type="radio" value="rdpHome" checked>
Also, you are using the wrong value for the taxonomy field. You are
setting a value of "bergeys", which is the ID of the field, not its
value. The correct value is "rdpHome".
--------Python test code---------------
#!/usr/bin/python

import urllib

options = [("strain", "type"), ("source", "both"),
("size", "gt1200"), ("taxonomy", "bergeys"),
("browse", "Browse")]
Try this

options = [ ("printParams", "no"), ("strain", "type"),
("source", "both"), ("size", "gt1200"),
("taxonomy", "rdpHome"), ("browse", "Browse"),]

params = urllib.urlencode(options)

rdpbrowsepage = urllib.urlopen(
"http://rdp.cme.msu.edu/hierarchy/HierarchyControllerServlet/start",
params)

pagehtml = rdpbrowsepage.read()

print pagehtml
---------end Python test code----------


HTH,

--
alan kennedy
------------------------------------------------------
email alan: http://xhaus.com/contact/alan
Jul 18 '05 #2

This discussion thread is closed

Replies have been disabled for this discussion.