472,804 Members | 1,258 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,804 software developers and data experts.

Difficulties POSTing to RDP Hierarchy Browse Page

Hello,
I'm trying to write a tool to scrape through some of the Ribosomal
Database Project II's (http://rdp.cme.msu.edu/) pages, specifically,
through the Hierarchy Browser. (http://rdp.cme.msu.edu/hierarchy/)
The Hierarchy Browser is accessed first through a page with a form.
There are four fields with several options to be chosen from (Strain,
Source, Size, and Taxonomy) and then a submit button labeled "Browse".
The HTML of the form is as follows (note, I am also including the
Javascript code, as it is called by the submit button):

--------excerpted HTML----------------
<script language="Javascript">

function resetHiddenVar(){
var f_form = document.forms['hierarchyForm'];
f_form.action= "HierarchyControllerServlet/start";
return ;
}

</script>

<form name="hierarchyForm" method="POST"
action="HierarchyControllerServlet/start/">
<input type='hidden' name='printParams' value='no' />

<h1>Hierarchy Browser - Start</h1><div class="cart" style="float:
right">[&nbsp;<a href="hb_help.jsp">help</a>&nbsp;]</div>

<p>&nbsp;</p>
<div id="options">

<table summary="options area" cellpadding="0" cellspacing="0"
border="0"><tr><td align="left" valign="middle">
<table border="0" cellspacing="0" cellpadding="0" summary="Options"
align="left" class="borderup">
<tr>
<th align="right" valign="middle" class="bottom greenbg"
nowrap="nowrap">Strain:</th>
<td class="bottom formtext" nowrap="nowrap"><input id="type"
name="strain" type="radio" value="type">
<label for="type">Type</label></td>
<td class="bottom formtext" nowrap="nowrap"><input id="nontype"
name="strain" type="radio" value="nontype">
<label for="nontype">Non Type</label>&nbsp;</td>
<td class="bottom formtext" nowrap="nowrap"><input name="strain"
type="radio" id="strainboth" value="both" checked>
<label for="strainboth">Both</label>&nbsp;</td>

</tr>
<tr>
<th align="right" valign="middle" class="bottom greenbg">Source:</th>
<td class="bottom formtext" nowrap="nowrap"><input id="environmental"
name="source" type="radio" value="environ">
<label for="environmental">Uncultured&nbsp;</label></td>
<td class="bottom formtext" nowrap="nowrap"><input id="isolates"
name="source" type="radio" value="isolates">
<label for="isolates">Isolates</label></td>
<td class="bottom formtext" nowrap="nowrap"><input name="source"
type="radio" id="sourceboth" value="both" checked >
<label for="sourceboth">Both</label></td>
</tr>

<tr>
<th align="right" valign="middle" class="bottom greenbg">Size:</th>
<td class="bottom formtext" nowrap="nowrap"><input
id="greaterthan1200" name="size" type="radio" value="gt1200" checked>
<label for="greaterthan1200"><u>&gt;</u>1200</label></td>
<td class="bottom formtext" nowrap="nowrap"><input id="lessthan1200"
name="size" type="radio" value="lt1200">
<label for="lessthan1200">&lt;1200</label></td>
<td class="bottom formtext" nowrap="nowrap"><input id="sizeboth"
name="size" type="radio" value="both">
<label for="sizeboth">Both</label></td>
</tr>
<tr>

<th align="right" valign="middle" class="bottom
greenbg">Taxonomy:</th>
<td class="bottom formtext" nowrap="nowrap"><input id="bergeys"
name="taxonomy" type="radio" value="rdpHome" checked>
<label for="bergeys">Bergey's</label></td>
<td colspan="2" class="bottom formtext" nowrap="nowrap"><input
id="ncbi" name="taxonomy" type="radio" value="ncbiHome">
<label for="ncbi">NCBI</label></td>
</tr>
</table>
</td>
<td align="left" valign="middle">&nbsp;&nbsp;&nbsp;
<input name="browse" type="submit" id="browse"
onclick="resetHiddenVar(); return true;" value="Browse">

</td></tr></table></p>
</div>
<!-- end options -->
</form>
----------end excerpted HTML--------------
The options I would like to simulate are browsing by strain=type,
source=both, size = gt1200, and taxonomy = bergeys. I see that the
form method is POST, and I read through the urllib documentation, and
saw that the syntax for POSTing is urllib.urlopen(url, data). Since
the submit button calls HierarchyControllerServlet/start (see the
Javascript), I figure that the url I should be contacting is
http://rdp.cme.msu.edu/hierarchy/Hie...rServlet/start
Thus, I came up with the following test code:

--------Python test code---------------
#!/usr/bin/python

import urllib

options = [("strain", "type"), ("source", "both"),
("size", "gt1200"), ("taxonomy", "bergeys"),
("browse", "Browse")]

params = urllib.urlencode(options)

rdpbrowsepage = urllib.urlopen(
"http://rdp.cme.msu.edu/hierarchy/HierarchyControllerServlet/start",
params)

pagehtml = rdpbrowsepage.read()

print pagehtml
---------end Python test code----------
However, the page that is returned is an error page that says the
request could not be completed. The correct page should show various
bacterial taxonomies, which are clickable to reveal greater detail of
that particular taxon.

I'm a bit stumped, and admittedly, I am in over my head on the subject
matter of networking and web-clients. Perhaps I should be using the
httplib module for connecting to the RDP instead, but I am unsure what
methods I need to use to do this. This is complicated by the fact that
these are JSP generated pages and I'm unsure what exactly the server
requires before giving up the desired page. For instance, there's a
jsessionid that's given and I'm unsure if this is required to access
pages, and if it is, how to place it in POST requests.

If anyone has suggestions, I would greatly appreciate them. If any
more information is needed that I haven't provided, please let me know
and I'll be happy to give what I am able. Thanks very, very much in
advance.

Chris
Jul 18 '05 #1
1 2228
[Chris Lasher]
I'm trying to write a tool to scrape through some of the Ribosomal
Database Project II's (http://rdp.cme.msu.edu/) pages, specifically,
through the Hierarchy Browser. (http://rdp.cme.msu.edu/hierarchy/)
I'm sure that urllib is the right tool to use. However, there may be one
or two problems with the way you're using it.
--------excerpted HTML----------------
<!-- snip -->
<form name="hierarchyForm" method="POST"
action="HierarchyControllerServlet/start/">
<input type='hidden' name='printParams' value='no' />
This is an omission from the params you are passing to the
HierarchyServlet. Although the "printParams" field is not visible to you
in a browser, the browser still submits a name/value pair in its form
submission. So you should also in your code, as shwon below.
<input id="bergeys" name="taxonomy" type="radio" value="rdpHome" checked>
Also, you are using the wrong value for the taxonomy field. You are
setting a value of "bergeys", which is the ID of the field, not its
value. The correct value is "rdpHome".
--------Python test code---------------
#!/usr/bin/python

import urllib

options = [("strain", "type"), ("source", "both"),
("size", "gt1200"), ("taxonomy", "bergeys"),
("browse", "Browse")]
Try this

options = [ ("printParams", "no"), ("strain", "type"),
("source", "both"), ("size", "gt1200"),
("taxonomy", "rdpHome"), ("browse", "Browse"),]

params = urllib.urlencode(options)

rdpbrowsepage = urllib.urlopen(
"http://rdp.cme.msu.edu/hierarchy/HierarchyControllerServlet/start",
params)

pagehtml = rdpbrowsepage.read()

print pagehtml
---------end Python test code----------


HTH,

--
alan kennedy
------------------------------------------------------
email alan: http://xhaus.com/contact/alan
Jul 18 '05 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: Julia | last post by:
Hi, I am still having Charest conversion difficulties s when passing string from C# TO ASP and than to access using ADO I am using HttpWebRequest to POST some Multilanguage(Hebrew and...
4
by: Larry Brindise | last post by:
I have an asp.net application. I have used VStudio Web Deployment Project to create the MSI file. I copy the MSI file from my developer PC to my test server running Win2003Server Web Edition. I...
3
by: Bennett Haselton | last post by:
I want to display a hierarchical listing of items from a database table, where, say, each row in the table has an "ID" field and a "parent_id" field giving the ID of its parent (NULL if it's at the...
13
by: Ian.Suttle | last post by:
I am have been researching this issue to no end, so any help would be very much appreciated. I have a page with form tags. Inside of the form tags is a panel that contains a user control. The...
5
by: Mike Moore | last post by:
I need to create something very similiar to the browse folder capability. This would allow me to support the following - A user would create a document on their server, then they would need to...
5
by: Bryan | last post by:
I am trying to get to a label control to get its Text value. from a previous page. The label control is buried in a Web User Control that is in a webpart zone. When I use this code:...
29
by: Gernot Frisch | last post by:
Hi, I have no clue. - I want to align the red, green, blue boxes in one line - red,green,blue must be 45px high - red (center) must be as wide as possible - yellow must start exactly below...
8
by: =?Utf-8?B?UGV0ZXJX?= | last post by:
I install Visual Studio 2005 Pro on Vista. I open and migrate a 2003 web project to 2005. I attempt to browse an aspx file from the Solution Exploer. It displays a blank html page. I create a...
0
tjc0ol
by: tjc0ol | last post by:
Hi guys, I'm a newbie of this stuffs, We had a small office network (1 Windows 2K - Server) and (3 Windows XP - Client). I am testing to 1 PC (Windows 2K) installed with Licensed Wingate...
0
by: erikbower65 | last post by:
Here's a concise step-by-step guide for manually installing IntelliJ IDEA: 1. Download: Visit the official JetBrains website and download the IntelliJ IDEA Community or Ultimate edition based on...
0
by: kcodez | last post by:
As a H5 game development enthusiast, I recently wrote a very interesting little game - Toy Claw ((http://claw.kjeek.com/))。Here I will summarize and share the development experience here, and hope it...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Sept 2023 starting at 18:00 UK time (6PM UTC+1) and finishing at about 19:15 (7.15PM) The start time is equivalent to 19:00 (7PM) in Central...
0
by: Taofi | last post by:
I try to insert a new record but the error message says the number of query names and destination fields are not the same This are my field names ID, Budgeted, Actual, Status and Differences ...
14
DJRhino1175
by: DJRhino1175 | last post by:
When I run this code I get an error, its Run-time error# 424 Object required...This is my first attempt at doing something like this. I test the entire code and it worked until I added this - If...
5
by: DJRhino | last post by:
Private Sub CboDrawingID_BeforeUpdate(Cancel As Integer) If = 310029923 Or 310030138 Or 310030152 Or 310030346 Or 310030348 Or _ 310030356 Or 310030359 Or 310030362 Or...
0
by: lllomh | last post by:
Define the method first this.state = { buttonBackgroundColor: 'green', isBlinking: false, // A new status is added to identify whether the button is blinking or not } autoStart=()=>{
0
by: lllomh | last post by:
How does React native implement an English player?
0
by: Mushico | last post by:
How to calculate date of retirement from date of birth

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.