472,119 Members | 1,665 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,119 software developers and data experts.

problem with grouping (?)

Hello to all,

I am quite new to sql and I don't understand why a query has a certain output.I apologise for the long post but I'd rather describe exactly what I am doing in, hoping that someone can spot the problem that I fail to.
I have created and populated two tables in my database that look like:

Table: binding_sites

id start_bindsite end_bindsite
1 10212250 10212257
2 10105334 10105368
3 101556063 101556073


Table:genome

gene_start gene_end gene_name
10212248 10312253 gene1
10212258 10322250 gene2
101556070 102556060 gene3

What I want is to find which gene from table genome is closer to each of the sites 1, 2, 3 in table binding_sites.

I have created a nested query that calculates the linear distances between the gene_start and both the start_bindsite and end_bindsite and then with the outer select I wish to choose the minimum distance. The problem is that when I group by the start_bindsite field the output is indeed for each binding site the minimum distance but the gene_name is always gene1 , whereas for the last two binding sites it is gene2 and gene3 respectively that should be presented. And when grouping by other fields or combination of other fields doesn't seem to help.

The query:

Expand|Select|Wrap|Line Numbers
  1. select   temp.start, temp.end, temp.gene_name, min(temp.distance) as minimum_distance
  2.           from (select  start_bindsite, end_bindsite, gene_name, abs(gene_start-start_bindsite) as distance
  3.                    from genome, binding_sites
  4.                    group by  gene_name, start_bindsite
  5.                    union 
  6.                   select  start_bindsite, end_bindsite, gene_name, abs(gene_start-end_bindsite) as distance
  7.                   from genome, binding_sites 
  8.                   group by  gene_name, start_bindsite) as temp
  9.         group by  temp.start
  10.  
  11.  
And the output
+-----------+-----------+-----------+------------------+
| start | end | gene_name | minimum_distance |
+------------+-----------+-----------+------------------+
| 10105334 | 10105368 | gene1 | 106880 |
| 10212250 | 10212257 | gene1 | 1 |
| 101556063 | 101556073 | gene1 | 3 |
+-----------+-----------+------------+------------------+
3 rows in set (0.00 sec)


Does anyone have any idea what am I doing wrong here? Any hint or suggestion would be hugely appreciated!!
Thanks for your time in advance,

Eva
Jul 12 '08 #1
2 1429
coolsti
310 100+
Without really getting into your particular query, I suspect the problem you have is similar to the problem statement of this thread:

http://bytes.com/forum/thread819577.html

and that the solution to your problem lies in my answer to this thread.

The problem is that your select query calculates the minimum distance as an output variable, and then you join your tables and do your group by. Your output before doing the group by then has the columns you have asked to be selected including this minimum distance. But when you then do the group by "id", MySQL has no way of knowing which of the rows for each "id" should be selected, and so in general will take the first row for each "id", in which case your gene name is wrong, although the minimum distance value is correct.

If I am correct in the above, there are two solutions to the problem. If you expect your output query to always only contain one row, you can do it in an easy way. Drop the "group by" clause, and just select what you wish, calculate your distance and give it some alias name, let us say "distance", and then add the clause "order by distance limit 1". This will not attempt to group by anything, but will order the rows in ascending order by the distance, and since you want the minimum distance, you limit the output to only the first row.

The above only works if your output will contain only one row. That is, you are limiting your query to handle only one "id" at a time. If you wish to have multiple rows in your output, say the row that corresponds to the minimum distance for many values of "id", then you need to use the technique I suggest in the thread I mention above.

Good luck!
Jul 13 '08 #2
Hello again,

Thank you so much for your reply, it was really helpgul.
I solved the problem after all by applying an inner join to the table described as temp and the table occuring from my whole previous query (only this time I had the same fields to the select and the group by ;-) ).
Thanks again.

Cheers,

Eva
Jul 13 '08 #3

Post your reply

Sign in to post your reply or Sign up for a free account.

Similar topics

16 posts views Thread by cody | last post: by
6 posts views Thread by Christoph | last post: by
3 posts views Thread by Steven Bethard | last post: by
10 posts views Thread by amitabh.mehra | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.