Saturday, June 26, 2010

My sql -- finding duplicates in a column

Here is how to find all entries that have duplicates in a mysql database


select email, count(email) from emails group by email having count(email) > 1;

So what is happening. To me it appears that the SQL boys have decided that selects can act in two independent ways:
1) "normal mode" - give me all the results
2) "aggregate mode" - give me groups of results represented by the first group member

It appears to me that this "aggregate mode" is triggered if either the "group by" or an aggregate function is present in the statement.

To be more specific;
1) When the "GROUP BY" is not present but an aggregate function is used then it is assumed that the data is one large group and only a single row is returned. This row appears to have the data from the first row of the normal select and the results of any aggregate function applied access the entire normal select results.
2) However if the "GROUP BY" is present then the functions operate on the individual groups. The resulting row(s) will use the normal selects first row for the normal data in the group and will apply any functions across the entire group.

This statement here will demonstrate the second rule in action more clearly.
select email, count(email), num, sum(num), avg(num), min(num), max(num) from emails group by email;

Here is the full test sql which brought me to my conclusions;
create table emails (email text, num integer);
describe emails;
insert into emails values("a@a.com", 1);
insert into emails values("b@a.com", 2);
insert into emails values("b@a.com", 3);
insert into emails values("c@a.com", 4);
insert into emails values("d@a.com", 5);
insert into emails values("d@a.com", 6);
insert into emails values("d@a.com", 7);
insert into emails values("b@a.com", 8);
select * from emails;
select distinct email from emails;
select email from emails group by email;
select email, count(email) from emails;
select email, count(email), num, sum(num), avg(num), min(num), max(num) from emails group by email;
select email, count(email) from emails group by email having count(email) > 1;

There are several functions that operate on the "group"
sum(), count(), avg(), min(), max(), etc

Refer here for more info;
http://dev.mysql.com/doc/refman/5.0/en/group-by-functions.html

No comments:

Post a Comment