Find and Delete Duplicate Rows in MySQL with GROUP BY and HAVING Clauses
Sometimes, duplicate records may appear in a MySQL database table, especially when importing data from external sources like spreadsheets that lack relational database functionalities. These duplicates can negatively impact your application’s performance and business logic. For example, if a customer is registered multiple times in an invoicing system, it could lead to complications in assigning credit limits.
To resolve this issue, you need to remove duplicate customer records from your database. In this guide, you’ll create a test_db database and a sample customers table. You’ll then populate the table with random customer details, including some duplicates. Finally, you’ll use GROUP BY and HAVING clauses in MySQL to identify and delete duplicate rows efficiently.
Prerequisites
To follow along with this guide, make sure you have the following:
- An Ubuntu 20.04 server.
- A sudo user.
- A LAMP Stack. To test this guide, you only need to install the MySQL database server.
- Create a test_db Database
- First, SSH to your server and run the command below to log in to your MySQL database server as root.
$ sudo mysql -u root -p
Run the command below to create the test_db database and enter the root password of your MySQL server .
mysql> CREATE DATABASE test_db;
Use the test_db database.
mysql> USE test_db;
Next, set up a customers table. In this table, you’ll uniquely identify the customers by referring to the customer_id column, which should be auto-populated since you’ll define it with a PRIMARY KEY statement. The table will then record the first_name, last_name and phone number of the customer.
Create the customers table.
mysql> CREATE TABLE customers ( customer_id BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY, first_name VARCHAR(50), last_name VARCHAR(50), phone VARCHAR(15) ) ENGINE = InnoDB;
You can INSERT details of the same customer multiple times in the customers table, so make sure you populate the table with some records . As earlier indicated, you might be importing some records from an external data source that lacks the functionalities of a relational database.
Run the INSERT commands one by one to populate the table.
mysql> INSERT INTO customers(first_name, last_name, phone) VALUES (‘JOHN’, ‘DOE’, ‘111111’);
mysql> INSERT INTO customers(first_name, last_name, phone) VALUES (‘MARY’, ‘SMITH’, ‘222222’);
mysql> INSERT INTO customers(first_name, last_name, phone) VALUES (‘JACOB’, ‘JAMES’, ‘444444’);
mysql> INSERT INTO customers(first_name, last_name, phone) VALUES (‘STEVE’, ‘JACKES’, ‘888888’);
mysql> INSERT INTO customers(first_name, last_name, phone) VALUES (‘MARY’, ‘SMITH’, ‘222222’);
mysql> INSERT INTO customers(first_name, last_name, phone) VALUES (‘RIAN’, ‘WHITE’, ‘101010’);
mysql> INSERT INTO customers(first_name, last_name, phone) VALUES (‘APPLE’, ‘GRABRIEL’, ‘242424’);
mysql> INSERT INTO customers(first_name, last_name, phone) VALUES (‘RIAN’, ‘WHITE’, ‘101010’);
mysql> INSERT INTO customers(first_name, last_name, phone) VALUES (‘JACOB’, ‘JAMES’, ‘444444’);
mysql> INSERT INTO customers(first_name, last_name, phone) VALUES (‘JOHN’, ‘DOE’, ‘111111’);
After each INSERT command, you should get the following output indicating that the record was inserted successfully.
Query OK, 1 row affected (0.01 sec)
Ensure the customer’s details are in place by running the following SELECT statement against the customers table.
mysql> SELECT customer_id, first_name, last_name, phone FROM customers;
Confirm the output below.
+-------------+------------+-----------+--------+ | customer_id | first_name | last_name | phone | +-------------+------------+-----------+--------+ | 1 | JOHN | DOE | 111111 | | 2 | MARY | SMITH | 222222 | | 3 | JACOB | JAMES | 444444 | | 4 | STEVE | JACKES | 888888 | | 5 | MARY | SMITH | 222222 | | 6 | RIAN | WHITE | 101010 | | 7 | APPLE | GRABRIEL | 242424 | | 8 | RIAN | WHITE | 101010 | | 9 | JACOB | JAMES | 444444 | | 10 | JOHN | DOE | 111111 | +-------------+------------+-----------+--------+ 10 rows in set (0.00 sec)
In the output above, you can observe that the records for JOHN DOE, MARY SMITH, RIAN WHITE, and JACOB JAMES appear more than once. While spotting duplicates is easy with only a few records, in a production setting with thousands or even millions of entries, manually detecting and removing them isn’t practical. The next section demonstrates how to tackle this issue with a single SQL command.
Identifying Duplicate Rows
In MySQL, you can identify duplicate rows by using the GROUP BY clause on a specific column and then applying the HAVING clause to filter groups that contain more than one record. For example, in your sample customers table, you can utilize the CONCAT function to merge the first_name, last_name, and phone fields into a single derived column. This derived column will serve as the grouping key. Then, by using the HAVING clause, you can select only those groups where the record count is greater than one. Your SQL query should look similar to the statement below.
mysql> SELECT customer_id, first_name, last_name, phone, COUNT(CONCAT(first_name, last_name, phone)) as total_count FROM customers GROUP BY CONCAT(first_name, last_name, phone) HAVING total_count > 1 ;
Execute the command above. Then, examine the output below. As you can see from the following list, the SQL command has retrieved all customers having more than one record.
+-------------+------------+-----------+--------+-------------+ | customer_id | first_name | last_name | phone | total_count | +-------------+------------+-----------+--------+-------------+ | 1 | JOHN | DOE | 111111 | 2 | | 2 | MARY | SMITH | 222222 | 2 | | 3 | JACOB | JAMES | 444444 | 2 | | 6 | RIAN | WHITE | 101010 | 2 | +-------------+------------+-----------+--------+-------------+ 4 rows in set (0.01 sec)
The next step is determining the rows that should remain when the duplicates are deleted. Use the command below to get the PRIMARY KEYs of these rows.
mysql> SELECT MIN(customer_id) FROM CUSTOMERS GROUP BY CONCAT(first_name, last_name, phone);
The MySQL MIN() function helps retrieve the first record in each group of customer entries when grouped by the concatenated column. If a customer’s details appear multiple times, the MIN() function ensures that only the PRIMARY KEY of the first occurrence in each group is selected, as shown in the example below.
+------------------+ | MIN(customer_id) | +------------------+ | 1 | | 2 | | 3 | | 4 | | 6 | | 7 | +------------------+ 6 rows in set (0.00 sec)
Drop the Duplicate Rows
Once you have identified the unique customer_id values, you can remove duplicate records from the customers table while keeping only the allowed entries. To achieve this, use the DELETE FROM customers statement, ensuring that only records not included in the allowlist are deleted using the NOT IN clause.
Important Consideration
In MySQL, modifying a table while simultaneously selecting from it in the same query is not allowed. To prevent errors, you must nest the SELECT MIN() statement one level deeper, as demonstrated in the query below.
mysql> DELETE FROM customers WHERE customer_id NOT IN ( SELECT customer_id FROM ( SELECT MIN(customer_id) as customer_id FROM CUSTOMERS GROUP BY CONCAT(first_name, last_name, phone) ) AS duplicate_customer_ids );
Once you execute the command above, the MySQL server should delete the 4 duplicate records as you can confirm from the output below.
Query OK, 4 rows affected (0.01 sec)
Ensure that your customers table doesn’t contain any more duplicates by executing the GROUP BY and HAVING statement one more time.
mysql> SELECT customer_id, first_name, last_name, phone, COUNT(CONCAT(first_name, last_name, phone)) as total_count FROM customers GROUP BY CONCAT(first_name, last_name, phone) HAVING total_count > 1 ;
This time around, you should get an Empty set since there are no duplicates.
Empty set (0.00 sec)
Query the customers table again.
mysql> SELECT customer_id, first_name, last_name, phone FROM customers;
Even by physically examining the list below, you can see that the duplicate records have been removed completely.
+-------------+------------+-----------+--------+ | customer_id | first_name | last_name | phone | +-------------+------------+-----------+--------+ | 1 | JOHN | DOE | 111111 | | 2 | MARY | SMITH | 222222 | | 3 | JACOB | JAMES | 444444 | | 4 | STEVE | JACKES | 888888 | | 6 | RIAN | WHITE | 101010 | | 7 | APPLE | GRABRIEL | 242424 | +-------------+------------+-----------+--------+ 6 rows in set (0.00 sec)
Conclusion
By following this guide, you’ve learned how to identify and remove duplicate records from a MySQL database using SQL queries. The GROUP BY and HAVING clauses help detect duplicates, while the MIN() function ensures only the earliest records remain. Finally, the DELETE query effectively removes the unwanted duplicates.
Cleaning up duplicate data is crucial for maintaining data integrity and optimizing database performance, especially when handling large datasets in production environments. Always remember to back up your database before executing mass DELETE commands to prevent accidental data loss.