Removing duplicate files in a given directory
The following code is a python script that removes duplicate files in a given directory. At first I considered the most basic thing: relying on identical names, but you might have 2 different files with the same name in 2 different directories. So I decided to rely on md5checksum, since any 2 files that yield the same md5checksum almost invariably have the same content.
#A simple Python script to remove duplicate files. Coded by MCoury AKA python-scripter import hashlib import os #define a function to calculate md5checksum for a given file: def md5(f): """takes one file f as an argument and generates an md5checksum for that file""" return hashlib.md5(open(f,'rb').read()).hexdigest() #define our main function: def rm_dup(path): """relies on the md5 function above to remove duplicate files""" if not os.path.isdir(path):#make sure the given directory exists print('specified directory does not exist!') else: md5_dict=<> for root, dirs, files in os.walk(path):#the os.walk function allows checking subdirectories too. for f in files: if not md5(os.path.join(root,f)) in md5_dict: md5_dict.update() else: md5_dict[md5(os.path.join(root,f))].append(os.path.join(root,f)) for key in md5_dict: while len(md5_dictPython delete duplicate files)>1: for item in md5_dictPython delete duplicate files: os.remove(item) md5_dictPython delete duplicate files.remove(item) print('Done!') if __name__=='__main__': print('=======A simple Python script to remove duplicate files===========') print() print('============Coded by MCoury AKA python-scripter===================') print() print('===========The script counts on the fact the fact=================') print('=========that if 2 files have the same md5checksum================') print('==========they most likely have the same content==================') print() path=input(r'Please provide the target path\directory. for example: c: or c:\directory. ') print() rm_dup(path)
Ever since discovering the Zen of python I became obsessed with using the least possible lines of code. Plus I have another (and perhaps more serious) concern; calculating md5checksum for large files takes precious memory real-estate right? Could that limit the functionality of the script? Also, what do you think of the implementation?
Remove Duplicate Files | Python
In this article, we are going to create a python script on how to remove duplicate files.
Introduction:
This python script removes duplicate files in the directory where the script runs. It first lists all the files in the directory and checks whether they have existing files or not. If exists it deletes the duplicate files.
A duplicate file is a copy of a file on your computer that may be stored in the same folder or in another folder.
Duplicate files have absolutely identical content, size, and extensions but might have different file names.
Project Prerequisites:
There are no external libraries that we are going to use to run this simple python script.
The OS module in Python provides functions for creating and removing a directory (folder), fetching its contents, changing and identifying the current directory, etc.,
It is possible to automatically perform many operating system tasks. The OS module in Python provides functions for creating and removing a directory (folder), fetching its contents, changing and identifying the current directory, etc.
You first need to import the os module to interact with the underlying operating system. So, import it using the import os statement before using its functions.
to know more about this module, refer to: https://www.tutorialsteacher.com/python/os-module
hashlib hashing function takes a variable length of bytes and converts it into a fixed-length sequence. The Python hashlib module is an interface for hashing messages easily. this contains numerous methods which will handle hashing any raw message in an encrypted format.
This module implements a common interface to many different secure hash and message digest algorithms. Included are the FIPS secure hash algorithms SHA1, SHA224, SHA256, SHA384, and SHA512 (defined in FIPS 180-2) as well as RSA’s MD5 algorithm (defined in Internet RFC 1321). The terms “secure hash” and “message digest” are interchangeable. Older algorithms were called message digests. The modern term is secure hash.
Code Implementation:
firstly we have to import the os and hashlib modules. without these two modules, we cannot run this program because they help in identifying directories and converting data into hash values.
Here we have to define a function. To read the particular block size from the file.
def: a function is a logical unit of code containing a sequence of statements indented under a name given using the “def” keyword.
Basically, what this function hashFIle() does is, it returns the hash string for the given file name and we fix a size of 65536 for BLOCK so that it won’t lead to memory overflow in case of large files.
# For large files, if we read it all together it can lead to memory overflow, So we take a blocksize to read at a time
with open(filename, ‘rb’) as file:
# Reads the particular blocksize from file
There are some of the methods of OS that we use in the following code:
isfile() this method is used to check whether the specified path is an existing regular file or not.
os.listdir() this method is used to get the list of all files and directories in the specified directory.
append() this method is used to append a passed obj into the existing list.
in this block of code, we are finding the list of deleted files and existing files using conditional statements and looping statements.
remove() this method removes the first occurrence of the element with the specified value.
if __name__ == "__main__": # Dictionary to store the hash and filename hashMap = <> # List to store deleted files deletedFiles = [] filelist = [f for f in os.listdir() if os.path.isfile(f)] for f in filelist: key = hashFile(f) # If key already exists, it deletes the file if key in hashMap.keys(): deletedFiles.append(f) os.remove(f) else: hashMapPython delete duplicate files = f if len(deletedFiles) != 0: print('Deleted Files') for i in deletedFiles: print(i) else: print('No duplicate files found')
Successfully if we run the program we will get to know the duplicate files and immediately it will remove the duplicate files in our system. if there are no duplicates then we get the output as no duplicate files found.
Source code:
Here is the complete source code of our project.
You can copy and run this on your machine.
import hashlib import os # Returns the hash string of the given file name def hashFile(filename): # For large files, if we read it all together it can lead to memory overflow, So we take a blocksize to read at a time BLOCKSIZE = 65536 hasher = hashlib.md5() with open(filename, 'rb') as file: # Reads the particular blocksize from file buf = file.read(BLOCKSIZE) while(len(buf) > 0): hasher.update(buf) buf = file.read(BLOCKSIZE) return hasher.hexdigest() if __name__ == "__main__": # Dictionary to store the hash and filename hashMap = <> # List to store deleted files deletedFiles = [] filelist = [f for f in os.listdir() if os.path.isfile(f)] for f in filelist: key = hashFile(f) # If key already exists, it deletes the file if key in hashMap.keys(): deletedFiles.append(f) os.remove(f) else: hashMapPython delete duplicate files = f if len(deletedFiles) != 0: print('Deleted Files') for i in deletedFiles: print(i) else: print('No duplicate files found')
we have completed the coding part now we have to run the program to get the output.
Output:
we can run this in command prompt like the following:
This is the output of my system. which means there are no duplicates in my system.
Let us also try creating a duplicate file in the same directory.
Here is the output in ubuntu, since we are not allowed to create duplicate files in windows.
Congratulations. You have now made a simple python script to remove duplicate files in our system.
Post You May Also Like:
Removing duplicate files in a given directory
The following code is a python script that removes duplicate files in a given directory. At first I considered the most basic thing: relying on identical names, but you might have 2 different files with the same name in 2 different directories. So I decided to rely on md5checksum, since any 2 files that yield the same md5checksum almost invariably have the same content.
#A simple Python script to remove duplicate files. Coded by MCoury AKA python-scripter import hashlib import os #define a function to calculate md5checksum for a given file: def md5(f): """takes one file f as an argument and generates an md5checksum for that file""" return hashlib.md5(open(f,'rb').read()).hexdigest() #define our main function: def rm_dup(path): """relies on the md5 function above to remove duplicate files""" if not os.path.isdir(path):#make sure the given directory exists print('specified directory does not exist!') else: md5_dict=<> for root, dirs, files in os.walk(path):#the os.walk function allows checking subdirectories too. for f in files: if not md5(os.path.join(root,f)) in md5_dict: md5_dict.update() else: md5_dict[md5(os.path.join(root,f))].append(os.path.join(root,f)) for key in md5_dict: while len(md5_dictPython delete duplicate files)>1: for item in md5_dictPython delete duplicate files: os.remove(item) md5_dictPython delete duplicate files.remove(item) print('Done!') if __name__=='__main__': print('=======A simple Python script to remove duplicate files===========') print() print('============Coded by MCoury AKA python-scripter===================') print() print('===========The script counts on the fact the fact=================') print('=========that if 2 files have the same md5checksum================') print('==========they most likely have the same content==================') print() path=input(r'Please provide the target path\directory. for example: c: or c:\directory. ') print() rm_dup(path)
Ever since discovering the Zen of python I became obsessed with using the least possible lines of code. Plus I have another (and perhaps more serious) concern; calculating md5checksum for large files takes precious memory real-estate right? Could that limit the functionality of the script? Also, what do you think of the implementation?