Web Development: Compressing and Extracting Files in Python

Compressing and Extracting Files in Python

If you have been using computers for some time, you have probably come across files with the .zip extension. They are special files that can hold the compressed content of many other files, folders, and subfolders. This makes them pretty useful for transferring files over the internet. Did you know that you can use Python to compress or extract files?

This tutorial will teach you how to use the zipfile module in Python, to extract or compress individual or multiple files at once.

Compressing Individual Files

This one is easy and requires very little code. We begin by importing the zipfile module and then open the ZipFile object in write mode by specifying the second parameter as 'w'. The first parameter is the path to the file itself. Here is the code that you need:

import zipfile
        
jungle_zip = zipfile.ZipFile('C:\\Stories\\Fantasy\\jungle.zip', 'w')
jungle_zip.write('C:\\Stories\\Fantasy\\jungle.pdf', compress_type=zipfile.ZIP_DEFLATED)

jungle_zip.close()

Please note that I will specify the path in all the code snippets in a Windows style format; you will need to make appropriate changes if you are on Linux or Mac.

You can specify different compression methods to compress files. The newer methods BZIP2 and LZMA were added in Python version 3.3, and there are some other tools as well which don't support these two compression methods. For this reason, it is safe to just use the DEFLATED method. You should still try out these methods to see the difference in the size of the compressed file.

Compressing Multiple Files

This is slightly complex as you need to iterate over all files. The code below should compress all files with the extension pdf in a given folder:

import os
import zipfile

fantasy_zip = zipfile.ZipFile('C:\\Stories\\Fantasy\\archive.zip', 'w')

for folder, subfolders, files in os.walk('C:\\Stories\\Fantasy'):

    for file in files:
        if file.endswith('.pdf'):
            fantasy_zip.write(os.path.join(folder, file), os.path.relpath(os.path.join(folder,file), 'C:\\Stories\\Fantasy'), compress_type = zipfile.ZIP_DEFLATED)

fantasy_zip.close()

This time, we have imported the os module and used its walk() method to go over all files and subfolders inside our original folder. I am only compressing the pdf files in the directory. You can also create different archived files for each format using if statements.

If you don't want to preserve the directory structure, you can put all the files together by using the following line:

fantasy_zip.write(os.path.join(folder, file), file, compress_type = zipfile.ZIP_DEFLATED)

The write() method accepts three parameters. The first parameter is the name of our file that we want to compress. The second parameter is optional and allows you to specify a different file name for the compressed file. If nothing is specified, the original name is used.

Extracting All Files

You can use the extractall() method to extract all the files and folders from a zip file into the current working directory. You can also pass a folder name to extractall() to extract all files and folders in a specific directory. If the folder that you passed does not exist, this method will create one for you. Here is the code that you can use to extract files:

import zipfile
        
fantasy_zip = zipfile.ZipFile('C:\\Stories\\Fantasy\\archive.zip')
fantasy_zip.extractall('C:\\Library\\Stories\\Fantasy')

fantasy_zip.close()

If you want to extract multiple files, you will have to supply the name of files that you want to extract as a list.

Extracting Individual Files

This is similar to extracting multiple files. One difference is that this time you need to supply the filename first and the path to extract them to later. Also, you need to use the extract() method instead of extractall(). Here is a basic code snippet to extract individual files.

import zipfile
        
fantasy_zip = zipfile.ZipFile('C:\\Stories\\Fantasy\\archive.zip')
fantasy_zip.extract('Fantasy Jungle.pdf', 'C:\\Stories\\Fantasy')

fantasy_zip.close()

Reading Zip Files

Consider a scenario where you need to see if a zip archive contains a specific file. Up to this point, your only option to do so is by extracting all the files in the archive. Similarly, you may need to extract only those files which are larger than a specific size. The zipfile module allows us to inquire about the contents of an archive without ever extracting it.

Using the namelist() method of the ZipFile object will return a list of all members of an archive by name. To get information on a specific file in the archive, you can use the getinfo() method of the ZipFile object. This will give you access to information specific to that file, like the compressed and uncompressed size of the file or its last modification time. We will come back to that later.

Calling the getinfo() method one by one on all files can be a tiresome process when there are a lot of files that need to be processed. In this case, you can use the infolist() method to return a list containing a ZipInfo object for every single member in the archive. The order of these objects in the list is same as that of actual zipfiles.

You can also directly read the contents of a specific file from the archive using the read(file) method, where file is the name of the file that you intend to read. To do this, the archive must be opened in read or append mode.

To get the compressed size of an individual file from the archive, you can use the compress_size attribute. Similarly, to know the uncompressed size, you can use the file_size attribute.

The following code uses the properties and methods we just discussed to extract only those files that have a size below 1MB.

import zipfile

stories_zip = zipfile.ZipFile('C:\\Stories\\Funny\\archive.zip')

for file in stories_zip.namelist():
    if stories_zip.getinfo(file).file_size < 1024*1024:
                stories_zip.extract(file, 'C:\\Stories\\Short\\Funny')
        
stories_zip.close()

To know the time and date when a specific file from the archive was last modified, you can use the date_time attribute. This will return a tuple of six values. The values will be the year, month, day of the month, hours, minutes, and seconds, in that specific order. The year will always be greater than or equal to 1980, and hours, minutes, and seconds are zero-based.

import zipfile

stories_zip = zipfile.ZipFile('C:\\Stories\\Funny\\archive.zip')

thirsty_crow_info = stories_zip.getinfo('The Thirsty Crow.pdf')

print(thirsty_crow_info.date_time)
print(thirsty_crow_info.compress_size)
print(thirsty_crow_info.file_size)
        
stories_zip.close()

This information about the original file size and compressed file size can help you decide whether it is worth compressing a file. I am sure it can be used in some other situations as well.

Final Thoughts

As evident from this tutorial, using the zipfile module to compress files gives you a lot of flexibility. You can compress different files in a directory to different archives based on their type, name, or size. You also get to decide whether you want to preserve the directory structure or not. Similarly, while extracting the files, you can extract them to the location you want, based on your own criteria like size, etc.

To be honest, it was also pretty exciting for me to compress and extract files by writing my own code. I hope you enjoyed the tutorial, and if you have any questions, please let me know in the comments.

Web Development

Monday, December 19, 2016