Wednesday, February 8, 2017

Working with Tar Files in Python

1. Introduction
"Tar" is an archiving format that has become rather popular in the opensource world. In essence, it takes several files and bundles them into onefile. Originally, the tar format was made for tape archives, hence the name;today it is often used for distributing source code or for making backups ofdata. Most Linux distributions have tools in the standard installation forcreating and unpacking tar files.Python's standard library comes with a module which makes creating andextracting tar files very simple. Examples of when individuals might want suchfunctionality include programming a custom backup script or a script to createa snapshot of other personal projects

2. Tutorial
This is a basic tutorial designed to teach three things: how to add filesto an archive, how to retrieve information on files in the archive, and how to extract files from the archive.

Adding Files
To begin, import the tarfile module. Then, create what is called a"TarFile Object". This is an object with special functions for interacting withthe tar file. In this case, we are opening the file "archive.tar.gz". Note thatthe mode is "w:gz", which opens the file fo writing and with gzip compression. As usual, "w" not preserve previous contents of the file. If the tarfilealready exists, use "a" to append files to the end of the archive (n.b.: youcannot use append with a compressed archive - there is no such mode as "a:gz").

Create a TarFile Object
>>> import tarfile
>>> tar = tarfile.open("archive.tar.gz", "w:gz")
>>> tar<tarfile.TarFile object at 0x2af77c060990>

Adding files to the archive is very simple. If you want the file to have adifferent name in the archive, use the arcname option.

Adding a File to the Archive
>>> tar.add("file.txt")
>>> tar.add("file.txt", arcname="new.txt")

Adding directories works in the same way. Note that by default a directory will be added recursively: every file and folder under it will be included.This behavior can be changed by setting recursive to False.

Adding a Directory to the Archive
>>> tar.add("docs/")
>>> tar.add("financial/", recursive=False)

As with normal file objects, always be sure to close a TarFile Object.

Close the TarFile Object
>>> tar.close()

File Information
The tarfile module includes the ability to retrieve information about theindividual contents of a tar file. Each item is accessed as a "TarInfo Object".For example, getmembers() will return a list of all TarInfo objects in a tarfile:

Listing TarInfo Objects
>>> import tarfile
>>> tar = tarfile.open("archive.tar.gz", "r:gz")
>>> members = tar.getmembers()
>>> members[<TarInfo 'text.txt' at 0x2b0b73e46a90>, <TarInfo 'text2.txt' at0x2b0b73e46ad0>]

Each TarInfo object has several methods associated with it.

TarInfo information
>>> members[0].name'text.txt'
>>> members[0].isfile()True

Extracting Files
Extracting the contents is a very simple process. To extract the entiretar file, simple use extractall(). This will extract the file to the currentworking directory. Optionally, a path may be specified to have the tar extractelsewhere.

Extracting an entire tar file
>>> import tarfile
>>> tar = tarfile.open("archive.tar.gz", "r:gz")
>>> tar.extractall()
>>> tar.extractall("/tmp/")

If only specific files need to be extracted, use extract()

Extracting a single file from a tar file
>>> import tarfile
>>> tar = tarfile.open("archive.tar.gz", "r:gz")
>>> tar.extract("text.txt")

You should be aware that there is at least onesecurity concernto takeinto account when extracting tar files. Namely, a tar can be designed tooverwrite files outside of the current working directory (/etc/passwd, forexample). Never extract a tar as the root user if you do not trust it

3. Examples
Archiving Select Files from a Directory
>>> import os
>>> import tar
>>> filewhitelist = ['.odt', '.pdf']
>>> contents = os.listdir(os.getcwd())
>>> tar = tarfile.open('backup.tar.gz', 'w:gz')
>>> for item in contents:
>>>    if item[-4:] in whitelist:
>>>       tar.add(item)
>>> tar.close()

4. Extending
Removing Files
The tarfile module does not contain any function to remove an item from anarchive. It is presumed that this is because of the nature of tape drives,which were not designed to move back and forth (considerthis postto thePython tutor mailing list). Nevertheless, other programs for creating tararchives do have a delete feature.The following code uses the popular GNU tar programs that comes with mostLinux distributions. Their documentation of the "--delete" flag can be readhere; note that they warn not to use it on an actual tape drive. The relianceon an external program obviously makes the code far less portable, but it issuitable for personal scripts.

Removing an Item from a Tar

>>> import subprocess
>>> def remove(archive, unwanted):
>>>     external = subprocess.getoutput("tar --version")
>>>     if external[:13] != "tar (GNU tar)":
>>>         raise Exception("err: need GNU tar to delete individual files.")
>>>     command = 'tar --delete --file="{0}" "{1}"'.format(archive, unwanted)
>>>     output = subprocess.getstatusoutput(command)[0]
>>>     return output

No comments:

Post a Comment