Sunday, May 27, 2007

dar: a solution to archiving video

Like myself, you may have gigs and gigs of video sitting on your hard drive, taking up space that should be used for live projects or new media. And you've filled up your 500GB drive, so that you are constantly having to do piecemeal copies of older material to DVD+R. I have the same problem. But as I get older, I try to be a little wiser and actually solve my problems instead of living with them.

I will make the distinction that you should be using DVD+R for your archives. I have found +Rs to be more reliable for data archival that -Rs.

Here's the problem with doing a straight copy of media files to DVD+R. Look at the directory listing below:
[root@computer ~]# ll /mnt/videos/20050721/
total 20871112
-rwxr-xr-x 1 root root 1469679848 Oct 18 2006 1.m2t-rwxr-xr-x 1 root root 58916 Oct 21 2006 20050721b.xml-rwxr-xr-x 1 root root 3877924864 Oct 22 2006 20050721dvd.mpg-rwxr-xr-x 1 root root 1130843124 Oct 29 2006 20050721.m2t
-rwxr-xr-x 1 root root 45197 Oct 20 2006 20050721.xml
-rwxr-xr-x 1 root root 70478 Oct 21 2006 20060721c.xml
-rwxr-xr-x 1 root root 70431 Oct 24 2006 20060721d.xml
-rwxr-xr-x 1 root root 71543 Oct 24 2006 20060721e.xml
-rwxr-xr-x 1 root root 8549 Oct 25 2006 20060721small.xml
-rwxr-xr-x 1 root root 2065588524 Oct 18 2006 2.m2t
-rwxr-xr-x 1 root root 1755983168 Oct 18 2006 3.m2t
-rwxr-xr-x 1 root root 902360332 Oct 18 2006 4.m2t
-rwxr-xr-x 1 root root 1491412084 Oct 18 2006 5.m2t
-rwxr-xr-x 1 root root 2865367596 Oct 18 2006 6.m2t
-rwxr-xr-x 1 root root 5461152404 Oct 18 2006 7.m2t


Here we see the typical mess of Cinelerra project files, source material (HDV MPEGTS files), and final renders (MPEGs). Now, totaling up the space used for these file, you get about 21GB. Ugh. And given the odd file sizes, you'd end up using about 7 DVDs just to backup what you've got. This is because HDV files are huge, ranging from 1.5GB to 5GB and DVDs only hold about 4.38GB usable space on each. So you're forced to do a statistical combinations balancing act in order to fit as many files on a DVD in the most efficient manner possible. We who live in the land of video production are all living and breathing this headache. What a pain in the ass. But what is the alternative?

The alternative is to find a program that can compress and archive files over multiple DVDs, grouping these files of different sizes and compacting them together. Most importantly, the software should then evenly divide the compressed file archive across multiple DVDs in the most efficient and space conscious manner possible.

Well, lucky for us, the Linux Gods have brought down "dar" from the heavens. Dar (disk archive), available at http://dar.linux.free.fr/, is a command line backup and restore tool that can compress files using the bz2 algorithm, put files into a single archive and divide that archive into manageable chunks destined for backup media of one type or another. For the file listing you see above, dar was able to take it and turn it into this:
[root@computer ~]# ll /mnt/videos/2007-05-27_data.*
-rw-r--r-- 1 root root 4194304000 May 27 2007-05-27_data.1.dar
-rw-r--r-- 1 root root 4194304000 May 27 2007-05-27_data.2.dar
-rw-r--r-- 1 root root 4194304000 May 27 2007-05-27_data.3.dar
-rw-r--r-- 1 root root 4194304000 May 27 2007-05-27_data.4.dar
-rw-r--r-- 1 root root 3912021642 May 27 2007-05-27_data.5.dar


Nice! Easily digestible chunks for a single layer DVD to handle!

Now, the compression that dar acheived was not very much. Total file size went from 2.135GB to about 2.06GB. This is because the MPEGTS files are already compressed, so I'm not going to gain much from bz2. My 3.2Ghz, 2GB, PC3200, RAID0 (stripe set of two IDE drives), Dell 400SC took about three hours and twenty minutes to compress that 21GB. So it's not fast.

Before you get too excited, here are some known limitations of dar:
http://dar.linux.free.fr/doc/Limitations.html

I made sure to give dar a full system test using the steps below.

1) archive the above directory of files
TIME: about three hours and twenty minutes on the system described above.
dar -m 256 -v -y -s 4000M -D -R /mnt/videos/20050721/ -c `date -I`_data
Adding file to archive: /mnt/videos/20050721/20050721e.xml
Adding file to archive: /mnt/videos/20050721/addSecond.sh
..

Update 2008/12/22
If you have 120 minute, 4.7GB DVD+Rs, you can up the number of bytes in each dar to 4400MB or 4,613,734,400 or (4400 x 1024 x 1024):
dar -m 256 -v -y -s 4400M -D -R /mnt/videos/20050721/ -c `date -I`_data

Note: you made need the latest and greatest version of dvd+rw-tools for this large filesize burning to work! I tested this on Fedora 10 and I was able to store and retrieve a 25GB dar archive using this procedure.

Note that you will need to use the "-allow-limited-size" switch to growisofs when you burn these larger than normal files to dvd:
growisofs -Z /dev/dvd -R -J -allow-limited-size filename.dar
end update

In short, the switches I used mean:

-m 256   = don't compress files less than 256 bytes
-v       = verbose output showing what is being archived
-y       = activate bz2 compression
-s 4000M = create archives 4000MB in size.  4000MB is 1024x1024x4000 bytes or 4,194,304,000 bytes.
    By the way, 4GB is actually 2 to the 32 power or 4,294,967,296 bytes.
-D       = store directories excluded by the -P option or absent from the command line path list as empty directories
-R       = specify the root directory for saving or restoring files
-c       = create the archive with the following name, using the current date

Here's the output of that command:
--------------------------------------------
17 inode(s) saved
with 0 hard link(s) recorded
0 inode(s) changed at the moment of the backup
0 inode(s) not saved (no file change)
0 inode(s) failed to save (filesystem error)
0 files(s) ignored (excluded by filters)
0 files(s) recorded as deleted from reference backup
--------------------------------------------
Total number of file considered: 17
--------------------------------------------


The command line switches I used above are well summarized in this HowTo:
http://dar.linux.free.fr/doc/mini-howto/index.html

Also, for you man page readers, here's the nitty gritty:
http://www.die.net/doc/linux/man/man1/dar.1.html

2) validate that the archive is does not contain errors
TIME: about an hour and a half.
dar -t <archive name>

Here is the output of that command:
--------------------------------------------
17 file(s) treated
0 file(s) with error
0 file(s) ignored (excluded by filters)
--------------------------------------------
Total number of file considered: 17
--------------------------------------------

Also, it is helpful to list out the contents of the created dar in order to verify it matches the files you want archived. Here is sample output from another archive I created:

[root@computer ~]# dar -l 20081016_data
[data ][ EA  ][compr] | permission | user  | group | size  |          date                 |    filename
----------------------+------------+-------+-------+-------+-------------------------------+------------
[Saved]       [  90%]   -rw-r--r--   root       root    46335   Tue Oct 21 22:09:02 2008        20081016e.xml
[Saved]       [  46%]   -rwxr-xr-x   root       root    990     Sat Oct 18 16:56:45 2008        addSecond.sh
[Saved]       [   8%]   -rw-r--r--   root       root    5663820164      Sat Oct 18 09:25:36 2008        20081016_6.m2t
[Saved]       [   2%]   -rw-r--r--   root       root    26411454        Sun Oct 26 16:29:52 2008        test.mov
[Saved]       [  91%]   -rw-r--r--   root       root    55587   Mon Oct 27 08:23:02 2008        20081016i.xml
[Saved]       [  79%]   -rw-r--r--   root       root    13680   Sat Oct 18 15:41:23 2008        20081016a.xml
[Saved]       [  47%]   -rwxr-xr-x   root       root    1408    Sat Oct 18 17:50:51 2008        2songlist.sh
[Saved]       [  51%]   -rw-r--r--   root       root    1688    Wed Oct 22 08:19:37 2008        vodcastNew.xml
[Saved]       [  47%]   -rwxr-xr-x   root       root    2411    Sat Oct 18 19:13:01 2008        1encode.sh.bak
[Saved]       [   8%]   -rw-r--r--   root       root    1143234956      Sat Oct 18 09:11:36 2008        20081016_4.m2t
[Saved]       [   5%]   -rw-r--r--   root       root    3805975877      Mon Oct 27 04:49:50 2008        StormPigs20081016.m2v
[Saved]       [     ]   -rwxr-xr-x   root       root    146     Sat Oct 18 17:56:25 2008        5ftp.sh
[Saved]       [  10%]   -rw-r--r--   root       root    804122436       Sat Oct 18 09:02:08 2008        20081016_1.m2t
[Saved]       [   8%]   -rw-r--r--   root       root    2091780308      Sat Oct 18 09:05:45 2008        20081016_2.m2t
[Saved]       [  88%]   -rw-r--r--   root       root    30648   Tue Oct 21 21:07:01 2008        20081016c.xml
[Saved]       [  89%]   -rw-r--r--   root       root    40105   Tue Oct 21 21:32:46 2008        20081016d.xml
[Saved]       [  79%]   -rw-r--r--   root       root    12197   Sat Oct 18 15:13:20 2008        20081016.xml
..

3) write each output file from dar to DVD
TIME: with a 18x burner running at 16x speed to DVD+R, this takes about an hour.

First, check your media:
dvd+rw-mediainfo /dev/dvd

Then burn your archive to disk:
growisofs -Z /dev/dvd -R -J /root/2007-05-27_data.1.dar
..


If you intend to do a lot of archiving, I suggest you purchase a recent model DVD+R recorder. When I first tested dar this past weekend, I had a mess of problems reading the archive files I had burned successfully to DVD. I figured my DVD was three years old and it was time for an upgrade, so I bought the internal version of this drive, the HP DVD940E External 18x Super Multi DVD Writer for $60 with a $30 rebate from Office Depot. The thing performs like a champ!

4) copy the archive from the DVDs to disk
TIME: with an 18x burner, this takes about twenty minutes.
mount /dev/cdrom /mnt/cdrom
cp /mnt/cdrom/* /mnt/videos/


5) validate that the archive files off the DVD do not contain errors
TIME: about an hour and a half.
dar -t <archive name>

While validating my archives off DVD, I encountered one problem:
[root@computer ~]# dar -t /mnt/videos/2007-05-27_data
ERR /6.m2t : compressed data CRC error
--------------------------------------------
17 file(s) treated
1 file(s) with error
0 file(s) ignored (excluded by filters)
--------------------------------------------
Total number of file considered: 17
--------------------------------------------

Bad news. It looks like the data written to one of the DVDs is corrupt. Since I had the originals files and they tested out correct, I re-wrote the archive to new DVDs and did not encounter this problem again. By the way, the test of my 20GB archives took about an hour.

Here is what a successful validation looks like:
[root@computer ~]# dar -t 20081016_data
--------------------------------------------
17 inode(s) treated
0 inode(s) with error
0 inode(s) ignored (excluded by filters)
--------------------------------------------
Total number of inode considered: 17
--------------------------------------------


6) if no errors, restore original files and verify file sizes
TIME: about three hours.
This step is optional, if you've already run "dar -t" to verify the integrity of the archive coming off the DVD. Here is the output:
dar -x 2007-05-27_data
--------------------------------------------
17 file(s) restored
0 file(s) not restored (not saved in archive)
0 file(s) ignored (excluded by filters)
0 file(s) less recent than the one on filesystem
0 file(s) failed to restore (filesystem error)
0 file(s) deleted
--------------------------------------------
Total number of file considered: 17
--------------------------------------------


There was some slowness copying the archives back from DVD (which took about two hours at 4x speed), but that's just the speed of the DVD player. Aside from that 4GB limit, dar live up to its reputation! So I'm pretty happy.

Review
1) archive your files
TIME: about three hours and twenty minutes on the system described above.
dar -m 256 -v -y -s 4000M -D -R /mnt/videos/20050721/ -c `date -I`_data

2) validate that the archive is does not contain errors
TIME: about an hour and a half.
dar -t <archive name>

3) write each output file from dar to DVD
TIME: with a 18x burner running at 16x speed to DVD+R, this takes about an hour.
growisofs -Z /dev/dvd -R -J /root/2007-05-27_data.1.dar

4) copy the archive from the DVDs to disk
TIME: with an 18x burner, this takes about twenty minutes.
cp /mnt/cdrom/* /mnt/videos/

5) validate that the archive files off the DVD do not contain errors
TIME: about an hour and a half.
dar -t <archive name>

OPTIONAL:
6) if no errors, restore original files and verify file sizes
TIME: about three hours.
dar -x 2007-05-27_data

Summary
If you wish to use dar and want to keep your valuable video data in tact for years to come, I strongly suggest you run through steps 1-5 each time you make an archive! Of course, just the basic steps take a total of eight hours for 20GB of data. The optional step brings that total to eleven hours of your time spent.

Of course, you don't have to archive EVERYTHING. Only archive the source videos and maybe the primary intermediates. For example, I archive all my MPEG-TS files from my cam, plus the MPEG2 video and MP3 audio rendered from my project. I DON'T archive the finals: DVD format, iTunes format and MPEG program streams, as I can always reproduce those from the primary intermediates that are rendered from the project.

In the end, you have to ask yourself "How much do I value the work that I've done?"
Going through these steps everytime you make an archive may seem like a pain, but the pain will be worse if your data goes away! You could opt to store your media on a hard drive, but if that hard drive gets near a speaker or large magnet, your data could be lost. If you are going to archive this data for years, it makes more sense to do it on optical formats that are not susceptible to damage by magnetism.

If you do decide to go the dar route and follow these steps, you'll have the peace of mind that your archives are error free.

Hopefully, dar might fit into your backup and recovery schemes. There are a number of other softwares to do something similar. Partimage on the http://www.sysresccd.org comes to mind, though that is used for entire partitions. Also Duplicity is available, but that's strength is in encryption and network backups. To its strength, dar is a proven solution and is very well documented:
http://dar.linux.free.fr/doc/

As I have time, I will post a bit more technical information about the commands used, but the best idea is to research the documentation at the link above, as well as do a simple "dar -h" at the command line for a listing of all the available features.

Update 1/4/2014
The Extraction Process Redux
I've been restoring dar archives from DVDs.  Today, I pulled out a couple five DVD dar archives that I originally created four years ago.  Each DVD took about six minutes to copy over to my hard drive.  I'm happy to say that dar restored the individual video files that I specified without any problems.  Here's a sample command:

dar -x 20090430_data -g 20090430.m2v

However, dar did spit out this message:
File ownership will not be restored as dar is not run as root. to avoid this message use -O option [return = YES | Esc = NO]
Continuing...

Error met while opening the last slice: This is an old archive, it can only be opened starting by the first slice. Trying to open the archive using the first slice...

Even with this message, the archived files restored without error.

The commands above mean:
-x = extract
-g = subdirectory to include in the operation

Also, another good switch is -O, to avoid the "root ownership" message seen above.  Be careful of the placement of -O..it has to be the first parameter.  Like so:

dar -O -x 20090430_data -g 20090430.m2v

After giving the -O parameter in the above command, all you should see is the "Error met while opening the last slice" message.

Update 10/1/2008
The Extraction Process
I pulled out a 6 DVD dar archive that I originally created more than a year ago and I'm happy to say that dar restored the files without any problems. Specifically, I needed to pull one MPEG video from a dar archive of about 25 files. The dar command to extract one specific file was relatively simple:
dar -x -I *.mpg

-x = extract
-I = include following filespec in operation


So my command ended up looking like this:
dar -x /mule/20060831 -I *.mpg

One thing I noticed is that depending on the archive, wildcards (like *.mpg) may work, but not all the time. In which case, you should remove the wildcard from the include specification and just use the exact syntax; eg:
dar -x /mule/20060831 -I file.mpg

That's it!

Have a good day!
The Video Mule

5/30/07 update - After using dar for the past couple of days and releasing about 50GB, I have to say that I am really starting to like this new process. It is a consistent, repeatable and efficient approach to archiving my material that I can kick off before bedtime.

10/1/08 update - Dargui is a nice, simple graphical front end to dar. For some reason, though, the filter did not work properly, so I reverted to command line. Perhaps someone else will have better luck.


References
http://dargui.sourceforge.net/
http://dar.linux.free.fr/doc/mini-howto/dar-differential-backup-mini-howto.en.html

No comments: