Quick and Dirty Dedupe Analyzer
QDDA - The Quick & Dirty Dedupe Analyzer
- Bart Sjerps (DellEMC)
- Source code
QDDA can analyze files, block devices and data streams to find duplicate blocks so one can estimate potential storage savings by implementing deduplication-capable storage (such as Dell-EMC XtremIO or VMAX All-Flash (experimental). It will also estimate the compression ratio.
QDDA is written in C/C++ for performance and requires Linux to run on. It uses MD5 for hash calculations, LZ4 for compression and SQLite3 as database.
Tested on EL6 (CentOS, but RHEL/OEL and EL7 should also work fine) and Linux Mint.
QDDA is licensed under GPLv3+ which means it's free to use, but use at your own risk. I accept no claims on any damage resulting from QDDA showing incorrect results.
(phew, that was the legal stuff. See the section below as well on protection against bugs).
3 Protection against bugs
QDDA is safe to run even on files/devices that are in use. It opens streams read-only and cannot modify any files except SQLite3 databases. It writes to a database file that needs to be either newly created or a pre-existing SQLite3 database. It can remove the database file but ONLY if it is an SQLite3 file (it tests the file magic).
For added safety you may run "qdda" as a non-privileged user and only open named pipes or standard input (stdin), or use ACL's (see below)
qdda 1.9.0 - The Quick & Dirty Dedupe Analyzer Use for educational purposes only - actual array reduction results may vary Usage: qdda [-D] [-B blksize] [-b <bandw>] [-c] [-d] [-f <dbpath>] [h|H] [-i importdb] [-k] [-n] [-q] [-r] [-t [gb]] [-v] [-x] [device list] -D (debug) : Show lots of annoying debug info -B <blksize_kb> : Set blocksize to blksize_kb kilobytes -P (purge) : Reclaim unused space in database (sqlite vacuum) -T <tempdir> : Use tempdir instead of /var/tmp for SQLite temp files -a (append) : Append data to existing database -b <bandw> : Throttle bandwidth in MB/s (default 200, set to 0 to disable throttling) -c <method|l> : Compression method for reporting (use -c l to list methods) -d (dump) : Dump block offsets and hashes (to search for block offset with hash) -f <database> : Specify alternative database location (default /var/tmp/qdda.db) -i <import-db> : Import data from another database (merge with existing data) -n (noupdate) : Dryrun - just read data (no update to database) -q (quiet) : Don't show progress indicator or intermediate results -r (no-Report) : Don't show report, don't merge staging data and keep staging database -t [size_gb|0] : Run raw performance test (size_gb 0) or merge test (data size in GB) -V (version) : print version and copyright info -x (eXtended) : Extended report (file info and dedupe/compression histograms) -h or ? (help) : This help -H Long help : More detailed information, tips & tricks qdda is safe to run even on files/devices that are in use. It opens streams read-only and cannot modify any files. It writes to a database file that needs to be either newly created or a pre-existing SQLite3 database. It can remove the database file but ONLY if it is an SQLite3 file. See the long help for additional safety options. *** Overview *** total = Total data blocks scanned used = Used (non-zero) space deduped = Required space after deduplication allocated = Required space after applying compression (array specific) *** Details *** Compression method = Description of array/method blocksize = Dedupe blocksize free (zero) = Zeroed blocks (not allocated) compress pre dedup = Sum of bytes of compressed blocks (before dedupe) merged by dedupe = Space saved by deduplication compress post dedup = Sum of bytes of compressed blocks (after dedupe) unique data = Blocks that can't be deduped duped 2:1 = Blocks that are deduped 2:1 duped >2x = Blocks that are deduped more than 2:1 duped total = Blocks that are deduped 2:1 or more *** Summary *** percentage used = Used capacity / Total capacity percentage free = Free capacity (zeroed) deduplication ratio = used capacity / deduped capacity compression ratio = deduped and compressed / deduped thin ratio = Free capacity / Total capacity combined = dedup * compressed * thin ratio raw capacity = Total scanned (allocated) net capacity = Total required after optimizations Hash calculations are highly accurate up to roughly 10TB data using 16K blocksize, after which hash collisions may cause increasingly inaccurate results. More info: http://outrun.nl/wiki/qdda
5 Usage example
Output from a run against a test VM running Oracle on ASM.
[root@outrun01 ~](-) # qdda -b0 /dev/oracleasm/* qdda 1.9.1 - The Quick & Dirty Dedupe Analyzer Use for educational purposes only - actual array reduction results may vary File 01, 196608 16k blocks (3072 MiB) processed, 68/9999 MB/s, 101 MB/s avg File 02, 196608 16k blocks (3072 MiB) processed, 64/9999 MB/s, 62 MB/s avg File 03, 131072 16k blocks (2048 MiB) processed, 126/9999 MB/s, 121 MB/s avg File 04, 65536 16k blocks (1024 MiB) processed, 59/9999 MB/s, 59 MB/s avg File 05, 65536 16k blocks (1024 MiB) processed, 68/9999 MB/s, 59 MB/s avg File 06, 65536 16k blocks (1024 MiB) processed, 58/9999 MB/s, 54 MB/s avg File 07, 65536 16k blocks (1024 MiB) processed, 66/9999 MB/s, 62 MB/s avg File 08, 65536 16k blocks (1024 MiB) processed, 69/9999 MB/s, 62 MB/s avg File 09, 65536 16k blocks (1024 MiB) processed, 62/9999 MB/s, 60 MB/s avg Merging 917504 blocks (14336 MiB) with 0 blocks (0 MiB) Indexing in 2.95 sec (310510 blocks/s, 4851 MiB/s), Joining in 1.36 sec (674995 blocks/s, 10546 MiB/s) *** Overview *** total = 14336.00 MiB ( 917504 blocks) used = 9327.33 MiB ( 596949 blocks) deduped = 5396.56 MiB ( 345380 blocks) allocated = 2178.44 MiB ( 139420 blocks) *** Details *** Compression method = XtremIO X2 blocksize = 16 KiB free (zero) = 5008.67 MiB ( 320555 blocks) compress pre dedup = 3084.41 MiB ( 66.93 %) merged by dedupe = 3930.77 MiB ( 251569 blocks) compress post dedup = 1990.34 MiB ( 63.12 %) unique data = 2712.03 MiB ( 173570 blocks) duped 2:1 = 2025.27 MiB ( 129617 blocks) duped >2:1 = 659.27 MiB ( 42193 blocks) duped total = 2684.53 MiB ( 171810 blocks) *** Summary *** percentage used = 65.06 % percentage free = 34.94 % deduplication ratio = 1.73 compression ratio = 2.48 thin ratio = 1.54 combined = 6.58 raw capacity = 14336.00 MiB net capacity = 2178.44 MiB
Explanation: We scanned 9 Oracle ASM devices with various types of data using default blocksize (16K), total 14GB of which about 9.3 GB is actually used. Because we use 16K block size, QDDA assumes we simulate an XtremIO X2.
Oracle rarely writes duplicate blocks (with some exceptions) so the dedupe ratio is not very high. But we copied the data disk to data-copy to simulate a cloned database which shows because almost 4GB of capacity is merged (saved) by deduplication.
So we find a lot of blocks in the "deduped 2x" category and a smaller amount in the higher dedupe ratios.
Pre-dedupe compression is just the sum of all blocks after compression (without dedupe so some similar blocks get both compressed). After dedupe the compressed size would be 1990 MB which is the sum of the compressed sizes of all blocks. But we store compressed blocks in "buckets" for performance reasons, so a 16K block that compresses into 4100 bytes goes into the 5K bucket because it's just too large for a 4096 bytes slot. Which explains why the allocated size is a little higher than compressed post dedup.
The space savings percentage post dedup is roughly 63% ( savings = 1 - (compressed post dedup/deduped size)) or about 1:3 which is usual for Oracle data.
The compressed block allocation into buckets comes at the expense of a bit additional capacity but avoids heavy fragmentation and performance issues. The required capacity on an array that does thin provisioning, bucket compression and deduplication with 16K blocks would be roughly 2.2 GB MB (and we had 14GB) so the overall efficiency here is 1:6.58. Real world environments may differ from my test environment).
Various factors influence how close QDDA can get to what you would really save on an AFA. Some factors that influence accuracy:
- Compression algorithm. QDDA uses LZ4 which is not what an actual XtremIO or VMAX AFA uses (for various reasons). So by design the compression ratio is not 100% accurate although for XtremIO they turned out to be pretty close. VMAX compression is experimental, results TBD in the future and the algorithm may change
- Hash collisions. There is a statistical chance that 2 different blocks generate the same hash. Statistically this happens once in a 5 TB data set (at 16KB) and at 5TB a deviation of 16K is acceptable as we don't store real data
- Differences in allocation algorithms. Especially VMAX AFA has a dynamic backend which is practically impossible to simulate with a tool so the allocation rates are slightly optimistic (for now)
- VMAX AFA currently does not dedupe so you need to look only at the "compress pre dedup" line. The allocated capacity is what would be required if VMAX would dedupe at 128K chunks - which may or may not happen in the future
- Bugs. Impossible of course... :-/
Note that the accuracy may get worse for datasets much larger than, say, 10TB@16K or 80TB@128K due to hash collisions.
- EL6 (CentOS/RHEL/OEL)
# Install the Outrun Extras repository yum install http://yum.outrun.nl/outrun-extras.rpm # Install qdda from repo yum install qdda
- Non-EL (Ubuntu, mint, etc)
Copy /usr/bin/qdda from an RPM-compatible system.
ready to use!
QDDA creates a key-value store using an SQLite3 database. The key-value pairs are filled with hash values (the key) and the number of occurrences (the value).
The hash algorithm used is MD5 (stripped to 56 bits - 7 bytes) and calculated for every block. The compressed size for each block is also fetched. The results will be merged in a key-value table with 3 columns: key (the MD5 hash), value (block count with same hash), bytes (compressed size). A zeroed block will have the special "hash" value of 0.
When scanning is complete, the number of unique keys represents the actual storage capacity that would be required after de-duplication. The sum of all values is the amount of raw blocks scanned. So by dividing the amount of unique keys by the total scanned blocks you get the deduplication ratio.
Every (unique) block is also compressed using the LZ4 compression algorithm . We keep track of the total bytes before and after compression for each block - so we can estimate the compression ratio. Some all-flash arrays use a performance optimized method for storing compressed data in "buckets" to increase performance and avoid fragmentation overhead. This "bucket compression" can also be calculated.
A zero block (only zero values) will generate the special hash value 0 (the real hash is ignored) so we can also figure out how much space is allocated vs. unused.
Using a simple key-value table "kv" where k is the hash and v equals the number of times the hash is found.
For simplicity, say we scan 10 blocks where the hashes would be:
0 5 6 7 8 <-- hashes (simplified) block 0 = 5 * block 1 = 0 * block 2 = 6 * block 3 = 7 * block 4 = 0 * block 5 = 6 * block 6 = 8 * block 7 = 6 * block 8 = 6 * block 9 = 5 * ------------- Sums: 2 2 4 1 1 ^--^ - non-dedupeable ^------ - deduped 4:1 ^--------- - deduped 2:1 ^------------ - zero (not used) here we have total = 10 free = 2 used = 8 unique = 2 (i.e. 7,8 appear only once) duped = 2 (5 and 6 appear more than once) deduped = 4 (i.e. 5 non-zero hashes: 5,6,7,8) FYI, The k,v table would look like this (5 rows): k=0, v=2 k=5, v=2 k=6, v=4 k=7, v=1 k=8, v=1 How to query the kv table: Total blocks = sum(v) - total logical blocks scanned (2+2+4+1+1 = 10) Free blocks = count(*) where k=0 - total logical zero blocks (v=2) Used blocks = sum(v) where k!=0 - total logical non-zero blocks (2+4+1+1 = 8) Deduped blocks = count(*) where k!=0 - total amount of hashes - required capacity after dedupe - excluding free blocks (4 keys with k!=0) Free + used = count(*) - equal to # rows in the table Unique blocks = count(*) where k!=0 and v=1 - total blocks which occur only once (no deduplication) (2, k and k ) Duped blocks = count(*) where k!=0 and v>1 - total blocks which occur more than once (deduplication occurs) (2, k and k)
For added safety you may run qdda as a non-privileged user. Non-root users usually do not have access to block devices so we need a way to get raw disk data to qdda without adding security risks.
To run QDDA as non-root user, there are various methods (assuming the "qdda" user without root privileges):
- Add qdda to group "disk". This is still unsafe as the "disk" group has read/write access to disks, and still causes problems with ASM disks for example which are usually owned by a different group
- Use "sudo cat /dev/<disk> | qdda <options>". Requires sudo access for qdda, still unsafe.
- Use named pipes:
- as root run 'mkfifo /tmp/p ; cat /dev/<disk> > /tmp/p'
- as qdda run 'qdda <options> /tmp/p'
- This still requires some commands to be run as 'root' which opens the gates for mistakes.
- Change the group/permissions on /dev/<disk>: problematic as it either gives all users read access (chmod 664) or alters permissions
- for other applications, again such as Oracle ASM (chgrp qdda /dev/<disk>; chmod 640 </dev/disk>)
The best solution I have found is to use extended ACLs on Linux:
setfacl -m u:qdda:r /dev/<disk>
This gives qdda read-only access without altering any of the existing ownerships/permissions. The permissions will be reset at next reboot as all device nodes are reconstructed. You need to have ACL enabled on the root file system (holding /dev/) and the setfacl tool installed (RPM package acl).
To make life easier, the QDDA package holds a simple script to set and reset the ACLs:
acl-qdda user=qdda /dev/sd* [root@outrun01 ~](-) # ls -al /dev/sd[abc] brw-rw----+ 1 root disk 8, 0 Nov 9 10:45 /dev/sda brw-rw----+ 1 root disk 8, 16 Nov 9 09:36 /dev/sdb brw-rw----+ 1 root disk 8, 32 Nov 9 09:36 /dev/sdc # note the + appended to the permissions [root@outrun01 ~](-) # acl-qdda list /dev/sda | head # file: /dev/sda # owner: root # group: disk user::rw- user:qdda:r-- <--- "r" means read-only group::rw- mask::rw- other::---
User qdda may now read all /dev/sd* disks. The standard UNIX permissions and ownership are preserved.
QDDA can produce deep insight in dedupe and compression distribution. For this run "qdda -x" (extended report) Example:
[root@outrun01 ~](-) # qdda -x qdda 1.9.0 - The Quick & Dirty Dedupe Analyzer Use for educational purposes only - actual array reduction results may vary File Blksz Blocks MiB Filename 1 16 196608 3072 /dev/oracleasm/data 2 16 196608 3072 /dev/oracleasm/data-cp 3 16 131072 2048 /dev/oracleasm/fra 4 16 65536 1024 /dev/oracleasm/iops 5 16 65536 1024 /dev/oracleasm/iops2 6 16 65536 1024 /dev/oracleasm/iops-cp 7 16 65536 1024 /dev/oracleasm/redo 8 16 65536 1024 /dev/oracleasm/vol1 9 16 65536 1024 /dev/oracleasm/vol2 Dedupe histogram: Dupcount Blocks Bytes 1 173570 2843770880 2 129617 2123644928 3 22973 376389632 4 10911 178765824 5 2552 41811968 6 4187 68599808 7 192 3145728 8 432 7077888 9 582 9535488 10 342 5603328 11 22 360448 Compression Histogram (XtremIO X2): Bucket(KiB) Buckets Blocks Bytes 1 73012 4563 74764288 2 15525 1940 31795200 3 8567 1606 26317824 4 7424 1856 30408704 5 10749 3359 55034880 6 19021 7132 116865024 7 21167 9260 151725056 8 60652 30326 496861184 9 75050 42215 691660800 10 32465 20290 332441600 11 12633 8685 142298112 12 3132 2349 38486016 13 685 556 9118720 14 115 100 1648640 16 5183 5183 84918272
11 More features
- Scanning from named pipes, stdin
- You can scan from other hosts via (named or unnamed) pipes. Examples:
dd if=/dev/urandom bs=1M count=1024 | qdda # via pipe on stdin ssh root@bighost "dd if=/dev/oracleasm/data" | qdda # via pipe from remote ssh-> dd # Using named pipe mkfifo pipe # start background dump to pipe dd if=/dev/urandom of=pipe bs=1M count=1024 & # let qdda read from pipe qdda pipe
- Note that ssh encrypts all data so performance may be limited.
- Scanning across a network pipe
- You can do this using netcat (nc). This sends the data unencrypted which gives better performance. Example:
- target host: (as non-root user): nc -l 19000 | qdda
- source host: (as root) : cat /dev/<disk> | nc targethost 19000
- Change blocksize
- You can change the default 16K blocksize between 1 and 128K in multiples of 1K (1024 bytes). Note that you need to reset the database when starting with a new blocksize.
- Non-standard blocksizes cause compression estimates to be disabled as we don't know what buckets are available to put the compressed blocks into.
- Bandwidth throttling
- qdda prevents overloading IO stacks by silently limiting bandwidth to 200MB/s. You can go fullspeed with "-b 0" or throttle to other values, i.e. "-b 50" for 50MB/s.
- Compression method
- You can select a non-default compression method. The metadata for this needs to be loaded in the database first. TBD.
- Change DB filename
- Default database is /var/tmp/qdda.db. Change with "-f mydb" if you want to store the database elsewhere (or don't want to lose the contents of the other one).
- If you don't care about keeping the results and have enough RAM, you can specify the SQLite name for keeping the database in memory with "-f :memory:". Qdda will not create a database file.
- Append instead of replace
- The default behaviour is to wipe the database with each scan. The "-a" (append) option prevents deleting the database. This allows multiple qdda runs where the results are added to the existing database.
- Use "-q" (Quiet) to avoid showing the progress indicator or intermediate results.
- Use "-r" (no-Report) to avoid merging and showing the results when finished.
- Performance test
- Use "-t <gb>" to run a performance test. Qdda will show individual speeds for hashing, compression, db operations and reading from a stream and an estimate for the max performance on your system. 0 = load test (hashing, checksumming, database insert) or a large dataset merge test with -t <gb>
- use "-D" (Debug) to show lots of annoying debug info.
- use "-Q" (Queries) to show which queries are executed against SQLite (note not all queries are displayed)
- Run your own queries against the data
- and run your own queries. Example:
# count blocks that have exactly 3 duplicates select count(k) from kv where v=3; # Get list of blocks that appear more than twice and sort them by amount select v,count(k) from kv where k!=0 and v>2 group by v order by v;
- You can test other than 16K blocksizes with the -B option. The max (currently) is 128K and must be specified in kibibytes (1024 bytes)
- Not tested (yet) with very large data sets - I've tested up to 4TB datasets which took a few hours to process. SQLite3 can handle very large amounts of values but I have not tuned SQLite much for performance. I expect it to run fine however up to at least 10TB with default blocksize.
- The compression ratio can differ from what the storage system uses. It's just a rough estimate. Mileage may vary. The algorithm is based on LZ4 and compress buckets and may work very different for various platforms.
- SQLite is used because it is available on all EL6 systems (and probably most other Linux distro's) and because it's simple and fast. Other (real) key-value stores might be better but this does the job. We can also store other info than key-value data in the database which makes it a good choice.
- The database roughly needs 20 bytes per block for scanning and about 3X that size for processing (non-dedupable) data (I found 16MiB DB size for scanning 8GB non-dedupable data). It may require more when processing huge datasets (due to require deeper indices).
- You can set the database filename to "/bigfilesystem/bla.db" if you run out of space on /var/tmp. Also note that many systems clean up /var/tmp after a day or so, place your database elsewhere if you want to keep results.
Tables with index require 20 bytes per block data + 20 bytes index, which means for a 1TB dataset at 16K blocksize, no deduplication:
Primary database (/var/tmp/qdda.db) = 1TB/16K * (20 + 20) = 67 108 864 blocks * 40 ~ 2.6 GB. staging database (/var/tmp/qdda-staging.db) = 2.6 GB (deleted after loading into the primary) temp size for indexing/joining (/var/tmp/<hidden file>) = Roughly 50% of staging database ~ 1.3 GB
Reporting on large databases also requires some temp space. Note:
- the larger the blocksize, the fewer rows are needed and the smaller the required space
- the staging database contains a row for every scanned block
- the primary database contains a row for every deduped block. If dedupe is 2:1 then the database size is 50%
These numbers can be calculated using the -t <size> option, for example for a 1TB dataset @ 16K: qdda -t1024 A 2TB dataset, 128K blocksize: qdda -B128 -T2048
If you need more space than available on the default paths, you can change the locations of the files:
- Primary database with -f option, for example -f /bigdata/myqdda.db - Staging database and hidden directories with -T option - choose a temp directory, i.e. -T /hugetemp/
qdda is single-threaded (for now). On my Intel i5 @ 3.10GHz I get the following profile test results:
[root@outrun01 ~](-) # qdda -t 0 -f /var/tmp/p.db qdda 1.9.0 - The Quick & Dirty Dedupe Analyzer Use for educational purposes only - actual array reduction results may vary *** Synthetic performance test *** Initializing: 131072 blocks, 16k (2048 MiB) Hashing: 3747010 usec, 573.12 MB/s Compressing: 1440107 usec, 1491.20 MB/s DB insert: 129703 usec, 16556.93 MB/s Total: 5316820 usec, 403.90 MB/s
The actual scanning is a little bit slower than a synthetic data generator, so expect roughly 300-600MB/s on typical systems. I expect high-end server type CPUs to be even faster. Note that to get more than about 200MB/s you need to disable throttling (-b 0) which was built in as a safeguard against I/O starvation on production systems.
Merge performance - the database internally has to count, sort and join a lot of rows (each block is a row in the staging database). Performance depends on CPU and I/O speed. Test the merging speed with the '-t size_gb' option. On my system the merge performance is (for 16K blocksize on empty database, 2 threads):
Size (GB) Indexing(s) Index(MB/s) Joining(s) Join(MB/s) 2 0.08 27170 0.16 12931 16 0.77 21251 1.35 12099 64 3.71 17654 5.56 11781 128 19.81 13230 23.57 11121 1024 53.52 19592 98.39 10657