Quick and Dirty Dedupe Analyzer

From Outrun Wiki
Jump to: navigation, search


Intro

QDDA - The Quick & Dirty Dedupe Analyzer

  • Author
Bart Sjerps (DellEMC)
  • License
GPLv3+
  • Source code
qdda on github

QDDA can analyze files, block devices and data streams to find duplicate blocks so one can estimate potential storage savings by implementing deduplication-capable storage (such as Dell-EMC XtremIO or VMAX All-Flash (experimental). It will also estimate the compression ratio.

QDDA is written in C/C++ for performance and requires Linux to run on. It uses MD5 for hash calculations, LZ4 for compression and SQLite3 as database.

Tested on EL6 (CentOS, but RHEL/OEL and EL7 should also work fine) and Linux Mint.

Disclaimer

QDDA is licensed under GPLv3+ which means it's free to use, but use at your own risk. I accept no claims on any damage resulting from QDDA showing incorrect results.

(phew, that was the legal stuff. See the section below as well on protection against bugs).

Protection against bugs

QDDA is safe to run even on files/devices that are in use. It opens streams read-only and cannot modify any files except SQLite3 databases. It writes to a database file that needs to be either newly created or a pre-existing SQLite3 database. It can remove the database file but ONLY if it is an SQLite3 file (it tests the file magic).

For added safety you may run "qdda" as a non-privileged user and only open named pipes or standard input (stdin), or use ACL's (see below)

Usage

qdda 1.9.0 - The Quick & Dirty Dedupe Analyzer
Use for educational purposes only - actual array reduction results may vary

Usage: qdda [-D] [-B blksize] [-b <bandw>] [-c] [-d] [-f <dbpath>] [h|H] [-i importdb] 
            [-k] [-n] [-q] [-r] [-t [gb]] [-v] [-x] [device list]
  -D (debug)        : Show lots of annoying debug info
  -B <blksize_kb>   : Set blocksize to blksize_kb kilobytes
  -P (purge)        : Reclaim unused space in database (sqlite vacuum)
  -T <tempdir>      : Use tempdir instead of /var/tmp for SQLite temp files
  -a (append)       : Append data to existing database
  -b <bandw>        : Throttle bandwidth in MB/s (default 200, set to 0 to disable throttling)
  -c <method|l>     : Compression method for reporting (use -c l to list methods)
  -d (dump)         : Dump block offsets and hashes (to search for block offset with hash)
  -f <database>     : Specify alternative database location (default /var/tmp/qdda.db)
  -i <import-db>    : Import data from another database (merge with existing data)
  -n (noupdate)     : Dryrun - just read data (no update to database)
  -q (quiet)        : Don't show progress indicator or intermediate results
  -r (no-Report)    : Don't show report, don't merge staging data and keep staging database
  -t [size_gb|0]    : Run raw performance test (size_gb 0) or merge test (data size in GB)
  -V (version)      : print version and copyright info
  -x (eXtended)     : Extended report (file info and dedupe/compression histograms)
  -h or ? (help)    : This help
  -H Long help      : More detailed information, tips & tricks

qdda is safe to run even on files/devices that are in use. It opens streams read-only and cannot modify any files.
It writes to a database file that needs to be either newly created or a pre-existing SQLite3 database.
It can remove the database file but ONLY if it is an SQLite3 file.

See the long help for additional safety options.

                      *** Overview ***
total               = Total data blocks scanned
used                = Used (non-zero) space
deduped             = Required space after deduplication
allocated           = Required space after applying compression (array specific)
                      *** Details ***
Compression method  = Description of array/method
blocksize           = Dedupe blocksize
free (zero)         = Zeroed blocks (not allocated)
compress pre dedup  = Sum of bytes of compressed blocks (before dedupe)
merged by dedupe    = Space saved by deduplication
compress post dedup = Sum of bytes of compressed blocks (after dedupe)
unique data         = Blocks that can't be deduped
duped 2:1           = Blocks that are deduped 2:1
duped >2x           = Blocks that are deduped more than 2:1
duped total         = Blocks that are deduped 2:1 or more
                      *** Summary ***
percentage used     = Used capacity / Total capacity
percentage free     = Free capacity (zeroed)
deduplication ratio = used capacity / deduped capacity
compression ratio   = deduped and compressed / deduped
thin ratio          = Free capacity / Total capacity
combined            = dedup * compressed * thin ratio
raw capacity        = Total scanned (allocated)
net capacity        = Total required after optimizations

Hash calculations are highly accurate up to roughly 10TB data using 16K blocksize,
after which hash collisions may cause increasingly inaccurate results.

More info: http://outrun.nl/wiki/qdda

Usage example

Output from a run against a test VM running Oracle on ASM.

[root@outrun01 ~](-) # qdda -b0 /dev/oracleasm/*
qdda 1.9.1 - The Quick & Dirty Dedupe Analyzer
Use for educational purposes only - actual array reduction results may vary

File 01, 196608 16k blocks (3072 MiB) processed, 68/9999 MB/s, 101 MB/s avg                  
File 02, 196608 16k blocks (3072 MiB) processed, 64/9999 MB/s, 62 MB/s avg                 
File 03, 131072 16k blocks (2048 MiB) processed, 126/9999 MB/s, 121 MB/s avg                 
File 04, 65536 16k blocks (1024 MiB) processed, 59/9999 MB/s, 59 MB/s avg                 
File 05, 65536 16k blocks (1024 MiB) processed, 68/9999 MB/s, 59 MB/s avg                 
File 06, 65536 16k blocks (1024 MiB) processed, 58/9999 MB/s, 54 MB/s avg                 
File 07, 65536 16k blocks (1024 MiB) processed, 66/9999 MB/s, 62 MB/s avg                 
File 08, 65536 16k blocks (1024 MiB) processed, 69/9999 MB/s, 62 MB/s avg                 
File 09, 65536 16k blocks (1024 MiB) processed, 62/9999 MB/s, 60 MB/s avg                 
Merging 917504 blocks (14336 MiB) with 0 blocks (0 MiB)
Indexing in 2.95 sec (310510 blocks/s, 4851 MiB/s), Joining in 1.36 sec (674995 blocks/s, 10546 MiB/s)
                      *** Overview ***
total               =    14336.00 MiB (    917504 blocks)
used                =     9327.33 MiB (    596949 blocks)
deduped             =     5396.56 MiB (    345380 blocks)
allocated           =     2178.44 MiB (    139420 blocks)
                      *** Details ***
Compression method  =      XtremIO X2
blocksize           =          16 KiB
free (zero)         =     5008.67 MiB (    320555 blocks)
compress pre dedup  =     3084.41 MiB (     66.93 %)
merged by dedupe    =     3930.77 MiB (    251569 blocks)
compress post dedup =     1990.34 MiB (     63.12 %)
unique data         =     2712.03 MiB (    173570 blocks)
duped 2:1           =     2025.27 MiB (    129617 blocks)
duped >2:1          =      659.27 MiB (     42193 blocks)
duped total         =     2684.53 MiB (    171810 blocks)
                      *** Summary ***
percentage used     =       65.06 %
percentage free     =       34.94 %
deduplication ratio =        1.73
compression ratio   =        2.48
thin ratio          =        1.54
combined            =        6.58
raw capacity        =    14336.00 MiB
net capacity        =     2178.44 MiB

Explanation: We scanned 9 Oracle ASM devices with various types of data using default blocksize (16K), total 14GB of which about 9.3 GB is actually used. Because we use 16K block size, QDDA assumes we simulate an XtremIO X2.

Oracle rarely writes duplicate blocks (with some exceptions) so the dedupe ratio is not very high. But we copied the data disk to data-copy to simulate a cloned database which shows because almost 4GB of capacity is merged (saved) by deduplication.

So we find a lot of blocks in the "deduped 2x" category and a smaller amount in the higher dedupe ratios.

Pre-dedupe compression is just the sum of all blocks after compression (without dedupe so some similar blocks get both compressed). After dedupe the compressed size would be 1990 MB which is the sum of the compressed sizes of all blocks. But we store compressed blocks in "buckets" for performance reasons, so a 16K block that compresses into 4100 bytes goes into the 5K bucket because it's just too large for a 4096 bytes slot. Which explains why the allocated size is a little higher than compressed post dedup.

The space savings percentage post dedup is roughly 63% ( savings = 1 - (compressed post dedup/deduped size)) or about 1:3 which is usual for Oracle data.

The compressed block allocation into buckets comes at the expense of a bit additional capacity but avoids heavy fragmentation and performance issues. The required capacity on an array that does thin provisioning, bucket compression and deduplication with 16K blocks would be roughly 2.2 GB MB (and we had 14GB) so the overall efficiency here is 1:6.58. Real world environments may differ from my test environment).

Accuracy

Various factors influence how close QDDA can get to what you would really save on an AFA. Some factors that influence accuracy:

  • Compression algorithm. QDDA uses LZ4 which is not what an actual XtremIO or VMAX AFA uses (for various reasons). So by design the compression ratio is not 100% accurate although for XtremIO they turned out to be pretty close. VMAX compression is experimental, results TBD in the future and the algorithm may change
  • Hash collisions. There is a statistical chance that 2 different blocks generate the same hash. Statistically this happens once in a 5 TB data set (at 16KB) and at 5TB a deviation of 16K is acceptable as we don't store real data
  • Differences in allocation algorithms. Especially VMAX AFA has a dynamic backend which is practically impossible to simulate with a tool so the allocation rates are slightly optimistic (for now)
  • VMAX AFA currently does not dedupe so you need to look only at the "compress pre dedup" line. The allocated capacity is what would be required if VMAX would dedupe at 128K chunks - which may or may not happen in the future
  • Bugs. Impossible of course... :-/

Note that the accuracy may get worse for datasets much larger than, say, 10TB@16K or 80TB@128K due to hash collisions.

Install

EL6 (CentOS/RHEL/OEL)
# Install the Outrun Extras repository
yum install http://yum.outrun.nl/outrun-extras.rpm
# Install qdda from repo
yum install qdda
Non-EL (Ubuntu, mint, etc)

Copy /usr/bin/qdda from an RPM-compatible system.

ready to use!

Concept

QDDA creates a key-value store using an SQLite3 database. The key-value pairs are filled with hash values (the key) and the number of occurrences (the value).

The hash algorithm used is MD5 (stripped to 56 bits - 7 bytes) and calculated for every block. The compressed size for each block is also fetched. The results will be merged in a key-value table with 3 columns: key (the MD5 hash), value (block count with same hash), bytes (compressed size). A zeroed block will have the special "hash" value of 0.

When scanning is complete, the number of unique keys represents the actual storage capacity that would be required after de-duplication. The sum of all values is the amount of raw blocks scanned. So by dividing the amount of unique keys by the total scanned blocks you get the deduplication ratio.

Every (unique) block is also compressed using the LZ4 compression algorithm . We keep track of the total bytes before and after compression for each block - so we can estimate the compression ratio. Some all-flash arrays use a performance optimized method for storing compressed data in "buckets" to increase performance and avoid fragmentation overhead. This "bucket compression" can also be calculated.

A zero block (only zero values) will generate the special hash value 0 (the real hash is ignored) so we can also figure out how much space is allocated vs. unused.

Summary

Using a simple key-value table "kv" where k is the hash and v equals the number of times the hash is found.

For simplicity, say we scan 10 blocks where the hashes would be:

              0  5  6  7  8  <-- hashes (simplified)
block 0 = 5      *
block 1 = 0   *
block 2 = 6         *
block 3 = 7            *
block 4 = 0   *
block 5 = 6         *
block 6 = 8               *
block 7 = 6         *
block 8 = 6         *
block 9 = 5      *
              -------------
Sums:         2  2  4  1  1
                       ^--^ - non-dedupeable
                    ^------ - deduped 4:1
                 ^--------- - deduped 2:1
              ^------------ - zero (not used)

here we have 

total   = 10
free    = 2
used    = 8
unique  = 2 (i.e. 7,8 appear only once)
duped   = 2 (5 and 6 appear more than once)
deduped = 4 (i.e. 5 non-zero hashes: 5,6,7,8)

FYI, The k,v table would look like this (5 rows):

k=0, v=2
k=5, v=2
k=6, v=4
k=7, v=1
k=8, v=1

How to query the kv table:

Total blocks      = sum(v)                      - total logical blocks scanned  (2+2+4+1+1 = 10)
Free blocks       = count(*) where k=0          - total logical zero blocks     (v[0]=2)
Used blocks       = sum(v)   where k!=0         - total logical non-zero blocks (2+4+1+1 = 8)
Deduped blocks    = count(*) where k!=0         - total amount of hashes - required capacity after dedupe - excluding free blocks (4 keys with k!=0)
Free + used       = count(*)                    - equal to # rows in the table

Unique blocks     = count(*) where k!=0 and v=1 - total blocks which occur only once (no deduplication) (2, k[7] and k[8] )
Duped blocks      = count(*) where k!=0 and v>1 - total blocks which occur more than once (deduplication occurs) (2, k[5] and k[6])

Safety

For added safety you may run qdda as a non-privileged user. Non-root users usually do not have access to block devices so we need a way to get raw disk data to qdda without adding security risks.

To run QDDA as non-root user, there are various methods (assuming the "qdda" user without root privileges):

  • Add qdda to group "disk". This is still unsafe as the "disk" group has read/write access to disks, and still causes problems with ASM disks for example which are usually owned by a different group
  • Use "sudo cat /dev/<disk> | qdda <options>". Requires sudo access for qdda, still unsafe.
  • Use named pipes:
as root run 'mkfifo /tmp/p ; cat /dev/<disk> > /tmp/p'
as qdda run 'qdda <options> /tmp/p'
This still requires some commands to be run as 'root' which opens the gates for mistakes.
  • Change the group/permissions on /dev/<disk>: problematic as it either gives all users read access (chmod 664) or alters permissions
for other applications, again such as Oracle ASM (chgrp qdda /dev/<disk>; chmod 640 </dev/disk>)

The best solution I have found is to use extended ACLs on Linux:

setfacl -m u:qdda:r /dev/<disk>

This gives qdda read-only access without altering any of the existing ownerships/permissions. The permissions will be reset at next reboot as all device nodes are reconstructed. You need to have ACL enabled on the root file system (holding /dev/) and the setfacl tool installed (RPM package acl).

To make life easier, the QDDA package holds a simple script to set and reset the ACLs:

acl-qdda user=qdda /dev/sd*
[root@outrun01 ~](-) # ls -al /dev/sd[abc]
brw-rw----+ 1 root disk 8,  0 Nov  9 10:45 /dev/sda
brw-rw----+ 1 root disk 8, 16 Nov  9 09:36 /dev/sdb
brw-rw----+ 1 root disk 8, 32 Nov  9 09:36 /dev/sdc
# note the + appended to the permissions
[root@outrun01 ~](-) # acl-qdda list /dev/sda | head
# file: /dev/sda
# owner: root
# group: disk
user::rw-
user:qdda:r-- <--- "r" means read-only
group::rw-
mask::rw-
other::---

User qdda may now read all /dev/sd* disks. The standard UNIX permissions and ownership are preserved.

Histograms

QDDA can produce deep insight in dedupe and compression distribution. For this run "qdda -x" (extended report) Example:

[root@outrun01 ~](-) # qdda -x
qdda 1.9.0 - The Quick & Dirty Dedupe Analyzer
Use for educational purposes only - actual array reduction results may vary


File  Blksz    Blocks   MiB         Filename 
1     16       196608   3072        /dev/oracleasm/data 
2     16       196608   3072        /dev/oracleasm/data-cp 
3     16       131072   2048        /dev/oracleasm/fra 
4     16       65536    1024        /dev/oracleasm/iops 
5     16       65536    1024        /dev/oracleasm/iops2 
6     16       65536    1024        /dev/oracleasm/iops-cp 
7     16       65536    1024        /dev/oracleasm/redo 
8     16       65536    1024        /dev/oracleasm/vol1 
9     16       65536    1024        /dev/oracleasm/vol2 

Dedupe histogram:
Dupcount     Blocks       Bytes 
1            173570       2843770880 
2            129617       2123644928 
3            22973        376389632 
4            10911        178765824 
5            2552         41811968 
6            4187         68599808 
7            192          3145728 
8            432          7077888 
9            582          9535488 
10           342          5603328 
11           22           360448 

Compression Histogram (XtremIO X2): 
Bucket(KiB)  Buckets      Blocks       Bytes 
1            73012        4563         74764288 
2            15525        1940         31795200 
3            8567         1606         26317824 
4            7424         1856         30408704 
5            10749        3359         55034880 
6            19021        7132         116865024 
7            21167        9260         151725056 
8            60652        30326        496861184 
9            75050        42215        691660800 
10           32465        20290        332441600 
11           12633        8685         142298112 
12           3132         2349         38486016 
13           685          556          9118720 
14           115          100          1648640 
16           5183         5183         84918272 


More features

Scanning from named pipes, stdin
You can scan from other hosts via (named or unnamed) pipes. Examples:
dd if=/dev/urandom bs=1M count=1024          | qdda # via pipe on stdin
ssh root@bighost "dd if=/dev/oracleasm/data" | qdda # via pipe from remote ssh-> dd

# Using named pipe
mkfifo pipe
# start background dump to pipe
dd if=/dev/urandom of=pipe bs=1M count=1024 &
# let qdda read from pipe
qdda pipe
Note that ssh encrypts all data so performance may be limited.
Scanning across a network pipe
You can do this using netcat (nc). This sends the data unencrypted which gives better performance. Example:
target host: (as non-root user): nc -l 19000 | qdda
source host: (as root)  : cat /dev/<disk> | nc targethost 19000
Change blocksize
You can change the default 16K blocksize between 1 and 128K in multiples of 1K (1024 bytes). Note that you need to reset the database when starting with a new blocksize.
Non-standard blocksizes cause compression estimates to be disabled as we don't know what buckets are available to put the compressed blocks into.
Bandwidth throttling
qdda prevents overloading IO stacks by silently limiting bandwidth to 200MB/s. You can go fullspeed with "-b 0" or throttle to other values, i.e. "-b 50" for 50MB/s.
Compression method
You can select a non-default compression method. The metadata for this needs to be loaded in the database first. TBD.
Change DB filename
Default database is /var/tmp/qdda.db. Change with "-f mydb" if you want to store the database elsewhere (or don't want to lose the contents of the other one).
In-memory
If you don't care about keeping the results and have enough RAM, you can specify the SQLite name for keeping the database in memory with "-f :memory:". Qdda will not create a database file.
Append instead of replace
The default behaviour is to wipe the database with each scan. The "-a" (append) option prevents deleting the database. This allows multiple qdda runs where the results are added to the existing database.
Quiet
Use "-q" (Quiet) to avoid showing the progress indicator or intermediate results.
No-report
Use "-r" (no-Report) to avoid merging and showing the results when finished.
Performance test
Use "-t <gb>" to run a performance test. Qdda will show individual speeds for hashing, compression, db operations and reading from a stream and an estimate for the max performance on your system. 0 = load test (hashing, checksumming, database insert) or a large dataset merge test with -t <gb>
Debug
use "-D" (Debug) to show lots of annoying debug info.
Queries
use "-Q" (Queries) to show which queries are executed against SQLite (note not all queries are displayed)
Advanced
Run your own queries against the data
sqlite3 /var/tmp/qdda.db
and run your own queries. Example:
# count blocks that have exactly 3 duplicates
select count(k) from kv where v=3;
# Get list of blocks that appear more than twice and sort them by amount
select v,count(k) from kv where k!=0 and v>2 group by v order by v;

Notes

  • You can test other than 16K blocksizes with the -B option. The max (currently) is 128K and must be specified in kibibytes (1024 bytes)
  • Not tested (yet) with very large data sets - I've tested up to 4TB datasets which took a few hours to process. SQLite3 can handle very large amounts of values but I have not tuned SQLite much for performance. I expect it to run fine however up to at least 10TB with default blocksize.
  • The compression ratio can differ from what the storage system uses. It's just a rough estimate. Mileage may vary. The algorithm is based on LZ4 and compress buckets and may work very different for various platforms.
  • SQLite is used because it is available on all EL6 systems (and probably most other Linux distro's) and because it's simple and fast. Other (real) key-value stores might be better but this does the job. We can also store other info than key-value data in the database which makes it a good choice.
  • The database roughly needs 20 bytes per block for scanning and about 3X that size for processing (non-dedupable) data (I found 16MiB DB size for scanning 8GB non-dedupable data). It may require more when processing huge datasets (due to require deeper indices).
  • You can set the database filename to "/bigfilesystem/bla.db" if you run out of space on /var/tmp. Also note that many systems clean up /var/tmp after a day or so, place your database elsewhere if you want to keep results.

Sizing

Tables with index require 20 bytes per block data + 20 bytes index, which means for a 1TB dataset at 16K blocksize, no deduplication:

Primary database (/var/tmp/qdda.db) = 1TB/16K * (20 + 20) = 67 108 864 blocks * 40 ~ 2.6 GB. staging database (/var/tmp/qdda-staging.db) = 2.6 GB (deleted after loading into the primary) temp size for indexing/joining (/var/tmp/<hidden file>) = Roughly 50% of staging database ~ 1.3 GB

Reporting on large databases also requires some temp space. Note:

  • the larger the blocksize, the fewer rows are needed and the smaller the required space
  • the staging database contains a row for every scanned block
  • the primary database contains a row for every deduped block. If dedupe is 2:1 then the database size is 50%

These numbers can be calculated using the -t <size> option, for example for a 1TB dataset @ 16K: qdda -t1024 A 2TB dataset, 128K blocksize: qdda -B128 -T2048

If you need more space than available on the default paths, you can change the locations of the files:

- Primary database with -f option, for example -f /bigdata/myqdda.db - Staging database and hidden directories with -T option - choose a temp directory, i.e. -T /hugetemp/

Performance

qdda is single-threaded (for now). On my Intel i5 @ 3.10GHz I get the following profile test results:

[root@outrun01 ~](-) # qdda -t 0 -f /var/tmp/p.db
qdda 1.9.0 - The Quick & Dirty Dedupe Analyzer
Use for educational purposes only - actual array reduction results may vary

*** Synthetic performance test ***
Initializing:         131072 blocks, 16k (2048 MiB)
Hashing:             3747010 usec,     573.12 MB/s
Compressing:         1440107 usec,    1491.20 MB/s
DB insert:            129703 usec,   16556.93 MB/s
Total:               5316820 usec,     403.90 MB/s

The actual scanning is a little bit slower than a synthetic data generator, so expect roughly 300-600MB/s on typical systems. I expect high-end server type CPUs to be even faster. Note that to get more than about 200MB/s you need to disable throttling (-b 0) which was built in as a safeguard against I/O starvation on production systems.

Merge performance - the database internally has to count, sort and join a lot of rows (each block is a row in the staging database). Performance depends on CPU and I/O speed. Test the merging speed with the '-t size_gb' option. On my system the merge performance is (for 16K blocksize on empty database, 2 threads):

Size (GB) Indexing(s)  Index(MB/s) Joining(s) Join(MB/s)
        2        0.08        27170       0.16      12931
       16        0.77        21251       1.35      12099
       64        3.71        17654       5.56      11781
      128       19.81        13230      23.57      11121
     1024       53.52        19592      98.39      10657