GitHub - slottad/summarize: Quickly compute array size statistics

#Summarize Compute quick chunk density, size and skew statistics over SciDB arrays.

#Example Some spacing was added for quicker readability:

$ iquery -naq "store(apply(build(<val:double> [x=1:10000000,1000000,0], random()), val2, iif(x%2=0, 'abc','def'), val3, 0), temp)"

$ iquery -aq "summarize(temp)"
{inst,attid}   att,    count,     bytes, chunks, min_count, avg_count, max_count, min_bytes,   avg_bytes, max_bytes
{0,0}        'all', 10000000, 170002720,     40,   1000000,     1e+06,   1000000,        48, 4.25007e+06,   9000072

#Options Summarize accepts string-style boolean flag options of per_attribute and per_instance which both default to 0 and can be combined:

$ iquery -aq "summarize(temp, 'per_instance=1')"
{inst,attid} att,count,bytes,chunks,min_count,avg_count,max_count,min_bytes,avg_bytes,max_bytes
{0,0} 'all',3000000,51000816,12,1000000,1e+06,1000000,48,4.25007e+06,9000072
{1,0} 'all',3000000,51000816,12,1000000,1e+06,1000000,48,4.25007e+06,9000072
{2,0} 'all',2000000,34000544,8,1000000,1e+06,1000000,48,4.25007e+06,9000072
{3,0} 'all',2000000,34000544,8,1000000,1e+06,1000000,48,4.25007e+06,9000072

$ iquery -aq "summarize(temp, 'per_attribute=1')"
{inst,attid} att,count,bytes,chunks,min_count,avg_count,max_count,min_bytes,avg_bytes,max_bytes
{0,0} 'val',10000000,80000720,10,1000000,1e+06,1000000,8000072,8.00007e+06,8000072
{0,1} 'val2',10000000,90000720,10,1000000,1e+06,1000000,9000072,9.00007e+06,9000072
{0,2} 'val3',10000000,800,10,1000000,1e+06,1000000,80,80,80
{0,3} 'EmptyTag',10000000,480,10,1000000,1e+06,1000000,48,48,48

$ iquery -aq "summarize(temp, 'per_attribute=1', 'per_instance=1')"
{inst,attid} att,count,bytes,chunks,min_count,avg_count,max_count,min_bytes,avg_bytes,max_bytes
{0,0} 'val',3000000,24000216,3,1000000,1e+06,1000000,8000072,8.00007e+06,8000072
{0,1} 'val2',3000000,27000216,3,1000000,1e+06,1000000,9000072,9.00007e+06,9000072
{0,2} 'val3',3000000,240,3,1000000,1e+06,1000000,80,80,80
{0,3} 'EmptyTag',3000000,144,3,1000000,1e+06,1000000,48,48,48
{1,0} 'val',3000000,24000216,3,1000000,1e+06,1000000,8000072,8.00007e+06,8000072
{1,1} 'val2',3000000,27000216,3,1000000,1e+06,1000000,9000072,9.00007e+06,9000072
{1,2} 'val3',3000000,240,3,1000000,1e+06,1000000,80,80,80
{1,3} 'EmptyTag',3000000,144,3,1000000,1e+06,1000000,48,48,48
{2,0} 'val',2000000,16000144,2,1000000,1e+06,1000000,8000072,8.00007e+06,8000072
{2,1} 'val2',2000000,18000144,2,1000000,1e+06,1000000,9000072,9.00007e+06,9000072
{2,2} 'val3',2000000,160,2,1000000,1e+06,1000000,80,80,80
{2,3} 'EmptyTag',2000000,96,2,1000000,1e+06,1000000,48,48,48
{3,0} 'val',2000000,16000144,2,1000000,1e+06,1000000,8000072,8.00007e+06,8000072
{3,1} 'val2',2000000,18000144,2,1000000,1e+06,1000000,9000072,9.00007e+06,9000072
{3,2} 'val3',2000000,160,2,1000000,1e+06,1000000,80,80,80
{3,3} 'EmptyTag',2000000,96,2,1000000,1e+06,1000000,48,48,48

#Returned Fields Described as follows:

inst (dimension): the logical ID of the instance returning the data
attid (dimension): the attribute ID or 0 when returning totals across attributes
att: the string attribute name or 'all'
count: the total count of non-empty cells - equal across multiple attributes
bytes: the total number of used bytes
chunks: the total number of chunks - added up across multiple attributes
min/avg/max_count: statistics about the number of non-empty cells in each chunk
min/avg/max_bytes: statistics about the sizes of chunks

#Comparison to Other Tools Summarize is one of several ways to get chunk sizing. We'll list some advantages and disadvantages. ##Advantages:

Very quick: data is gathered quickly from chunk headers; chunks are not scanned off disk. For arrays exceeding tens of terabytes, op_count may take a while and summarize will perform much faster.
Does not require stored arrays: for example, summarize(apply(temp, b, random())) is possible. The user can thus avoid a store operation. However, an implicit materialization - full or partial, depending on the query - still happens as this is executed.
Easy to summarize the latest version of an array. list('chunk map') is good at describing how much total space an array takes up, but to find the size of the most recent version, the query is much more complicated.
Measures exactly the binary size of SciDB's internal RLE format.
Easy: a memorable keyword and quick access. Simpler than aggregate(filter(list('chunk map'), uaid=...)...,)

##Disadvantages:

Does not return compressed size or compression ratio for arrays with additional (zlib) compression enabled. You can find that data list('datastores') or list('chunk map')
Does not return other on-disk size overhead for stored arrays (tombstones, free lists, etc are not counted)

#Updated Features

A new option size_type which allows the data to be pulled from the physical chunk descriptors, thus allowing the compress size to discovered. Values may be:
- uncompressed or usize
- compressed or csize
- allocated or asize
A new function, human_bytes(), which can convert an obtuse value such as 341306424 to a more readable 325.50MB

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.settings		.settings
extern/MurmurHash		extern/MurmurHash
src		src
.cproject		.cproject
.project		.project
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

License

slottad/summarize

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages