Predicting compression ratio of data with compressible decision

Item request has been placed!

Item request cannot be made.

Processing Request

اقرأ أكثر حفظ في قائمتي

Publication Date:
April 16, 2024

معلومة اضافية
- Patent Number:
  11962,333
- Appl. No:
  17/891389
- Application Filed:
  August 19, 2022
- نبذة مختصرة :
  A data-compression analyzer can rapidly make a binary decision to compress or not compress an input data block or can use a slower neural network to predict the block's compression ratio with a regression model. A Concentration Value (CV) that is the sum of the squares of the frequencies and a Number of Zero (NZ) symbols are calculated from an un-sorted symbol frequency table. A rapid decision to compress is signaled when their product CV*NZ exceeds a horizontal threshold THH. During training, CV*NZ is plotted as a function of compression ratio C % for many training data blocks. Different test values of THH are applied to the plot to determine true and false positive rates that are plotted as a Receiver Operating Characteristic (ROC) curve. The point on the ROC curve having the largest Youden index is selected as the optimum THH for use in future binary decisions.
- Inventors:
  Hong Kong Applied Science and Technology Research Institute Company Limited (Hong Kong, HK)
- Assignees:
  Hong Kong Applied Science and Technology Research Institute Company Limited (Hong Kong, HK)
- Claim:
  1. A data-compression analyzer comprising a processor or an Application-Specific Integrated Circuit (ASIC) which comprises: a counter for counting occurrences of symbols in an input data block to generate a symbol frequency table storing symbol frequencies for the input data block, wherein the symbol frequency table is stored in a computer memory; a calculator for calculating a Concentration Value (CV) of the input data block by summing squares of the symbol frequencies in the symbol frequency table; a zero tracker having a Number of Zeros (NZ) that indicates a number of symbols in the input data block having a symbol value of zero; a multiplier that multiplies the Concentration Value (CV) with the Number of Zeros (NZ) to get a CV*NZ product; and a comparator for comparing the CV*NZ product to a horizontal threshold and signaling that the input data block is compressible when the CV*NZ product exceeds the horizontal threshold, and signaling that the input data block is not compressible when the CV*NZ product is less than the horizontal threshold, whereby the CV*NZ product is compared to the horizontal threshold to signal that the input data block is compressible or not compressible.
- Claim:
  2. The data-compression analyzer of claim 1 further comprising; a zeros counter that counts a number of symbols having a value of zero in the input data block to generate the Number of Zeros (NZ).
- Claim:
  3. The data-compression analyzer of claim 1 wherein the Number of Zeros (NZ) is obtained by reading an entry in the symbol frequency table for symbols having a symbols value of zero.
- Claim:
  4. The data-compression analyzer of claim 1 further comprising; a threshold optimizer comprising: a trainer that applies a plurality of training data blocks to the counter, causing the calculator and multiplier to generate a plurality of CV*NZ products for the plurality of training data blocks; wherein a compression ratio is received or generated for each training data block in the plurality of training data blocks, the compression ratio for a training data block being a ratio of a size of the training data block after compression to a size of the training data block; a tabulator that tabulates the plurality of CV*NZ products as a function of the compression ratio for each of the plurality of training data blocks; a Receiver Operating Characteristic (ROC) tabulator that compares the plurality of CV*NZ products to the horizontal threshold to obtain predicted results, and compares the compression ratio for each training data block to a vertical threshold to obtain actual results, and generates true and false positive rates from the predicted and actual results; an ROC optimizer that selects an optimal combination of the true and false positive rates and outputs a test value of the horizontal threshold corresponding to the optimal combination as the horizontal threshold for use by the comparator for input data blocks.
- Claim:
  5. The data-compression analyzer of claim 4 further comprising; a Youden index generator that generates a plurality of Youden indexes, wherein each Youden index corresponds to a point on an ROC curve of the combination of the true and false positive rates, wherein the ROC optimizer selects a Youden index having a maximum value to identify an optimal combination and the horizontal threshold generating the optimal combination.
- Claim:
  6. The data-compression analyzer of claim 4 wherein the ROC tabulator receives a vertical threshold that indicates a compression ratio, wherein training data blocks with the compression ratio greater than the vertical threshold are considered to be non-compressible, and wherein training data blocks with the compression ratio less than the vertical threshold are considered to be compressible; the ROC tabulator testing a plurality of test values of the horizontal threshold, the test values of the horizontal threshold being compared to the plurality of CV*NZ products, wherein a training data block having a CV*NZ product that is greater than the test value of the horizontal threshold is considered to be predicted compressible and a training data block having a CV*NZ product that is less than the test value of the horizontal threshold is considered to be predicted non-compressible; wherein the ROC tabulator tabulates a ROC curve wherein each point on the ROC curve is for a different one of the test values of the horizontal threshold, wherein the ROC tabulator tabulates true positives for predicted non-compressible training data blocks that are considered non-compressible, true negatives for predicted compressible training data blocks that are considered compressible, false positives for predicted compressible training data blocks that are considered non-compressible, and false negatives for predicted non-compressible training data blocks that are considered compressible; wherein the ROC tabulator tabulates a true positive rate as a function of a false positive rate for each of the plurality of test values of the horizontal threshold to determine points on the ROC curve; wherein the true positive rate is calculated by dividing the true positives by a sum of the true positives and the false negatives; wherein the false positive rate is calculated by dividing the false positives by a sum of the false positives and the true negatives.
- Claim:
  7. The data-compression analyzer of claim 6 wherein the ROC optimizer selects an optimal point on the ROC curve and outputs the test value of the horizontal threshold corresponding to the optimal point as the horizontal threshold for use by the comparator for input data blocks.
- Claim:
  8. The data-compression analyzer of claim 7 wherein the ROC optimizer calculates a plurality of Youden indexes, wherein each Youden index corresponds to a point on the ROC curve, the ROC optimizer selecting a Youden index having a maximum value to identify the optimal point.
- Claim:
  9. The data-compression analyzer of claim 1 further comprising; a sorter for sorting the symbol frequency table based on the symbol value to generated a sorted symbol frequency table; a clipper for discarding entries in the sorted symbol frequency table having symbol values below a cutoff value to generate a clipped sorted symbol frequency table having fewer entries than the sorted symbol frequency table; a neural network that receives the clipped sorted symbol frequency table as an input and generates a predicted compression ratio for the input data block.
- Claim:
  10. The data-compression analyzer of claim 9 wherein a compression ratio is received or generated for each training data block in the plurality of training data blocks, the compression ratio for a training data block being a ratio of a size of the training data block after compression to a size of the training data block; a neural network trainer that trains the neural network using the plurality of training data blocks that are processed by the counter, sorter, and clipper and input to the neural network, wherein weights of the neural network are adjusted to converge the predicted compression ratio to the compression ratio for each training data block during training, whereby the neural network is trained by the plurality of training data blocks to converge the predicted compression ratio to the compression ratio for each training data block during training.
- Claim:
  11. A computer-implemented method for managing compression comprising: receiving an input data block and storing at least a portion of the input data block on a computer memory device; generating a symbol frequency table by counting occurrences of symbols in an input data block to generate symbol frequencies stored in a symbol frequency table in the computer memory device; calculating a Concentration Value (CV) of the input data block by summing squares of the symbol frequencies read from the symbol frequency table; wherein the symbol frequency table includes a zero entry for a zero-valued symbol, the symbol frequency for the zero entry being a Number of Zeros (NZ) that indicates a number of symbols in the input data block having the zero-valued symbol; using a computer to multiply the Concentration Value (CV) with the Number of Zeros (NZ) to get a CV*NZ product; and comparing the CV*NZ product to a horizontal threshold and signaling that the input data block is compressible when the CV*NZ product exceeds the horizontal threshold, and signaling that the input data block is not compressible when the CV*NZ product is less than the horizontal threshold, whereby the computer compares the CV*NZ product to the horizontal threshold to signal that the input data block is compressible or is not compressible.
- Claim:
  12. The computer-implemented method of claim 11 further comprising during training: processing a plurality of training data blocks each having a compression ratio, the computer generating CV*NZ products for each block in the plurality of training data blocks; generating a first table of the CV*NZ products as a function of the compression ratio; wherein a vertical threshold indicates a threshold compression ratio, wherein a block with a compression ratio greater than the threshold compression ratio is classified as an actual compressible block, and a block with a compression ratio less than the threshold compression ratio is classified as an actual non-compressible block; setting a horizontal threshold to a plurality of test values, wherein a block with a CV*NZ product greater than the horizontal threshold is a predicted compressible block, and a block with a CV*NZ product less than the horizontal threshold is a predicted non-compressible block; generating a second table of true prediction rates and false prediction rates, wherein each entry in the second table has a true prediction rate and a false prediction rate and a test value of the horizontal threshold that generated the true prediction rate and the false prediction rate in the entry; wherein the true prediction rates and false prediction rates are a function of at least two of: a number of the predicted compressible blocks, a number of the predicted non-compressible blocks, a number of the actual compressible blocks, and a number of the actual non-compressible blocks; that are determined by the plurality of test values of the horizontal threshold; generating a Youden index for each entry in the second table; selecting as a selected entry an entry in the second table having a Youden index that is larger than Youden indexes for other entries in the second table; and outputting the horizontal threshold from the selected entry.
- Patent References Cited:
  9509336 November 2016 Henry
  9564918 February 2017 Asher et al.
  9606750 March 2017 Seo et al.
  10097202 October 2018 Ki et al.
  20130007346 January 2013 Khan
  20190141167 May 2019 Nolan et al.
  20210193076 June 2021 Morrell et al.
  115843366 March 2023
- Other References:
  ISR and Written Opinion, PCT/CN2022/118272, dated May 4, 2023. cited by applicant
- Primary Examiner:
  Sandifer, Matthew D
- Attorney, Agent or Firm:
  Auvinen, Stuart T.
  gPatent LLC
- الرقم المعرف:
  edspgr.11962333

تعليقات

No Comments.

Predicting compression ratio of data with compressible decision

اتصل بنا

اتبع