How crunching works

Understand how Crunch crunches your data.

Once you've integrated the Granica API/SDK into your application, it's time to get crunching. Granica Crunch works first and foremost at the bucket level. You first specify buckets which are eligible to be crunched via policy, and then you run the granica crunch <bucket_name> to begin crunching an eligible bucket and the objects within it. "Crunching" is our euphemism for "evaluation and processing", and in our compression context the purpose of processing is to eliminate inefficiencies in data, especially large-scale AI-related data. We think there's no better way to do that than to "crunch" the data down to it's purest information-rich state.

Crunch lexicon

  • Crunched buckets are those which have been put under active management, processing and monitoring by Crunch. A bucket is in the state "crunched" immediately after you run the granica crunch command.
  • Crunched objects are those objects in a crunched bucket which have been evaluated by Crunch.
  • Vanilla buckets are those which have not been crunched.
  • Vanilla objects are those which have not been crunched. All objects in a vanilla bucket are by definition vanilla objects. Objects in a crunched bucket that are yet to be processed (i.e. crunched!) are also vanilla objects.
  • Ingested objects are those which have been crunched and reduced and/or intelligently batched.

Two ways to crunch

You can use Crunch to crunch incoming data either in the background after objects land in your buckets, or inline directly via the Granica API. Most Crunch users start out with background crunching and once they are familiar with how Crunch works they switch to inline crunching as it further lowers costs. Once your data is crunched you’ll see immediate savings in your storage costs. Coming Soon, you'll also be able to take advantage of instant and free secondary copies of your data for various use cases.

Background crunch workflow

With this approach you write (PUT) objects into your buckets just as you normally would, using the vanilla S3/GCS SDK. No change to the write path.

Background crunch workflow

1. User runs the granica crunch CLI command to initiate crunching on an eligible vanilla source data bucket, at which time the bucket becomes a Crunched data bucket as it is now managed by Crunch. The Controller component of Crunch retrieves a copy of any vanilla objects in the Crunched data bucket using LIST and GET operations.

2. When an application or user issues write (PUT) requests to the Crunched data bucket, the Controller receives notifications about the new vanilla objects via real-time SQS pub/sub events and GETs the objects from the Crunched data bucket.

3. The Controller sends the vanilla objects to a load balanced Data Reducer component.

4. The Data Reducer validates with Crunch policy whether the vanilla objects are eligible to be crunched. If they are not eligible, the objects are not ingested and thus remain vanilla objects inside the Crunched data bucket. If they are eligible, the Data Reducer reduces and/or intelligently batches the data and writes it to a Reduced data bucket. The degree of data reduction varies by data type and is completely lossless. Objects are typically batched at 10:1, reducing per-object monitoring costs for classes such as Amazon S3 IT by ~90%. From the point of view of the Crunched data bucket, the objects have been ingested by Crunch. All Reduced data buckets are internally managed by Granica and invisible to your applications.

Finally, any ingested objects in the Crunched data bucket are immediately cleaned up (i.e. deleted) to ensure that only the reduced version of the ingested objects are retained, thus reducing your monthly cloud storage bill.

note

If you have bucket data isolation enabled (say for compliance reasons), then the reduced data is written to separate buckets that map 1:1 to your Crunched data buckets.

note

When you crunch a bucket currently in the Amazon S3 IT storage class, the crunched data will change to the frequently accessed sub-class of S3 IT (i.e. the most expensive sub-class) regardless which sub-class it was previously in, thus resetting the S3 IT "clock". Depending on access patterns this reset may/may not incur incremental operational costs, but as with all other infrastructure and operational costs incurred by Crunch they are paid for out of the generated savings.

The following video demo will explain how you will interact with Granica Crunch to background crunch an S3 bucket, and then read objects that were in this bucket Granica bgcrunch Demo Video

Inline crunch workflow

With this approach you primarily write (PUT) objects into your bucket using the Granica S3-compatible API. Crunch crunches the data inline and also intelligently batches the PUT requests to reduce ops costs and generate additional savings.

note

Enabling your custom applications to write (and read) through Granica is a simple process, typically requiring only a single-line code change. Follow this Get Started guide to learn how to integrate the Granica API and thus Crunch into your apps. Granica supports multiple languages and multiple clouds.

Inline crunch workflow

1. User runs the granica crunch CLI command to initiate crunching on an eligible vanilla source data bucket, at which time the bucket becomes a Crunched data bucket as it is now managed by Crunch. The Controller component of Crunch retrieves a copy of any vanilla objects in the Crunched data bucket using LIST and GET operations.

2a. (Inline) When an application uses the Granica API to write (PUT) vanilla objects into a Crunched data bucket, those inline writes bypass the Crunched data bucket, flow directly into the Controller, and incur zero PUT ops costs. Each write has metadata such as the API request, object key, the bucket, the principal, and so forth, which is captured and stored, enabling Granica to manage and maintain the logical bucket-object mapping.

note

If an application uses the Granica API to write objects to a vanilla bucket (i.e. a bucket that is not crunched), then the Controller simply passes them through unmodified to the vanilla bucket.

2b. (Background) When an application or user issues write (PUT) requests to the Crunched data bucket, the Controller receives notifications about the new vanilla objects via real-time SQS pub/sub events and GETs the objects from the Crunched data bucket.

3. The Controller sends the vanilla objects to a load balanced Data Reducer component.

4. The Data Reducer validates with Crunch policy whether the vanilla objects are eligible to be crunched. If they are not eligible, the objects are not ingested and are instead written unmodified to the Crunched data bucket. If they are eligible, the Data Reducer reduces, intelligently batches and writes the deeply reduced data to a Reduced data bucket. For inline writes, this is the only place where PUT ops costs are incurred and since they are batched (typically 10:1) the costs are typically reduced by ~90%. The degree of data reduction varies by data type and is completely lossless. From the point of view of the Crunched data bucket, the objects have been ingested. All Reduced data buckets are internally managed by Granica and invisible to your applications.

Finally, any ingested objects in the Crunched data bucket are immediately cleaned up (i.e. deleted) to ensure that only the reduced version of the ingested objects are retained, thus reducing your monthly cloud storage bill.

note

If you have bucket data isolation enabled (say for compliance reasons), then the reduced data is written to separate buckets that map 1:1 to your Crunched data buckets.

tip

To avoid cross-AZ data transfer charges for writes simply deploy Granica into the same AZ as your application. If an application can't be co-located with Granica, consider writing to the Crunched data bucket using the vanilla S3/GCS SDK and the data will be crunched in the background. Note that while this approach eliminates cross-AZ charges for writes, it also eliminates PUT ops savings as batching can no longer be performed. If your application is write-intensive, the PUT ops savings will likely be much larger than the cross-AZ charges.

The following video demo will explain how you will interact with Granica Crunch to write and read objects using inline crunch. Granica incrunch Demo Video

See also