Back to blog
February 13, 2024 · 5 min · SQD Team

Creating Parquet Datasets with Squid SDK

Tutorial Parquet Data Analytics
Creating Parquet Datasets with Squid SDK

Overview

This article explains how to leverage the Squid SDK to build Parquet datasets for blockchain data analytics. Parquet is a highly efficient columnar storage file format widely used for big data analytics.

Key Features of Parquet

The format offers several advantages:

  1. Columnar Storage — Data from the same column is stored together, enabling efficient compression
  2. Compression Support — Works with Snappy, Gzip, and LZO algorithms
  3. Cross-Platform Compatibility — Readable across Java, Python, R, and other languages
  4. Python Integration — Easily converted to Python dataframes for analysis with numpy and related tools

Implementation Steps

Converting a Squid to use S3 buckets and Parquet format requires three main actions:

  1. Import necessary Squid SDK packages for Parquet and S3 operations
  2. Transform the GraphQL schema into a table with appropriate column types
  3. Modify data-saving logic to use Parquet format with batch saving

Data Access and Analysis

Once created, Parquet datasets can be accessed through:

  • AWS SDK for listing and reading S3 objects
  • DuckDB for efficient querying
  • Python notebooks with boto3 for downloading files
  • Visualization libraries like plotly for charts and graphs

The article provides examples of tracking NFT transfers and contract deployments, demonstrating practical applications of this approach for blockchain data analysis workflows.

Want to learn more about SQD?