Snowflake integration with Trino – DZone

In today’s discourse we go into the intricacies of accessing Snowflake via the Trino project. This article highlights the seamless integration of Trin with Snowflake, offering a comprehensive analysis of its benefits and implications.

Previous articles

Previous articles about Pahuljica and Trin:

Motivation

A common query among potential Snowflake adopters concerns its compatibility with on-premises data and cloud platforms such as Azure. In this article, we address this question directly, exploring the feasibility of accessing Snowflake with local data via the Trino project. Let’s uncover the possibilities together.

High level steps

  • Deploy Trino in Docker
  • Get a Snowflake trial account
  • Connect the dots
  • Conclusion

Step-by-step instructions

Navigating the data integration landscape can be daunting, especially when considering Snowflake’s compatibility with on-premises environments. In this guide, we aim to simplify the process by using a Docker environment to simulate local conditions. Our approach prioritizes simplicity, leveraging standard Snowflake configurations and basic Trino Docker settings. It’s important to consult your documentation for specific scenarios, but let’s start with the basics.

Deploy Trino in Docker

I have a build file called compose-trino.yaml with the following content:

services:

  trino:
    container_name: trino
    hostname: trino
    build: trino/.
    ports:
      - "8080:8080"
    environment:
      - _JAVA_OPTIONS=-Dfile.encoding=UTF-8
    volumes:
      - ./trino/catalog:/etc/trino/catalog
      - ./trino/etc:/etc/trino

In the current directory I have a folder called trino. Inside the folder I have the following files:

FROM trinodb/trino:442
LABEL version="1.0"
LABEL description="trino container"
ENV REFRESHED_AT 2024_03_15

I also have two more folders called etc and catalog.

Within catalog directory, I set a snowflake.properties file with the following content:

connector.name=snowflake
connection-url=jdbc:snowflake://<account>.snowflakecomputing.com
connection-user=root
connection-password=secret
snowflake.account=account
snowflake.database=database
snowflake.role=role
snowflake.warehouse=warehouse

If you run into any obstacles along the way, don’t hesitate to refer to the comprehensive Trino documentation available here. Let’s dive in!

After you set up the Snowflake environment, you can adjust these properties to your own values.

Within etc directory, I have a jvm.config with the following content:

--add-opens=java.base/java.nio=ALL-UNNAMED
-Djdk.module.illegalAccess=permit

These particular JDK tags are specific to Snowflake.

i also have config.properties with the following content:

coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
discovery.uri=http://example.net:8080

And finally, node.properties with the following content:

node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/tmp/trino/data

With everything in place, you are now ready to launch the Compose environment. Run the following command to start the environment: docker compose -f compose-trino.yaml up -d.

After successful configuration, you should observe a running container named trino. You can confirm this by running the command: docker ps.

f426506aa443   snowflake-docker-trino   "/usr/lib/trino/bin/…"   53 minutes ago   Up 47 minutes (healthy)   0.0.0.0:8080->8080/tcp   trino

If you run into any issues, you can further troubleshoot by examining the Trino record using the following command: docker logs trino.

You can access the Trino container with the following command:

docker exec -it trino trino

After logging in, you can verify that the Snowflake catalog is properly configured by running the following command:

trino> show catalogs;
  Catalog  
-----------
 snowflake 
 system    

For the next stage of this tutorial, kindly proceed to sign up for a Snowflake trial account. Opt for the standard edition since we won’t be using business features. During the login process I selected the Azure eastus2 region for my Snowflake deployment.

After completing the application, you will receive a confirmation email. After verification, you will get access to your Snowflake environment. Retrieve the required details from the email Snowflake sent, especially the credentials, and fill it out snowflake.properties file located in the trino/catalog directory.

Snowflake trial account

Connect the dots

Snowflake provides a variety of demo tutorials, including the Tasty Bytes series. In this tutorial, we’ll focus on the “Load sample data using SQL from an S3 bucket” worksheet. Alternatively, feel free to select a dataset of your choice.

---> set the Role
USE ROLE accountadmin;

---> set the Warehouse
USE WAREHOUSE compute_wh;

---> create the Tasty Bytes Database
CREATE OR REPLACE DATABASE tasty_bytes_sample_data;

---> create the Raw POS (Point-of-Sale) Schema
CREATE OR REPLACE SCHEMA tasty_bytes_sample_data.raw_pos;

---> create the Raw Menu Table
CREATE OR REPLACE TABLE tasty_bytes_sample_data.raw_pos.menu
(
    menu_id NUMBER(19,0),
    menu_type_id NUMBER(38,0),
    menu_type VARCHAR(16777216),
    truck_brand_name VARCHAR(16777216),
    menu_item_id NUMBER(38,0),
    menu_item_name VARCHAR(16777216),
    item_category VARCHAR(16777216),
    item_subcategory VARCHAR(16777216),
    cost_of_goods_usd NUMBER(38,4),
    sale_price_usd NUMBER(38,4),
    menu_item_health_metrics_obj VARIANT
);

---> confirm the empty Menu table exists
SELECT * FROM tasty_bytes_sample_data.raw_pos.menu;

---> create the Stage referencing the Blob location and CSV File Format
CREATE OR REPLACE STAGE tasty_bytes_sample_data.public.blob_stage
url="s3://sfquickstarts/tastybytes/"
file_format = (type = csv);

---> query the Stage to find the Menu CSV file
LIST @tasty_bytes_sample_data.public.blob_stage/raw_pos/menu/;

---> copy the Menu file into the Menu table
COPY INTO tasty_bytes_sample_data.raw_pos.menu
FROM @tasty_bytes_sample_data.public.blob_stage/raw_pos/menu/;

---> how many rows are in the table?
SELECT COUNT(*) AS row_count FROM tasty_bytes_sample_data.raw_pos.menu;

---> what do the top 10 rows look like?
SELECT TOP 10 * FROM tasty_bytes_sample_data.raw_pos.menu;

---> what menu items does the Freezing Point brand sell?
SELECT 
   menu_item_name
FROM tasty_bytes_sample_data.raw_pos.menu
WHERE truck_brand_name="Freezing Point";

---> what is the profit on Mango Sticky Rice?
SELECT 
   menu_item_name,
   (sale_price_usd - cost_of_goods_usd) AS profit_usd
FROM tasty_bytes_sample_data.raw_pos.menu
WHERE 1=1
AND truck_brand_name="Freezing Point"
AND menu_item_name="Mango Sticky Rice";

---> to finish, let's extract the Mango Sticky Rice ingredients from the semi-structured column
SELECT 
    m.menu_item_name,
    obj.value:"ingredients"::ARRAY AS ingredients
FROM tasty_bytes_sample_data.raw_pos.menu m,
    LATERAL FLATTEN (input => m.menu_item_health_metrics_obj:menu_item_health_metrics) obj
WHERE 1=1
AND truck_brand_name="Freezing Point"
AND menu_item_name="Mango Sticky Rice";

We have a dataset in Snowflake, so now let’s go back to Trino and access the Snowflake data from there.

If your Compose environment is currently active but has no essential configuration such as snowflake.database, snowflake.warehouse or other relevant Snowflake properties, it is crucial to stop the environment. Before proceeding, make sure that these properties are configured correctly. Once set up, you can restart the Compose environment and continue the integration process without any problems.

docker compose -f compose-trino.yaml down

Back to snowflake.properties file, change the properties to:

connection-user=snowflakeuser
connection-password=snowflakepassword
snowflake.database=tasty_bytes_sample_data
snowflake.role=accountadmin
snowflake.warehouse=compute_wh

Restart the environment and access the Trino shell.

Inside the Trino shell type:

Since the Snowflake catalog is already configured to connect to our Trino environment, we can omit the database name from the fully qualified table name. Select any of the above queries from the Snowflake worksheet and try running them in the Trino container.

trino:raw_pos> SELECT COUNT(*) AS row_count FROM raw_
pos.menu;
 row_count 
-----------
       100 
(1 row)

Query 20240315_185131_00013_45g27, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0.84 [1 rows, 0B] [1 rows/s, 0B/s]
trino:raw_pos> SELECT 
            ->    menu_item_name
            -> FROM raw_pos.menu
            -> WHERE truck_brand_name="Freezing Poi
nt";
   menu_item_name   
--------------------
 Lemonade           
 Sugar Cone         
 Waffle Cone        
 Two Scoop Bowl     
 Bottled Water      
 Bottled Soda       
 Ice Tea            
 Ice Cream Sandwich 
 Mango Sticky Rice  
 Popsicle           
(10 rows)

Query 20240315_185212_00015_45g27, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
1.23 [10 rows, 0B] [8 rows/s, 0B/s]

Indeed, accessing Snowflake datasets using Trin from our local environment demonstrates the flexibility and interoperability of these tools. This integration allows us to seamlessly work with data across platforms, improving our analytical capabilities and workflow efficiency.

Additionally, you can access the Trino user interface via http://localhost:8080. With the default configuration, no password is required and the username is set to admin. By reaching the “finished queries” section, you can review the queries you’ve executed, providing valuable insights into your workflow and making it easier to debug if necessary. This feature improves the visibility and transparency of your data operations within the Trino environment.

Overview of the Trino cluster

Conclusion

Trino and its commercial version, Starburst, are powerful tools for unifying data from disparate sources. This article shows how to easily access Snowflake using local tools with Trina. The synergy between Snowflake and Trino offers a robust data management and analytics solution, empowering organizations to leverage cloud data storage and distributed query processing for enhanced insights.

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *