Introducing Serverless Tasks in Snowflake

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

5 min readOct 12, 2021

Some background…

One of the recent additions to the Data Engineering capabilities of Snowflake was the introduction of Scheduled Tasks. A scheduled task is pretty much exactly what it sounds like: a way to take a single SQL statement or stored procedure call and execute it on a recurring schedule. (Note: these are not to be confused with triggers, which many database platforms support as “event-driven” task executions.) If you’re already familiar with Snowflake tasks (and their equally cool counterpart, Table Streams), feel free to skip ahead. Otherwise, read on for some background basics.

Tasks

Tasks can be scheduled with a defined chronology or as part of a larger task hierarchy. When using the chronology option, tasks can be scheduled to run every N minutes (where N is a positive integer less than 11520), or with a more finely-grained CRON expression. It’s also possible to create a dependency between any two tasks, such that the “child” task doesn’t run on a scheduled basis, it only runs after its “parent task” has been completed. This creates a task hierarchy, or “task tree”. Only the root task at the top of a tree requires a defined schedule.

As a former “mainframer”, this reminds me of the good old days of JCL, where you defined an element of code, and then provided the system with execution and task dependency details.

Table Streams

Tasks are a great feature, but their real power doesn’t become evident until they’re combined with a corresponding Table Stream object. (Note: we call them “streams”, but they have nothing whatsoever to do with streaming data sources, like Kafka or Kinesis.) Snowflake streams are our method of implementing Change Data Capture on a specific base table. Think of them like a specialized view on a base table. They don’t contain any data of their own — they represent an offset into the source table and reflect any rows that have been added, changed, or removed in the base table since the stream was last inspected. When any DML statement reads from the stream and commits, the offset into the base table is reset to 0, and the stream is “emptied”.

For example, assume you have a table base_table with 1000 rows in it, and a stream on that table called str_base_table. Here are a few queries that illustrate the concept of a stream:

SELECT count(*) from base_table;       -- returns 1000
SELECT count(*) from str_base_table;   -- returns 0
INSERT INTO base_table (100 new rows); -- ETL/ELT job or whatever
 
SELECT count(*) from base_table;      -- returns 1100
SELECT count(*) from str_base_table;  -- returns 100INSERT into curated_table             -- inserts 100 rows 
  SELECT * from str_base_table;       -- into curated_tableSELECT count(*) from str_base_table;  -- returns 0 (stream is empty)

Tasks on Streams

Now let’s say you have a recurring task that wakes up every 20 minutes and checks a table containing raw IoT sensor data streaming in from a Kafka topic. Its job is to curate that raw data into a set of downstream tables — normalizing datatypes, stripping out duplicate records, whatever. The first step in the task is to query the raw data table to see if any new data has arrived. That spins up a virtual warehouse cluster and you pay the 60-second minimum, even if there wasn’t any new data present.

But a Table Stream will allow the task to perform that check for free! The system-level function SYSTEM$STREAM_HAS_DATA() can be invoked as a WHEN condition just as the task starts. If the corresponding stream is empty, the task just goes back to sleep. No warehouse required, and lots of potential cost savings…

Serverless Tasks

Now on to the good stuff! Snowflake has taken the Task object and made it serverless! What does that mean, and why is it good? We’ve essentially de-coupled a task from a named warehouse cluster. Before this enhancement was released, the WAREHOUSE parameter tightly coupled the task definition to one of your warehouse clusters, and you had to guess what warehouse-size would be best suited for the task. Too large, and you’ve wasted resources and money. Too small, and performance could suffer enough that the task misses its next scheduled iteration. Serverless tasks take all that guesswork out of the equation.

To enable the serverless feature, all you do is remove the existing WAREHOUSE parameter, and replace it with the new USER_TASK_MANAGED_INITIAL_WAREHOUSE_SIZE parameter. This takes a t-shirt size string as its argument (i.e., ‘SMALL’, ‘MEDIUM’, and so on). That represents the size of the compute resource that will be allocated for the first few iterations of the task. Once there are a few runs in TASK_HISTORY, Snowflake will use that metadata to effectively “right-size” the compute resource for all subsequent executions. It essentially “learns” what the most effective warehouse-size should be, based on the history of the task. No more guesswork!

Some additional things to know

The resources that get assigned to a task are still billed per second, just like regular warehouse compute. You’ll see them broken out in your bill with other Global Services features like Materialized Views, Cluster maintenance, and so on. But there is a 1.5x markup for these credits. Even with that markup, switching over to the serverless compute model could still reduce compute costs over user-managed warehouses because you won’t inadvertently allocate a 16-node XL when a single-node XS would have done the job...

The new SERVERLESS_TASK_HISTORY view (and corresponding table function) have been added to the ACCOUNT_USAGE schema for tracking serverless task executions and consumption. These can be queried by an ACCOUNTADMIN, any role that has been granted MONITOR USAGE, or the owner of a specific task.

Here’s a query that a task owner can run that shows the current credit usage for a serverless task:

select sum(credits_used) as consumption
  from table(<database_name>.information_schema.serverless_task_history(
    date_range_start=>dateadd(d, -1, current_timestamp()),
    date_range_end=>dateadd(d, 1, current_timestamp()),
    task_name => '<task_name>')
    ) ;

replacing database_name and task_name with your specific details.

Conclusion

Serverless tasks are here! I encourage everyone to experiment with them and do in-depth comparisons of their task consumption with the traditional warehouse method. I see this feature gaining widespread adoption very soon.

Let it Snow!

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Introducing Serverless Tasks in Snowflake

Some background…

Tasks

Table Streams

Tasks on Streams

Serverless Tasks

Some additional things to know

Conclusion

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Written by Paul Horan

Responses (1)