close
close

Leveraging Snowflake AI/ML for Anomaly Detection

Anomaly detection is the process of identifying the data deviation from the expected results in a time-series data. This deviation can have a huge impact on forecasting models if not identified before the model creation. Snowflake Cortex AL/ML suite helps you train the models to spot and correct these outliers in order to help improve the quality of your results. Detecting outliers also helps in identifying the source of the deviations in processes.

Anomaly detection works with both single and multi-series data. Multi-series data represents multiple independent threads of events. For example, if you have sales data for multiple stores, each store’s sales can be checked separately by a single model based on the store identifier. These outliers can be detected in time-series data using the Snowflake built-in class SNOWFLAKE.ML.ANOMALY_DETECTION.

Please follow the steps below to implement the anomaly detection in a time-series dataset.

  • Create an anomaly detection object by passing the training data. This object fits a model to the training data that you provide.
  • Using this anomaly detection model object, call the DETECT_ANOMALIES function to identify anomalies by passing the data to analyze.

In this article, I will be focusing on leveraging the SNOWFLAKE.ML.ANOMALY_DETECTION function to detect anomalies in superstore sales.

Data Setup and Exploration

In this article, we will be using the historical Technology sales data for a superstore. The following code can be used to explore the technology sales. 

select * from superstore.superstore_ml_functions.superstore_sales where category = 'Technology';

select * from superstore.superstore_ml_functions.superstore_sales where category = 'Technology'; results

Having explored the historical sales, let’s create a table to store the last year of sales. This data will be used as training data.

CREATE OR REPLACE TABLE superstore_tech_sales_last_year AS (
    SELECT
     to_timestamp_ntz(Order_Date) AS timestamp,
        Segment,
        Category,
        Sub_Category,
        Sales
    FROM
        superstore_sales
    WHERE
        Order_Date > (SELECT max(Order_Date) - interval '1 year' FROM superstore_sales where category = 'Technology')
        and category = 'Technology'
    GROUP BY
        all
);

CREATE OR REPLACE TABLE superstore_tech_sales_historical AS (
    SELECT
        to_timestamp_ntz(Order_Date) AS timestamp,
        Segment,
        Category,
        Sub_Category,
        Sales
    FROM
        superstore_sales
    WHERE
        Order_Date 

Identifying Anomalies

In this section, we will focus on creating training datasets, analysis datasets, and models to detect the anomalies in a time series dataset.

The following code can be used to create training datasets for 6 months of historical sales to enrich the forecast models to detect the anomalies.

CREATE OR REPLACE VIEW superstore_tech_sales_historical_training
  AS SELECT timestamp,sum(sales) as sales  FROM superstore_tech_sales_historical where timestamp = '2022-01-01' group by timestamp;

After creating the training dataset, let’s create the model to detect the anomalies using SNOWFLAKE.ML.ANOMALY_DETECTION class, key parameters to this function are as follows.

  • Model name: anomaly_basic_model
  • Training dataset: superstore_tech_sales_historical_training

Along with these two key attributes, we also need to specify the timestamp column and key metric column in the dataset. In our use, SALES is the key metric where we want to identify the outliers. This call might take a few minutes to build the models.

  CREATE OR REPLACE SNOWFLAKE.ML.ANOMALY_DETECTION anomaly_basic_model(
  INPUT_DATA => TABLE(superstore_tech_sales_historical_training),
  TIMESTAMP_COLNAME => 'TIMESTAMP',
  TARGET_COLNAME => 'SALES',
  LABEL_COLNAME => '');

Instance ANOMALY_BASIC_MODEL successfully created

Now that we have the basic model ready, we will create the data to analyze using this model. You can use the code below to create a view. This view will be the source for the anomaly detection.

CREATE OR REPLACE VIEW superstore_tech_sales_for_analysis
  AS SELECT timestamp,sum(sales) as sales  FROM superstore_tech_sales_last_year where timestamp 

After creating the training data, models, and analysis datasets, the final stage in this process is to identify the anomalies. Please use the code below to look for the outliers in our analysis dataset.

CALL anomaly_basic_model!DETECT_ANOMALIES(
  INPUT_DATA => TABLE(superstore_tech_sales_for_analysis),
  TIMESTAMP_COLNAME =>'TIMESTAMP',
  TARGET_COLNAME => 'SALES'
);

The above code will forecast the actuals, lower band, and upper band, and also call out if there are any anomalies in the data.

Results forecasting actuals, lower and upper bands, and call out of anomalies in data

While detecting the anomalies, you can also provide the labeled data to the model. For example, if you want to identify a few abnormal sales and want the forecast models to consider them as outliers while forecasting the sales, you can use labeled data as part of the LABEL_COLNAME parameter. This is called supervised anomalies detection.

The following code block will create a new training model with an additional attribute called LABEL. This will be a boolean type to identify the outliers. Any sale of $1,000 or more is being labeled here. 

CREATE OR REPLACE VIEW superstore_tech_sales_historical_training_with_label
  AS SELECT DATE_TRUNC('day',timestamp) as timestamp,sum(sales) as sales, case when sum(sales) > 1000 then true else false end as label  FROM superstore_tech_sales_historical where timestamp = '2022-01-01' group by DATE_TRUNC('day',timestamp);

CREATE OR REPLACE SNOWFLAKE.ML.ANOMALY_DETECTION anomaly_labeled_model(
  INPUT_DATA => TABLE(superstore_tech_sales_historical_training_with_label),
  TIMESTAMP_COLNAME => 'TIMESTAMP',
  TARGET_COLNAME => 'SALES',
  LABEL_COLNAME => 'LABEL');

CALL anomaly_labeled_model!DETECT_ANOMALIES(
  INPUT_DATA => TABLE(superstore_tech_sales_for_analysis),
  TIMESTAMP_COLNAME =>'TIMESTAMP',
  TARGET_COLNAME => 'SALES'
); 

The following is the output for a supervised forecast model.

Output for a supervised forecast model

Conclusion

In this article, we have explored the Snowflake AI/ML capabilities to uncover the anomalies by creating forecast models. As a next step, I would recommend continued learning the Snowflake Cortex framework. You can explore designing anomaly visualizations, and create an automated anomaly detection pipeline for recurring training and execution.