refresh-dataflow

Incrementally Refresh Dataflow

Advanced incremental refresh framework for Microsoft Fabric dataflows with intelligent bucketing, automatic retry mechanisms, and CI/CD support.

Python License Microsoft Fabric Dataflow


Table of Contents


Why This Tool?

Microsoft Fabric’s standard incremental refresh capabilities have limitations that prevent effective data loading in certain scenarios:

Common Problems:

How This Framework Solves These Issues:

This solution is ideal for data engineers working with Microsoft Fabric who need robust, production-ready incremental refresh capabilities beyond what’s available out of the box.


Use Cases

This framework excels in the following scenarios:


Prerequisites

Before using this notebook, ensure you have the following:

Required Access & Permissions

Technical Requirements

Knowledge Prerequisites

Overview

This notebook implements an advanced framework for incremental data refresh in Microsoft Fabric dataflows. It is designed to handle scenarios where standard incremental refresh does not work due to limitations in the data source, such as query folding issues or bucket size constraints.

The notebook supports both regular Dataflow Gen2 and CI/CD Dataflow Gen2 objects, automatically detecting the dataflow type and using the appropriate Microsoft Fabric REST API endpoints.

Architecture

The notebook orchestrates a coordinated refresh process across multiple Fabric components:

┌─────────────┐
│   Pipeline  │ (Triggers notebook with parameters)
└──────┬──────┘
       │
       ▼
┌─────────────────────────────────────────────────────────┐
│              Notebook (DataflowRefresher)               │
│  ┌────────────────────────────────────────────────────┐ │
│  │ 1. Read tracking table from Warehouse              │ │
│  │ 2. Recover from interrupted runs (if needed)       │ │
│  │ 3. Calculate date ranges & buckets                 │ │
│  │ 4. Backup data in affected range                   │ │
│  │ 5. Delete overlapping data from destination table  │ │
│  │ 6. Update tracking table (Running status)          │ │
│  │ 7. Call Fabric REST API to trigger dataflow        │ │
│  │    (passes RangeStart/RangeEnd if supported)      │ │
│  │ 8. Poll for completion status                      │ │
│  │ 9. On success: drop backup, update tracking table  │ │
│  │    On failure: restore from backup, retry or fail  │ │
│  └────────────────────────────────────────────────────┘ │
└─────────┬───────────────────────────────────┬───────────┘
          │                                   │
          ▼                                   ▼
┌─────────────────────┐           ┌─────────────────────┐
│   Warehouse         │           │  Fabric REST API    │
│                     │           │  - Regular DF:      │
│  [Incremental       │           │    /dataflows/...   │
│   Update] Table     │           │  - CI/CD DF:        │
│  - range_start      │           │    /items/.../jobs  │
│  - range_end        │           └──────────┬──────────┘
│  - status           │                      │
│                     │                      ▼
│  [temporary_backup] │           ┌───────────────────┐
│  (temporary backup  │◄─────────>│   Dataflow Gen2   │
│   during refresh)   │           │  (Power Query M)  │
└─────────┬───────────┘           └─────────┬─────────┘
          │                                 │
          │                                 ▼
          │                       ┌───────────────────┐
          └──────────────────────>│  Data Source      │
            (Dataflow reads range)│  (Filtered by     │
                                  │   date range)     │
                                  └─────────┬─────────┘
                                            │
                                            ▼
                                  ┌───────────────────┐
                                  │   Warehouse       │
                                  │   Destination     │
                                  │   Table           │
                                  └───────────────────┘

Key Flow:

  1. Pipeline passes parameters to notebook
  2. Notebook recovers from any interrupted previous run, then manages the tracking table and orchestrates refresh
  3. Tracking table stores date ranges (also read by dataflows without parameter support)
  4. Notebook backs up affected data before deleting, restores on failure
  5. Dataflow executes with filtered date range — passed directly as parameters (CI/CD with public parameters) or read from tracking table (regular dataflows)
  6. Data flows from source to warehouse destination table

Key Features

Pipeline Parameters

Configure the following parameters when setting up the notebook activity in your Fabric pipeline:

Parameter Type Required Description
workspace_name String No Name of the Fabric workspace containing the dataflow. Defaults to the current workspace if omitted. The notebook resolves this to a workspace ID automatically.
dataflow_name String Yes Name of the dataflow to refresh. The notebook resolves this to a dataflow ID automatically.
initial_load_from_date String Yes* Start date for initial historical load (format: ‘YYYY-MM-DD’). *Required only for first load
bucket_size Integer or String No Size of each refresh bucket. Accepts: integer for days (e.g., 7), or string with suffix: "1M" for months, "1Y" for years (default: 1)
bucket_retry_attempts Integer No Total number of attempts per bucket, including the initial attempt (e.g., 3 = 1 initial + 2 retries; default: 3)
incrementally_update_last Integer or String No Period to overlap/refresh in incremental updates. Accepts: integer for days (e.g., 1), or string with suffix: "3M" for months, "1Y" for years (default: 1)
reinitialize_dataflow Boolean No Set to True to delete tracking data and restart from scratch (default: False)
destination_table String Yes Name of the destination table in the warehouse where data is written. Can be just table name (uses dbo schema) or schema.table format for tables in other schemas
incremental_update_column String Yes DateTime column used for incremental filtering
is_cicd_dataflow Boolean No Explicitly specify if this is a CI/CD dataflow (auto-detected if not provided)
load_from_date String No Ad-hoc extract start date (format: ‘YYYY-MM-DD’). Both load_from_date and load_to_date must be set.
load_to_date String No Ad-hoc extract end date (format: ‘YYYY-MM-DD’). Both load_from_date and load_to_date must be set.
backup_schema String No Schema name for backup tables, created automatically if it doesn’t exist (default: temporary_backup)
skip_backup Boolean No Skip backup/restore before delete operations for faster runs. Warning: data is not recoverable if a refresh fails (default: False)
force_run Boolean No Override the concurrency guard if a previous run appears active. Use with caution — only set if you are certain the active-looking run is stale (default: False)
validate_only Boolean No Resolve IDs and print the execution plan (bucket count, date ranges, parameters) without modifying any data. Useful for verifying configuration before a live run (default: False)
adhoc_fail_if_unsupported Boolean No When ad-hoc mode is requested on a dataflow that cannot consume the date range parameters, raise an error instead of silently proceeding with a best-effort reload (default: True)
cicd_execute_route_mode String No CI/CD Execute route compatibility mode. 'auto' (default) tries the documented endpoint first and falls back to the legacy Background Jobs route on 400 InvalidJobType when no job was created. 'documented' always uses the documented Dataflow Execute path. 'legacy_query' always uses the legacy ?jobType=Execute path. See CI/CD Execute — InvalidJobType for details.

Notebook Constants

The following constants must be configured inside the notebook:

Constant Example Description
SCHEMA "[Warehouse DB].[dbo]" Database schema where the tracking table resides
INCREMENTAL_TABLE "[Incremental Update]" Name of the metadata tracking table
CONNECTION_ARTIFACT "Your Warehouse Name" Name of the warehouse artifact
CONNECTION_ARTIFACT_ID "your-warehouse-id-guid" Technical ID (GUID) of the warehouse artifact (not the workspace ID). Find it in the warehouse URL: .../warehouses/{warehouse_id}. The default fabric.get_workspace_id() is a placeholder only — this must be replaced with the actual warehouse GUID.
CONNECTION_ARTIFACT_TYPE "Warehouse" Type of artifact (typically “Warehouse”)

Setup Instructions

Follow these steps to set up and configure the notebook:

Step 1: Import the Notebook

  1. Navigate to your Microsoft Fabric workspace
  2. Click NewImport notebook
  3. Upload the Incrementally Refresh Dataflow.ipynb file
  4. Wait for the import to complete

Step 2: Configure Notebook Constants

Open the notebook and update the following constants in the third code cell:

# Update these constants with your warehouse details
SCHEMA = "[YourWarehouseName].[dbo]"
INCREMENTAL_TABLE = "[Incremental Update]"
CONNECTION_ARTIFACT = "YourWarehouseName"
CONNECTION_ARTIFACT_ID = "your-warehouse-id-guid"
CONNECTION_ARTIFACT_TYPE = "Warehouse"

How to find your Warehouse ID:

  1. Open your warehouse in Fabric
  2. Check the URL: https://app.fabric.microsoft.com/groups/{workspace_id}/warehouses/{warehouse_id}
  3. Copy the warehouse_id GUID from the URL

Step 3: Create a Fabric Pipeline

  1. In your workspace, create a new Data Pipeline
  2. Add a Notebook activity to the pipeline canvas
  3. Configure the notebook activity:
    • Notebook: Select the imported notebook
    • Parameters: Add the required parameters (see Quick Start Example below)

Step 4: Configure Your Dataflow

Ensure your dataflow Power Query includes logic to read range_start and range_end from the tracking table (see Integration section).

Step 5: Test the Setup

  1. Run the pipeline with all required parameters
  2. Monitor the notebook execution in Fabric
  3. Check the [Incremental Update] table in your warehouse to verify tracking records are created
  4. Verify data appears in your destination table

Before running the pipeline for the first time, call bootstrap_tracking_row() once during initial deployment to pre-seed the singleton tracking row with status='New'.

Why this matters: Fabric Warehouse constraints are NOT ENFORCED. If two pipeline runs start simultaneously on a fresh deployment (no prior tracking row), both will attempt to INSERT a new row and succeed — creating duplicate rows that violate the singleton invariant. Pre-seeding with bootstrap_tracking_row() avoids the first-run INSERT race when bootstrap is called before any concurrent pipeline execution begins: the runtime always finds an existing row and uses the atomic UPDATE acquire path, which is race-safe (only one writer wins the compare-and-swap).

How to call it (run once from a notebook cell or setup script):

refresher.bootstrap_tracking_row(
    dataflow_id="<your-dataflow-guid>",
    workspace_id="<your-workspace-guid>",
    dataflow_name="Sales Data Incremental Refresh",
    initial_load_from_date="2024-01-01",
    bucket_size=7,
    incrementally_update_last=2,
    destination_table="FactSales",
    incremental_update_column="OrderDate",
    is_cicd_dataflow=False
)

This method is idempotent — serially safe to call multiple times (it is a no-op if a row already exists). Note: concurrent bootstrap calls on an empty table can still both insert; serialize deployment manually to avoid this edge case.

Quick Start Example

Here’s a complete example of how to configure the pipeline parameters for your first run:

Example Configuration

Pipeline Parameters:

{
  "workspace_name": "My Fabric Workspace",
  "dataflow_name": "Sales Data Incremental Refresh",
  "initial_load_from_date": "2024-01-01",
  "bucket_size": 7,
  "bucket_retry_attempts": 3,
  "incrementally_update_last": 2,
  "reinitialize_dataflow": false,
  "destination_table": "FactSales",
  "incremental_update_column": "OrderDate",
  "is_cicd_dataflow": null
}

Ad-hoc mode example – reload June 2024 data without affecting the incremental tracking state:

{
  "load_from_date": "2024-06-01",
  "load_to_date": "2024-06-30"
}

Parameter Explanation:

Expected First Run Behavior

  1. No tracking record exists → Initial load scenario
  2. Calculates range: 2024-01-01 to yesterday 23:59:59
  3. Splits into 7-day buckets
  4. Processes each bucket sequentially
  5. Creates tracking table entry
  6. Dataflow reads range_start and range_end from tracking table
  7. Data is loaded into dbo.FactSales table

Expected Subsequent Runs

  1. Tracking record exists → Incremental update scenario
  2. Reads last successful range end date
  3. Calculates new range with 2-day overlap
  4. Processes new buckets
  5. Updates tracking table with new range

How It Works

1. Metadata Table Management

The notebook automatically creates and manages two warehouse tables.

[Incremental Update] — current orchestration state (one row per dataflow):

[Incremental Update History] — execution ledger (one row per bucket attempt):

Columns: id (identity), dataflow_id, dataflow_name, run_id, job_instance_id, trigger_mode (initial_load / incremental / adhoc / failed_bucket), bucket_index, range_start, range_end, status, attempt_number, is_cicd_dataflow, is_adhoc, start_time, end_time, created_at.

A row is inserted with status=Running at bucket start and then finalized in place by id at completion. This provides a full audit trail of every bucket attempt for RCA, compliance, and performance analysis without relying on external logs. History write failures are non-fatal and do not block orchestration.

2. Dataflow Type Detection

The notebook automatically detects whether the dataflow is a CI/CD or regular dataflow by probing the official Microsoft Fabric API endpoints:

3. Processing Logic

Scenario A: Initial Load (No Previous Refresh)

  1. Validates that initial_load_from_date is provided
  2. Calculates date range from initial_load_from_date to yesterday at 23:59:59
  3. Splits the date range into buckets based on bucket_size
  4. For each bucket:
    • Backs up data in the affected range (unless skip_backup is True)
    • Deletes any overlapping data in the destination table
    • Updates tracking table with “Running” status
    • Triggers dataflow refresh
    • Waits for completion and monitors status
    • If bucket fails: Restores data from backup, then retries up to bucket_retry_attempts times with exponential backoff (30s, 60s, 120s, etc.)
    • If all retries fail: Data has been restored from the last backup. Logs error, updates tracking table, and raises RuntimeError to fail the pipeline
    • If bucket succeeds: Drops the backup table, then moves to next bucket

Scenario B: Previous Refresh Failed

  1. Detects failed status from previous run
  2. Retrieves the failed bucket’s date range
  3. Backs up data in the affected range (unless skip_backup is True)
  4. Retries the failed bucket using the same retry logic as above, restoring from backup on each failure
  5. If all retries fail: Data has been restored from backup. Raises exception to fail the pipeline
  6. If retry succeeds: Drops backup table, completes the run. The next pipeline run continues with normal incremental processing (Scenario C)

Scenario C: Incremental Update (Previous Refresh Successful)

  1. Calculates new date range:
    • If incrementally_update_last is set: Uses min(last_end_date + 1 second, yesterday - N days) to ensure overlap without gaps
    • Otherwise: Starts from last_end_date + 1 second
    • End date is always yesterday at 23:59:59
  2. Splits date range into buckets if needed
  3. Processes each bucket with the same backup/restore, retry, and failure logic as initial load
  4. If any bucket fails after all retries: Data has been restored from backup. Raises exception to fail the pipeline

4. Retry Mechanism with Exponential Backoff

When a bucket refresh fails, the notebook retries with exponential backoff. The bucket_retry_attempts parameter controls the total number of attempts (including the initial attempt). With the default of 3:

  1. Attempt 1: Initial attempt (no wait)
  2. Attempt 2: Waits 30 seconds, then retries
  3. Attempt 3: Waits 60 seconds, then retries

Each subsequent retry doubles the wait time (30s × 2^(attempt-1)) plus a small random jitter (±5–10 s) to reduce thundering-herd pressure when multiple buckets fail simultaneously. Setting bucket_retry_attempts to 4 would add a fourth attempt after approximately a 120-second wait.

  1. If all attempts fail:
    • Updates tracking table with failed status
    • Logs detailed error message
    • Raises RuntimeError — the exception propagates to Fabric, marking the notebook as failed in the pipeline
    • No further buckets are processed

This ensures transient issues (network glitches, temporary service unavailability) are handled gracefully, while persistent failures properly fail the pipeline.

5. Date Range Logic

Inclusive end-of-day contract: RangeEnd is always the inclusive end of the bucket — 23:59:59 of the last day. Downstream Power Query filters should use <= RangeEnd (not < RangeEnd). This contract assumes source data does not contain meaningful sub-second values.

6. Connection Management

The notebook proactively manages database connections to prevent timeout issues:

7. Data Safety (Backup & Restore)

Before deleting data from the destination table, the notebook backs up the affected rows into a separate schema:

Set skip_backup = True to disable backups for faster execution. Warning: with backups disabled, data deleted before a failed refresh cannot be recovered.

8. Ad-hoc Extract Mode

Set both load_from_date and load_to_date to load data for an arbitrary date range without modifying the incremental tracking state.

Compatibility:

Dataflow type Ad-hoc behaviour
CI/CD with public RangeStart/RangeEnd parameters Full support — range is passed directly via the Execute API at refresh time
CI/CD without public parameters Range is written to the tracking table ad-hoc row; dataflow must read it explicitly
Regular dataflow (standard incremental filter) Dataflow re-runs the last incremental range — ad-hoc range is ignored
Regular dataflow with custom ad-hoc query logic Works if Power Query explicitly reads the is_adhoc=1 row

By default (adhoc_fail_if_unsupported = True) the notebook raises an error when ad-hoc mode is used on a dataflow that does not support direct parameter passing, rather than silently running with unintended data. Set adhoc_fail_if_unsupported = False to downgrade this to a warning and proceed with a best-effort reload.

Execution Results

Upon completion or failure, the notebook prints:

Dataflow refresh execution completed:
Status: Completed / Completed with N failures / Failed: [error message]
Total buckets processed: N
Successful refreshes: N
Failed refreshes: N
Total retry attempts: N
Duration: X.XX seconds
Dataflow type: CI/CD / Regular

If a bucket fails after all retries:

================================================================================
DATAFLOW REFRESH FAILED
================================================================================
Error: Bucket N/M failed after 3 attempts with status Failed. Range: YYYY-MM-DD to YYYY-MM-DD

The dataflow refresh has been terminated due to bucket failure after all retry attempts.
Please check the logs above for detailed error information.
================================================================================

For unexpected (non-retry) errors, the output uses a distinct header:

================================================================================
DATAFLOW REFRESH ERROR
================================================================================
Error: [error message]

An unexpected error occurred during the dataflow refresh process.
Please check the logs above for detailed error information.
================================================================================

In both cases, the notebook re-raises the exception, causing the Fabric pipeline to mark the notebook activity as Failed and surface the real error message.

Integration with Dataflow Power Query

If your dataflow is a CI/CD Dataflow Gen2, you can define RangeStart and RangeEnd as public parameters (type: DateTime). The notebook will automatically discover these parameters and pass the date range values directly when triggering the refresh — no need for the dataflow to query the tracking table.

  1. In your dataflow, create two parameters:
    • Name: RangeStart, Type: DateTime
    • Name: RangeEnd, Type: DateTime
  2. Use them in your Power Query filter:
let
    FilteredData = Table.SelectRows(YourDataSource,
        each [YourDateColumn] >= RangeStart and [YourDateColumn] <= RangeEnd)
in
    FilteredData

The notebook uses the Dataflow Execute endpoint to pass these values at refresh time. All CI/CD dataflows must expose public RangeStart and RangeEnd parameters — there is no unparameterized CI/CD path. Parameter discovery is performed as a pre-flight check before any backup or delete; if discovery fails transiently (429, 5xx, network error) or if the parameters are absent (confirmed non-support), execution fails closed with a descriptive error before any data is modified. The operator must resolve the configuration or retry when the control plane is available — no automatic fallback to a different endpoint or API family occurs.

Note: The tracking table is still updated with range_start/range_end regardless of whether parameters are passed directly. This maintains backward compatibility, status tracking, and retry logic.

executeOption policy: The Execute call uses executeOption: SkipApplyChanges by default, which runs only the last successfully published dataflow definition. This avoids a CI/CD publish or git-sync failure causing a refresh failure. The alternative, ApplyChangesIfNeeded, publishes the latest saved CI/CD state before executing — but since February 2026, if that publish step fails, the refresh also fails (instead of silently running the older published version). To change this policy, update the _CICD_EXECUTE_OPTION class constant in DataflowRefresher.

Option 2: Read from Tracking Table (Regular Dataflows)

Your dataflow Power Query reads the range_start and range_end values from the [Incremental Update] table:

let
    Source = Sql.Database("[server]", "[database]"),
    TrackingTable = Source{[Schema="dbo", Item="Incremental Update"]}[Data],
    FilteredRows = Table.SelectRows(TrackingTable, each
        [dataflow_id] = "your-dataflow-id"
        and ([is_adhoc] = false or [is_adhoc] = null)),
    RangeStart = FilteredRows{0}[range_start],
    RangeEnd = FilteredRows{0}[range_end],

    // Use RangeStart and RangeEnd to filter your data source
    FilteredData = Table.SelectRows(YourDataSource,
        each [YourDateColumn] >= RangeStart and [YourDateColumn] <= RangeEnd)
in
    FilteredData

Note: The filter on is_adhoc ensures the dataflow reads the incremental tracking row, not a temporary ad-hoc row. If you are using ad-hoc extract mode, the ad-hoc row’s is_adhoc flag is set to true so both rows can coexist without conflict.

Platform Constraints and Known Limitations

CI/CD Background Jobs API — Public Preview

The Fabric REST endpoints used for CI/CD dataflow triggering and status polling are currently in Public Preview and are explicitly not recommended for production use by Microsoft. They may change or be removed without notice. The Dataflow Execute trigger endpoint (/v1/workspaces/.../dataflows/.../jobs/execute/instances) requires a delegated user token with at least Member workspace permissions — it does not support Service Principals or Managed Identities, making it incompatible with standard enterprise service accounts (Azure DevOps, GitHub Actions, etc.). Parameter discovery and job-status polling endpoints may have different authentication requirements; verify per endpoint in your tenancy.

Mitigation: Parameter discovery runs as a pre-flight check before any data modification and fails closed if RangeStart/RangeEnd parameters are absent or if discovery itself fails transiently. There is no automatic fallback to a different endpoint or API family. Once a CI/CD Execute is triggered, polling remains on the exact job-instance contract (/items/{id}/jobs/instances/{jobInstanceId}) and never crosses to the regular transactions API. The internal defaults wait_for_completion=True and timeout_minutes=120 are not exposed as notebook parameters in the current configuration — to change them, pass them programmatically when calling execute_incremental_refresh() directly, or add them to the parameters cell and top-level call. Monitor Microsoft’s changelog for GA availability. Until then, consider this path Tier-2/Tier-3 suitable and not Tier-1 mission-critical.

Regular Dataflow Polling — Probabilistic Correlation

The regular (non-CI/CD) dataflow refresh path correlates with the correct transaction by comparing the trigger timestamp within a 60-second skew window. In shared workspaces where another operator could trigger a manual refresh close in time, a false correlation is theoretically possible. Mitigate by running notebook-managed regular dataflows in a dedicated workspace with no manual refreshes.

Concurrency — Advisory, Not Atomic

The concurrency guard is a check-then-act pattern, not a database-level lock. Two pipeline runs starting within milliseconds of each other could both pass the guard. The run_id suffix on backup tables isolates their data operations, but the risk of redundant processing exists in extreme concurrent scenarios.

Best Practices

  1. Bucket Size: Start with 1 day buckets. Increase if your data volume is low and performance is not a concern.
  2. Retry Attempts: Default of 3 is recommended. Increase only if you experience frequent transient failures.
  3. Overlap Days: Set incrementally_update_last to 1 or more if your source data can be updated retroactively.
  4. Destination Table Schema:
    • If your table is in the dbo schema: Use just the table name (e.g., "SalesData")
    • If your table is in another schema: Use schema.table format (e.g., "staging.SalesData" or "analytics.SalesData")
  5. Monitoring: Monitor the [Incremental Update] table in your warehouse to track refresh history and troubleshoot issues.
  6. Pipeline Design: Use the notebook activity failure to trigger alerts or retry logic at the pipeline level.
  7. Backup Strategy: Leave skip_backup as False (default) for production pipelines. Only use skip_backup = True for development/testing or when the source data can be easily reloaded.

Troubleshooting

Notebook fails immediately

Symptoms:

Solutions:

  1. Verify all required parameters are provided:
    • dataflow_name (required), workspace_name (optional — defaults to current workspace)
    • destination_table, incremental_update_column
    • initial_load_from_date (required for first run only)
  2. Check warehouse connection:
    -- Test warehouse connection by running this in your warehouse
    SELECT TOP 1 * FROM INFORMATION_SCHEMA.TABLES
    
  3. Verify notebook constants are configured:
    • Open the notebook
    • Check the third code cell for SCHEMA, CONNECTION_ARTIFACT_ID, etc.
    • Ensure the warehouse ID is correct (copy from warehouse URL)

Common Errors:

Bucket keeps failing after retries

Symptoms:

Solutions:

  1. Check dataflow execution logs:
    • Open the dataflow in Fabric
    • Navigate to Refresh history
    • Review error messages from failed refreshes
  2. Verify Power Query configuration:
    • Ensure dataflow reads range_start and range_end correctly
    • Test with a small date range manually
    • Check for data type mismatches
  3. Inspect the tracking table:
    -- Check current tracking state
    SELECT TOP 5
        dataflow_id,
        dataflow_name,
        status,
        range_start,
        range_end,
        update_time
    FROM [dbo].[Incremental Update]
    WHERE dataflow_id = 'your-dataflow-id'
    ORDER BY update_time DESC
    
  4. Reduce bucket size:
    • Try smaller bucket_size (e.g., 1 day instead of 7)
    • Large date ranges may timeout or exceed memory limits
  5. Check data source:
    • Verify source system is accessible
    • Check for connectivity issues during refresh window
    • Confirm source data exists for the date range

Wrong dataflow type detected

Symptoms:

Solutions:

  1. Explicitly set the dataflow type:
    {
      "is_cicd_dataflow": true   // or false for regular dataflows
    }
    
  2. Verify workspace and dataflow IDs:
    • Check the URL when viewing your dataflow
    • Ensure no typos in the GUID values
    • Confirm the dataflow exists in the specified workspace
  3. Check dataflow type in Fabric:
    • CI/CD dataflows are typically created through Git integration
    • Regular dataflows are created directly in the workspace

Database connection timeouts

Symptoms:

Solutions:

Tracking table issues

Problem: Tracking table shows “Running” but notebook completed

Solution:

-- Manually check for stuck records
SELECT * FROM [dbo].[Incremental Update]
WHERE status = 'Running'
  AND update_time < DATEADD(HOUR, -2, GETDATE())

-- If needed, manually update stuck records
UPDATE [dbo].[Incremental Update]
SET status = 'Failed'
WHERE dataflow_id = 'your-dataflow-id'
  AND status = 'Running'
  AND update_time < DATEADD(HOUR, -2, GETDATE())

Problem: Need to restart from scratch

Solution:

-- Option 1: Delete tracking record (will trigger initial load on next run)
DELETE FROM [dbo].[Incremental Update]
WHERE dataflow_id = 'your-dataflow-id'

-- Option 2: Use reinitialize_dataflow parameter
-- Set reinitialize_dataflow = true in pipeline parameters

Backup table recovery

Problem: Leftover backup table exists

Solution: The notebook automatically detects and recovers from leftover backup tables on the next run. If you need to manually intervene:

-- Check for backup tables
SELECT name FROM sys.tables WHERE schema_id = SCHEMA_ID('temporary_backup')

-- Manually restore from a backup if needed (note: hyphens are stripped from dataflow ID)
INSERT INTO [dbo].[YourTable] SELECT * FROM [temporary_backup].[_backup_<dataflow_id_no_hyphens>]

-- Drop the backup table
DROP TABLE IF EXISTS [temporary_backup].[_backup_<dataflow_id_no_hyphens>]

Monitoring and debugging

Check tracking table history:

-- View refresh history
SELECT
    dataflow_name,
    status,
    range_start,
    range_end,
    update_time,
    DATEDIFF(SECOND,
        LAG(update_time) OVER (PARTITION BY dataflow_id ORDER BY update_time),
        update_time
    ) as seconds_since_last_refresh
FROM [dbo].[Incremental Update]
WHERE dataflow_id = 'your-dataflow-id'
ORDER BY update_time DESC

Check for data gaps:

-- Identify gaps in refreshed date ranges
WITH RangedData AS (
    SELECT
        range_start,
        range_end,
        LEAD(range_start) OVER (ORDER BY range_start) as next_start
    FROM [dbo].[Incremental Update]
    WHERE dataflow_id = 'your-dataflow-id'
      AND status = 'Success'
)
SELECT
    range_end as gap_start,
    next_start as gap_end,
    DATEDIFF(DAY, range_end, next_start) as gap_days
FROM RangedData
WHERE DATEADD(SECOND, 1, range_end) < next_start

CI/CD Execute — InvalidJobType (HTTP 400)

Symptoms:

Root cause: The documented Dataflow Execute endpoint (/v1/workspaces/{wid}/dataflows/{dfid}/jobs/execute/instances) is a preview API. Some tenant clusters have not yet adopted this route shape and return 400 InvalidJobType even though Microsoft Learn documents the path. The older Background Jobs API route (?jobType=Execute) is still accepted by these clusters.

Default behaviour (cicd_execute_route_mode = 'auto'): The notebook tries the documented route first. If it receives 400 InvalidJobType with no Location header (meaning no job was created), it logs a warning and retries once on the legacy Background Jobs route using the identical parameter payload. No data risk: the retry only fires when the first request was rejected before any job instance was created.

Permanently switch to the legacy route: If the documented route consistently fails in your tenant, set:

cicd_execute_route_mode = 'legacy_query'

Permanently lock to the documented route: Once a smoke test confirms the documented route works consistently in your tenant, set:

cicd_execute_route_mode = 'documented'

This avoids the auto-retry overhead on every run.

Same-run restore on trigger rejection: If the trigger is rejected after the production range has already been deleted, the notebook restores from the pinned backup immediately in the same run — the production range is not left hollow until the next notebook run. The log will confirm: "Same-run restore completed — production data is intact." If same-run restore fails, the backup is preserved and the log will say: "Backup preserved for manual recovery." Next-run recovery via _recover_from_backup_if_needed() remains active as a safety net for crash/OOM scenarios.

Getting help

If you continue experiencing issues:

  1. Check the notebook execution logs in Fabric
  2. Review the dataflow refresh history
  3. Verify all configuration values match your environment
  4. Test with a minimal date range (1-2 days) first

Version History