Migrating Oracle ODI to Databricks: Knowledge Modules to PySpark and Delta Lake

April 8, 2026 · 18 min read · MigryX Team

Oracle Data Integrator (ODI) occupies a unique position in the ETL landscape. Unlike traditional ETL tools that extract data, transform it in a middle-tier engine, and load results to a target, ODI uses an ELT (Extract-Load-Transform) architecture that pushes transformation logic to the target database engine. This is implemented through Knowledge Modules (KMs) — code templates in SQL, PL/SQL, or Jython that generate database-specific scripts at runtime. The Knowledge Module abstraction makes ODI powerful but also creates a deep dependency on the ODI runtime, the ODI repository, and Oracle-specific SQL dialects.

Databricks provides a modern lakehouse platform where the ELT philosophy is native: data lands in Delta Lake on cloud object storage, and PySpark or Spark SQL transformations execute on elastic Spark clusters directly against that storage. There is no middle-tier engine, no proprietary repository, and no code-generation abstraction layer. This guide maps every major ODI concept — Knowledge Modules, interfaces/mappings, packages, scenarios, load plans, variables, sequences, journalizing, and topology — to its Databricks equivalent, with production-ready code examples.

ODI Architecture vs. Databricks Lakehouse Architecture

ODI's architecture centers on three key abstractions: the Topology (physical and logical architecture definitions), the Model (metadata about data stores), and the Project (interfaces, packages, and procedures that define transformation logic). The ODI Agent orchestrates execution, calling Knowledge Modules that generate and execute SQL on source and target databases. The ODI Repository (master and work repositories) stores all metadata, versioning, and execution history.

Databricks replaces this entire stack with a unified platform. Unity Catalog serves as the metadata layer (replacing ODI Topology and Model). PySpark notebooks and Databricks SQL replace interfaces and Knowledge Modules. Databricks Workflows replace packages, scenarios, and load plans. Delta Lake provides the target storage with ACID transactions, and Spark clusters replace the ODI Agent as the compute engine.

ODI ConceptDatabricks EquivalentNotes
Integration Knowledge Module (IKM)PySpark transformation / Delta Lake MERGETarget-side transformation and loading logic
Loading Knowledge Module (LKM)Auto Loader / COPY INTO / JDBC readSource-to-staging data extraction
Check Knowledge Module (CKM)Delta Lake expectations / PySpark validationData quality checks with quarantine handling
Interface / MappingPySpark notebook / Spark SQL scriptSource-to-target transformation definition
PackageDatabricks Workflow (multi-task job)Orchestrated sequence of transformation steps
ScenarioDatabricks Job (compiled, schedulable)Deployed, versioned execution unit
Load PlanDatabricks Workflow with task dependenciesDAG-based orchestration with parallelism
ODI VariableDatabricks widget / job parameterRuntime parameterization
ODI Sequencemonotonically_increasing_id() / row_number()Surrogate key generation
ODI AgentDatabricks cluster / SQL warehouseElastic compute with auto-scaling
Journalizing (CDC)Delta Lake Change Data Feed (CDF)Row-level change tracking with timestamps
ODI Topology (Physical/Logical)Unity Catalog (catalog / schema)Three-level namespace with governance
Flow Control (CKM)Delta Lake expectations / quality checksConstraint validation with error tables
ODI RepositoryUnity Catalog + Git ReposMetadata storage with version control
Oracle ODI to Databricks migration — automated end-to-end by MigryX

Oracle ODI to Databricks migration — automated end-to-end by MigryX

Understanding Knowledge Module Translation

The core challenge of ODI migration is that Knowledge Modules are not simple transformations — they are code templates that generate SQL dynamically based on the interface definition, source/target metadata, and KM options. A single IKM like "IKM Oracle Incremental Update" contains dozens of template steps that produce INSERT, UPDATE, MERGE, and error-handling SQL. Migrating from ODI to Databricks means replacing this code-generation layer with explicit PySpark or Spark SQL that achieves the same result.

IKM Pattern: Incremental Update (Append/Merge)

The most common IKM pattern is incremental update, where incoming data is merged into a target table based on primary key matching. In ODI, this is handled by IKMs like "IKM Oracle Incremental Update" or "IKM SQL to SQL Append." In Databricks, Delta Lake MERGE provides this natively with ACID guarantees.

# ODI IKM Oracle Incremental Update equivalent
# ODI generates:
#   1. CREATE staging table (C$ table)
#   2. INSERT INTO staging FROM source (via LKM)
#   3. UPDATE target WHERE keys match AND data differs
#   4. INSERT INTO target WHERE keys don't exist in target
#   5. Optional: DELETE from target WHERE keys don't exist in source (full refresh)
#   6. DROP staging table

# Databricks Delta Lake MERGE: All steps in one atomic operation
from delta.tables import DeltaTable
from pyspark.sql import functions as F

# Step 1: Read source data (replaces LKM + staging)
source_df = (spark.read
    .format("jdbc")
    .option("url", "jdbc:oracle:thin:@//oracle-host:1521/ORCL")
    .option("dbtable", "source_schema.customer_orders")
    .option("user", dbutils.secrets.get("oracle", "username"))
    .option("password", dbutils.secrets.get("oracle", "password"))
    .load()
)

# Step 2: Apply transformations (replaces ODI interface mappings)
transformed = (source_df
    .withColumn("order_amount_usd",
        F.when(F.col("currency") == "EUR", F.col("amount") * 1.08)
        .when(F.col("currency") == "GBP", F.col("amount") * 1.27)
        .otherwise(F.col("amount"))
    )
    .withColumn("order_quarter",
        F.concat(
            F.year("order_date").cast("string"),
            F.lit("-Q"),
            F.quarter("order_date").cast("string")
        )
    )
    .withColumn("_load_timestamp", F.current_timestamp())
    .select(
        "order_id", "customer_id", "order_date",
        "order_amount_usd", "order_quarter",
        "product_id", "quantity", "status", "_load_timestamp"
    )
)

# Step 3: MERGE into target (replaces IKM incremental update)
target_table = DeltaTable.forName(spark, "silver.customer_orders")

target_table.alias("tgt").merge(
    transformed.alias("src"),
    "tgt.order_id = src.order_id"
).whenMatchedUpdate(
    condition="""
        tgt.order_amount_usd != src.order_amount_usd OR
        tgt.status != src.status OR
        tgt.quantity != src.quantity
    """,
    set={
        "customer_id": "src.customer_id",
        "order_date": "src.order_date",
        "order_amount_usd": "src.order_amount_usd",
        "order_quarter": "src.order_quarter",
        "product_id": "src.product_id",
        "quantity": "src.quantity",
        "status": "src.status",
        "_load_timestamp": "src._load_timestamp"
    }
).whenNotMatchedInsertAll().execute()

print(f"MERGE completed: {transformed.count()} source records processed")
ODI's IKM generates a multi-step SQL script that creates staging tables, performs separate UPDATE and INSERT operations, and drops staging tables. Delta Lake MERGE eliminates all of this ceremony — it atomically handles inserts, updates, and optionally deletes in a single transactional operation. There are no staging tables to manage, no orphaned temporary objects, and no partial-failure states.

LKM Pattern: Source Data Extraction

ODI Loading Knowledge Modules (LKMs) handle data extraction from heterogeneous sources. "LKM SQL to SQL" uses JDBC to move data between databases. "LKM File to SQL" reads flat files into staging tables. In Databricks, these patterns map to JDBC reads, Auto Loader for file ingestion, and COPY INTO for batch file loads.

# LKM File to SQL equivalent: Load CSV files from cloud storage
# ODI LKM generates: CREATE external table, SELECT INTO staging

# Databricks Auto Loader: Schema-aware, incremental file processing
df_raw = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "csv")
    .option("cloudFiles.schemaLocation", "/mnt/checkpoints/orders_schema")
    .option("cloudFiles.inferColumnTypes", "true")
    .option("header", "true")
    .option("delimiter", ",")
    .option("cloudFiles.schemaEvolutionMode", "addNewColumns")
    .load("/mnt/landing/oracle_export/orders/")
)

# Add audit metadata
df_bronze = (df_raw
    .withColumn("_source_file", F.input_file_name())
    .withColumn("_ingested_at", F.current_timestamp())
)

# Write to bronze layer
(df_bronze.writeStream
    .format("delta")
    .option("checkpointLocation", "/mnt/checkpoints/orders_bronze")
    .option("mergeSchema", "true")
    .trigger(availableNow=True)
    .toTable("bronze.customer_orders")
)
# LKM SQL to SQL equivalent: JDBC extraction from Oracle
# ODI generates: INSERT INTO staging SELECT * FROM source@dblink

source_data = (spark.read
    .format("jdbc")
    .option("url", "jdbc:oracle:thin:@//oracle-host:1521/ORCL")
    .option("dbtable", "(SELECT * FROM hr.employees WHERE modified_date > SYSDATE - 1)")
    .option("user", dbutils.secrets.get("oracle", "username"))
    .option("password", dbutils.secrets.get("oracle", "password"))
    .option("fetchsize", "10000")
    .option("numPartitions", "8")
    .option("partitionColumn", "employee_id")
    .option("lowerBound", "1")
    .option("upperBound", "1000000")
    .load()
)

source_data.write.format("delta").mode("overwrite").saveAsTable("bronze.employees")

MigryX: Purpose-Built Parsers for Every Legacy Technology

MigryX does not rely on generic text matching or regex-based parsing. For every supported legacy technology, MigryX has built a dedicated Abstract Syntax Tree (AST) parser that understands the full grammar and semantics of that platform. This means MigryX captures not just what the code does, but why — understanding implicit behaviors, default settings, and platform-specific quirks that generic tools miss entirely.

ODI Interface Mappings to PySpark DataFrame Transformations

An ODI interface (called "mapping" in ODI 12c) defines source-to-target column mappings with transformation expressions. Each target column has an expression that can reference source columns, ODI functions, lookups, and variables. In PySpark, these become DataFrame select() or withColumn() operations.

# ODI Interface mapping example:
# Source: SALES.TRANSACTIONS (Oracle)
# Target: DW.FACT_SALES (Oracle)
#
# Mappings:
#   SALE_KEY       = SALES_SEQ.NEXTVAL                    (ODI sequence)
#   CUSTOMER_SK    = LOOKUP(DIM_CUSTOMER, CUST_ID)        (lookup)
#   PRODUCT_SK     = LOOKUP(DIM_PRODUCT, PROD_CODE)       (lookup)
#   SALE_DATE      = TRUNC(TXN_TIMESTAMP)                 (expression)
#   SALE_AMOUNT    = QUANTITY * UNIT_PRICE * (1 - DISCOUNT)(expression)
#   CURRENCY_CODE  = NVL(CURRENCY, 'USD')                 (null handling)
#   REGION         = DECODE(REGION_CODE, 1,'NA',2,'EU',3,'APAC','OTHER') (decode)

from pyspark.sql import functions as F
from pyspark.sql.window import Window

# Read source
transactions = spark.table("bronze.transactions")

# Lookup tables (replaces ODI lookup definitions)
dim_customer = spark.table("gold.dim_customer").select("customer_sk", "cust_id")
dim_product = spark.table("gold.dim_product").select("product_sk", "prod_code")

# Apply transformations (replaces ODI interface column mappings)
fact_sales = (transactions
    # Joins replace ODI lookups
    .join(F.broadcast(dim_customer),
          transactions.cust_id == dim_customer.cust_id, "left")
    .join(F.broadcast(dim_product),
          transactions.prod_code == dim_product.prod_code, "left")
    # Column expressions replace ODI derivations
    .withColumn("sale_date", F.to_date("txn_timestamp"))
    .withColumn("sale_amount",
        F.col("quantity") * F.col("unit_price") * (1 - F.col("discount")))
    .withColumn("currency_code",
        F.coalesce(F.col("currency"), F.lit("USD")))
    .withColumn("region",
        F.when(F.col("region_code") == 1, "NA")
        .when(F.col("region_code") == 2, "EU")
        .when(F.col("region_code") == 3, "APAC")
        .otherwise("OTHER"))
    # Surrogate key replaces ODI sequence
    .withColumn("sale_key", F.monotonically_increasing_id())
    # Select final columns matching target schema
    .select(
        "sale_key", "customer_sk", "product_sk",
        "sale_date", "sale_amount", "currency_code", "region"
    )
)

fact_sales.write.format("delta").mode("append").saveAsTable("gold.fact_sales")

ODI Variables to Databricks Widgets and Parameters

ODI variables serve multiple purposes: they parameterize interfaces and packages, hold query results (refreshing variables), store global configuration, and control flow logic. Databricks provides widgets for notebook-level parameterization and job parameters for workflow-level configuration.

ODI Variable TypeDatabricks EquivalentScope
Project Variable (static)Notebook widget with default valueNotebook-level parameter
Global VariableJob parameter / cluster environment variableWorkflow-level or cluster-level
Refreshing Variable (query-based)Spark SQL query result assigned to Python variableRuntime computed value
Sequence VariableMAX(column) + row_number() patternIncrementing numeric key
Declare Variable (procedure-local)Python local variableFunction-level scope
# ODI refreshing variable equivalent
# ODI: #GLOBAL.LAST_EXTRACT_DATE = SELECT MAX(extract_date) FROM ctl.extract_log WHERE status='SUCCESS'

# Databricks: Query-based variable assignment
last_extract_date = spark.sql("""
    SELECT MAX(extract_date) AS last_date
    FROM control.extract_log
    WHERE status = 'SUCCESS'
""").collect()[0]["last_date"]

print(f"Last successful extract: {last_extract_date}")

# Use in subsequent queries (replaces ODI variable substitution)
new_records = spark.sql(f"""
    SELECT * FROM bronze.transactions
    WHERE modified_date > '{last_extract_date}'
""")

# ODI project variable equivalent: Notebook widgets
dbutils.widgets.text("source_schema", "bronze", "Source Schema")
dbutils.widgets.text("target_schema", "silver", "Target Schema")
dbutils.widgets.dropdown("load_mode", "incremental", ["full", "incremental"], "Load Mode")

source_schema = dbutils.widgets.get("source_schema")
target_schema = dbutils.widgets.get("target_schema")
load_mode = dbutils.widgets.get("load_mode")

if load_mode == "full":
    df = spark.table(f"{source_schema}.transactions")
else:
    df = spark.sql(f"""
        SELECT * FROM {source_schema}.transactions
        WHERE modified_date > '{last_extract_date}'
    """)

ODI Sequences to Surrogate Key Generation

ODI sequences generate auto-incrementing numeric values for surrogate keys, similar to Oracle database sequences. In Databricks, surrogate key generation uses monotonically_increasing_id() for unique (but not sequential) identifiers, or row_number() with a maximum-value offset for sequential keys that continue from the last loaded value.

# ODI Sequence: ODI_SEQ_CUSTOMER_KEY (native sequence, or ODI-managed identity)
# Pattern: Get current max, assign sequential keys to new records

from pyspark.sql.window import Window

# Get current maximum key from target table
max_key_row = spark.sql("""
    SELECT COALESCE(MAX(customer_key), 0) AS max_key
    FROM gold.dim_customer
""").collect()[0]
max_key = max_key_row["max_key"]

# Assign sequential keys to new records
window_spec = Window.orderBy("customer_id")
new_customers = (spark.table("silver.customer_staging")
    .withColumn("customer_key",
        F.row_number().over(window_spec) + F.lit(max_key))
)

new_customers.write.format("delta").mode("append").saveAsTable("gold.dim_customer")
print(f"Assigned keys {max_key + 1} through {max_key + new_customers.count()}")

ODI Packages and Scenarios to Databricks Workflows

ODI packages orchestrate the execution of interfaces, procedures, variables, and other packages in a defined sequence with conditional branching (OK/KO paths). Scenarios are compiled, deployable versions of packages. Load plans provide parallel execution with step-level restart capability. All of these map to Databricks Workflows, which support task DAGs, conditional execution, retry policies, and parameterized runs.

# ODI Package structure:
#   Step 1: Refresh Variable (get last extract date)
#   Step 2 (OK): Interface: Load staging (LKM File to SQL)
#   Step 3 (OK): Interface: Transform and merge (IKM Incremental Update)
#   Step 4 (OK): Interface: Build aggregates
#   Step 4 (KO): Procedure: Send error notification
#   Step 5: Procedure: Update control table

# Databricks Workflow equivalent
{
  "name": "odi_migration_daily_load",
  "tasks": [
    {
      "task_key": "get_extract_date",
      "notebook_task": {
        "notebook_path": "/odi_migration/01_get_extract_date"
      },
      "max_retries": 1,
      "timeout_seconds": 300
    },
    {
      "task_key": "load_staging",
      "depends_on": [{"task_key": "get_extract_date"}],
      "notebook_task": {
        "notebook_path": "/odi_migration/02_load_staging",
        "base_parameters": {
          "last_extract_date": "{{tasks.get_extract_date.values.last_extract_date}}"
        }
      },
      "run_if": "ALL_SUCCESS",
      "max_retries": 2
    },
    {
      "task_key": "transform_merge",
      "depends_on": [{"task_key": "load_staging"}],
      "notebook_task": {
        "notebook_path": "/odi_migration/03_transform_merge"
      },
      "run_if": "ALL_SUCCESS"
    },
    {
      "task_key": "build_aggregates",
      "depends_on": [{"task_key": "transform_merge"}],
      "notebook_task": {
        "notebook_path": "/odi_migration/04_build_aggregates"
      },
      "run_if": "ALL_SUCCESS"
    },
    {
      "task_key": "send_error_notification",
      "depends_on": [{"task_key": "transform_merge"}],
      "notebook_task": {
        "notebook_path": "/odi_migration/error_notification"
      },
      "run_if": "AT_LEAST_ONE_FAILED"
    },
    {
      "task_key": "update_control_table",
      "depends_on": [
        {"task_key": "build_aggregates"},
        {"task_key": "send_error_notification"}
      ],
      "notebook_task": {
        "notebook_path": "/odi_migration/05_update_control_table"
      },
      "run_if": "ALL_DONE"
    }
  ],
  "schedule": {
    "quartz_cron_expression": "0 0 6 * * ?",
    "timezone_id": "America/New_York"
  }
}

ODI Load Plans: Parallel Execution with Dependencies

ODI load plans provide sophisticated orchestration with parallel steps, serial steps, exception handling, and restart capabilities. Databricks Workflows natively support parallel task execution through the dependency graph — tasks without dependency relationships run in parallel automatically.

# ODI Load Plan with parallel steps:
#   Serial Step: Initialize
#   Parallel Step:
#     - Load Customers (interface)
#     - Load Products (interface)
#     - Load Regions (interface)
#   Serial Step: Load Orders (depends on all dimension loads)
#   Serial Step: Build Aggregates

# Databricks Workflow: Parallel tasks run automatically when dependencies allow
{
  "name": "odi_load_plan_equivalent",
  "tasks": [
    {
      "task_key": "initialize",
      "notebook_task": {"notebook_path": "/pipelines/initialize"}
    },
    {
      "task_key": "load_customers",
      "depends_on": [{"task_key": "initialize"}],
      "notebook_task": {"notebook_path": "/pipelines/load_customers"}
    },
    {
      "task_key": "load_products",
      "depends_on": [{"task_key": "initialize"}],
      "notebook_task": {"notebook_path": "/pipelines/load_products"}
    },
    {
      "task_key": "load_regions",
      "depends_on": [{"task_key": "initialize"}],
      "notebook_task": {"notebook_path": "/pipelines/load_regions"}
    },
    {
      "task_key": "load_orders",
      "depends_on": [
        {"task_key": "load_customers"},
        {"task_key": "load_products"},
        {"task_key": "load_regions"}
      ],
      "notebook_task": {"notebook_path": "/pipelines/load_orders"}
    },
    {
      "task_key": "build_aggregates",
      "depends_on": [{"task_key": "load_orders"}],
      "notebook_task": {"notebook_path": "/pipelines/build_aggregates"}
    }
  ]
}

ODI Journalizing (CDC) to Delta Lake Change Data Feed

ODI journalizing captures changes on source tables by creating journal tables (J$ tables) and triggers or log-based mechanisms. This enables incremental extraction where only changed rows are processed. Delta Lake provides Change Data Feed (CDF), which records row-level changes (inserts, updates, pre-image and post-image of updates, deletes) automatically when enabled on a table.

# Enable Change Data Feed on a Delta table (replaces ODI journalizing setup)
spark.sql("""
    ALTER TABLE silver.customer_orders
    SET TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true')
""")

# Read changes since last processing (replaces ODI journal table query)
# ODI equivalent: SELECT * FROM J$_CUSTOMER_ORDERS WHERE JRN_FLAG IN ('I','U')

changes = (spark.read
    .format("delta")
    .option("readChangeFeed", "true")
    .option("startingVersion", 15)  # or startingTimestamp
    .table("silver.customer_orders")
)

# Filter by change type
inserts = changes.filter(F.col("_change_type") == "insert")
updates = changes.filter(F.col("_change_type").isin("update_postimage"))
deletes = changes.filter(F.col("_change_type") == "delete")

print(f"Changes: {inserts.count()} inserts, {updates.count()} updates, {deletes.count()} deletes")

# Apply changes to downstream gold table using MERGE
gold_target = DeltaTable.forName(spark, "gold.customer_orders_summary")

# Process only changed records (incremental, like ODI journalizing)
changed_customers = changes.filter(
    F.col("_change_type").isin("insert", "update_postimage")
).select("customer_id", "order_amount_usd", "order_date")

# Reaggregate only affected customers
updated_summary = (changed_customers
    .groupBy("customer_id")
    .agg(
        F.sum("order_amount_usd").alias("total_orders"),
        F.max("order_date").alias("last_order_date"),
        F.count("*").alias("order_count")
    )
)

gold_target.alias("tgt").merge(
    updated_summary.alias("src"),
    "tgt.customer_id = src.customer_id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
ODI journalizing requires creating journal tables (J$ tables), journal triggers or log-mining configuration, and journal subscribers in the ODI topology. Delta Lake Change Data Feed eliminates all of this infrastructure — enabling CDF is a single table property change, and the change feed is automatically maintained by Delta Lake's transaction log. No triggers, no journal tables, no subscriber configuration.
MigryX Screenshot

From parsed legacy code to production-ready modern equivalents — MigryX automates the entire conversion pipeline

From Legacy Complexity to Modern Clarity with MigryX

Legacy ETL platforms encode business logic in visual workflows, proprietary XML formats, and platform-specific constructs that are opaque to standard analysis tools. MigryX’s deep parsers crack open these proprietary formats and extract the underlying data transformations, business rules, and data flows. The result is complete transparency into what your legacy code actually does — often revealing undocumented logic that even the original developers had forgotten.

ODI Topology to Unity Catalog

ODI Topology defines physical and logical schemas that abstract database connections. Physical schemas map to actual database schemas, while logical schemas provide a layer of indirection for environment promotion (dev/test/prod). Unity Catalog provides a three-level namespace (catalog.schema.table) that serves the same purpose with additional governance capabilities.

ODI Topology ConceptUnity Catalog EquivalentNotes
Physical SchemaUnity Catalog SchemaDirect mapping to database schema
Logical SchemaUnity Catalog CatalogEnvironment abstraction (dev/test/prod catalogs)
Context (DEV/TEST/PROD)Catalog name (dev_catalog/prod_catalog)Environment promotion via catalog switching
Data Server (connection)External connection / Secret scopeManaged credentials with secret management
Agent (physical/standalone)Databricks cluster / SQL warehouseElastic compute with auto-scaling
Technology definitionBuilt-in format support + connectorsDelta, Parquet, CSV, JSON, JDBC, etc.
# ODI Context-based environment switching equivalent
# ODI: Logical schema "LS_DW" maps to "DW_DEV" in DEV context, "DW_PROD" in PROD context

# Databricks: Use catalog name for environment switching
environment = dbutils.widgets.get("environment")  # "dev" or "prod"

if environment == "dev":
    catalog = "dev_catalog"
elif environment == "prod":
    catalog = "prod_catalog"
else:
    raise ValueError(f"Unknown environment: {environment}")

spark.sql(f"USE CATALOG {catalog}")
spark.sql("USE SCHEMA warehouse")

# All subsequent queries automatically use the correct environment
df = spark.table("fact_orders")  # Resolves to dev_catalog.warehouse.fact_orders or prod_catalog.warehouse.fact_orders

CKM Flow Control to Delta Lake Data Quality

ODI Check Knowledge Modules (CKMs) validate data against constraints defined on the target datastore. Rows that fail validation are moved to error tables (E$ tables) with error messages and codes. In Databricks, data quality checks can be implemented using PySpark validation logic, Delta Lake expectations (in Delta Live Tables), or SQL-based constraint checks.

# ODI CKM Oracle equivalent: Validate data against target constraints
# ODI generates: INSERT INTO E$_TARGET SELECT * FROM I$_TARGET WHERE constraint_violated

df_staged = spark.table("staging.order_lines")

# Define validation rules (replaces ODI CKM constraint checks)
validation_rules = {
    "NOT_NULL_ORDER_ID": F.col("order_id").isNotNull(),
    "NOT_NULL_PRODUCT_ID": F.col("product_id").isNotNull(),
    "POSITIVE_QUANTITY": F.col("quantity") > 0,
    "POSITIVE_AMOUNT": F.col("amount") > 0,
    "VALID_STATUS": F.col("status").isin("OPEN", "SHIPPED", "DELIVERED", "CANCELLED"),
    "VALID_DATE": F.col("order_date").between("2020-01-01", "2030-12-31")
}

# Apply all validation rules
valid_mask = F.lit(True)
error_reasons = F.lit("")

for rule_name, rule_condition in validation_rules.items():
    valid_mask = valid_mask & rule_condition
    error_reasons = F.when(
        ~rule_condition,
        F.concat(error_reasons, F.lit(f"{rule_name}; "))
    ).otherwise(error_reasons)

df_validated = df_staged.withColumn("_is_valid", valid_mask).withColumn("_error_reasons", error_reasons)

# Separate valid and error records (replaces E$ error table)
valid_records = df_validated.filter(F.col("_is_valid")).drop("_is_valid", "_error_reasons")
error_records = (df_validated
    .filter(~F.col("_is_valid"))
    .withColumn("_error_timestamp", F.current_timestamp())
    .withColumn("_source_system", F.lit("ODI_MIGRATION"))
)

# Write valid records to target
valid_records.write.format("delta").mode("append").saveAsTable("silver.order_lines")

# Write errors to quarantine table (replaces E$ table)
error_records.write.format("delta").mode("append").saveAsTable("quarantine.order_lines_errors")

valid_count = valid_records.count()
error_count = error_records.count()
print(f"Flow control results: {valid_count} valid, {error_count} rejected")

# Fail the notebook if error rate exceeds threshold
error_rate = error_count / (valid_count + error_count) if (valid_count + error_count) > 0 else 0
if error_rate > 0.05:
    raise Exception(f"Error rate {error_rate:.2%} exceeds 5% threshold. Check quarantine.order_lines_errors.")

Medallion Architecture: Organizing the Migrated Pipelines

ODI projects typically organize data flows across staging schemas, intermediate work tables, and target data warehouse schemas. This maps to the Databricks Medallion Architecture pattern, which provides a standardized approach to data quality progression.

ODI Data LayerMedallion TierDelta Lake Implementation
Staging area (C$ tables)BronzeRaw ingestion with audit metadata columns
Work tables (I$ tables)SilverCleansed, deduplicated, conformed data
Target data warehouseGoldBusiness-level dimensions, facts, and aggregates
Error tables (E$ tables)QuarantineRejected records with error reasons and timestamps
Journal tables (J$ tables)Delta Lake Change Data FeedAutomatic change tracking via transaction log

MLflow: Extending Beyond ODI's ETL Scope

ODI is a pure data integration tool with no native machine learning capabilities. Databricks extends the platform with MLflow for experiment tracking, model training, model registry, and model serving. Feature engineering pipelines that previously terminated in ODI-loaded tables and fed external ML tools can now run end-to-end on Databricks.

# Feature engineering pipeline that extends beyond what ODI can do
import mlflow
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml import Pipeline

# Features from ODI-migrated gold tables
features = spark.sql("""
    SELECT
        c.customer_id,
        c.tenure_months,
        c.credit_tier,
        o.total_orders,
        o.avg_order_value,
        o.days_since_last_order,
        CASE WHEN o.days_since_last_order > 180 THEN 1 ELSE 0 END AS is_churned
    FROM gold.dim_customer c
    JOIN gold.customer_order_summary o ON c.customer_id = o.customer_id
""")

# Build ML pipeline
with mlflow.start_run(run_name="churn_model_v1"):
    indexer = StringIndexer(inputCol="credit_tier", outputCol="credit_tier_idx")
    assembler = VectorAssembler(
        inputCols=["tenure_months", "credit_tier_idx", "total_orders",
                    "avg_order_value", "days_since_last_order"],
        outputCol="features"
    )
    rf = RandomForestClassifier(labelCol="is_churned", featuresCol="features", numTrees=100)
    pipeline = Pipeline(stages=[indexer, assembler, rf])

    model = pipeline.fit(features)
    mlflow.spark.log_model(model, "churn_model")
    mlflow.log_metric("accuracy", 0.92)

How MigryX Automates ODI-to-Databricks Migration

MigryX uses AST-based deterministic parsing to analyze ODI repository exports (XML-based smart exports and SDK extracts), building a complete abstract syntax tree of each interface mapping, Knowledge Module template, package flow, and load plan dependency graph. Unlike regex-based tools that cannot parse nested ODI expressions, or AI-only approaches that approximate KM behavior, MigryX's parser achieves +95% accuracy by understanding the full structure of ODI metadata — including KM option settings, variable substitutions, flow control conditions, and multi-technology topology mappings.

MigryX ODI Migration Workflow

  1. Repository Analysis — MigryX connects to the ODI repository (or processes smart exports) and inventories all projects, folders, interfaces/mappings, packages, scenarios, load plans, variables, sequences, and Knowledge Module assignments. Column-level lineage is traced from source datastores through every mapping expression to target columns.
  2. Knowledge Module Decomposition — Each KM assignment is decomposed into its generated SQL steps. IKM incremental patterns become Delta Lake MERGE operations. LKM extraction patterns become Auto Loader or JDBC read configurations. CKM validation patterns become PySpark quality checks with quarantine tables.
  3. Interface-to-PySpark Translation — ODI interface column mappings, expressions, joins, filters, and lookup definitions are parsed into an AST and deterministically translated to PySpark DataFrame operations. ODI functions (NVL, DECODE, INSTR, SUBSTR) are mapped to their PySpark/Spark SQL equivalents.
  4. Package-to-Workflow Conversion — ODI package step sequences with OK/KO branching are converted to Databricks Workflow task DAGs with run_if conditions (ALL_SUCCESS, AT_LEAST_ONE_FAILED) that preserve the original error-handling logic.
  5. STTM Documentation — Source-to-Target Mapping documents capture every column mapping, transformation expression, KM option, and data type conversion for comprehensive audit trails and regulatory compliance.
  6. Merlin AI — MigryX's Merlin AI handles edge cases including custom Knowledge Modules with non-standard template steps, complex Jython procedures, and Oracle-specific SQL constructs (CONNECT BY, MODEL clause, XMLTABLE) that require contextual translation beyond deterministic rules.

Key Takeaways

Migrating from Oracle ODI to Databricks eliminates the Knowledge Module abstraction layer that sits between your transformation logic and its execution. In ODI, you define mappings in a visual tool, assign Knowledge Modules that generate SQL, and deploy scenarios that the ODI Agent executes against target databases. In Databricks, you write PySpark or Spark SQL directly — the code is the transformation, and it executes on elastic Spark clusters against Delta Lake storage. The result is a simpler, more transparent architecture where every transformation is visible, testable, and version-controlled in Git. Unity Catalog provides governance that exceeds ODI's topology-based access control, and Delta Lake delivers ACID transactions, time travel, and Change Data Feed capabilities that ODI's journal-table approach cannot match. With MigryX's automated, AST-based approach, organizations can migrate hundreds of ODI interfaces, packages, and load plans while preserving complete business logic fidelity and generating the comprehensive STTM documentation that enterprise migrations demand.

Why MigryX Is the Only Platform That Handles This Migration

The challenges described throughout this article are exactly what MigryX was built to solve. Here is how MigryX transforms this process:

MigryX combines precision AST parsing with Merlin AI to deliver 99% accurate, production-ready migration — turning what used to be a multi-year manual effort into a streamlined, validated process. See it in action.

Ready to migrate from Oracle ODI to Databricks?

See how MigryX converts ODI Knowledge Modules, interfaces, packages, and load plans to production-ready PySpark notebooks and Databricks Workflows.

Explore ODI Migration   Schedule a Demo