Step-by-Step Guide to Using madsql Infer-Schema

This is Part 2 of a series on madsql.
Part 1 covered split-statements. This post covers infer-schema, recovering CREATE TABLE DDL from a SQL workload when your source schema is incomplete or missing.

One of the most common situations on Oracle migration projects isn’t “we have the schema and need to convert it.” It’s “we have a query workload, some application SQL, maybe a partial DDL dump and the authoritative schema definition is somewhere between incomplete and nonexistent.”

Before you can run converted SQL against Oracle, you need tables. Before you have tables, you need DDL. infer-schema reads the SQL you do have queries, DML, DDL fragments, or any combination and infers CREATE TABLE statements from it. The result is a working starter schema, not a finished product, but it gets you to a testable state without manually reverse-engineering table structures from query patterns.


When to Use infer-schema

Reach for infer-schema when:

  • You’ve inherited a query workload with no accompanying DDL
  • The source DDL exists but is incomplete, missing columns that appear in queries, stale types, wrong nullability
  • You need a baseline Oracle schema to test converted SQL against before the source team delivers formal DDL
  • You’re migrating from a platform where schema wasn’t formally managed and the queries are the best documentation you have
  • You want a fast structural inventory of what tables and columns a workload is actually touching

The output is intentionally a starter schema. Explicit CREATE TABLE statements in the input give the strongest type evidence. Query-only inputs produce best-effort inference that you’ll want to review before using in production. That’s the honest boundary of the feature and it’s still far faster than reconstructing DDL manually from a hundred queries.


What infer-schema Can Read

The following statement classes contribute to schema inference:

  • CREATE TABLE strongest type evidence; contributes explicit column names and types
  • CREATE VIEW, CREATE MATERIALIZED VIEW, CREATE INDEX contributes table and column references
  • SELECT, INSERT, UPDATE, DELETE contributes table membership and column usage patterns
  • CREATE SCHEMA, CREATE DATABASE, USE contributes schema/namespace names

Evidence is merged across all statements in the input. A column that appears in both a SELECT and an INSERT INTO table(col, ...) VALUES gets type evidence from both. The more statements that reference a column, and the more explicit those references are, the more confident the inferred type.


Basic Usage

Infer DDL from a file and print to stdout:

madsql infer-schema
--source tsql
./workload.sql

Write the DDL artifact to a directory:

madsql infer-schema
--source tsql
--in ./workload.sql
--out ./schema

When --out is a directory, the artifact uses a deterministic dialect-aware name:

inferred_schema-tsql.sql.

This keeps output predictable across repeated runs.

Merge a directory of SQL files into a single schema artifact:

madsql infer-schema
--source tsql
--in ./sql
--out ./schema

Oracle-Targeted Output

Adding --target oracle translates inferred types to Oracle equivalents. This is the mode you want when the goal is a working Oracle schema, not a generic DDL representation.

madsql infer-schema
--source tsql
--target oracle
--pretty
./workload.sql

With --target oracle, type translations are applied automatically:

Source type Oracle equivalent


TEXT CLOB
VARCHAR VARCHAR2
GEOGRAPHY SDO_GEOMETRY
DOUBLE NUMBER
BIGINT INT
BOOLEAN NUMBER(1)


Generating Oracle User and Grant DDL

In Oracle, a schema is a user. When you’re migrating a workload that uses schema-qualified tables, like nyc_taxi.trips, you need to create the Oracle user, grant it the right privileges, and set the session schema before your CREATE TABLE statements will work.

madsql handles this with --create-user and --create-user-password:

madsql infer-schema
--source singlestore
--target oracle
--create-user
--create-user-password 'ChangeMe123'
--pretty
./examples/input/singlestore/nyc_taxi_queries.sql

Output:

CREATE USER nyc_taxi IDENTIFIED BY "ChangeMe123";
GRANT CREATE SESSION, CREATE TABLE TO nyc_taxi;
ALTER SESSION SET CURRENT_SCHEMA = nyc_taxi;
CREATE TABLE nyc_taxi.neighborhoods (
id INT,
name CLOB,
polygon SDO_GEOMETRY
);
CREATE TABLE nyc_taxi.trips (
accept_time NUMBER,
dropoff_location SDO_GEOMETRY,
dropoff_time NUMBER,
num_riders NUMBER,
pickup_location SDO_GEOMETRY,
pickup_time NUMBER,
price NUMBER,
request_time NUMBER,
status CLOB
);

The schema name nyc_taxi is inferred from the qualified table references in the source queries. The output is ready to run against an Oracle instance.

For non-Oracle targets, use --create-schema instead, it prepends CREATE SCHEMA IF NOT EXISTS statements. The two flags are mutually exclusive and target-specific:

--create-user requires --target oracle

--create-schema is for everything else.


Controlling Fallback Types

When madsql can’t infer a stronger type for a column from the evidence available, it falls back to a default. By default that fallback is TEXT. For Oracle work, you’ll almost always want to override this:

madsql infer-schema
--source tsql
--target oracle
--default-type 'VARCHAR2(255)'
--pretty
./workload.sql

Common --default-type values for Oracle:

  • VARCHAR2(255) — general-purpose string fallback
  • VARCHAR2(100) — tighter string fallback for known-short columns
  • NUMBER — when most untyped columns are expected to be numeric
  • CLOB — when you expect unknown columns to hold long text
  • DATE — when the workload is primarily date-oriented

Choose a fallback that matches the domain of the workload. A reporting schema full of string dimensions is different from a transactional schema full of IDs and amounts.


Handling Unqualified Columns in Multi-Table Queries

A query like

SELECT name, amount FROM orders o JOIN customers c ON o.cust_id = c.id

has unqualified column references, name and amount could belong to either table. madsql’s default is to assign these to the first table in scope. For Oracle migration work, it’s often cleaner to skip them rather than risk placing columns in the wrong table:

madsql infer-schema
--source tsql
--target oracle
--unqualified-columns skip
--default-type 'VARCHAR2(255)'
./workload.sql

With --unqualified-columns skip, only columns with clear table qualification contribute to the inferred schema. The output is smaller and more trustworthy, at the cost of potentially missing some columns. You can always add them manually after reviewing the output against the source queries.


Quick Schema Inference From stdin

For interactive exploration, pipe a query directly in:

echo "
SELECT e.employee_id, e.first_name, e.last_name, e.salary,
d.department_name, d.location_id
FROM employees e
JOIN departments d ON e.department_id = d.department_id
WHERE e.salary > 50000
ORDER BY e.last_name;" |
madsql infer-schema
--source postgres
--target oracle
--default-type 'VARCHAR2(100)'
--pretty

Output:

CREATE TABLE departments (
department_id NUMBER,
department_name VARCHAR2(100),
location_id NUMBER
);
CREATE TABLE employees (
department_id NUMBER,
employee_id NUMBER,
first_name VARCHAR2(100),
last_name VARCHAR2(100),
salary NUMBER
);

salary is typed as NUMBER because it appears in the numeric comparison > 50000. The join key department_id is typed as NUMBER from the join condition. String columns fall back to VARCHAR2(100) from --default-type.


JSON Output for Downstream Processing

If you need the inferred schema in a machine-readable format, to feed into another tool, to diff schema changes over time, or to generate custom DDL, use --format json:

madsql infer-schema
--source tsql
--format json
--in ./workload.sql
--out ./schema

This writes inferred_schema-tsql.json to ./schema. The JSON payload includes table names, column names, inferred types, and confidence indicators for low-certainty assignments.


Hybrid Inference for DDL-Heavy Workloads

For workloads that mix standard SQL with DDL statements, dashboard-style constructs, or template placeholders (${variable}, $name), hybrid inference mode supplements SQLGlot with sql-metadata and simple-ddl-parser:

madsql infer-schema
--source tsql
--target oracle
--infer-engine hybrid
--create-user
--create-user-password 'ChangeMe123'
--pretty
./workload.sql

SQLGlot stays primary, hybrid mode adds supplemental signals from the two additional parsers without replacing the main inference path. Both are included in the standard install; no extra setup needed.

Use hybrid mode when the default run is missing tables or columns you can see referenced in the source SQL. The hybrid report uses a distinct filename: YYYYMMDD-HHMMSS-madsql-infer-schema-hybrid.md.


Full Run With Artifacts

For a schema inference run you’ll hand off to a review, include the error report and Markdown summary:

madsql infer-schema
--source tsql
--target oracle
--create-user
--create-user-password 'ChangeMe123'
--default-type 'VARCHAR2(255)'
--pretty
--in ./workload.sql
--out ./schema
--errors infer-errors.json
--log 1
--report

What lands in ./schema:

  • inferred_schema-tsql-to-oracle.sql, the Oracle DDL with user creation boilerplate
  • infer-errors.json, structured error report with version info and any parse failures
  • YYYYMMDD-HHMMSS-madsql-infer-schema.log, detailed run log
  • YYYYMMDD-HHMMSS-madsql-infer-schema-report.md, Markdown summary with table count, column count, type distribution, and version info

Common Pitfalls

Expecting production-ready DDL from query-only input. If the workload has no CREATE TABLE statements, everything is inferred from query patterns. The output is a useful starting point, not a finished schema. Review it before using it as the basis for a migration.

Not setting –default-type for Oracle. The default fallback is TEXT, which isn’t a valid Oracle type. Always set --default-type when targeting Oracle, VARCHAR2(255) is a safe general-purpose choice.

Using –create-schema for Oracle targets. --create-schema is for non-Oracle targets and emits CREATE SCHEMA IF NOT EXISTS. For Oracle use --create-user and --create-user-password. Mixing them produces a fatal CLI error.

Unqualified columns adding noise to the schema. Multi-table queries with unqualified column references can place columns in the wrong table.

Add --unqualified-columns skip on complex workloads and add the ambiguous columns back manually after reviewing the output.

Missing schema names. If the workload uses unqualified table names (no schema.table prefix), madsql won’t infer a schema name. The CREATE TABLE statements will use bare table names. This is correct behavior but means --create-user won’t have a schema name to derive, qualify your table references in the source where you can.


CI Integration Notes

Schema inference is a good fit for a pre-migration validation step in CI, run it against the query workload and diff the output artifact against a known-good baseline to detect schema drift.

madsql infer-schema
--source tsql
--target oracle
--default-type 'VARCHAR2(255)'
--in ./sql
--out ./schema
--errors infer-errors.json
--continue

The deterministic output naming (inferred_schema-tsql-to-oracle.sql) means you can commit a baseline artifact to source control and use a simple diff in CI to detect when the inferred schema changes, which is a useful signal that the underlying query workload has changed in a structurally meaningful way.

To keep a CI job green while still capturing parse failures for review:

madsql infer-schema
--source tsql
--target oracle
--in ./sql
--out ./schema
--errors infer-errors.json
--ignore-errors

Next in the series, Part 3: convert, full batch SQL dialect conversion for Oracle migration, including schema side artifacts, split output, reporting, and CI pipeline integration.


Leave a Reply

Discover more from OraMatt: YABAOracle

Subscribe now to keep reading and get access to the full archive.

Continue reading