16. Working with Data Frames
Python-oracledb can query directly to a data frame format, and can also insert data frames into Oracle Database. This can improve performance and reduce memory requirements when your application uses Python data frame libraries such as Apache PyArrow, Pandas, Polars, NumPy, Dask, PyTorch, or writes files in Apache Parquet format.
Python-oracledb has a DataFrame object that exposes an Apache Arrow ArrowArrayStream PyCapsule Interface. This enables zero-copy data interchanges to the data frame objects of other libraries.
16.1. Fetching Data Frames
Data frames can be fetched by using a standard SQL query with Connection or AsyncConnection methods.
16.1.1. Data Frame Queries
The python-oracledb methods for fetching rows into data frames are:
Connection.fetch_df_all()fetches all rows from a queryConnection.fetch_df_batches()implements an iterator for fetching batches of rows
These methods can also be called from AsyncConnection. The methods all return python-oracledb DataFrame objects.
For example, to fetch all rows from a query and print some information about the results:
sql = "select * from departments where department_id > :1"
# Adjust arraysize to tune the query fetch performance
odf = connection.fetch_df_all(statement=sql, parameters=[100], arraysize=100)
print(odf.column_names())
print(f"{odf.num_columns()} columns")
print(f"{odf.num_rows()} rows")
With Oracle Database’s standard DEPARTMENTS table, this would display:
['DEPARTMENT_ID', 'DEPARTMENT_NAME', 'MANAGER_ID', 'LOCATION_ID']
4 columns
17 rows
To fetch in batches, use an iterator:
import pyarrow
sql = "select * from departments where department_id < :1"
# Adjust "size" to tune the query fetch performance
# Here it is small to show iteration
for odf in connection.fetch_df_batches(statement=sql, parameters=[80], size=4):
df = pyarrow.table(odf).to_pandas()
print(df)
With Oracle Database’s standard DEPARTMENTS table, this would display:
DEPARTMENT_ID DEPARTMENT_NAME MANAGER_ID LOCATION_ID
0 10 Administration 200 1700
1 20 Marketing 201 1800
2 30 Purchasing 114 1700
3 40 Human Resources 203 2400
DEPARTMENT_ID DEPARTMENT_NAME MANAGER_ID LOCATION_ID
0 50 Shipping 121 1500
1 60 IT 103 1400
2 70 Public Relations 204 2700
Converting to other data frame formats is shown later in this chapter.
Asynchronous Data Frame Queries
With asynchronous programming, use the appropriate syntax. For example, to fetch all rows at once:
connection = await oracledb.connect_async(...)
odf = await connection.fetch_df_all(sql="select ...", parameters=..., arraysize=...)
Or to iterate:
connection = await oracledb.connect_async(...)
async for odf in connection.fetch_df_batches(sql="select ...", parameters=..., size=...):
do_something(odf)
16.1.2. Data Frame Type Mapping
16.1.2.1. Default Data Frame Type Mapping
Internally, python-oracledb’s DataFrame support makes use of Apache nanoarrow libraries to build data frames.
When querying, the following default data type mapping occurs from Oracle Database types to the Arrow types used in python-oracledb DataFrame objects. Querying any other data types from Oracle Database will result in an exception. Output type handlers cannot be used to map data types.
Oracle Database Type |
Apache Arrow Data Type |
|---|---|
DOUBLE |
|
FLOAT |
|
LARGE_BINARY |
|
BOOLEAN |
|
LARGE_STRING |
|
LARGE_STRING |
|
TIMESTAMP |
|
LARGE_STRING |
|
LARGE_BINARY |
|
LARGE_STRING |
|
LARGE_STRING |
|
DECIMAL128, INT64, or DOUBLE |
|
LARGE_STRING |
|
LARGE_BINARY |
|
TIMESTAMP |
|
TIMESTAMP |
|
TIMESTAMP |
|
LARGE_STRING |
|
List or struct with DOUBLE, FLOAT, INT8, or UINT8 values |
Numbers
When converting Oracle Database NUMBERs:
If the column has been created without a precision and scale, or you are querying an expression that results in a number without precision or scale, then the Apache Arrow data type will be DOUBLE.
If
oracledb.defaults.fetch_decimalsis set to True, then the Apache Arrow data type is DECIMAL128.If the column has been created with a scale of 0, and a precision value that is less than or equal to 18, then the Apache Arrow data type is INT64.
In all other cases, the Apache Arrow data type is DOUBLE.
Strings
When converting Oracle Database character types:
If the number of records being fetched by
Connection.fetch_df_all(), or fetched in each batch byConnection.fetch_df_batches(), can be handled by 32-bit offsets, you can use an explicit mapping to fetch as STRING instead of the default LARGE_STRING. This will save 4 bytes per record.
Vectors
When converting Oracle Database VECTORs:
Dense vectors are fetched as lists.
Sparse vectors are fetched as structs with fields
num_dimensions,indicesandvaluessimilar to SparseVector objects.Fixed and flexible dimensions are supported for dense VECTOR columns. For sparse VECTOR columns, the dimension of each vector must be the same.
VECTOR columns with flexible formats are not supported. Each vector value must have the same storage format data type.
Vector values are fetched as the following types:
Oracle Database VECTOR format
Apache Arrow data type
FLOAT64
DOUBLE
FLOAT32
FLOAT
INT8
INT8
BINARY
UINT8
See Fetching VECTOR columns to Data Frames for more information.
CLOB and NCLOB
When converting Oracle Database CLOBs and NCLOBs:
LOBs must be no more than 1 GB in length.
If the number of records being fetched by
Connection.fetch_df_all(), or fetched in each batch byConnection.fetch_df_batches(), can be handled by 32-bit offsets, you can use an explicit mapping to fetch CLOBs and NCLOBs as STRING instead of the default LARGE_STRING. This will save 4 bytes per record.
BLOB
When converting Oracle Database BLOBs:
LOBs must be no more than 1 GB in length.
If the number of records being fetched by
Connection.fetch_df_all(), or fetched in each batch byConnection.fetch_df_batches(), can be handled by 32-bit offsets, you can use an explicit mapping to fetch BLOBs as BINARY instead of the default LARGE_BINARY. This will save 4 bytes per record.
Dates and Timestamps
When converting Oracle Database DATEs and TIMESTAMPs:
Apache Arrow TIMESTAMPs will not have timezone data.
For Oracle Database DATE columns, the Apache Arrow TIMESTAMP will have a time unit of “seconds”.
For Oracle Database TIMESTAMP types, the Apache Arrow TIMESTAMP time unit depends on the Oracle type’s fractional precision as shown in the table below:
Oracle Database TIMESTAMP fractional second precision range
Apache Arrow TIMESTAMP time unit
0
seconds
1 - 3
milliseconds
4 - 6
microseconds
7 - 9
nanoseconds
16.1.2.2. Explicit Data Frame Type Mapping
You can explicitly set the data types and names that a DataFrame will use for query results. This provides fine-grained control over the physical data representation of the resulting Arrow arrays. It allows you to specify a representation that is more efficient for its specific use case. This can reduce memory consumption and improve processing speed.
The parameter requested_schema parameter to
Connection.fetch_df_all(), Connection.fetch_df_batches(),
AsyncConnection.fetch_df_all(), or
AsyncConnection.fetch_df_batches() should be an object implementing the
Arrow PyCapsule schema interface.
For example, the pyarrow.schema() factory function can be used to create a
new schema. This takes a list of field definitions as input. Each field can be
a tuple of (name, DataType):
import pyarrow
# Default fetch
odf = connection.fetch_df_all(
"select 123 c1, 'Scott' c2 from dual"
)
tab = pyarrow.table(odf)
print("Default Output:", tab)
# Fetching with an explicit schema
schema = pyarrow.schema([
("col_1", pyarrow.int16()),
("C2", pyarrow.string())
])
odf = connection.fetch_df_all(
"select 456 c1, 'King' c2 from dual",
requested_schema=schema
)
tab = pyarrow.table(odf)
print("\nNew Output:", tab)
The schema should have an entry for each queried column.
Running the example shows that the number column with the explicit schema was fetched into the requested type INT16. Its name has also changed:
Default Output: pyarrow.Table
C1: double
C2: string
----
C1: [[123]]
C2: [["Scott"]]
New Output: pyarrow.Table
col_1: int16
C2: string
----
col_1: [[456]]
C2: [["King"]]
Supported Explicit Type Mapping
The following table shows the explicit type mappings that are supported. The Oracle Database types in each row can be converted to any of the Apache Arrow types in the same row. An error will occur if the database type or the data cannot be represented in the requested schema type.
Oracle Database Types
Apache Arrow Data Types
DECIMAL128(p, s) DOUBLE FLOAT INT8 INT16 INT32 INT64 UINT8 UINT16 UINT32 UINT64
BINARY FIXED SIZE BINARY LARGE_BINARY
BOOLEAN
DB_TYPE_DATEDB_TYPE_TIMESTAMPDB_TYPE_TIMESTAMP_LTZDB_TYPE_TIMESTAMP_TZDATE32 DATE64 TIMESTAMP
DOUBLE FLOAT
DB_TYPE_CHARDB_TYPE_CLOBDB_TYPE_LONGDB_TYPE_NCHARDB_TYPE_NCLOBDB_TYPE_NVARCHARDB_TYPE_VARCHARLARGE_STRING STRING
16.1.3. Converting python-oracledb’s DataFrame to Other Data Frames
To use data frames in your chosen analysis library, DataFrame objects can be converted. Examples for some libraries are shown in the following sections. Other libraries will have similar methods.
Conversion Overview
Guidelines for converting python-oracledb DataFrame objects to data frames for other libraries are:
To convert to a PyArrow Table, use pyarrow.table() which leverages the Arrow PyCapsule interface.
To convert to a Pandas DataFrame, use pyarrow.table().to_pandas().
If you want to use a library other than Pandas or PyArrow, use the library’s
from_arrow()method to convert a PyArrow Table to the applicable data frame, if your library supports this. For example, with Polars use polars.from_arrow().
You should test and benchmark to find the best option for your applications.
16.1.3.1. Creating PyArrow Tables
An example that creates and uses a PyArrow Table is:
import pyarrow
# Get a python-oracledb DataFrame
# Adjust arraysize to tune the query fetch performance
sql = "select id, name from mytable order by id"
odf = connection.fetch_df_all(statement=sql, arraysize=100)
# Create a PyArrow table
pyarrow_table = pyarrow.table(odf)
print("\nNumber of rows and columns:")
(r, c) = pyarrow_table.shape
print(f"{r} rows, {c} columns")
Internally pyarrow.table() leverages the Apache Arrow PyCapsule interface that python-oracledb DataFrame objects expose.
See samples/dataframe_pyarrow.py for a runnable example.
16.1.3.2. Creating Pandas DataFrames
An example that creates and uses a Pandas DataFrame is:
import pandas
import pyarrow
# Get a python-oracledb DataFrame
# Adjust arraysize to tune the query fetch performance
sql = "select * from mytable where id = :1"
myid = 12345 # the bind variable value
odf = connection.fetch_df_all(statement=sql, parameters=[myid], arraysize=1000)
# Get a Pandas DataFrame from the data.
pdf = pyarrow.table(odf).to_pandas()
# Perform various Pandas operations on the DataFrame
print(pdf.T) # transform
print(pdf.tail(3)) # last three rows
The to_pandas() method supports arguments like
types_mapper=pandas.ArrowDtype and deduplicate_objects=False, which may
be useful for some data sets.
See samples/dataframe_pandas.py for a runnable example.
16.1.3.3. Creating Polars DataFrames
An example that creates and uses a Polars DataFrame is:
import polars
import pyarrow
# Get a python-oracledb DataFrame
# Adjust arraysize to tune the query fetch performance
sql = "select id from mytable order by id"
odf = connection.fetch_df_all(statement=sql, arraysize=100)
# Convert to a Polars DataFrame
pdf = polars.from_arrow(odf)
# Perform various Polars operations on the DataFrame
r, c = pdf.shape
print(f"{r} rows, {c} columns")
print(pdf.sum())
See samples/dataframe_polars.py for a runnable example.
16.1.3.4. Writing Apache Parquet Files
To write output in Apache Parquet file
format, you can use data frames as an efficient intermediary. Use the
Connection.fetch_df_batches() iterator and convert to a PyArrow Table that can be written by the PyArrow
library.
import pyarrow
import pyarrow.parquet as pq
FILE_NAME = "sample.parquet"
# Tune the fetch batch size for your query
BATCH_SIZE = 10000
sql = "select * from mytable"
pqwriter = None
for odf in connection.fetch_df_batches(statement=sql, size=BATCH_SIZE):
# Get a PyArrow table from the query results
pyarrow_table = pyarrow.table(odf)
if not pqwriter:
pqwriter = pq.ParquetWriter(FILE_NAME, pyarrow_table.schema)
pqwriter.write_table(pyarrow_table)
pqwriter.close()
See samples/dataframe_parquet_write.py for a runnable example.
16.1.3.5. The DLPack Protocol
The DataFrame format facilitates working with query results as tensors. Conversion can be done using the standard DLPack Protocol implemented by PyArrow.
Using NumPy Arrays
For example, to convert to NumPy ndarray format:
import numpy
import pyarrow
SQL = "select id from mytable order by id"
# Get a python-oracledb DataFrame
# Adjust arraysize to tune the query fetch performance
odf = connection.fetch_df_all(statement=SQL, arraysize=100)
# Convert to an ndarray via the Python DLPack specification
pyarrow_array = pyarrow.array(odf.get_column_by_name("ID"))
np = numpy.from_dlpack(pyarrow_array)
# Perform various numpy operations on the ndarray
print(numpy.sum(np))
print(numpy.log10(np))
See samples/dataframe_numpy.py for a runnable example.
16.1.3.6. Using Torch
An example of working with data as a Torch tensor is:
import pyarrow
import torch
SQL = "select id from mytable order by id"
# Get a python-oracledb DataFrame
# Adjust arraysize to tune the query fetch performance
odf = connection.fetch_df_all(statement=SQL, arraysize=100)
# Convert to a Torch tensor via the Python DLPack specification
pyarrow_array = pyarrow.array(odf.get_column_by_name("ID"))
tt = torch.from_dlpack(pyarrow_array)
# Perform various Torch operations on the tensor
print(torch.sum(tt))
print(torch.log10(tt))
See samples/dataframe_torch.py for a runnable example.
16.1.4. Fetching VECTOR columns to Data Frames
Columns of the VECTOR data type can be fetched with
the methods Connection.fetch_df_all() and
Connection.fetch_df_batches(). VECTOR columns can have flexible
dimensions, but flexible storage formats are not supported: each vector value
must have the same format data type. Vectors can be dense or sparse.
See Data Frame Type Mapping for the type mapping for VECTORs.
Dense Vectors
By default, Oracle Database vectors are “dense”. These are fetched in python-oracledb as Arrow lists. For example, if the table:
create table myvec (v64 vector(3, float64));
contains these two vectors:
[4.1, 5.2, 6.3]
[7.1, 8.2, 9.3]
then the code:
odf = connection.fetch_df_all("select v64 from myvec")
pyarrow_table = pyarrow.table(odf)
will result in a PyArrow table containing lists of doubles. The table can be converted to a data frame of your chosen library.
For example, to convert the PyArrow table to Pandas:
pdf = pyarrow_table.to_pandas()
Or you can convert the python-oracledb DataFrame directly if the library supports it. For example, to fetch to Pandas the syntax is the same as shown in Creating Pandas DataFrames:
odf = connection.fetch_df_all("select v64 from myvec")
pdf = pyarrow.table(odf).to_pandas()
print(pdf)
The output will be:
V64
0 [4.1, 5.2, 6.3]
1 [7.1, 8.2, 9.3]
Sparse Vectors
Sparse vectors (where many of the values are 0) are fetched as structs with
fields num_dimensions, indices, and values similar to
SparseVector objects which are discussed in a
non-data frame context in Using SPARSE Vectors.
If the table:
create table myvec (v64 vector(3, float64, sparse));
contains these two vectors:
[3, [1,2], [4.1, 5.2]]
[3, [0], [9.3]]
then the code to fetch as data frames:
import pyarrow
odf = connection.fetch_df_all("select v64 from myvec")
pdf = pyarrow.table(odf).to_pandas()
print(pdf)
print("First row:")
num_dimensions = pdf.iloc[0].V64['num_dimensions']
print(f"num_dimensions={num_dimensions}")
indices = pdf.iloc[0].V64['indices']
print(f"indices={indices}")
values = pdf.iloc[0].V64['values']
print(f"values={values}")
will display:
V64
0 {'num_dimensions': 3, 'indices': [1, 2], 'valu...
1 {'num_dimensions': 3, 'indices': [0], 'values'...
First row:
num_dimensions=3
indices=[1 2]
values=[4.1 5.2]
You can convert each struct as needed. One way to convert into Pandas dataframes with sparse values is via a SciPy coordinate format matrix. The Pandas method from_spmatrix() can then be used to create the final sparse dataframe:
import numpy
import pandas
import pyarrow
import scipy
def convert_to_sparse_array(val):
dimensions = val["num_dimensions"]
col_indices = val["indices"]
row_indices = numpy.zeros(len(col_indices))
values = val["values"]
sparse_matrix = scipy.sparse.coo_matrix(
(values, (col_indices, row_indices)), shape=(dimensions, 1))
return pandas.arrays.SparseArray.from_spmatrix(sparse_matrix)
odf = connection.fetch_df_all("select v64 from myvec")
pdf = pyarrow.table(odf).to_pandas()
pdf["SPARSE_ARRAY_V64"] = pdf["V64"].apply(convert_to_sparse_array)
print(pdf.SPARSE_ARRAY_V64)
The code will print:
0 [0.0, 4.1, 5.2]
Fill: 0.0
IntIndex
Indices: ar...
1 [9.3, 0.0, 0.0]
Fill: 0.0
IntIndex
Indices: ar...
Name: SPARSE_ARRAY_V64, dtype: object
16.2. Inserting Data Frames
Python-oracledb DataFrame instances, or third-party
DataFrame instances that support the Apache Arrow PyCapsule Interface, can be
inserted into Oracle Database by passing them directly to
Cursor.executemany() or AsyncCursor.executemany(). They can
also be passed to Connection.direct_path_load() and
AsyncConnection.direct_path_load().
16.2.1. Inserting Data Frames with executemany()
For example, with the table:
create table t (col1 number, col2 number);
The following code will insert a Pandas DataFrame:
import pandas
d = {'A': [1.2, 2.4, 8.9], 'B': [3.333, 4.9, 0.0]}
pdf = pandas.DataFrame(data=d)
cursor.executemany("insert into t (col1, col2) values (:1, :2)", pdf)
Inserting to a dense VECTOR column:
create table SampleVectorTab (v64 vector(3, float64));
Can be done like:
import pandas
d = {"v": [[3.3, 1.32, 5.0], [2.2, 2.32, 2.0]]}
pdf = pandas.DataFrame(data=d)
cursor.executemany("insert into SampleVectorTab (v64) values (:1)", pdf)
See dataframe_insert.py for a runnable example.
For general information about fast data ingestion, and discussion of
Cursor.executemany() and AsyncCursor.executemany() options, see
Batch Statement and Bulk Copy Operations.
16.2.2. Inserting Data Frames with Direct Path Loads
Very large DataFrame objects can be efficiently
inserted using Oracle Database Direct Path Loading by passing them to
Connection.direct_path_load(). You can also pass third-party DataFrame
instances that support the Apache Arrow PyCapsule Interface.
See Direct Path Loads for general information about Direct Path Loads.
For example, if the user “HR” has the table:
create table mytab (
id number(9),
name varchar2(100));
The following code will insert a Pandas DataFrame:
import pandas
d = [
(1, "Abigail"),
(2, "Anna"),
(3, "Janey"),
(4, "Jessica"),
]
pdf = pandas.DataFrame(data=d)
connection.direct_path_load(
schema_name="hr",
table_name="mytab",
column_names=["id", "name"],
data=pdf
)
16.3. Explicit Conversion to DataFrame or ArrowArray
Data frames that support the Apache Arrow PyCapsule Interface can be explicitly
converted to DataFrame and ArrowArray objects by calling oracledb.from_arrow(). The
resulting object depends on what interface is supported by the source object.
For example:
import pandas
d = {'A': [1.2, 2.4, 8.9], 'B': [3.333, 4.9, 0.0]}
pdf = pandas.DataFrame(data=d)
print(type(pdf))
odf = oracledb.from_arrow(pdf)
print(type(odf))
will print:
<class 'pandas.core.frame.DataFrame'>
<class 'oracledb.dataframe.DataFrame'>