SparkSQL is a utility class for interacting with Spark SQL.
SparkSQL(
self,
spark_session: Optional[SparkSession] = None,
catalog: Optional[str] = None,
schema: Optional[str] = None,
ignore_tables: Optional[List[str]] = None,
include_tables: Optional[List[str]] = None,
sample_rows_in_table_info: int = 3
)| Name | Type | Description |
|---|---|---|
spark_session | Optional[SparkSession] | Default: NoneA SparkSession object. If not provided, one will be created. |
catalog | Optional[str] | Default: NoneThe catalog to use. If not provided, the default catalog will be used. |
schema | Optional[str] | Default: NoneThe schema to use. If not provided, the default schema will be used. |
ignore_tables | Optional[List[str]] | Default: NoneA list of tables to ignore. If not provided, all tables will be used. |
include_tables | Optional[List[str]] | Default: NoneA list of tables to include. If not provided, all tables will be used. |
sample_rows_in_table_info | int | Default: 3The number of rows to include in the table info. Defaults to 3. |
Creating a remote Spark Session via Spark connect. For example: SparkSQL.from_uri("sc://localhost:15002")
Get names of tables available.
Get information about specified tables.
Follows best practices as specified in: Rajkumar et al, 2022 (https://arxiv.org/abs/2204.00498)
If sample_rows_in_table_info, the specified number of sample rows will be
appended to each table description. This can increase performance as
demonstrated in the paper.
Execute a SQL command and return a string representing the results.
If the statement returns rows, a string of the results is returned. If the statement returns no rows, an empty string is returned.
If the statement throws an error, the error message is returned.