DataWorks provides ODPS SQL nodes so that you can schedule MaxCompute SQL tasks on a regular basis, integrate ODPS SQL nodes with other types of nodes, and then jointly schedule the nodes. MaxCompute SQL tasks can process terabytes of data in distributed scenarios that do not require real-time processing by using the SQL-like syntax. This topic describes the precautions on and guidance for the development of MaxCompute SQL tasks in DataWorks.
Prerequisites
An ODPS SQL node is created. For more information, see Create and manage MaxCompute nodes .
Background information
MaxCompute SQL is used to process and query data in MaxCompute. MaxCompute SQL supports common SQL statements, such as SELECT, INSERT, UPDATE, and DELETE statements, and specific MaxCompute syntax and functions. You can use MaxCompute SQL to write SQL-like statements to query and process data, without the need to write complex data processing logic. For more information about the SQL syntax, see Overview of MaxCompute SQL .
Limits
The following table describes the limits on the development of MaxCompute SQL tasks in DataWorks.
Item |
Description |
Use of comments |
You cannot separately add comments to a SET or USE statement. |
You cannot use semicolons (;) in comments. |
|
You cannot add comments to the end of a complete statement. If a semicolon (;) is added to the end of an SQL statement, the SQL statement is considered complete. |
|
Execution of SQL statements |
You cannot separately use a SET or USE statement in the code of an ODPS SQL node. They must be executed with other SQL statements. |
SQL development |
The code of an ODPS SQL node can be a maximum of 128 KB in size and can contain a maximum of 200 SQL statements. |
Result query |
You can use only statements that start with SELECT or WITH to return formatted result sets. A maximum of 10,000 rows of result data can be displayed. |
Precautions
The sequence in which the SET and USE statements are executed in the code of an ODPS SQL node varies based on the environment in DataWorks.
-
DataStudio: The SET and USE statements are combined in the node code and are executed before you execute other SQL statements.
-
Scheduling environment: All statements are executed in sequence.
Sample code that is defined in an ODPS SQL node:
set a=b;
create table name1(id string);
set c=d;
create table name2(id string);
The following table describes the sequences of executing the preceding statements in different environments.
SQL statement |
DataStudio |
Scheduling environment |
First SQL statement |
|
|
Second SQL statement |
|
|
Simple code editing example
SQL statements
The MaxCompute SQL syntax is similar to the standard SQL syntax. DDL, DML, and DQL statements and MaxCompute-specific statements are supported. For information about the syntax requirements and usage examples of each SQL statement, see Overview of MaxCompute SQL . The following example shows how to write and execute SQL statements.
-
If new data types are used for the additional functions of MaxCompute V2.0, you must add
set odps.sql.type.system.odps2=true;
before the SQL statements that use the functions, and commit and execute the SET statement together with the SQL statements. For more information about data types in MaxCompute V2.0, see MaxCompute V2.0 data type edition . -
DataWorks provides scheduling parameters. To enable parameters to be dynamically passed in the code of an ODPS SQL node, you can define variables in the node code in the
${Variable name}
format, and assign parameters to the variables as values in the Parameters section of the Properties tab on the configuration tab of the ODPS SQL node. For information about the supported formats of scheduling parameters, see Supported formats of scheduling parameters .
-
Create a table
You can execute the
CREATE TABLE
statement to create a non-partitioned table, a partitioned table, an external table, or a clustered table. For more information, see CREATE TABLE . Sample SQL statement:-- Create a partitioned table named test1. create table if not exists students ( id BIGINT, name STRING, age BIGINT, birth DATE) partitioned by (gender STRING);
-
Insert data
You can execute the
INSERT INTO
orINSERT OVERWRITE
statement to insert data into a destination table. For more information, see Insert or update data into a table or a static partition (INSERT INTO and INSERT OVERWRITE) . Sample SQL statements:-- Insert data. INSERT INTO students partition(gender='boy') VALUES (1,'John',15,DATE '2008-05-15') ; INSERT INTO students partition(gender='boy') VALUES (2,'Jack',17,DATE '2006-07-20') ; INSERT INTO students partition(gender='girl') VALUES (3,'Alice',20,DATE '2003-04-20') ; INSERT INTO students partition(gender='girl') VALUES (4,'Lily',21,DATE '2002-01-08') ; INSERT INTO students partition(gender='boy') VALUES (5,'Bob',17,DATE '2006-09-12') ;
ImportantThe INSERT INTO statement may result in unexpected data duplication. We recommend that you execute the
INSERT OVERWRITE
statement, instead of theINSERT INTO
statement. For more information, see Insert or update data into a table or a static partition (INSERT INTO and INSERT OVERWRITE) . -
Query data
You can execute
SELECT
statements to perform operations, such as nested queries, sorting, and queries by group. For more information, see SELECT syntax . Sample SQL statements:-- Enable the full table scan feature, which is only valid for the current session. set odps.sql.allow.fullscan=true; -- Query the information about all boys and sort the information by ID in ascending order. select * from students where gender='boy' order by id;
NoteBy default, RAM users do not have the required permissions to query MaxCompute tables in the production environment. If you want to query MaxCompute tables in the production environment as a RAM user, you must request the required permissions for the RAM user in Security Center in the DataWorks console. For information about the permissions of built-in workspace-level roles on MaxCompute data and how to manage the permissions, see Manage permissions on data in a MaxCompute compute engine instance . For information about how to grant permissions to users by running commands, see Manage user permissions by using commands .
SQL functions
MaxCompute allows you to use built-in functions and User-defined functions (UDFs) for data development and analysis. For information about built-in functions, see Overview of built-in functions . For information about UDFs, see Overview of UDFs . The following example shows how to use SQL functions.
-
Built-in functions
MaxCompute provides a large number of built-in functions. You can directly call the built-in functions. Based on the example in the preceding section, you can use the
dateadd
function to change data in the birth column based on a specified unit and a specified interval. Sample statements:-- Enable the full table scan feature, which is only valid for the current session. set odps.sql.allow.fullscan=true; select id, name, age, birth, dateadd(birth,1,'mm') as birth_dateadd from students;
-
UDFs
To use a UDF, you must write the code, upload the code as a resource, and register the function. For more information, see Create and use a MaxCompute function .
Running results and returned data
-
Running results are displayed in a workbook. You can analyze the running results, view or manage the running results in the workbook, or copy the running results to an Excel file on your on-premises machine. For more information, see Debugging procedure .
NoteDue to the adjustment made by the International Organization for Standardization (ISO) on the UTC+8 time zone, differences exist between the actual time and the output time when you execute related SQL statements in DataWorks. In a year from 1900 to 1928, the time difference is 352 seconds. Before the year of 1900, the time difference is 9 seconds.
-
Run logs: You can click the Runtime Log tab to view the run logs in Logview. For more information, see Use Logview V2.0 to view job information .
-
Returned data:
-
Query the information about all boys and sort the information by ID in ascending order.
+------------+------------+------------+------------+------------+ | id | name | age | birth | gender | +------------+------------+------------+------------+------------+ | 1 | John | 15 | 2008-05-15 | boy | | 2 | Jack | 17 | 2006-07-20 | boy | | 5 | Bob | 17 | 2006-09-12 | boy | +------------+------------+------------+------------+------------+
-
Change data in the birth column based on a specified unit and a specified interval.
+------------+------------+------------+------------+---------------+ | id | name | age | birth | birth_dateadd | +------------+------------+------------+------------+---------------+ | 4 | Lily | 21 | 2002-01-08 | 2002-02-08 | | 3 | Alice | 20 | 2003-04-20 | 2003-05-20 | | 2 | Jack | 17 | 2006-07-20 | 2006-08-20 | | 1 | John | 15 | 2008-05-15 | 2008-06-15 | | 5 | Bob | 17 | 2006-09-12 | 2006-10-12 | +------------+------------+------------+------------+---------------+
On the Result tab, you can sort result data by specific field and perform operations such as download on the result data. For more information, see Process query results .
-
Advanced code editing examples
For more information, see the following topics: