Data Lake Guide
Based on ClickHouse, a modern database, we have built a new generation of data lake. Tightly integrated with cloud services, it supports automatic entry of stored data and log table data, providing powerful data support for statistical analysis of business operations.
The new data lake offers the following unique features:
- Column-based storage that supports high compression rates, which can significantly reduce storage costs;
- Column-based traversal and vector operations that support efficient queries, which can reduce complex queries that would otherwise take minutes to seconds;
- Intuitive data views that support the storage of intermediate results, which, together with join queries, can support highly efficient and complex second-order queries;
- Richer functionality to support more complex SQL queries. See the Common Use Cases section below;
- Seamless connection to log tables that can support real-time input and query from a variety of data sources, allowing for more flexible integration of external data sources and more complete data analysis capabilities for the business.
Data Entry
Before querying the data, we need to store the data first. At present, we mainly support data input from classes in the Data Storage service. Classes are divided into two categories: log tables and basic classes.
Log Tables
Log tables are a special class of data storage designed to meet the business needs of storing events, logs, and other "immutable" data that cannot be changed after it is written. Once collected, this data can be used for business auditing and operational analysis, and for developers to track events. Because this type of data is appended and not modified, we are able to provide greater concurrent write throughput.
Enabling and Accessing Log Tables
On the "Data Storage" page of the dashboard, click "Create class" and check the "log table" option to create a log table. For example, we can create a log table named EventLog
.
At the SDK level, creating a log table object is the same as creating a normal object:
// for Android
AVObject event = new AVObject("EventLog");
event.put("eventType", "buttonClick");
event.put("eventName", "orderSubmit");
event.put("eventDate", new Date());
event.saveInBackground();
Logs are submitted and stored directly in the data lake and are available in real time.
Syncing With Basic Classes
The process of synchronizing basic class objects to the data lake is a bit more complicated because of the support for updates.
On the dashboard, go to "Data Storage" - "Data lake", click "Create Class Sync", and select the fields that need to be queried and analyzed in the data lake, then you can start synchronizing the class with the data lake. The synchronization starts immediately. Depending on the size of the data, the synchronization may take a long time, so please be patient.
For classes with synchronization enabled, we will synchronize the previous day's updated data in the early morning of the next day. Therefore, the updated objects will not be visible until the next day. We will continue to tweak this process to achieve more real-time synchronization.
Data Type Conversion
The field types in the data lake are significantly different from those in the Data Dtorage service, and we perform data conversions during the data entry process. Among the things that need special attention are that
- The data lake does not support the
null
type, and missing fields are stored as zero values. The zero value of a string type is an empty string, and the zero value of a numeric field is the number 0. - The
Boolean
type is converted toUInt8
. 0 means false and 1 means true. With the above zero-value feature, the default value for a missing field is 0, which means false. - Nested objects are not supported. They are stored as JSON strings and must be extracted using the JSON function.
- For array types, the elements must be of the same type. They are converted to
Array(String)
during data entry, and you must use the type conversion function to convert them back to their original type.
See the following table for specific type conversion behavior:
Data Storage Type | Data Lake Type | Description | Example |
Array | Array(String) | The element type is stored as a string, and you need to use the type to convert the elements to the original type after extracting the elements | toInt64(xs[1]) |
Boolean | UInt8 | 0 means false, 1 means true | emailVerified = 1 |
Bytes | N/A | Not supported | |
Date | DateTime | ||
File | *.className String *.objectId String | Is expanded to two fields: className and objectId | |
GeoPoint | Point | Experimental support | |
Number | Float64 | ||
Object | String | Stored as JSON string; you must extract the fields using the appropriate functions | visitParamExtractInt64(object, 'id') |
Pointer | *.className String *.objectId String | Is expanded to two fields: className and objectId |
See ClickHouse Data Types for all the types supported by the data lake, and Type Conversion Functions for how to convert data between different types.
Query Syntax
The data lake only supports select
queries. The syntax looks like this
SELECT column1, column2, expr ...
FROM table | (subquery)
[INNER|LEFT|RIGHT|FULL|CROSS] JOIN (subquery)|table (ON <expr_list>)
[WHERE expr]
[GROUP BY expr_list]
[HAVING expr]
[ORDER BY expr_list]
[LIMIT [skip, ]n]
[UNION ...]
Note that strings, dates, and certain other values must be enclosed in single quotes. For special characters in the field or table name, you can use backquotes to enclose them in quotation marks. For example
WHERE order_status = 'delivered'
AND `from` = 'example.com'
For more query syntax, please refer to the ClickHouse documentation.
Common Use Cases
Extracting JSON Fields Using visitParamExtract*
During data synchronization, multiple layers of nested objects are imported as JSON strings. To extract the fields from a JSON string, it is recommended to use visitParamExtract*
. For example
SELECT visitParamExtractString('{"abc":"\\u263a"}', 'abc') = '☺';
If the nesting is complex and there are overlapping fields, you can use JSONExtract*
. For example
SELECT JSONExtractFloat('{"a": "hello", "b": [-100, 200.0, 300]}', 'b', 2) = 200.0
More JSON extraction functions can be found in the official documentation.
Time Zone Conversion With toTimeZone
By default, dates are displayed in the server-side timezone. If this is not expected, you can use toTimeZone
to convert the date to a specific timezone.
toTimeZone(createdAt, 'Asia/Shanghai')
Using toYYYYMMDD for Statistics by Date
When analyzing statistics by date, you can use toYYYYMMDD
to stringify the date and use "group by". You also need to be aware of time zone handling, for example
GROUP BY toYYYYMMDD(toTimeZone(createdAt, 'Asia/Shanghai'))
Use argMax to Extract the Most Recent Version
When you encounter duplicate data, you can use argMax to extract the last piece of data as the "latest version" of the data. For example, we can extract the latest status of a particular order
SELECT
orderId,
argMax(status, updatedAt) AS status
FROM my_class
GROUP BY orderId
Other Restrictions
- For security reasons, the
sessionToken
,password
,salt
, andauthData
fields of the_User
table and theACL
fields of all tables do not support synchronization. - For performance reasons, large array fields such as
m
andmu
of the_Conversation
table do not support synchronization at this time. - To better isolate the impact between applications, we impose quota limits on slow queries from the same application. For example, we only allow a maximum of 3 slow queries to run simultaneously. If a query takes more than 5 seconds, it is a slow query.
REST API
Coming soon.
Pricing
The data lake feature is available to all games and is currently in beta preview, so there is no charge for now. When we officially release the feature, we will charge for both "storage" and "compute resources".