Ongoing Tasks: ETL Basics
-
ETL (Extract, Transform & Load) is a three-stage RavenDB process that transfers data from a RavenDB database to an external target. The data can be filtered and transformed along the way.
-
The external target can be:
- Another RavenDB database instance (outside of the Database Group)
- A relational database
- Elasticsearch
- OLAP (Online Analytical Processing)
- A message broker such as Apache Kafka, RabbitMQ, or Azure Queue Storage
-
ETL can be used on sharded and non-sharded databases alike.
Learn more about how ETL works on a sharded database here. -
In this article:
Why use ETL
-
Share relevant data
Send data in a well-defined format to match specific requirements, ensuring only relevant data is transmitted
(e.g., sending data to an existing reporting solution). -
Protect your data - Share partial data
Limit access to sensitive data. Details that should remain private can be filtered out as you can share partial data. -
Reduce system calls
Distribute data across related services within your system architecture, allowing each service to access its own copy of the data without cross-service calls (e.g., sharing a product catalog among multiple stores). -
Transform the data
- Modify content sent as needed with JavaScript code.
- Multiple documents can be sent from a single source document.
- Data can be transformed to match the target destination's model.
-
Aggregate your data
Data sent from multiple locations can be aggregated in a central server
(e.g., aggregating sales data from point of sales systems for centralized calculations).
Defining ETL Tasks
-
The following ETL tasks can be defined:
- RavenDB ETL - send data to another RavenDB database
- SQL ETL - send data to an SQL database
- Snowflake ETL - send data to a Snowflake warehouse
- OLAP ETL - send data to an OLAP destination
- Elasticsearch ETL - send data to an Elasticsearch destination
- Kafka ETL - send data to a Kafka message broker
- RabbitMQ ETL - send data to an RabbitMQ exchange
- Azure Queue Storage ETL - send data to an Azure Queue Storage message queue
- Amazon SQS ETL - send data to an Amazon SQS message queue
-
All ETL tasks can be defined from the Client API or from the Studio.
-
The destination address and access options are set using a pre-defined connection string, simplifying deployment across different environments. For example, with RavenDB ETL, multiple URLs can be configured in the connection string since the target database can reside on multiple nodes within the Database Group of the destination cluster.
If one of the destination nodes is unavailable, RavenDB automatically executes the ETL process against another node specified in the connection string. Learn more in the Connection Strings article.
ETL Stages
ETL's three stages are:
- Extract - Extract the documents from the database
- Transform - Transform & filter the documents data according to the supplied script (optional)
- Load - Load (write) the transformed data into the target destination
Extract
The ETL process starts with retrieving the documents from the database.
You can choose which documents will be processed by the next two stages (Transform and Load).
The possible options are:
- Documents from a single collection
- Documents from multiple collections
- All documents
Transform
-
This stage transforms and filters the extracted documents according to a provided script.
Any transformation can be done so that only relevant data is shared.
The script is written in JavaScript and its input is a document. -
A task can be provided with multiple transformation scripts.
Different scripts run in separate processes, allowing multiple scripts to run in parallel. -
You can do any transformation and send only data you are interested in sharing.
The following is an example of RavenDB ETL script processing documents from the "Employees" collection:
var managerName = null;
// 'this' refers to the source document being processed
if (this.ReportsTo !== null)
{
var manager = load(this.ReportsTo);
managerName = manager.FirstName + " " + manager.LastName;
}
// Load the object to a target database destination (to collection "EmployeesWithManager")
loadToEmployeesWithManager({
Name: this.FirstName + " " + this.LastName,
Title: this.Title,
BornOn: new Date(this.Birthday).getFullYear(),
Manager: managerName
});
Syntax
The transformation script has access to all predefined JavaScript functions available in RavenDB's scripting engine.
See the Predefined JavaScript functions article for the complete reference.
The script context exposes the following members for working with the current document:
| Member / Function | Type | Description |
|---|---|---|
this | object | The current document being processed (including its metadata). |
id(document) | function | Returns the document's string ID. Pass this to get the ID of the current document. |
load(id) | function | Loads and returns another document by its ID. Note: Loading a document increases the allowed script step count. Changes to a related document that is loaded do not trigger the ETL process. |
In addition, the following ETL-specific functions are available.
These are not part of the general JavaScript engine and apply to ETL scripts only:
| Function / Member | Type | Description |
|---|---|---|
loadTo | function | Loads an object to the specified target. An object will only be sent to the destination if this method is called. This command has several syntax options, see The loadTo method below. |
| Attachments: | ||
loadAttachment(name) | function | Loads an attachment of the current document. |
hasAttachment(name) | function | Returns true if an attachment with the given name exists on the current document. |
getAttachments() | function | Returns a collection of attachment details for the current document. Each item has: Name, Hash, ContentType, Size. |
<doc>.addAttachment([name,] attachmentRef) | function | Adds an attachment to a transformed document (<doc>) that will be sent to the target. For RavenDB ETL specifics, see this section. |
The loadTo method
An object will only be sent to the destination if the loadTo method is called.
To specify which target to load the data into, use either of the following overloads in your script.
Two equivalent syntax options are available:
-
loadTo<TargetName>(obj, {attributes})- Here the target name, <TargetName>, is embedded directly in the function name as a string literal.
- <TargetName> is Not a variable and cannot be used as one.
-
loadTo('TargetName', obj, {attributes})- Here the target name is passed as an argument to the loadTo method.
- Separating the target name from the
loadTofunction name makes it possible to include symbols like'-'and'.'in target names. This is not possible when theloadTo<TargetName>syntax is used because including special characters in the name of a JavaScript function makes it invalid. - Note that this syntax may vary for some ETL types - refer to each type's documentation for the exact form.
The TargetName must correspond to the appropriate target entity for the ETL type:
- RavenDB ETL: a collection name
- SQL ETL: a table name
- OLAP ETL: a folder name
- Elasticsearch ETL: an index name
- Kafka ETL: a topic name
- RabbitMQ ETL: an exchange name
- Azure Queue Storage ETL: a queue name
Batch processing
Documents are extracted and transformed by the ETL process in a batch manner.
The number of documents processed depends on the following configuration limits:
-
ETL.ExtractAndTransformTimeoutInSec(default: 30 sec)
Time-frame for the extraction and transformation stages (in seconds), after which the loading stage will start. -
ETL.MaxNumberOfExtractedDocuments(default: 8192)
Maximum number of extracted documents in an ETL batch. -
ETL.MaxNumberOfExtractedItems(default: 8192)
Maximum number of extracted items (documents, counters) in an ETL batch. -
ETL.MaxBatchSizeInMb(default: 64 MB)
Maximum size of an ETL batch in MB.
Load
-
Loading the results to the target destination is the last stage.
-
In contrast to Replication, ETL is a push-only process that writes data to the destination whenever documents from the relevant collections are changed. Existing entries on the target will always be overwritten.
-
Updates are implemented by executing consecutive DELETEs and INSERTs.
When a document is modified, the delete command is sent before the new data is inserted and both are processed under the same transaction on the destination side. This applies to all ETL types with two exceptions:- In RavenDB ETL, when documents are loaded to the same collection there is no need to send DELETE because the document on the other side has the same identifier and will just update it.
- in SQL ETL you can configure to use inserts only, which is a viable option for append-only systems.
Securing ETL processes for encrypted databases:
If your RavenDB database is encrypted, then you must not send data in an ETL process using a non-encrypted channel by default. It means that the connection to the target must be secured:
- In RavenDB ETL, a URL of a destination server has to use HTTPS
(a server certificate of the source server needs to be registered as a client certificate on the destination server). - in SQL ETL, a connection string to an SQL database must specify encrypted connection (specific per SQL engine provided).
This validation can be turned off by selecting the Allow ETL on a non-encrypted communication channel option in the Studio,
or setting AllowEtlOnNonEncryptedChannel if the task is defined using the Client API.
Please note that in such cases, your data encrypted at rest won't be protected in transit.
Troubleshooting
ETL errors and warnings are logged to files and displayed in the notification center panel.
You will be notified if any of the following events happen:
- Connection error to the target
- JS script is invalid
- Transformation error
- Load error
- Slow SQL was detected
Fallback Mode:
If the ETL cannot proceed with the load stage (e.g. it can't connect to the destination) then it enters the fallback mode.
The fallback mode means suspending the process and retrying it periodically.
The fallback time starts from 5 seconds and it's doubled on every consecutive error according to the time passed since the last error,
but it never crosses ETL.MaxFallbackTimeInSec configuration (default: 900 sec).
Once the process is in the fallback mode, then the Reconnect state is shown in the Studio.