Transport UDFs
Transport is a framework for writing performant user-defined functions (UDFs) that are portable across a variety of engines including Apache Spark, Apache Hive, and Trino. Transport UDFs are also capable of directly processing data stored in serialization formats such as Apache Avro. With Transport, developers only need to implement their UDF logic once using the Transport API. Transport then takes care of translating the UDF to native UDF version targeted at various engines or formats. Currently, Transport is capable of generating engine-artifacts for Spark, Hive, and Trino, and format-artifacts for Avro. Further details on Transport can be found in this LinkedIn Engineering blog post.
Documentation
- Project information: README.md
- Transport release notes: docs/release-notes.md
- Contributing: #contributing
- User guides
- Transport UDFs API: docs/transport-udfs-api.md
- Authoring Transport UDFs: docs/authoring-transport-udfs.md
- Using Transport UDFs: docs/using-transport-udfs.md
- Writing Tests for Transport UDFs: docs/writing-tests.md
- FAQ: docs/faq.md
- Developer guides
- How the Transport Plugin works: [TODO]
- How the Transport Test Framework works: [TODO]
Example
This example shows how a portable UDF is written using the Transport APIs.
public class MapFromTwoArraysFunction extends StdUDF2<StdArray, StdArray, StdMap> implements TopLevelStdUDF {
private StdType _mapType;
@Override
public List<String> getInputParameterSignatures() {
return ImmutableList.of(
"array(K)",
"array(V)"
);
}
@Override
public String getOutputParameterSignature() {
return "map(K,V)";
}
@Override
public void init(StdFactory stdFactory) {
super.init(stdFactory);
_mapType = getStdFactory().createStdType(getOutputParameterSignature());
}
@Override
public StdMap eval(StdArray a1, StdArray a2) {
if (a1.size() != a2.size()) {
return null;
}
StdMap map = getStdFactory().createMap(_mapType);
for (int i = 0; i < a1.size(); i++) {
map.put(a1.get(i), a2.get(i));
}
return map;
}
@Override
public String getFunctionName() {
return "map_from_two_arrays";
}
@Override
public String getFunctionDescription() {
return "A function to create a map out of two arrays";
}
}
In the example above, StdMap
and StdArray
are interfaces that provide high-level map and array operations to their objects. Depending on the engine where this UDF is executed, those interfaces are implemented differently to deal with native data types used by that engine. getStdFactory()
is a method used to create objects that conform to a given data type (such as a map whose keys are of the type of elements in the first array and values are of the type of elements in the second array). StdUDF2
is an abstract class to express a UDF that takes two parameters. It is parametrized by the UDF input types and the UDF output type. Please consult the Transport UDFs API for more details and examples.
How to Build
Clone the repository:
git clone https://github.com/linkedin/transport.git
Change directory to transport
:
cd transport
Build:
./gradlew build
Please note that this project requires Java 1.8.0_151
or higher. Either set JAVA_HOME
to the home of an appropriate version and then use ./gradlew build
as described above, or set the org.gradle.java.home
gradle property to the Java home of an appropriate version as below:
./gradlew -Dorg.gradle.java.home=/path/to/java/home build
How to Use
The project under the directory transportable-udfs-examples
is a standalone Gradle project that shows how to setup a project that uses the Transport UDFs framework to write Transportable UDFs. You can model your project after that standalone project. It implements a number of example UDFs to showcase different features and aspects of the API. Basically, you need to check out three components:
-
UDF examples code to familiarize yourself with the API, and how to write new UDFs.
-
Test code to find out how to write UDF tests in a unified testing API, but have the framework test them on multiple platforms.
-
Root
build.gradle
file to find out how to apply thetransport
plugin, which enables generating Hive, Spark, and Trino UDFs out of the transportable UDFs you define once you build your project. To see that in action:
Change directory to transportable-udfs-examples
:
cd transportable-udfs-examples
Build transportable-udfs-examples
:
gradle build
You will notice that the build process generates some code. This is the platform-specific versions of the UDFs. Once the build succeeds, check out the output artifacts:
ls transportable-udfs-example-udfs/build/libs/
The results should be like:
transportable-udfs-example-udfs-hive.jar
transportable-udfs-example-udfs-trino.jar
transportable-udfs-example-udfs-spark.jar
transportable-udfs-example-udfs.jar
That is it! While only one version of the UDFs is implemented, multiple jars are produced upon building the project. Each of those jars uses native platform APIs and data models to implement the UDFs. So from an execution engine's perspective, there is no data transformation needed for interoperability or portability. Only suitable classes are used for each engine.
To call those jars from your SQL engine (i.e., Hive, Spark, or Trino), the standard process for deploying UDF jars is followed for each engine. For example, in Hive, you add the jar to the classpath using the ADD JAR
statement, and register the UDF using CREATE FUNCTION
statement. In Trino, the jar is deployed to the plugin
directory. However, a small patch is required for the Trino engine to recognize the jar as a plugin, since the generated Trino UDFs implement the SqlScalarFunction
API, which is currently not part of Trino's SPI architecture. You can find the patch here and apply it before deploying your UDFs jar to the Trino engine.
Contributing
The project is under active development and we welcome contributions of different forms:
-
Contributing new general-purpose Transport UDFs (e.g., Machine Learning UDFs, Spatial UDFs, Linear Algebra UDFs, etc).
-
Contributing new platform support.
-
Contributing a framework for new types of UDFs, e.g., aggregate UDFs (UDAFs), or table functions (UDTFs).
Please take a look at the Contribution Agreement.
Questions?
Please send any questions or discussion topics to [email protected]
License
BSD 2-CLAUSE LICENSE
Copyright 2018 LinkedIn Corporation.
All Rights Reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the
distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.