Bigstep DataLake client libraries
These libraries enable the "dl://" prefix in hadoop and associated tools so that hdfs dfs -ls or distcopy work properly. They can also be used as a standalone FileSystem implementation to enable easy interaction with the DataLake from java or scala applications. They also build with a standalone version of "hdfs dfs".
###Before getting started:
-
Deploy a "DataLake" in the Bigstep's Control panel.
-
Install Java Cryptography Extension on all nodes of the cluster.
###Using the standalone tool The command line tool should be independent of the environment. It can coexist with any kerberos instalation. To use: 0. Have java installed, have the unlimited strenght policy files installed (openJDK includes them)
- Download the binaries
- Generate a keytab:
./bin/dl genkeytab [email protected] /etc/kxxxx.keytab
- Add your kerberos identity to the embeded core-site.xml:
vim conf/core-site.xml
- Execute any hdfs dfs command ./bin/dl fs -ls dl://node10930-datanodes-data-lake01-uk-reading.bigstep.io:14000/data_lake/dlxxx/
###Using as part of a Hadoop environment:
To use the library within a hadoop environment:
- Install kerberos client libraries: Centos:
yum install krb5-workstation
Ubuntu:
apt-get install krb5-config, krb5-user, krb5-clients
OsX:
brew install krb5
- Update the /etc/krb5.conf or download the auto-generated file from you Bigstep account:
[appdefaults]
validate=false
[libdefaults]
default_realm = bigstep.io
dns_lookup_realm = false
dns_lookup_kdc = false
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true
validate=false
rdns = false
ignore_acceptor_hostname = true
udp_preference_limit = 1
default_tgs_enctypes = aes256-cts-hmac-sha1-96 aes128-cts-hmac-sha1-96 des3-cbc-sha1 arcfour-hmac-md5 camellia256-cts-cmac camellia128-cts-cmac des-cbc-crc des-cbc-md5 des-cbc-md4
default_tkt_enctypes = aaes256-cts-hmac-sha1-96 aes128-cts-hmac-sha1-96 des3-cbc-sha1 arcfour-hmac-md5 camellia256-cts-cmac camellia128-cts-cmac des-cbc-crc des-cbc-md5 des-cbc-md4
permitted_enctypes = aes256-cts-hmac-sha1-96 aes128-cts-hmac-sha1-96 des3-cbc-sha1 arcfour-hmac-md5 camellia256-cts-cmac camellia128-cts-cmac des-cbc-crc des-cbc-md5 des-cbc-md4
[realms]
bigstep.io = {
kdc = fullmetal.bigstep.com
admin_server = fullmetal.bigstep.com
max_renewable_life = 5d 0h 0m 0s
}
- Add the following to core-site.xml. If using Spark standalone place this file in the spark-2.x.x/conf directory:
<property>
<name>fs.dl.impl</name>
<value>com.bigstep.datalake.DLFileSystem</value>
</property>
<property>
<name>fs.dl.impl.kerberosPrincipal</name>
<value>[email protected]</value>
</property>
<property>
<name>fs.dl.impl.kerberosKeytab</name>
<value>/etc/hadoop/kx.keytab</value>
</property>
<property>
<name>fs.dl.impl.kerberosRealm</name>
<value>bigstep.io</value>
</property>
<property>
<name>fs.dl.impl.homeDirectory</name>
<value>/data_lake/dlxxxx</value>
</property>
<!-- optional -->
<property>
<name>fs.dl.impl.defaultFilePermissions</name>
<value>00640</value>
</property>
<!-- optional -->
<property>
<name>fs.dl.impl.defaultUMask</name>
<value>007</value>
</property>
- Create a keytab:
# ktutil
ktutil: addent -password -p [email protected] -k 1 -e aes256-cts
ktutil: wkt /root/.k5keytab
ktutil: exit
- Make sure that the jar is available on all the cluster machines. Also the keytab must be reachable to yarn user (eg: not /root).
- Vanilla Hadoop hadoop-2.7.x/share/hadoop/common/
- Vanilla Spark 2.0 spark-2.x.x/jars
- Cloudera CDH 5.x /opt/cloudera/parcels/CDH/lib/hadoop/
- It should be possible technically to be used with any hadoop enabled application as long as it is added to the classpath.
You should now be able to use regular hadoop commands like distcp:
hadoop distcp hdfs://localhost/user/hdfs/test dl://node10930-datanodes-data-lake01-uk-reading.bigstep.io:14000/data_lake/dlzzz
##Using as part of a Spark environment:
- Extract a vanilla spark 2.* directory
- Copy the datalake-.jar in the jars directory of spark-2.
cp datalake-client-libraries-*.jar jars
- Modify core-site.xml to include the above
- Start spark-shell
###Using via Spark
./bin/spark-shell --packages com.bigstep:datalake-client-libraries:1.4
###Using programatically: To use directly in a maven use:
<dependency>
<groupId>com.bigstep</groupId>
<artifactId>datalake-client-libraries</artifactId>
<version>1.4</version>
</dependency>
Javadoc is available from maven central.
To compile use:
mvn package
More information can be found at the DataLake documentation.
###Troubleshooting You can find out a lot more about what the library is doing by enabling debug:. Change the log4j.rootLogger to DEBUG in conf/log4j.properties file:
log4j.rootLogger=DEBUG, A1
Key for the principal [email protected] not available in /etc/kxxx.keytab
[Krb5LoginModule] authentication failed
Unable to obtain password from user
Try enabling Kerberos debug via:
export KERBEROS_DEBUG=true
If you see something along the lines of:
Found unsupported keytype (18) for [email protected]
You need to enable AES 256 (assuming it's not illegal in your country) by adding the "Unlimited Strength JCE" policy files. Follow your operating system guides for installing that. For Mac Try: http://bigdatazone.blogspot.ro/2014/01/mac-osx-where-to-put-unlimited-jce-java.html
If a command stalls for an unknown reason and then timeouts with KDC not found try adding the following in the [libdefaults] section of your /etc/krb5.conf file:
udp_preference_limit=1
If you see errors related to the missing StringUtil.toLowerCase
try the hadoop2.6 branch as the master is linked to hadoop-2.7.
Using file encryption
It is possible to encrypt/decrypt files when uploading/downloading from the datalake. To enable this, add the following properties to core-site.xml
:
<!-- This tells the DataLake client if it should encrypt/decrypt files when
uploading/downloading. If property is missing, the default value is false. -->
<property>
<name>fs.dl.impl.shouldUseEncryption</name>
<value>true</value>
</property>
<!-- The location of the AES key. The file should be exactly 16 bytes long.
This property is required if fs.fl.impl.shouldUseEncryption is set to true. -->
<property>
<name>fs.dl.impl.encryptionKeyPath</name>
<value>/etc/PUT_YOUR_KEY_PATH_HERE</value>
</property>
By setting fs.dl.impl.shouldUseEncryption
to true, you should also provide the path to the file containing the AES key in fs.dl.impl.encryptionKeyPath
(the file should be exactly 16 bytes long). This will enable encrypting or decrypting of files when uploading or appending, or, respectively, downloading them. You can easily disable this by setting fs.dl.impl.shouldUseEncryption
to false.