Spark Layer of the BOM for Verticals
References
- This GitHub repository is a component of the BOM4V project, aiming at demonstrating end-to-end Spark-based examples of Machine Learning (ML) pipelines, for instance churn detection in telecoms and transport industries.
- Central Maven repository with BOM4V Jar artefacts
- Docker cloud with ready-to-use images
Machine Learning (ML)
- Churn Prediction with Apache Spark Machine Learning, by Carol McDonald on MapR blog, 5 June 2017
- Realtime prediction using Spark Structured Streaming, XGBoost and Scala, by Bogdan Cojocar on Medium, 24 June 2018
Installation
Short version
Just add the dependency on ti-spark-examples
in the SBT project configuration (typically, build.sbt
in the project root directory):
libraryDependencies += "org.bom4v.ti" %% "ti-spark-examples" % "0.0.1-spark2.3"
Run the demonstrator
$ mkdir -p ~/dev/ti
$ cd ~/dev/ti
$ git clone https://github.com/bom4v/metamodels.git
$ cd metamodels
$ rake clone && rake checkout
$ rake offline=true deliver
$ cd workspace/src/ti-spark-examples
$ ./fillLocalDataDir.sh
$ sbt run
[info] Loading global plugins from ~/.sbt/1.0/plugins
[info] Loading project definition from ~/dev/ti/metamodels/workspace/src/ti-spark-examples/project
[info] Set current project to ti-spark-examples (in build file:~/dev/ti/metamodels/workspace/src/ti-spark-examples/)
[info] Compiling 1 Scala source to ~/dev/ti/metamodels/workspace/src/ti-spark-examples/target/scala-2.11/classes...
[info] Running org.bom4v.ti.Demonstrator
17/08/06 18:04:26 INFO DataNucleus.Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
17/08/06 18:04:26 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored
17/08/06 18:04:28 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
17/08/06 18:04:28 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
17/08/06 18:04:28 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
17/08/06 18:04:28 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
17/08/06 18:04:28 INFO DataNucleus.Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing
17/08/06 18:04:29 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
cdrDF:
root
|-- specificationVersionNumber: integer (nullable = true)
|-- releaseVersionNumber: integer (nullable = true)
|-- fileName: string (nullable = true)
|-- fileAvailableTimeStamp: timestamp (nullable = true)
|-- fileUtcTimeOffset: integer (nullable = true)
|-- sender: string (nullable = true)
|-- recipient: string (nullable = true)
|-- sequenceNumber: integer (nullable = true)
|-- callEventsCount: string (nullable = true)
|-- eventType: string (nullable = true)
|-- imsi: long (nullable = true)
|-- imei: long (nullable = true)
|-- callEventStartTimeStamp: timestamp (nullable = true)
|-- utcTimeOffset: integer (nullable = true)
|-- callEventDuration: integer (nullable = true)
|-- causeForTermination: integer (nullable = true)
|-- accessPointNameNI: string (nullable = true)
|-- accessPointNameOI: string (nullable = true)
|-- dataVolumeIncoming: string (nullable = true)
|-- dataVolumeOutgoing: string (nullable = true)
|-- sgsnAddress: string (nullable = true)
|-- ggsnAddress: string (nullable = true)
|-- chargingId: string (nullable = true)
|-- chargeAmount: integer (nullable = true)
|-- teleServiceCode: integer (nullable = true)
|-- bearerServiceCode: string (nullable = true)
|-- supplementaryServiceCode: string (nullable = true)
|-- dialledDigits: string (nullable = true)
|-- connectedNumber: string (nullable = true)
|-- thirdPartyNumber: string (nullable = true)
|-- callingNumber: long (nullable = true)
|-- recEntityId: long (nullable = true)
|-- callReference: string (nullable = true)
|-- locationArea: string (nullable = true)
|-- cellId: string (nullable = true)
|-- msisdn: string (nullable = true)
|-- servingNetwork: string (nullable = true)
+--------------------------+--------------------+--------+----------------------+-----------------+------+---------+--------------+---------------+---------+---------------+---------------+-----------------------+-------------+-----------------+-------------------+-----------------+-----------------+------------------+------------------+-----------+-----------+----------+------------+---------------+-----------------+------------------------+-------------+---------------+----------------+-------------+-----------+-------------+------------+------+------+--------------+
|specificationVersionNumber|releaseVersionNumber|fileName|fileAvailableTimeStamp|fileUtcTimeOffset|sender|recipient|sequenceNumber|callEventsCount|eventType|imsi|imei|callEventStartTimeStamp|utcTimeOffset|callEventDuration|causeForTermination|accessPointNameNI|accessPointNameOI|dataVolumeIncoming|dataVolumeOutgoing|sgsnAddress|ggsnAddress|chargingId|chargeAmount|teleServiceCode|bearerServiceCode|supplementaryServiceCode|dialledDigits|connectedNumber|thirdPartyNumber|callingNumber|recEntityId|callReference|locationArea|cellId|msisdn|servingNetwork|
+--------------------------+--------------------+--------+----------------------+-----------------+------+---------+--------------+---------------+---------+---------------+---------------+-----------------------+-------------+-----------------+-------------------+-----------------+-----------------+------------------+------------------+-----------+-----------+----------+------------+---------------+-----------------+------------------------+-------------+---------------+----------------+-------------+-----------+-------------+------------+------+------+--------------+
| 2| 1| null| 2017-04-26 14:11:29| -400| FRAKS| ITAUT| 304561| null| mtc|250209890003854|355587045959660| 2017-04-26 21:02:55| 300| 0| 0| null| null| null| null| null| null| null| 0| 21| null| null| null| null| null| 39043490004|33672054372| null| null| null| null| null|
| 2| 1| null| 2017-04-26 14:11:29| -400| FRAKS| ITAUT| 304561| null| mtc|250209890003854|355587045959660| 2017-04-26 21:04:10| 300| 0| 0| null| null| null| null| null| null| null| 0| 21| null| null| null| null| null| 39043490004|33672054372| null| null| null| null| null|
| 2| 1| null| 2017-04-26 14:11:29| -400| FRAKS| ITAUT| 304561| null| mtc|250209890003854|355587045959660| 2017-04-26 21:04:14| 300| 0| 0| null| null| null| null| null| null| null| 0| 21| null| null| null| null| null| 39043490004|33672054372| null| null| null| null| null|
| 2| 1| null| 2017-04-26 14:11:29| -400| FRAKS| ITAUT| 304561| null| mtc|250209890003854|355587045959660| 2017-04-26 21:04:39| 300| 0| 0| null| null| null| null| null| null| null| 0| 21| null| null| null| null| null| 39043490004|33672054372| null| null| null| null| null|
| 2| 1| null| 2017-04-26 14:11:29| -400| FRAKS| ITAUT| 304561| null| mtc|250209890003854|355587045959660| 2017-04-26 21:04:46| 300| 0| 0| null| null| null| null| null| null| null| 0| 21| null| null| null| null| null| 39043490004|33672054372| null| null| null| null| null|
| 2| 1| null| 2017-04-26 14:11:29| -400| FRAKS| ITAUT| 304561| null| mtc|250209890003854|355587045959660| 2017-04-26 21:04:51| 300| 0| 0| null| null| null| null| null| null| null| 0| 21| null| null| null| null| null| 39043490004|33672054372| null| null| null| null| null|
| 2| 1| null| 2017-04-26 14:11:29| -400| FRAKS| ITAUT| 304561| null| mtc|250209890003854|355587045959660| 2017-04-26 21:05:08| 300| 0| 0| null| null| null| null| null| null| null| 0| 21| null| null| null| null| null| 39043490004|33672054372| null| null| null| null| null|
+--------------------------+--------------------+--------+----------------------+-----------------+------+---------+--------------+---------------+---------+---------------+---------------+-----------------------+-------------+-----------------+-------------------+-----------------+-----------------+------------------+------------------+-----------+-----------+----------+------------+---------------+-----------------+------------------------+-------------+---------------+----------------+-------------+-----------+-------------+------------+------+------+--------------+
only showing top 7 rows
+--------------------------+--------------------+--------+----------------------+-----------------+------+---------+--------------+---------------+---------+----+----+-----------------------+-------------+-----------------+-------------------+-----------------+-----------------+------------------+------------------+-----------+-----------+----------+------------+---------------+-----------------+------------------------+-------------+---------------+----------------+-------------+-----------+-------------+------------+------+------+--------------+
|specificationVersionNumber|releaseVersionNumber|fileName|fileAvailableTimeStamp|fileUtcTimeOffset|sender|recipient|sequenceNumber|callEventsCount|eventType|imsi|imei|callEventStartTimeStamp|utcTimeOffset|callEventDuration|causeForTermination|accessPointNameNI|accessPointNameOI|dataVolumeIncoming|dataVolumeOutgoing|sgsnAddress|ggsnAddress|chargingId|chargeAmount|teleServiceCode|bearerServiceCode|supplementaryServiceCode|dialledDigits|connectedNumber|thirdPartyNumber|callingNumber|recEntityId|callReference|locationArea|cellId|msisdn|servingNetwork|
+--------------------------+--------------------+--------+----------------------+-----------------+------+---------+--------------+---------------+---------+----+----+-----------------------+-------------+-----------------+-------------------+-----------------+-----------------+------------------+------------------+-----------+-----------+----------+------------+---------------+-----------------+------------------------+-------------+---------------+----------------+-------------+-----------+-------------+------------+------+------+--------------+
+--------------------------+--------------------+--------+----------------------+-----------------+------+---------+--------------+---------------+---------+----+----+-----------------------+-------------+-----------------+-------------------+-----------------+-----------------+------------------+------------------+-----------+-----------+----------+------------+---------------+-----------------+------------------------+-------------+---------------+----------------+-------------+-----------+-------------+------------+------+------+--------------+
+--------------------------+--------------------+--------+----------------------+-----------------+------+---------+--------------+---------------+---------+----+----+-----------------------+-------------+-----------------+-------------------+-----------------+-----------------+------------------+------------------+-----------+-----------+----------+------------+---------------+-----------------+------------------------+-------------+---------------+----------------+-------------+-----------+-------------+------------+------+------+--------------+
|specificationVersionNumber|releaseVersionNumber|fileName|fileAvailableTimeStamp|fileUtcTimeOffset|sender|recipient|sequenceNumber|callEventsCount|eventType|imsi|imei|callEventStartTimeStamp|utcTimeOffset|callEventDuration|causeForTermination|accessPointNameNI|accessPointNameOI|dataVolumeIncoming|dataVolumeOutgoing|sgsnAddress|ggsnAddress|chargingId|chargeAmount|teleServiceCode|bearerServiceCode|supplementaryServiceCode|dialledDigits|connectedNumber|thirdPartyNumber|callingNumber|recEntityId|callReference|locationArea|cellId|msisdn|servingNetwork|
+--------------------------+--------------------+--------+----------------------+-----------------+------+---------+--------------+---------------+---------+----+----+-----------------------+-------------+-----------------+-------------------+-----------------+-----------------+------------------+------------------+-----------+-----------+----------+------------+---------------+-----------------+------------------------+-------------+---------------+----------------+-------------+-----------+-------------+------------+------+------+--------------+
+--------------------------+--------------------+--------+----------------------+-----------------+------+---------+--------------+---------------+---------+----+----+-----------------------+-------------+-----------------+-------------------+-----------------+-----------------+------------------+------------------+-----------+-----------+----------+------------+---------------+-----------------+------------------------+-------------+---------------+----------------+-------------+-----------+-------------+------------+------+------+--------------+
+--------------------------+--------------------+--------+----------------------+-----------------+------+---------+--------------+---------------+---------+----+----+-----------------------+-------------+-----------------+-------------------+-----------------+-----------------+------------------+------------------+-----------+-----------+----------+------------+---------------+-----------------+------------------------+-------------+---------------+----------------+-------------+-----------+-------------+------------+------+------+--------------+
|specificationVersionNumber|releaseVersionNumber|fileName|fileAvailableTimeStamp|fileUtcTimeOffset|sender|recipient|sequenceNumber|callEventsCount|eventType|imsi|imei|callEventStartTimeStamp|utcTimeOffset|callEventDuration|causeForTermination|accessPointNameNI|accessPointNameOI|dataVolumeIncoming|dataVolumeOutgoing|sgsnAddress|ggsnAddress|chargingId|chargeAmount|teleServiceCode|bearerServiceCode|supplementaryServiceCode|dialledDigits|connectedNumber|thirdPartyNumber|callingNumber|recEntityId|callReference|locationArea|cellId|msisdn|servingNetwork|
+--------------------------+--------------------+--------+----------------------+-----------------+------+---------+--------------+---------------+---------+----+----+-----------------------+-------------+-----------------+-------------------+-----------------+-----------------+------------------+------------------+-----------+-----------+----------+------------+---------------+-----------------+------------------------+-------------+---------------+----------------+-------------+-----------+-------------+------------+------+------+--------------+
+--------------------------+--------------------+--------+----------------------+-----------------+------+---------+--------------+---------------+---------+----+----+-----------------------+-------------+-----------------+-------------------+-----------------+-----------------+------------------+------------------+-----------+-----------+----------+------------+---------------+-----------------+------------------------+-------------+---------------+----------------+-------------+-----------+-------------+------------+------+------+--------------+
dfFilteredBySQL:
+--------------------------+--------------------+--------+----------------------+-----------------+------+---------+--------------+---------------+---------+----+----+-----------------------+-------------+-----------------+-------------------+-----------------+-----------------+------------------+------------------+-----------+-----------+----------+------------+---------------+-----------------+------------------------+-------------+---------------+----------------+-------------+-----------+-------------+------------+------+------+--------------+
|specificationVersionNumber|releaseVersionNumber|fileName|fileAvailableTimeStamp|fileUtcTimeOffset|sender|recipient|sequenceNumber|callEventsCount|eventType|imsi|imei|callEventStartTimeStamp|utcTimeOffset|callEventDuration|causeForTermination|accessPointNameNI|accessPointNameOI|dataVolumeIncoming|dataVolumeOutgoing|sgsnAddress|ggsnAddress|chargingId|chargeAmount|teleServiceCode|bearerServiceCode|supplementaryServiceCode|dialledDigits|connectedNumber|thirdPartyNumber|callingNumber|recEntityId|callReference|locationArea|cellId|msisdn|servingNetwork|
+--------------------------+--------------------+--------+----------------------+-----------------+------+---------+--------------+---------------+---------+----+----+-----------------------+-------------+-----------------+-------------------+-----------------+-----------------+------------------+------------------+-----------+-----------+----------+------------+---------------+-----------------+------------------------+-------------+---------------+----------------+-------------+-----------+-------------+------------+------+------+--------------+
+--------------------------+--------------------+--------+----------------------+-----------------+------+---------+--------------+---------------+---------+----+----+-----------------------+-------------+-----------------+-------------------+-----------------+-----------------+------------------+------------------+-----------+-----------+----------+------------+---------------+-----------------+------------------------+-------------+---------------+----------------+-------------+-----------+-------------+------------+------+------+--------------+
+--------------------------+--------------------+--------+----------------------+-----------------+------+---------+--------------+---------------+---------+---------------+---------------+-----------------------+-------------+-----------------+-------------------+-----------------+-----------------+------------------+------------------+-----------+-----------+----------+------------+---------------+-----------------+------------------------+-------------+---------------+----------------+-------------+-----------+-------------+------------+------+------+--------------+
|specificationVersionNumber|releaseVersionNumber|fileName|fileAvailableTimeStamp|fileUtcTimeOffset|sender|recipient|sequenceNumber|callEventsCount|eventType| imsi| imei|callEventStartTimeStamp|utcTimeOffset|callEventDuration|causeForTermination|accessPointNameNI|accessPointNameOI|dataVolumeIncoming|dataVolumeOutgoing|sgsnAddress|ggsnAddress|chargingId|chargeAmount|teleServiceCode|bearerServiceCode|supplementaryServiceCode|dialledDigits|connectedNumber|thirdPartyNumber|callingNumber|recEntityId|callReference|locationArea|cellId|msisdn|servingNetwork|
+--------------------------+--------------------+--------+----------------------+-----------------+------+---------+--------------+---------------+---------+---------------+---------------+-----------------------+-------------+-----------------+-------------------+-----------------+-----------------+------------------+------------------+-----------+-----------+----------+------------+---------------+-----------------+------------------------+-------------+---------------+----------------+-------------+-----------+-------------+------------+------+------+--------------+
| 2| 1| null| 2017-04-26 14:11:29| -400| FRAKS| ITAUT| 304561| null| mtc|250209890003854|355587045959660| 2017-04-26 21:01:54| 300| 0| 0| null| null| null| null| null| null| null| 0| 21| null| null| null| null| null| 39043490004|33672054372| null| null| null| null| null|
| 2| 1| null| 2017-04-26 14:11:29| -400| FRAKS| ITAUT| 304561| null| mtc|250209890003854|355587045959660| 2017-04-26 21:02:09| 300| 0| 0| null| null| null| null| null| null| null| 0| 21| null| null| null| null| null| 39043490004|33672054372| null| null| null| null| null|
| 2| 1| null| 2017-04-26 14:11:29| -400| FRAKS| ITAUT| 304561| null| mtc|250209890003854|355587045959660| 2017-04-26 21:02:19| 300| 0| 0| null| null| null| null| null| null| null| 0| 21| null| null| null| null| null| 39043490004|33672054372| null| null| null| null| null|
| 2| 1| null| 2017-04-26 14:11:29| -400| FRAKS| ITAUT| 304561| null| mtc|250209890003854|355587045959660| 2017-04-26 21:02:24| 300| 0| 0| null| null| null| null| null| null| null| 0| 21| null| null| null| null| null| 39043490004|33672054372| null| null| null| null| null|
| 2| 1| null| 2017-04-26 14:11:29| -400| FRAKS| ITAUT| 304561| null| mtc|250209890003854|355587045959660| 2017-04-26 21:02:28| 300| 0| 0| null| null| null| null| null| null| null| 0| 21| null| null| null| null| null| 39043490004|33672054372| null| null| null| null| null|
| 2| 1| null| 2017-04-26 14:11:29| -400| FRAKS| ITAUT| 304561| null| mtc|250209890003854|355587045959660| 2017-04-26 21:02:51| 300| 0| 0| null| null| null| null| null| null| null| 0| 21| null| null| null| null| null| 39043490004|33672054372| null| null| null| null| null|
| 2| 1| null| 2017-04-26 14:11:29| -400| FRAKS| ITAUT| 304561| null| mtc|250209890003854|355587045959660| 2017-04-26 21:02:55| 300| 0| 0| null| null| null| null| null| null| null| 0| 21| null| null| null| null| null| 39043490004|33672054372| null| null| null| null| null|
| 2| 1| null| 2017-04-26 14:11:29| -400| FRAKS| ITAUT| 304561| null| mtc|250209890003854|355587045959660| 2017-04-26 21:04:10| 300| 0| 0| null| null| null| null| null| null| null| 0| 21| null| null| null| null| null| 39043490004|33672054372| null| null| null| null| null|
| 2| 1| null| 2017-04-26 14:11:29| -400| FRAKS| ITAUT| 304561| null| mtc|250209890003854|355587045959660| 2017-04-26 21:04:14| 300| 0| 0| null| null| null| null| null| null| null| 0| 21| null| null| null| null| null| 39043490004|33672054372| null| null| null| null| null|
| 2| 1| null| 2017-04-26 14:11:29| -400| FRAKS| ITAUT| 304561| null| mtc|250209890003854|355587045959660| 2017-04-26 21:04:39| 300| 0| 0| null| null| null| null| null| null| null| 0| 21| null| null| null| null| null| 39043490004|33672054372| null| null| null| null| null|
+--------------------------+--------------------+--------+----------------------+-----------------+------+---------+--------------+---------------+---------+---------------+---------------+-----------------------+-------------+-----------------+-------------------+-----------------+-----------------+------------------+------------------+-----------+-----------+----------+------------+---------------+-----------------+------------------------+-------------+---------------+----------------+-------------+-----------+-------------+------------+------+------+--------------+
only showing top 10 rows
+--------------------------+--------------------+--------+----------------------+-----------------+------+---------+--------------+---------------+---------+---------------+---------------+-----------------------+-------------+-----------------+-------------------+-----------------+-----------------+------------------+------------------+-----------+-----------+----------+------------+---------------+-----------------+------------------------+-------------+---------------+----------------+-------------+-----------+-------------+------------+------+------+--------------+
|specificationVersionNumber|releaseVersionNumber|fileName|fileAvailableTimeStamp|fileUtcTimeOffset|sender|recipient|sequenceNumber|callEventsCount|eventType| imsi| imei|callEventStartTimeStamp|utcTimeOffset|callEventDuration|causeForTermination|accessPointNameNI|accessPointNameOI|dataVolumeIncoming|dataVolumeOutgoing|sgsnAddress|ggsnAddress|chargingId|chargeAmount|teleServiceCode|bearerServiceCode|supplementaryServiceCode|dialledDigits|connectedNumber|thirdPartyNumber|callingNumber|recEntityId|callReference|locationArea|cellId|msisdn|servingNetwork|
+--------------------------+--------------------+--------+----------------------+-----------------+------+---------+--------------+---------------+---------+---------------+---------------+-----------------------+-------------+-----------------+-------------------+-----------------+-----------------+------------------+------------------+-----------+-----------+----------+------------+---------------+-----------------+------------------------+-------------+---------------+----------------+-------------+-----------+-------------+------------+------+------+--------------+
| 2| 1| null| 2017-04-26 14:11:29| -400| FRAKS| ITAUT| 304561| null| mtc|250209890003854|355587045959660| 2017-04-26 21:01:54| 300| 0| 0| null| null| null| null| null| null| null| 0| 21| null| null| null| null| null| 39043490004|33672054372| null| null| null| null| null|
| 2| 1| null| 2017-04-26 14:11:29| -400| FRAKS| ITAUT| 304561| null| mtc|250209890003854|355587045959660| 2017-04-26 21:02:09| 300| 0| 0| null| null| null| null| null| null| null| 0| 21| null| null| null| null| null| 39043490004|33672054372| null| null| null| null| null|
| 2| 1| null| 2017-04-26 14:11:29| -400| FRAKS| ITAUT| 304561| null| mtc|250209890003854|355587045959660| 2017-04-26 21:02:19| 300| 0| 0| null| null| null| null| null| null| null| 0| 21| null| null| null| null| null| 39043490004|33672054372| null| null| null| null| null|
| 2| 1| null| 2017-04-26 14:11:29| -400| FRAKS| ITAUT| 304561| null| mtc|250209890003854|355587045959660| 2017-04-26 21:02:24| 300| 0| 0| null| null| null| null| null| null| null| 0| 21| null| null| null| null| null| 39043490004|33672054372| null| null| null| null| null|
| 2| 1| null| 2017-04-26 14:11:29| -400| FRAKS| ITAUT| 304561| null| mtc|250209890003854|355587045959660| 2017-04-26 21:02:28| 300| 0| 0| null| null| null| null| null| null| null| 0| 21| null| null| null| null| null| 39043490004|33672054372| null| null| null| null| null|
| 2| 1| null| 2017-04-26 14:11:29| -400| FRAKS| ITAUT| 304561| null| mtc|250209890003854|355587045959660| 2017-04-26 21:02:51| 300| 0| 0| null| null| null| null| null| null| null| 0| 21| null| null| null| null| null| 39043490004|33672054372| null| null| null| null| null|
| 2| 1| null| 2017-04-26 14:11:29| -400| FRAKS| ITAUT| 304561| null| mtc|250209890003854|355587045959660| 2017-04-26 21:02:55| 300| 0| 0| null| null| null| null| null| null| null| 0| 21| null| null| null| null| null| 39043490004|33672054372| null| null| null| null| null|
| 2| 1| null| 2017-04-26 14:11:29| -400| FRAKS| ITAUT| 304561| null| mtc|250209890003854|355587045959660| 2017-04-26 21:04:10| 300| 0| 0| null| null| null| null| null| null| null| 0| 21| null| null| null| null| null| 39043490004|33672054372| null| null| null| null| null|
| 2| 1| null| 2017-04-26 14:11:29| -400| FRAKS| ITAUT| 304561| null| mtc|250209890003854|355587045959660| 2017-04-26 21:04:14| 300| 0| 0| null| null| null| null| null| null| null| 0| 21| null| null| null| null| null| 39043490004|33672054372| null| null| null| null| null|
| 2| 1| null| 2017-04-26 14:11:29| -400| FRAKS| ITAUT| 304561| null| mtc|250209890003854|355587045959660| 2017-04-26 21:04:39| 300| 0| 0| null| null| null| null| null| null| null| 0| 21| null| null| null| null| null| 39043490004|33672054372| null| null| null| null| null|
+--------------------------+--------------------+--------+----------------------+-----------------+------+---------+--------------+---------------+---------+---------------+---------------+-----------------------+-------------+-----------------+-------------------+-----------------+-----------------+------------------+------------------+-----------+-----------+----------+------------+---------------+-----------------+------------------------+-------------+---------------+----------------+-------------+-----------+-------------+------------+------+------+--------------+
only showing top 10 rows
copyOfCDRDF:
+-----------+-------------+
| number|callingNumber|
+-----------+-------------+
|33672054372| 39043490004|
|33672054372| 39043490004|
|33672054372| 39043490004|
|33672054372| 39043490004|
|33672054372| 39043490004|
|33672054372| 39043490004|
|33672054372| 39043490004|
|33672054372| 39043490004|
|33672054372| 39043490004|
|33672054372| 39043490004|
|33672054372| 39043490004|
|33672054372| 39043490004|
|33672054372| 39043490004|
|33672054372| 39043490004|
+-----------+-------------+
newCDRDF:
root
|-- number: string (nullable = true)
|-- callingNumber: string (nullable = true)
+-----------+-------------+
| number|callingNumber|
+-----------+-------------+
|33672054372| 39043490004|
|33672054372| 39043490004|
|33672054372| 39043490004|
|33672054372| 39043490004|
|33672054372| 39043490004|
|33672054372| 39043490004|
|33672054372| 39043490004|
|33672054372| 39043490004|
|33672054372| 39043490004|
|33672054372| 39043490004|
|33672054372| 39043490004|
|33672054372| 39043490004|
|33672054372| 39043490004|
|33672054372| 39043490004|
+-----------+-------------+
[success] Total time: 17 s, completed Aug 6, 2017 6:04:35 PM
Interacting with a Spark installation
So far, we have seen how to launch the application on the Spark engine embedded by the JVM spawned by SBT. That embedded Spark engine has some limitations, and a vanilla version of Spark installation may be preferred for more demanding use cases.
On recent Spark installations, there is no need to prefix file-paths by hdfs://
or to specify absolute file-paths:
- In stand-alone mode, Spark will look in the local file-system
- In cluster mode, Spark will look in HDFS. If the file-paths are relative, then Spark will look relatively from the user home directory (typically,
/user/$USER
) on HDFS
In the following sections, details are given on how to interact with HDFS for instance, to transfer back and forth betwwen the local filesystem and HDFS), but most of those operations are now optional on a local Spark installation.
(Optional) Copy the data onto HDFS
$ export HDFS_URL="hdfs://127.0.0.1:9000"
$ alias hdfsfs='hdfs dfs -Dfs.defaultFS=$HDFS_URL'
$ export HDFS_USR_DIR="/user/<user>"
$ hdfsfs -mkdir -p $HDFS_USR_DIR/data/cdr
$ hdfsfs -put data/cdr/CDR-sample.csv $HDFS_USR_DIR/data/cdr
$ hdfsfs -cat $HDFS_USR_DIR/data/cdr/CDR-sample.csv|head -3
Local Spark cluster
- It is assumed here that Spark has been installed locally
$ export MVN_CHD_REPO="$HOME/.m2/repository"
$ $SPARK_HOME/bin/spark-submit \
--class org.bom4v.ti.Demonstrator \
--master local --deploy-mode client \
--jars \
file:$MVN_CHD_REPO/org/bom4v/ti/ti-models-calls_2.11/0.0.1/ti-models-calls_2.11-0.0.1.jar,\
file:$MVN_CHD_REPO/org/bom4v/ti/ti-serializers-calls_2.11/0.0.1-spark2.3/ti-serializers-calls_2.11-0.0.1-spark2.3.jar,\
file:$MVN_CHD_REPO/org/bom4v/ti/ti-serializers-customers_2.11/0.0.1-spark2.3/ti-serializers-customers_2.11-0.0.1-spark2.3.jar,\
file:$MVN_CHD_REPO/org/bom4v/ti/ti-models-customers_2.11/0.0.1/ti-models-customers_2.11-0.0.1.jar \
target/scala-2.11/ti-spark-examples_2.11-0.0.1-spark2.3.jar
Spark cluster - Client mode
- It is assumed here that a Spark cluster has been installed somewhere, and that you are allowed to launch jobs on that cluster
- On some recent local installations of Spark, for instance on MacOS, the Yarn cluster client mode is equivalent to the local mode
$ $SPARK_HOME/bin/spark-submit \
--class org.bom4v.ti.Demonstrator \
--master yarn --deploy-mode client \
--jars \
file:$MVN_CHD_REPO/org/bom4v/ti/ti-models-calls_2.11/0.0.1/ti-models-calls_2.11-0.0.1.jar,\
file:$MVN_CHD_REPO/org/bom4v/ti/ti-serializers-calls_2.11/0.0.1-spark2.3/ti-serializers-calls_2.11-0.0.1-spark2.3.jar,\
file:$MVN_CHD_REPO/org/bom4v/ti/ti-serializers-customers_2.11/0.0.1-spark2.3/ti-serializers-customers_2.11-0.0.1-spark2.3.jar,\
file:$MVN_CHD_REPO/org/bom4v/ti/ti-models-customers_2.11/0.0.1/ti-models-customers_2.11-0.0.1.jar \
target/scala-2.11/ti-spark-examples_2.11-0.0.1-spark2.3.jar
Spark cluster - Server mode
If the jobs are to be launched from a remote machine, you may want to map the local HDFS port to the HDFS port of the remote machine. For instance, from an independent terminal window on the local machine:
$ The -N option allows to not launch any command (eg, bash)
$ ssh <user>@<remote-machine> -N -L 9000:127.0.0.1:9000
Then, the following commands will work:
- remotely if the above SSH port forwarding has been set up
- locally if the above SSH port forwarding has not been set up
$ export HDFS_URL="hdfs://127.0.0.1:9000"
$ alias hdfsfs='hdfs dfs -Dfs.defaultFS=${HDFS_URL}'
$ export ATF_USR_DIR="/user/<user>/artefacts"
$ export ATF_USR_URL="${HDFS_URL}${ATF_USR_DIR}"
$ hdfsfs -mkdir -p $ATF_USR_DIR
$ hdfsfs -put -f target/scala-2.11/ti-spark-examples_2.11-0.0.1-spark2.3.jar $ATF_USR_DIR
$ $SPARK_HOME/bin/spark-submit \
--class org.bom4v.ti.Demonstrator \
--master yarn --deploy-mode cluster \
--jars \
file:$MVN_CHD_REPO/org/bom4v/ti/ti-models-calls_2.11/0.0.1/ti-models-calls_2.11-0.0.1.jar,\
file:$MVN_CHD_REPO/org/bom4v/ti/ti-serializers-calls_2.11/0.0.1-spark2.3/ti-serializers-calls_2.11-0.0.1-spark2.3.jar,\
file:$MVN_CHD_REPO/org/bom4v/ti/ti-serializers-customers_2.11/0.0.1-spark2.3/ti-serializers-customers_2.11-0.0.1-spark2.3.jar,\
file:$MVN_CHD_REPO/org/bom4v/ti/ti-models-customers_2.11/0.0.1/ti-models-customers_2.11-0.0.1.jar \
target/scala-2.11/ti-spark-examples_2.11-0.0.1-spark2.3.jar