morpheus-db

A data science library for the JVM

License

License

Categories

Categories

Morpheus Business Logic Libraries Science
GroupId

GroupId

com.d3xsystems
ArtifactId

ArtifactId

morpheus-db
Last Version

Last Version

1.0.3
Release Date

Release Date

Type

Type

jar
Description

Description

morpheus-db
A data science library for the JVM
Project Organization

Project Organization

D3X Systems

Download morpheus-db

How to add to project

<!-- https://jarcasting.com/artifacts/com.d3xsystems/morpheus-db/ -->
<dependency>
    <groupId>com.d3xsystems</groupId>
    <artifactId>morpheus-db</artifactId>
    <version>1.0.3</version>
</dependency>
// https://jarcasting.com/artifacts/com.d3xsystems/morpheus-db/
implementation 'com.d3xsystems:morpheus-db:1.0.3'
// https://jarcasting.com/artifacts/com.d3xsystems/morpheus-db/
implementation ("com.d3xsystems:morpheus-db:1.0.3")
'com.d3xsystems:morpheus-db:jar:1.0.3'
<dependency org="com.d3xsystems" name="morpheus-db" rev="1.0.3">
  <artifact name="morpheus-db" type="jar" />
</dependency>
@Grapes(
@Grab(group='com.d3xsystems', module='morpheus-db', version='1.0.3')
)
libraryDependencies += "com.d3xsystems" % "morpheus-db" % "1.0.3"
[com.d3xsystems/morpheus-db "1.0.3"]

Dependencies

compile (1)

Group / Artifact Type Version
com.d3xsystems : morpheus-core jar 1.0.3

provided (1)

Group / Artifact Type Version
org.projectlombok : lombok jar 1.16.22

test (5)

Group / Artifact Type Version
org.apache.commons : commons-dbcp2 jar 2.1.1
org.xerial : sqlite-jdbc jar 3.21.0.1
com.h2database : h2 jar 1.4.196
org.hsqldb : hsqldb jar 2.4.0
org.testng : testng jar 6.13.1

Project Modules

There are no modules declared in this project.

Introduction

The Morpheus library is designed to facilitate the development of high performance analytical software involving large datasets for both offline and real-time analysis on the Java Virtual Machine (JVM). The library is written in Java 8 with extensive use of lambdas, but is accessible to all JVM languages.

Motivation

At its core, Morpheus provides a versatile two-dimensional memory efficient tabular data structure called a DataFrame, similar to that first popularised in R. While dynamically typed scientific computing languages like R, Python & Matlab are great for doing research, they are not well suited for large scale production systems as they become extremely difficult to maintain, and dangerous to refactor. The Morpheus library attempts to retain the power and versatility of the DataFrame concept, while providing a much more type safe and self describing set of interfaces, which should make developing, maintaining & scaling code complexity much easier.

Another advantage of the Morpheus library is that it is extremely good at scaling on multi-core processor architectures given the powerful threading capabilities of the Java Virtual Machine. Many operations on a Morpheus DataFrame can seamlessly be run in parallel by simply calling parallel() on the entity you wish to operate on, much like with Java 8 Streams. Internally, these parallel implementations are based on the Fork & Join framework, and near linear improvements in performance are observed for certain types of operations as CPU cores are added.

Capabilities

A Morpheus DataFrame is a column store structure where each column is represented by a Morpheus Array of which there are many implementations, including dense, sparse and memory mapped versions. Morpheus arrays are optimized and wherever possible are backed by primitive native Java arrays (even for types such as LocalDate, LocalDateTime etc...) as these are far more efficient from a storage, access and garbage collection perspective. Memory mapped Morpheus Arrays, while still experimental, allow very large DataFrames to be created using off-heap storage that are backed by files.

While the complete feature set of the Morpheus DataFrame is still evolving, there are already many powerful APIs to affect complex transformations and analytical operations with ease. There are standard functions to compute summary statistics, perform various types of Linear Regressions, apply Principal Component Analysis (PCA) to mention just a few. The DataFrame is indexed in both the row and column dimension, allowing data to be efficiently sorted, sliced, grouped, and aggregated along either axis.

Data Access

Morpheus also aims to provide a standard mechanism to load datasets from various data providers. The hope is that this API will be embraced by the community in order to grow the catalogue of supported data sources. Currently, providers are implemented to enable data to be loaded from Quandl, The Federal Reserve, The World Bank, Yahoo Finance and Google Finance.

Morpheus at a Glance

A Simple Example

Consider a dataset of motor vehicle characteristics accessible here. The code below loads this CSV data into a Morpheus DataFrame, filters the rows to only include those vehicles that have a power to weight ratio > 0.1 (where weight is converted into kilograms), then adds a column to record the relative efficiency between highway and city mileage (MPG), sorts the rows by this newly added column in descending order, and finally records this transformed result to a CSV file.

DataFrame.read().csv(options -> {
    options.setResource("https://www.d3xsystems.com/public/data/samples/cars93.csv");
    options.setExcludeColumnIndexes(0);
}).rows().select(row -> {
    var weightKG = row.getDouble("Weight") * 0.453592d;
    var horsepower = row.getDouble("Horsepower");
    return horsepower / weightKG > 0.1d;
}).cols().add("MPG(Highway/City)", Double.class, v -> {
    var cityMpg = v.row().getDouble("MPG.city");
    var highwayMpg = v.row().getDouble("MPG.highway");
    return highwayMpg / cityMpg;
}).rows().sort(false, "MPG(Highway/City)").write().csv(options -> {
    options.setFile("/Users/witdxav/cars93m.csv");
    options.setTitle("DataFrame");
});

This example demonstrates the functional nature of the Morpheus API, where many method return types are in fact a DataFrame and therefore allow this form of method chaining. In this example, the methods csv(), select(), add(), and sort() all return a frame. In some cases the same frame that the method operates on, or in other cases a filter or shallow copy of the frame being operated on. The first 10 rows of the transformed dataset in this example looks as follows, with the newly added column appearing on the far right of the frame.

 Index  |  Manufacturer  |     Model      |   Type    |  Min.Price  |   Price   |  Max.Price  |  MPG.city  |  MPG.highway  |       AirBags        |  DriveTrain  |  Cylinders  |  EngineSize  |  Horsepower  |  RPM   |  Rev.per.mile  |  Man.trans.avail  |  Fuel.tank.capacity  |  Passengers  |  Length  |  Wheelbase  |  Width  |  Turn.circle  |  Rear.seat.room  |  Luggage.room  |  Weight  |  Origin   |           Make            |  MPG(Highway/City)  |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     9  |      Cadillac  |       DeVille  |    Large  |    33.0000  |  34.7000  |    36.3000  |        16  |           25  |         Driver only  |       Front  |          8  |      4.9000  |         200  |  4100  |          1510  |               No  |             18.0000  |           6  |     206  |        114  |     73  |           43  |              35  |            18  |    3620  |      USA  |         Cadillac DeVille  |             1.5625  |
    10  |      Cadillac  |       Seville  |  Midsize  |    37.5000  |  40.1000  |    42.7000  |        16  |           25  |  Driver & Passenger  |       Front  |          8  |      4.6000  |         295  |  6000  |          1985  |               No  |             20.0000  |           5  |     204  |        111  |     74  |           44  |              31  |            14  |    3935  |      USA  |         Cadillac Seville  |             1.5625  |
    70  |    Oldsmobile  |  Eighty-Eight  |    Large  |    19.5000  |  20.7000  |    21.9000  |        19  |           28  |         Driver only  |       Front  |          6  |      3.8000  |         170  |  4800  |          1570  |               No  |             18.0000  |           6  |     201  |        111  |     74  |           42  |            31.5  |            17  |    3470  |      USA  |  Oldsmobile Eighty-Eight  |         1.47368421  |
    74  |       Pontiac  |      Firebird  |   Sporty  |    14.0000  |  17.7000  |    21.4000  |        19  |           28  |  Driver & Passenger  |        Rear  |          6  |      3.4000  |         160  |  4600  |          1805  |              Yes  |             15.5000  |           4  |     196  |        101  |     75  |           43  |              25  |            13  |    3240  |      USA  |         Pontiac Firebird  |         1.47368421  |
     6  |         Buick  |       LeSabre  |    Large  |    19.9000  |  20.8000  |    21.7000  |        19  |           28  |         Driver only  |       Front  |          6  |      3.8000  |         170  |  4800  |          1570  |               No  |             18.0000  |           6  |     200  |        111  |     74  |           42  |            30.5  |            17  |    3470  |      USA  |            Buick LeSabre  |         1.47368421  |
    13  |     Chevrolet  |        Camaro  |   Sporty  |    13.4000  |  15.1000  |    16.8000  |        19  |           28  |  Driver & Passenger  |        Rear  |          6  |      3.4000  |         160  |  4600  |          1805  |              Yes  |             15.5000  |           4  |     193  |        101  |     74  |           43  |              25  |            13  |    3240  |      USA  |         Chevrolet Camaro  |         1.47368421  |
    76  |       Pontiac  |    Bonneville  |    Large  |    19.4000  |  24.4000  |    29.4000  |        19  |           28  |  Driver & Passenger  |       Front  |          6  |      3.8000  |         170  |  4800  |          1565  |               No  |             18.0000  |           6  |     177  |        111  |     74  |           43  |            30.5  |            18  |    3495  |      USA  |       Pontiac Bonneville  |         1.47368421  |
    56  |         Mazda  |          RX-7  |   Sporty  |    32.5000  |  32.5000  |    32.5000  |        17  |           25  |         Driver only  |        Rear  |     rotary  |      1.3000  |         255  |  6500  |          2325  |              Yes  |             20.0000  |           2  |     169  |         96  |     69  |           37  |              NA  |            NA  |    2895  |  non-USA  |               Mazda RX-7  |         1.47058824  |
    18  |     Chevrolet  |      Corvette  |   Sporty  |    34.6000  |  38.0000  |    41.5000  |        17  |           25  |         Driver only  |        Rear  |          8  |      5.7000  |         300  |  5000  |          1450  |              Yes  |             20.0000  |           2  |     179  |         96  |     74  |           43  |              NA  |            NA  |    3380  |      USA  |       Chevrolet Corvette  |         1.47058824  |
    51  |       Lincoln  |      Town_Car  |    Large  |    34.4000  |  36.1000  |    37.8000  |        18  |           26  |  Driver & Passenger  |        Rear  |          8  |      4.6000  |         210  |  4600  |          1840  |               No  |             20.0000  |           6  |     219  |        117  |     77  |           45  |            31.5  |            22  |    4055  |      USA  |         Lincoln Town_Car  |         1.44444444  |

A Regression Example

The Morpheus API includes a regression interface in order to fit data to a linear model using either OLS, WLS or GLS. The code below uses the same car dataset introduced in the previous example, and regresses Horsepower on EngineSize. The code example prints the model results to standard out, which is shown below, and then creates a scatter chart with the regression line clearly displayed.

//Load the data
var data = DataFrame.read().csv(options -> {
    options.setResource("https://www.d3xsystems.com/public/data/samples/cars93.csv");
    options.setExcludeColumnIndexes(0);
});

//Run OLS regression and plot
var regressand = "Horsepower";
var regressor = "EngineSize";
data.regress().ols(regressand, regressor, true, model -> {
    IO.println(model);
    var xy = data.cols().select(regressand, regressor);
    Chart.create().withScatterPlot(xy, false, regressor, chart -> {
        chart.title().withText(regressand + " regressed on " + regressor);
        chart.subtitle().withText("Single Variable Linear Regression");
        chart.plot().style(regressand).withColor(Color.RED).withPointsVisible(true);
        chart.plot().trend(regressand).withColor(Color.BLACK);
        chart.plot().axes().domain().label().withText(regressor);
        chart.plot().axes().domain().format().withPattern("0.00;-0.00");
        chart.plot().axes().range(0).label().withText(regressand);
        chart.plot().axes().range(0).format().withPattern("0;-0");
        chart.show();
    });
    return Optional.empty();
});
==============================================================================================
                                   Linear Regression Results                                                            
==============================================================================================
Model:                                   OLS    R-Squared:                            0.5360
Observations:                             93    R-Squared(adjusted):                  0.5309
DF Model:                                  1    F-Statistic:                        105.1204
DF Residuals:                             91    F-Statistic(Prob):                  1.11E-16
Standard Error:                      35.8717    Runtime(millis)                           52
Durbin-Watson:                        1.9591                                                
==============================================================================================
   Index     |  PARAMETER  |  STD_ERROR  |  T_STAT   |   P_VALUE   |  CI_LOWER  |  CI_UPPER  |
----------------------------------------------------------------------------------------------
  Intercept  |    45.2195  |    10.3119  |   4.3852  |   3.107E-5  |    24.736  |   65.7029  |
 EngineSize  |    36.9633  |     3.6052  |  10.2528  |  7.573E-17  |    29.802  |   44.1245  |
==============================================================================================

UK House Price Trends

It is possible to access all UK residential real-estate transaction records from 1995 through to current day via the UK Government Open Data initiative. The data is presented in CSV format, and contains numerous columns, including such information as the transaction date, price paid, fully qualified address (including postal code), property type, lease type and so on.

Let us begin by writing a function to load these CSV files from Amazon S3 buckets, and since they are stored one file per year, we provide a parameterized function accordingly. Given the requirements of our analysis, there is no need to load all the columns in the file, so below we only choose to read columns at index 1, 2, 4, and 11. In addition, since the files do not include a header, we re-name columns to something more meaningful to make subsequent access a little clearer.

/**
 * Loads UK house price from the Land Registry stored in an Amazon S3 bucket
 * Note the data does not have a header, so columns will be named Column-0, Column-1 etc...
 * @param year      the year for which to load prices
 * @return          the resulting DataFrame, with some columns renamed
 */
private DataFrame<Integer,String> loadHousePrices(Year year) {
    var resource = "http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-%s.csv";
    return DataFrame.read().csv(options -> {
        options.setResource(String.format(resource, year.getValue()));
        options.setHeader(false);
        options.setCharset(StandardCharsets.UTF_8);
        options.setIncludeColumnIndexes(1, 2, 4, 11);
        options.getFormats().setParser("TransactDate", Parser.ofLocalDate("yyyy-MM-dd HH:mm"));
        options.setColumnNameMapping((colName, colOrdinal) -> {
            switch (colOrdinal) {
                case 0:     return "PricePaid";
                case 1:     return "TransactDate";
                case 2:     return "PropertyType";
                case 3:     return "City";
                default:    return colName;
            }
        });
    });
}

Below we use this data in order to compute the median nominal price (not inflation adjusted) of an apartment for each year between 1995 through 2014 for a subset of the largest cities in the UK. There are about 20 million records in the unfiltered dataset between 1993 and 2014, and while it takes a fairly long time to load and parse (approximately 3.5GB of data), Morpheus executes the analytical portion of the code in about 5 seconds (not including load time) on a standard Apple Macbook Pro purchased in late 2013. Note how we use parallel processing to load and process the data by calling results.rows().keys().parallel().

//Create a data frame to capture the median prices of Apartments in the UK'a largest cities
var results = DataFrame.ofDoubles(
    Range.of(1995, 2015).map(Year::of),
    Array.of("LONDON", "BIRMINGHAM", "SHEFFIELD", "LEEDS", "LIVERPOOL", "MANCHESTER")
);

//Process yearly data in parallel to leverage all CPU cores
results.rows().keys().parallel().forEach(year -> {
    System.out.printf("Loading UK house prices for %s...\n", year);
    var prices = loadHousePrices(year);
    prices.rows().select(row -> {
        //Filter rows to include only apartments in the relevant cities
        var propType = row.getValue("PropertyType");
        var city = (String)row.getValue("City");
        var cityUpperCase = city != null ? city.toUpperCase() : null;
        return propType != null && propType.equals("F") && results.cols().contains(cityUpperCase);
    }).rows().groupBy("City").forEach(0, (groupKey, group) -> {
        //Group row filtered frame so we can compute median prices in selected cities
        var city = (String)groupKey.item(0);
        var priceStat = group.col("PricePaid").stats().median();
        results.setDouble(year, city, priceStat);
    });
});

//Map row keys to LocalDates, and map values to be percentage changes from start date
var plotFrame = results.mapToDoubles(v -> {
    var firstValue = v.col().getDoubleAt(0);
    var currentValue = v.getDouble();
    return (currentValue / firstValue - 1d) * 100d;
}).rows().mapKeys(row -> {
    var year = row.key();
    return LocalDate.of(year.getValue(), 12, 31);
});

//Create a plot, and display it
Chart.create().withLinePlot(plotFrame, chart -> {
    chart.title().withText("Median Nominal House Price Changes");
    chart.title().withFont(new Font("Arial", Font.BOLD, 14));
    chart.subtitle().withText("Date Range: 1995 - 2014");
    chart.plot().axes().domain().label().withText("Year");
    chart.plot().axes().range(0).label().withText("Percent Change from 1995");
    chart.plot().axes().range(0).format().withPattern("0.##'%';-0.##'%'");
    chart.plot().style("LONDON").withColor(Color.BLACK);
    chart.legend().on().bottom();
    chart.show();
});

The percent change in nominal median prices for apartments in the subset of chosen cities is shown in the plot below. It shows that London did not suffer any nominal house price decline as a result of the Global Financial Crisis (GFC), however not all cities in the UK proved as resilient. What is slightly surprising is that some of the less affluent northern cities saw a higher rate of appreciation in the 2003 to 2006 period compared to London. One thing to note is that while London did not see any nominal price reduction, there was certainly a fairly severe correction in terms of EUR and USD since Pound Sterling depreciated heavily against these currencies during the GFC.

Visualization

Visualizing data in Morpheus DataFrames is made easy via a simple chart abstraction API with adapters supporting both JFreeChart as well as Google Charts (with others to follow by popular demand). This design makes it possible to generate interactive Java Swing charts as well as HTML5 browser based charts via the same programmatic interface. For more details on how to use this API, see the section on visualization here, and the code here.

Maven Artifacts

Morpheus is published to Maven Central so it can be easily added as a dependency in your build tool of choice. The codebase is split into modules to better manage external dependencies. The core library currently has only three external dependencies, and this will be reduced to zero after further modularization. The various Maven artifacts currently available are as follows:

Morpheus Core

The core library that contains Morpheus Arrays, DataFrames and other key interfaces & implementations.

<dependency>
    <groupId>com.d3xsystems</groupId>
    <artifactId>d3x-morpheus-core</artifactId>
    <version>1.0.31</version>
</dependency>

Morpheus Excel

An adapter to load Excel spreadsheets into a Morpheus DataFrame

<dependency>
    <groupId>com.d3xsystems</groupId>
    <artifactId>d3x-morpheus-excel</artifactId>
    <version>1.0.31</version>
</dependency>

Morpheus JSON

An adapter to read and write Morpheus DataFrames in an efficient JSON format leveraging the Google GSON library.

<dependency>
    <groupId>com.d3xsystems</groupId>
    <artifactId>d3x-morpheus-json</artifactId>
    <version>1.0.31</version>
</dependency>

Morpheus DB

An adapter to read and write Morpheus DataFrames against a SQL data store.

<dependency>
    <groupId>com.d3xsystems</groupId>
    <artifactId>d3x-morpheus-db</artifactId>
    <version>1.0.31</version>
</dependency>

Morpheus Visualization

A library of components to display Morpheus DataFrames in charts and tables.

<dependency>
    <groupId>com.d3xsystems</groupId>
    <artifactId>d3x-morpheus-viz</artifactId>
    <version>1.0.31</version>
</dependency>

Morpheus Quandl

An adapter to load data from Quandl into a Morpheus DataFrame

<dependency>
    <groupId>com.d3xsystems</groupId>
    <artifactId>d3x-morpheus-quandl</artifactId>
    <version>1.0.31</version>
</dependency>

Morpheus Worldbank

An adapter to load data from the Worldbank into a Morpheus DataFrame.

<dependency>
    <groupId>com.d3xsystems</groupId>
    <artifactId>d3x-morpheus-worldbank</artifactId>
    <version>1.0.31</version>
</dependency>

Q&A Forum

A Questions & Answers forum has been setup using Google Groups and is accessible here

Javadocs

Morpheus Javadocs can be accessed online here.

Build Status

A Continuous Integration build server can be accessed here, which builds code after each merge.

License

Morpheus is released under the Apache Software Foundation License Version 2.

com.d3xsystems

D3X Systems

Data-Driven Decision Execution Systems

Versions

Version
1.0.3
1.0.2