com.cldellow:manu-common

Utilities to manage timeseries data.

License

License

GroupId

GroupId

com.cldellow
ArtifactId

ArtifactId

manu-common
Last Version

Last Version

0.2.2
Release Date

Release Date

Type

Type

jar
Description

Description

com.cldellow:manu-common
Utilities to manage timeseries data.
Project URL

Project URL

https://github.com/cldellow/manu

Download manu-common

How to add to project

<!-- https://jarcasting.com/artifacts/com.cldellow/manu-common/ -->
<dependency>
    <groupId>com.cldellow</groupId>
    <artifactId>manu-common</artifactId>
    <version>0.2.2</version>
</dependency>
// https://jarcasting.com/artifacts/com.cldellow/manu-common/
implementation 'com.cldellow:manu-common:0.2.2'
// https://jarcasting.com/artifacts/com.cldellow/manu-common/
implementation ("com.cldellow:manu-common:0.2.2")
'com.cldellow:manu-common:jar:0.2.2'
<dependency org="com.cldellow" name="manu-common" rev="0.2.2">
  <artifact name="manu-common" type="jar" />
</dependency>
@Grapes(
@Grab(group='com.cldellow', module='manu-common', version='0.2.2')
)
libraryDependencies += "com.cldellow" % "manu-common" % "0.2.2"
[com.cldellow/manu-common "0.2.2"]

Dependencies

test (3)

Group / Artifact Type Version
junit : junit jar 4.12
com.pholser : junit-quickcheck-core jar 0.7
com.pholser : junit-quickcheck-generators jar 0.7

Project Modules

There are no modules declared in this project.

Manu: "Mostly archived, not updated"

Build Status codecov Maven Central

A time series storage format for integers and floats, using efficient delta encodings from FastPFOR.

Examples: pageviews by article in Wikipedia, stock open/close/high/low prices, weather temperatures.

Components

  • manu-format, a library for maintaining the data on disk
  • manu-cli, a command-line tool for ingesting data into the format
  • manu-serve, a web server to expose the data over REST

Design criteria

Priorities

  • Cheap
    • I'm doing this to drive a hobby project; my dream would be to host a variety of datasets for $10/month.
    • A Fermi estimate suggests Wikipedia pageviews has 100B datapoints over the last 10 years. This implies that storage costs will dominate.
  • Doesn’t need to be always-on
    • This sort of follows from cheap -- the ability to load subsets of data, or to run on spot instances will be a useful tool to cut costs.

Non-priorities

  • Concurrent / fast writes
    • These can happen offline.
  • Fast reads
    • The pareto principle will likely apply to queries - 1% of keys will get 99% of reads. We can use Varnish or similar to cache at the application level.

Assumptions

  • Dense datasets
    • Keys: if we see a key once, we expect to see it again.
    • Values: if key X has a datapoint at T1, we expect most other keys will as well.
  • Correlated values
    • Value for key X at T1 is likely related to value at T2.
  • Some datasets can be lossy
    • Wikipedia pageviews, e.g., are likely insensitive to precision so long as the trend is generally correct.

Obligatory

Manu

Credit: Our Greatest Asset, Saturday Morning Breakfast Cereal

Versions

Version
0.2.2
0.2.1
0.2.0