org.brutusin:flea-db
A java library for creating standalone, portable, schema-full object databases supporting pagination and faceted search, and offering strong-typed and generic APIs.
Built on top of Apache Lucene.
Main features:
- Schema-full/self-descriptive
- Simple and powerful API. Strong-typed and generic flavors
- High robustness. Record, field names and type validation.
- Pagination
- Faceted search
- In memory and persistent versions
Table of Contents:
- org.brutusin:flea-db
Motivation
- Create a library with a very simple API, self-descriptive with high robustness aimed at indexing objects and providing advanced search capabilities, pagination and faceted search.
- Originally born with the purpose of indexing raw data files, and (almost) steady data sets.
- Lucene is an extense and powerful low level library, but its API is not very easy to understand.
- Putting schemas into play, self-description can be used to simplify API (fields type), to provide strong validation mechanisms, and to enable the creation of flexible and generic downstream components.
- Lucene has a lot of experimental APIs that may (and use to) change in time. This library adds a level of indirection. providing a stable high level interface. Upgrades in the underlying Lucene version are absorved by flea-db.
Maven dependency
<dependency>
<groupId>org.brutusin</groupId>
<artifactId>flea-db</artifactId>
</dependency>
Click here to see the latest available version released to the Maven Central Repository.
If you are not using maven and need help you can ask here.
APIs
All flea-db
functionality is defined by FleaDB
interface.
The library provides two implementations for it:
- A low-level generic implementation
GenericFleaDB
. - A high-level strong-typed implementation
ObjectFleaDB
built on top of the previous one.
GenericFleaDB
GenericFleaDB
is the lowest level flea-db implementation that defines the database schema using a JSON schema and stores and indexes records of type JsonNode
. It uses Apache Lucene APIs and org.brutusin:json
SPI to maintain two different indexes (one for the terms and other for the taxonomy, see index structure), hyding the underlying complexity from the user perspective.
This is how it works:
- On instantiation: A
JsonSchema
and an index folder are passed depending on whether the database is new and/or persistent. Then the JSON schema (passed or readed from the existing databaseflea.json
descriptor file) is processed, looking for itsindex
properties, and finally a database schema is created. - On storing: The passed
JsonNode
record is validated against the JSON schema. Then aJsonTransformer
instance (making use of the processed database schema) transforms the records in terms understandable by Lucene (documents, fields, facet fields ...) and finally the storage is delegated to the Lucene API. - On commit: Underlying index and taxonomy writters are commited and searchers are refreshed to reflect the changes.
- On querying: The
Query
andSort
objects are transformed into terms understandable by Lucene making use of the database schema. The returned paginator is basically a wrapper around the underlying lueceneIndexSearcher
andQuery
objects that lazily (on demand) performs searches to the index.
ObjectFleaDB
ObjectFleaDB
is built on top of GenericFleaDB
.
Basically an ObjectFleaDB
delegates all its functionality to a wrapped GenericFleaDB
instance, making use of org.brutusin:json
to perform transformations POJO<->JsonNode
and Class<->JsonSchema
. This is the reason why all flea-db
databases can be used with GenericFleaDB
.
Schema
JSON SPI
This library makes use of the org.brutusin:json
, so a JSON service provider like json-provider
is needed at runtime. The choosen provider will determine JSON serialization, validation, parsing, schema generation and expression semantics.
JSON Schema extension
Standard JSON schema specification has been extended to declare indexable properties ("index":"index"
and "index":"facet"
options).
See http://brutusin.org/json/json-schema-spec
for more details.
Example:
{
"type": "object",
"properties": {
"age": {
"type": "integer",
"index": "index"
},
"category": {
"type": "string",
"index": "facet"
}
}
}
"index":"index"
: Means that the property is indexed by Lucene under a field with name set according to the rules explained in nomenclature section."index":"facet"
: Means that the property is indexed as in the previous case, but also a facet is created with this field name.
Annotations
See documentation in JSON SPI for supported annotations used in the strong-typed scenario.
Indexed fields nomenclature
Databases are self descriptive, they provide information of their schema and indexed fields (via Schema
).
Field semantics are inherited from the expression semantics defined in the schema specification http://brutusin.org/json/json-schema-spec
Indexation values
Supose JsonNode node
to be stored and let fieldId
be the expression identifying a database field, according to the previous section.
Expression exp = JsonCodec.getInstance().compile(fieldId);
JsonSchema fieldSchema = exp.projectSchema(rootSchema);
JsonNode fieldNode = exp.projectNode(node);
Then, the following rules apply to extract index and facet values for that field:
fieldSchema | index:index | index:facet |
---|---|---|
String | fieldNode.asString() |
fieldNode.asString() |
Boolean | fieldNode.asString() |
fieldNode.asString() |
Integer | fieldNode.asLong() |
Unsupported |
Number | fieldNode.asDouble() |
Unsupported |
Object | each of its property names | each of its property names |
Array | recurse for each of its elements | recurse for each of its elements |
Usage
Database persistence
Databases can be created in RAM memory or in disk, depending on the addressed problem characteristics (performance, dataset size, indexation time ...).
In order to create a persistent database, a constructor(s) with a File
argument has to be choosen:
Flea db1 = new GenericFleaDB(indexFolder, jsonSchema);
// or
Flea db2 = new ObjectFleaDB(indexFolder, Record.class);
NOTE: Multiple instances can be used to read the same persistent database (for example different concurrent JVM executions), but only one can hold the writing file-lock (claimed the first time a write method is called).
On the other side, the database will be kept in RAM memory and lost at the end of the JVM execution.
Flea db1 = new GenericFleaDB(jsonSchema);
// or
Flea db2 = new ObjectFleaDB(Record.class);
Write operations
The following operations perform modifications on the database.
Store
In order to store a record the store(...)
method has to be used:
db1.store(jsonNode);
// or
db2.store(record);
internally this ends up calling addDocument
in the underlying Lucene IndexWriter
.
Delete
The API enables to delete a set of records using delete(Query q)
.
NOTE: Due to Lucene facet internals, categories are never deleted from the taxonomy index, despite of being orphan.
Commit
Previous operations (store and delete) are not (and won't ever be) visible until commit()
is called. Underlying seachers and writers are released, to be lazily created in further read or write operations.
Optimization
Databases can be optimized in order to achieve a better performance by using optimize()
. This method triggers a highly costly (in terms of free disk space needs and computation) merging of the Lucene index segments into a single one.
Nevertheless, this operation is useful for immutable databases, that can be once optimized prior its usage.
Read operations
Two kind of read operations can be performed, both supporting a Query argument, that defines the search criteria.
Record queries
Record queries can be paginated and the ordering of the results can be specified via a Sort argument.
public E getSingleResult(final Query q)
public Paginator<E> query(final Query q)
public Paginator<E> query(final Query q, final Sort sort)
Facet queries
FacetResponse
represents the faceting info returned by the database.
public List<FacetResponse> getFacetValues(final Query q, FacetMultiplicities activeFacets)
public List<FacetResponse> getFacetValues(final Query q, int maxFacetValues)
public List<FacetResponse> getFacetValuesStartingWith(String facetName, String prefix, Query q, int max)
public int getNumFacetValues(Query q, String facetName)
public double getFacetValueMultiplicity(String facetName, String facetValue, Query q)
Faceting is provided by lucene-facet.
Closing
Databases must be closed after its usage, via close()
method in order to free the resources and locks hold. Closing a database makes it no longer usable.
Threading issues
Both implementations are thread safe and can be shared across multiple threads.
Index structure
Persistent flea-db databases create the following index structure:
/flea-db/
|-- flea.json
|-- record-index
| |-- ...
|-- taxonomy-index
| |-- ...
being flea.json
the database descriptor containing its schema, and being record-index
and taxonomy-index
subfolders the underlying Lucene index structures.
ACID properties
flea-db
offers the following ACID properties, inherited from Lucene ones:
- Atomicity: When changes are performed, and then committed, either all (if the commit succeeds) or none (if the commit fails) of them will be visible.
- Consistency: if the computer or OS crashes, or the JVM crashes or is killed, or power is lost, indexes will remain intact (ie, not corrupt).
- Isolation: Changes performed are not visible until committed.
- Durability: In case of a persistent database, when the commit returns, all changes have been written to disk. If the JVM crashes, all changes will still be present in the index, despite of not the database not being properly closed.
Examples:
Generic API:
// Generic interaction with a previously created database
FleaDB<JsonNode> db = new GenericFleaDB(indexFolder);
// Store records
JsonNode json = JsonCodec.getInstance.parse("...");
db.store(json);
db.commit();
// Query records
Query q = Query.createTermQuery("$.id", "0");
Paginator<JsonRecord> paginator = db.query(q);
int totalPages = paginator.getTotalPages(pageSize);
for (int i = 1; i <= totalPages; i++) {
List<JsonRecord> page = paginator.getPage(i, pageSize);
for (int j = 0; j < page.size(); j++) {
JsonRecord json = page.get(j);
System.out.println(json);
}
}
db.close();
Strong-typed API:
// Create object database
FleaDB<Record> db = new ObjectFleaDB(indexFolder, Record.class);
// Store records
for (int i = 0; i < REC_NO; i++) {
Record r = new Record();
// ... populate record
db.store(r);
}
db.commit();
// Query records
Query q = Query.createTermQuery("$.id", "0");
Paginator<Record> paginator = db.query(q);
int totalPages = paginator.getTotalPages(pageSize);
for (int i = 1; i <= totalPages; i++) {
List<Record> page = paginator.getPage(i, pageSize);
for (int j = 0; j < page.size(); j++) {
Record r = page.get(j);
System.out.println(r);
}
}
db.close();
See available test classes for more examples.
Main stack
This module could not be possible without:
- Apache Lucene.
- The following json-codec-jackson dependencies:
- FasterXML/jackson stack: The underlying JSON stack.
- com.fasterxml.jackson.module:jackson-module-jsonSchema: For java class to JSON schema mapping.
- com.github.fge:json-schema-validator: For validation against a JSON schema.
Lucene version
4.10.3
(Dec, 2014)
Support, bugs and requests
https://github.com/brutusin/flea-db/issues
Authors
- Ignacio del Valle Alles (https://github.com/idelvall/)
Contributions are always welcome and greatly appreciated!
License
Apache License, Version 2.0 http://www.apache.org/licenses/LICENSE-2.0