parallel-collectors

A set of utilities for parallel collection processing in Java

License	License Apache License, Version 2.0
GroupId	GroupId com.pivovarit
ArtifactId	ArtifactId parallel-collectors
Last Version	Last Version 2.5.0
Release Date	Release Date Feb 11, 2021
Type	Type jar
Description	Description parallel-collectors A set of utilities for parallel collection processing in Java
Source Code Management	Source Code Management https://github.com/pivovarit/parallel-collectors

Download parallel-collectors

Filename	Size
parallel-collectors-2.5.0.pom
parallel-collectors-2.5.0.jar	28 KB
parallel-collectors-2.5.0-sources.jar	8 KB
parallel-collectors-2.5.0-javadoc.jar	394 KB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/com.pivovarit/parallel-collectors/ -->
<dependency>
    <groupId>com.pivovarit</groupId>
    <artifactId>parallel-collectors</artifactId>
    <version>2.5.0</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/com.pivovarit/parallel-collectors/
implementation 'com.pivovarit:parallel-collectors:2.5.0'

Gradle Kotlin

// https://jarcasting.com/artifacts/com.pivovarit/parallel-collectors/
implementation ("com.pivovarit:parallel-collectors:2.5.0")

Apache Buildr

'com.pivovarit:parallel-collectors:jar:2.5.0'

Apache Ivy

<dependency org="com.pivovarit" name="parallel-collectors" rev="2.5.0">
  <artifact name="parallel-collectors" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='com.pivovarit', module='parallel-collectors', version='2.5.0')
)

Scala SBT

libraryDependencies += "com.pivovarit" % "parallel-collectors" % "2.5.0"

Leiningen

[com.pivovarit/parallel-collectors "2.5.0"]

Dependencies

test (9)

Group / Artifact	Type	Version
org.openjdk.jmh : jmh-core	jar	1.27
org.openjdk.jmh : jmh-generator-annprocess	jar	1.27
org.slf4j : slf4j-simple	jar	1.7.30
org.junit.jupiter : junit-jupiter-engine	jar	5.7.1
org.junit.vintage : junit-vintage-engine	jar	5.7.1
org.assertj : assertj-core	jar	3.19.0
org.awaitility : awaitility	jar	4.0.3
com.tngtech.archunit : archunit-junit5-api	jar	0.16.0
com.tngtech.archunit : archunit-junit5-engine	jar	0.16.0

Project Modules

There are no modules declared in this project.

Java Stream API Parallel Collectors - overcoming limitations of standard Parallel Streams

Parallel Collectors is a toolkit easing parallel collection processing in Java using Stream API... but without limitations imposed by standard Parallel Streams.

list.stream()
  .collect(parallel(i -> foo(i), toList(), executor, parallelism))
    .orTimeout(1000, MILLISECONDS)
    .thenAcceptAsync(System.out::println, otherExecutor)
    .thenRun(() -> System.out.println("Finished!"));

They are:

lightweight (yes, you could achieve the same with Project Reactor, but that's often a hammer way too big for the job)
powerful (combined power of Stream API and CompletableFutures allows to specify timeouts, compose with other CompletableFutures, or just perform the whole processing asynchronously)
configurable (it's possible to provide your own Executor, parallelism)
non-blocking (no need to block the calling thread while waiting for the result to arrive)
short-circuiting (if one of the operations raises an exception, remaining tasks will get interrupted)
non-invasive (they are just custom implementations of Collector interface, no magic inside, zero-dependencies)
versatile (missing an API for your use case? process the resulting Stream with the whole generosity of Stream API by reusing already available Collectors)

Maven Dependencies

<dependency>
    <groupId>com.pivovarit</groupId>
    <artifactId>parallel-collectors</artifactId>
    <version>2.5.0</version>
</dependency>

Gradle

compile 'com.pivovarit:parallel-collectors:2.5.0'

Philosophy

Parallel Collectors are unopinionated by design, so it's up to their users to use them responsibly, which involves things like:

proper configuration of a provided Executor and its lifecycle management
choosing the appropriate parallelism level
making sure that the tool is applied in the right context

Make sure to read API documentation before using these in production.

Basic API

The main entrypoint is the com.pivovarit.collectors.ParallelCollectors class - which follows the convention established by java.util.stream.Collectors and features static factory methods returning custom java.util.stream.Collector implementations spiced up with parallel processing capabilities.

By design, it's obligatory to supply a custom Executor instance and manage its lifecycle.

All parallel collectors are one-off and must not be reused.

Available Collectors:

CompletableFuture<Collection<T>> parallel(Function, Collector, Executor, parallelism)
CompletableFuture<Stream<T>> parallel(Function, Executor, parallelism)
Stream<T> parallelToStream(Function, Executor, parallelism)
Stream<T> parallelToOrderedStream(Function, Executor, parallelism)

Batching Collectors

By default, all ExecutorService threads compete for each task separately - which results in a basic form of work-stealing, which, unfortunately, is not free, but can decrease processing time for subtasks with varying processing time.

However, if the processing time for all subtasks is similar, it might be better to distribute tasks in batches to avoid excessive contention:

Batching alternatives are available under the ParallelCollectors.Batching namespace.

Leveraging CompletableFuture

Parallel Collectors™ expose results wrapped in CompletableFuture instances which provides great flexibility and possibility of working with them in a non-blocking fashion:

CompletableFuture<List<String>> result = list.stream()
  .collect(parallel(i -> foo(i), toList(), executor));

This makes it possible to conveniently apply callbacks, and compose with other CompletableFutures:

list.stream()
  .collect(parallel(i -> foo(i), toSet(), executor))
  .thenAcceptAsync(System.out::println, otherExecutor)
  .thenRun(() -> System.out.println("Finished!"));

Or just join() if you just want to block the calling thread and wait for the result:

List<String> result = list.stream()
  .collect(parallel(i -> foo(i), toList(), executor))
  .join();

What's more, since JDK9, you can even provide your own timeout easily.

Examples

1. Apply `i -> foo(i)` in parallel on a custom `Executor` and collect to `List`

Executor executor = ...

CompletableFuture<List<String>> result = list.stream()
  .collect(parallel(i -> foo(i), toList(), executor));

2. Apply `i -> foo(i)` in parallel on a custom `Executor` with max parallelism of 4 and collect to `Set`

Executor executor = ...

CompletableFuture<Set<String>> result = list.stream()
  .collect(parallel(i -> foo(i), toSet(), executor, 4));

3. Apply `i -> foo(i)` in parallel on a custom `Executor` and collect to `LinkedList`

Executor executor = ...

CompletableFuture<List<String>> result = list.stream()
  .collect(parallel(i -> foo(i), toCollection(LinkedList::new), executor));

4. Apply `i -> foo(i)` in parallel on a custom `Executor` and stream results in completion order

Executor executor = ...

list.stream()
  .collect(parallelToStream(i -> foo(i), executor))
  .forEach(i -> ...);

5. Apply `i -> foo(i)` in parallel on a custom `Executor` and stream results in original order

Executor executor = ...

list.stream()
  .collect(parallelToOrderedStream(i -> foo(i), executor))
  .forEach(i -> ...);

Rationale

Stream API is a great tool for collection processing, especially if you need to parallelize execution of CPU-intensive tasks, for example:

public static void parallelSetAll(int[] array, IntUnaryOperator generator) {
    Objects.requireNonNull(generator);
    IntStream.range(0, array.length).parallel().forEach(i -> { array[i] = generator.applyAsInt(i); });
}

However, Parallel Streams execute tasks on a shared ForkJoinPool instance.

Unfortunately, it's not the best choice for running blocking operations even when using ManagedBlocker - as explained here by Tagir Valeev) - this could easily lead to the saturation of the common pool, and to a performance degradation of everything that uses it.

For example:

List<String> result = list.parallelStream()
  .map(i -> foo(i)) // runs implicitly on ForkJoinPool.commonPool()
  .collect(Collectors.toList());

In order to avoid such problems, the solution is to isolate blocking tasks and run them on a separate thread pool... but there's a catch.

Sadly, Streams can only run parallel computations on the common ForkJoinPool which effectively restricts the applicability of them to CPU-bound jobs.

However, there's a trick that allows running parallel Streams in a custom FJP instance... but it's not considered reliable:

Note, however, that this technique of submitting a task to a fork-join pool to run the parallel stream in that pool is an implementation "trick" and is not guaranteed to work. Indeed, the threads or thread pool that is used for execution of parallel streams is unspecified. By default, the common fork-join pool is used, but in different environments, different thread pools might end up being used.

Says Stuart Marks on StackOverflow.

Not even mentioning that this approach was seriously flawed before JDK-10 - if a Stream was targeted towards another pool, splitting would still need to adhere to the parallelism of the common pool, and not the one of the targeted pool [JDK8190974].

Dependencies

None - the library is implemented using core Java libraries.

Limitations

Upstream Stream is always evaluated as a whole, even if the following operation is short-circuiting. This means that none of these should be used for working with infinite streams.

This limitation is imposed by the design of the Collector API.

Good Practices

Consider providing reasonable timeouts for CompletableFutures in order to not block for unreasonably long in case when something bad happens (how-to)
Name your thread pools - it makes debugging easier (how-to)
Limit the size of a working queue of your thread pool (source)
Limit the level of parallelism (source)
A no-longer-used ExecutorService should be shut down to allow reclamation of its resources
Keep in mind that CompletableFuture#then(Apply|Combine|Consume|Run|Accept) might be executed by the calling thread. If this is not suitable, use CompletableFuture#then(Apply|Combine|Consume|Run|Accept)Async instead, and provide a custom executor instance.

Words of Caution

Even if this tool makes it easy to parallelize things, it doesn't always mean that you should. Parallelism comes with a price that can be often higher than not using it at all. Threads are expensive to create, maintain and switch between, and you can only create a limited number of them.

It's essential to follow up on the root cause and double-check if parallelism is the way to go.

It often turns out that the root cause can be addressed by using a simple JOIN statement, batching, reorganizing your data... or even just by choosing a different API method.

See CHANGELOG.MD for a complete version history.

Versions

Version
2.5.0 Feb 11, 2021
2.4.1 Nov 22, 2020
2.4.0 Sep 26, 2020
2.3.3 Apr 25, 2020
2.3.2 Feb 27, 2020
2.3.1 Feb 22, 2020
2.3.0 Feb 9, 2020
2.2.0 Jan 19, 2020
2.1.0 Dec 11, 2019
2.0.0 Nov 20, 2019
1.2.1 Nov 17, 2019
1.2.0 Nov 16, 2019
1.1.0 Jun 24, 2019
1.0.3 Jun 9, 2019
1.0.2 Jun 3, 2019
1.0.1 May 2, 2019
1.0.0 Apr 30, 2019
0.3.0 Apr 5, 2019
0.2.0 Mar 30, 2019
0.1.2 Mar 26, 2019
0.1.1 Mar 24, 2019
0.1.0 Mar 17, 2019
0.0.3 Feb 21, 2019
0.0.2 Feb 2, 2019
0.0.1 Jan 31, 2019
0.0.1-RC3 Jan 29, 2019
0.0.1-RC2 Jan 28, 2019
0.0.1-RC1 Jan 27, 2019

parallel-collectors

License

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Source Code Management

Download parallel-collectors

How to add to project

Dependencies

test (9)

Project Modules

Java Stream API Parallel Collectors - overcoming limitations of standard Parallel Streams

Maven Dependencies

Gradle

Philosophy

Basic API

Available Collectors:

Batching Collectors

Leveraging CompletableFuture

Examples

1. Apply i -> foo(i) in parallel on a custom Executor and collect to List

2. Apply i -> foo(i) in parallel on a custom Executor with max parallelism of 4 and collect to Set

3. Apply i -> foo(i) in parallel on a custom Executor and collect to LinkedList

4. Apply i -> foo(i) in parallel on a custom Executor and stream results in completion order

5. Apply i -> foo(i) in parallel on a custom Executor and stream results in original order

Rationale

Dependencies

Limitations

Good Practices

Words of Caution

Versions

1. Apply `i -> foo(i)` in parallel on a custom `Executor` and collect to `List`

2. Apply `i -> foo(i)` in parallel on a custom `Executor` with max parallelism of 4 and collect to `Set`

3. Apply `i -> foo(i)` in parallel on a custom `Executor` and collect to `LinkedList`

4. Apply `i -> foo(i)` in parallel on a custom `Executor` and stream results in completion order

5. Apply `i -> foo(i)` in parallel on a custom `Executor` and stream results in original order