Written by Poogle
on August 08, 2022

Modern Java in Action - Ch.6(2)

참고: 책 - Modern Java in Action

책 Modern Java in Action을 읽고 정리합니다. 이번 포스트에서는 Ch 6.4 ~ Ch 6.6의 내용을 읽고 정리합니다.

Ch 6. 스트림으로 데이터 수집
6.4 분할
- 6.4.1 분할의 장점
- 6.4.2 숫자를 소수와 비소수로 분할하기
6.5 Collector 인터페이스
- 6.5.1 Collector 인터페이스의 메서드 살펴보기
- 6.5.2 응용하기
6.6 커스텀 컬렉터를 구현해서 성능 개선하기
- 6.6.1 소수로만 나누기
- 6.6.2 컬렉터 성능 비교

6.4 분할

분할은 분할 함수(partitioning function)라 불리는 predicate를 분류 함수로 사용하는 특수한 그룹화 기능입니다. 분할 함수는 Boolean을 반환하므로 맵의 키 형식이 Boolean입니다.

ex. 채식 요리와 채식이 아닌 요리로 나누는 예제입니다.

Map<Boolean, List<Dish>> partitionedMenu = 
        menu.stream()
            .collect(partitioningBy(Dish::isVegetarian)); // 분할 함수

List<Dish> vegetarianDishes = 
        menu.stream()
            .filter(Dish::isVegetarian).collect(toList());  // 메뉴 리스트로 생성한 스트림을 필터링하고 수집해도 가능

6.4.1 분할의 장점

분할 함수가 반환하는 참, 거짓 두 가지 요소의 스트림 리스트를 모두 유지한다는 것이 분할의 장점입니다. 컬렉터를 두 번째 인수로 전달할 수 있는 오버로드된 버전의 partitioningBy 메서드도 있습니다.

ex. 재료 별로 채식 요리와 채식이 아닌 요리로 나누는 예제입니다.

Map<Boolean, Map<Dish.Type, List<Dish>>> vegetarianDishesByType = 
        menu.stream()
            .collect(partitioningBy(Dish::isVegetarian, groupingBy(Dish::getType)));

ex. 채식 요리와 채식이 아닌 요리 각각 그룹에서 가장 칼로리가 높은 요리를 찾는 예제입니다.

Map<Boolean, Dish> mostCaloricPartitionedByVegetarian = 
        menu.stream()
            .collect(partitioningBy(Dish::isVegetarian, collectingAndThen(maxBy(comparingInt(Dish::getCalories)), Optional::get)));

6.4.2 숫자를 소수와 비소수로 분할하기

ex. 정수 n을 인수로 받아서 2에서 n까지의 자연수를 소수와 비소수로 나누는 프로그램을 구현합니다.

public boolean isPrime(int candidate) {
    int candidateRoot = (int) Math.sqrt((double)candidate); //제곱근 이하의 수로 제한
    return IntStream.rangeClosed(2, candidateRoot) //2부터 candidateRoot미만 사이의 자연수 생성
                    .noneMatch(i -> candidate % i == 0); //스트림의 모든 정수로 candidate를 나눌 수 없으면 true 반환
}

주어진 수가 소수인지 아닌지 판단하는 predicate를 구현합니다.

public Map<Boolean, List<Integer>> partitionPrimes(int n) {
    return IntStream.rangeClosed(2, n).boxed()
        .collect(partitioningBy(candidate -> isPrime(candidate)));
}

isPrime()을 predicate로 이용하고 partitioningBy 컬렉터로 reduce해서 소수와 비소수로 분류합니다.

Collectors 클래스의 정적 팩토리 메서드

toList: 스트림의 모든 항목을 리스트로 수집
- 반환 형식: List<T>
- ex. List<Dish> dishes = menuStream.collect(toList());
toSet: 스트림의 모든 항목을 중복이 없는 집합으로 수집
- 반환 형식: Set<T>
- ex. Set<Dish> dishes = menuStream.collect(toSet());
toCollection: 스트림의 모든 항목을 발행자가 제공하는 컬렉션으로 수집
- 반환 형식: Collection<T>
- ex. Collection<Dish> dishes = menuStream.collect(toCollection(), ArryList::new);
counting: 스트림의 항목 수 계산
- 반환 형식: Long
- ex. long howManyDishes = menuStream.collect(counting());
summingInt: 스트림의 항목에서 정수 프로퍼티값을 더함
- 반환 형식: Integer
- ex. int totalCalories = menuStream.collect(summingInt(Dish::getCalories));
averagingInt: 스트림 항목의 정수 프로퍼티의 평균값 계산
- 반환 형식: Double
- ex. double avgCalories = menuStream.collect(averagingInt(Dish::getCalories));
summarizingInt: 스트림 내 항목의 최댓값, 최솟값, 합계, 평균 등의 정수 정보 통계 수집
- 반환 형식: IntSummaryStatistics
- ex. IntSummaryStatistics menuStatistics = menuStream.collect(summarizingInt(Dish::getCalories));
joining: 스트림의 각 항목에 toString 메서드를 호출한 결과 문자열 연결
- 반환 형식: String
- ex. String shortMenu = menuStream.map(Dish::getName).collect(joining(","));
maxBy: 주어진 비교자를 이용해서 스트림의 최댓값 요소를 Optional로 감싼 값을 반환, 스트림에 요소가 없을 때는 Optional.empty() 반환
- 반환 형식: Optional<T>
- ex. Optional<Dish> fattest = menuStream.collect(maxBy(comparingInt(Dish::getCalories)));
minBy: 주어진 비교자를 이용해서 스트림의 최솟값 요소를 Optional로 감싼 값을 반환, 스트림에 요소가 없을 때는 Optional.empty() 반환
- 반환 형식: Optional<T>
- ex. Optional<Dish> lightest = menuStream.collect(minBy(comparingInt(Dish::getCalories)));
reducing: 누적자를 초깃값으로 설정한 다음에 BinaryOperator로 스트림의 각 요소를 반복적으로 누적자와 합쳐 스트림을 하나의 값으로 리듀싱
- 반환 형식: The type produced by the reduction operation
- ex. int totalCalories = menuStream.collect(reducing(0, Dish::getCalories, Integer::sum));
collectingAndThen: 다른 컬렉터를 감싸고 그 결과에 변환 함수 적용
- 반환 형식: The type returned by the transforming function
- ex. int howManyDishes = menuStream.collect(collectingAndThen(toList(), List::size));
groupingBy: 하나의 프로퍼티값을 기준으로 스트림의 항목을 그룹화하며 기준 프로퍼티값을 결과 맵의 키로 사용
- 반환 형식: Map<K, List<T>>
- ex. Map<Dish.Type, List<Dish>> dishesByType = menuStream.collect(groupingBy(Dish::getType));
partitioningBy: 프레디케이트를 스트림의 각 항목에 적용한 결과로 항목 분할
- 반환 형식: Map<Boolean, List<T>>
- ex. Map<Boolean, List<Dish>> vegetarianDishes = menuStream.collect(partitioningBy(Dish::isVegetarian));

6.5 Collector 인터페이스

Collector 인터페이스는 리듀싱 연산(즉, 컬렉터)을 어떻게 구현할지 제공하는 메서드 집합으로 구성됩니다.

public interface Collector<T, A, R> {
    Supplier<A> supplier();
    BiConsumer<A, T> accumulator();
    Function<A, R> finisher();
    BinaryOperator<A> combiner();
    Set<Characteristics> characteristics();
}

T는 수집될 항목의 제네릭 형식입니다.
A는 누적자, 즉 수집 과정에서 중간 결과를 누적하는 객체의 형식입니다.
R은 수집 연산 결과 객체의 형식(대개 컬렉션 형식)입니다.

6.5.1 Collector 인터페이스의 메서드 살펴보기

supplier 메서드: 새로운 결과 컨테이너 만들기

public Supplier<List<T>> supplier() {
    return () -> new ArrayList<T>();
}

//생성자 참조 전달
public Supplier<List<T>> supplier() {
    return ArrayList::new;
}

supplier 메서드는 빈 결과로 이루어진 Supplier를 반환해야 합니다.
즉, supplier는 수집 과정에서 빈 누적자 인스턴스를 만드는 파라미터가 없는 함수입니다.
ToListCollector처럼 누적자를 반환하는 컬렉터에서는 빈 누적자가 비어있는 스트림의 수집 과정의 결과가 될 수 있습니다.

accumulator 메서드: 결과 컨테이너에 요소 추가하기

public BiConsumer<List<T>, T> accumulator() {
    return(list, item) -> list.add(item);
}

//메서드 참조
public BiConsumer<List<T>, T> accumulator() {
    return List::add;
}

accumulator메서드는 리듀싱 연산을 수행하는 함수를 반환합니다.
스트림에서 n번째 요소를 탐색할 때 두 인수, 즉 누적자(스트림의 첫 n-1개 항목을 수집한 상태)와 n번째 요소를 함수에 적용합니다.
함수의 반환값은 void, 즉 요소를 탐색하면서 적용하는 함수에 의해 누적자 내부 상태가 바뀌므로 누적자가 어떤 값일지 단정할 수 없습니다.

finisher 메서드: 최종 변환값을 결과 컨테이너로 적용하기

public Function<List<T>, List<T>> finisher() {
    return Function.identity();
}

finisher 메서드는 스트림 탐색을 끝내고 누적자 객체를 최종 결과로 변환하면서 누적 과정을 끝낼 때 호출할 함수를 반환해야 합니다.

순차 리듀싱 과정의 논리적 순서

실제로는 collect가 동작하기 전에 다른 중간 연산과 파이프라인을 구성할 수 있게 해줘야 하고, 병렬 실행 등도 고려해야 하므로 스트림 리듀싱 기능 구현은 생각보다 복잡합니다.

combiner 메서드: 두 결과 컨테이너 병합

public BinaryOperator<List<T>> combiner() {
    return (list1, list2) -> {
        list1.addAll(list2);
        return list1;
    }
}

combiner는 스트림의 서로 다른 서브파트를 병렬로 처리할 때 누적자가 이 결과를 어떻게 처리할지 정의합니다.
스트림의 두 번째 서브 파트에서 수집한 항목 리스트를 첫 번째 서브파트 결과 리스트 뒤에 추가하면 됩니다.
combiner를 통해 스트림의 리듀싱을 병렬로 수행할 수 있습니다.
1. 스트림을 분할해야 하는지 정의하는 조건이 거짓을 바뀌기 전까지 원래 스트림을 재귀적으로 분할합니다.
  - 보통 분산된 작업의 크기가 너무 작아지면 병렬 수행의 속도는 순차 수행의 속도보다 느려집니다. -> 병렬 수행의 효과가 상쇄!!
  - 일반적으로 프로세싱 코어의 개수를 초과하는 병렬 작업은 효율적이지 않습니다.
2. 모든 서브 스트림의 각 요소에 리듀싱 연산을 순차적으로 적용해서 서브스트림을 병렬로 처리할 수 있습니다.
3. 마지막에는 컬렉터의 combiner 메서드가 반환하는 함수로 모든 부분결과를 쌍으로 합칩니다. 즉, 분할된 모든 서브스트림의 결과를 합치면서 연산이 완료됩니다.

Characteristics 메서드

Characteristics는 스트림을 병렬로 리듀스할 것인지 그리고 병렬로 리듀스한다면 어떤 최적화를 선택해야 할지 힌트를 제공합니다.
UNORDERED: 리듀싱 결과는 스트림 요소의 방문 순서나 누적 순서에 영향을 받지 않습니다.
CONCURRENT: 다중 스레드에서 accumulator 함수를 동시에 호출할 수 있으며 이 컬렉터는 스트림의 병렬 리듀싱을 수행할 수 있습니다. 컬렉터의 플래그에 UNORDERED를 함께 설정하지 않았다면 데이터 소스가 정렬되어 있지 않은(즉, 집합처럼 요소의 순서가 무의미한) 상황에서만 병렬 리듀싱을 수행할 수 있습니다.
IDENTITY_FINISH: finisher메서드가 반환하는 함수는 단순히 identity를 적용할 뿐이므로 이를 생략할 수 있습니다. 따라서 리듀싱 과정의 최종결과로 누적자 객체를 바로 사용할 수 있습니다. 또한 누적자 A를 결과 R로 안전하게 형변환할 수 있습니다.

6.5.2 응용하기

ex. 커스텀 ToListCollector를 구현하는 예제입니다.

import java.util.*;
import java.util.function.*;
import java.util.stream.Collector;
import static java.util.stream.Collector.Characteristics.*;

public class ToListCollector<T> implements Collector<T, List<T>, List<T>> {
    @Override
    public Supplier<List<T>> supplier() {
        return ArrayList::new; //수집 연산의 시발점
    }
    
    @Override
    public BiConsumer<List<T>, T> accumulator() {
        return List::add; //탐색한 항목을 누적하고 바로 누적자를 고침
    }
    
    @Override
    public Function<List<T>, List<T>> finisher() {
        return Function.identity(); //항등 함수
    }
    
    @Override
    public BinaryOperator<List<T>> combiner() {
        return (list1, list2) -> { //두 번째 콘텐츠와 합쳐서 첫 번째 누적자를 고침
            list1.addAll(list2); //변경된 첫 번째 누적자를 반환
            return list1;
        };
    }
    
    @Override
    public Set<Characterisics> characteristics() {
        return Collections.unmodifiableSet(EnumSet.of(IDENTITY_FINISH, CONCURRENT));
        return ArrayList::new;
    }
}

6.6 커스텀 컬렉터를 구현해서 성능 개선하기

ex. 커스텀 컬렉터로 n까지의 자연수를 소수와 비소수로 분할하는 예제입니다.

//Before

public Map<Boolean, List<Integer>> partitionPrimes(int n) {
    return IntStream.rangeClosed(2, n).boxed()
                    .collect(partitioningBy(candidate -> isPrime(candidate)));
}

//Before

public boolean isPrime(int candidate) {
    int candidateRoot = (int) Math.sqrt((double)candidate); //제곱근 이하의 수로 제한
    return IntStream.rangeClosed(2, candidateRoot) //2부터 candidateRoot미만 사이의 자연수 생성
                    .noneMatch(i -> candidate % i == 0); //스트림의 모든 정수로 candidate를 나눌 수 없으면 true 반환
}

6.6.1 소수로만 나누기

소수로 나누어 떨어지는지 확인해서 대상의 범위를 좁힐 수 있는데 약수(divisor)가 소수여야 하기 때문에 현재 숫자 이하에서 발견한 소수로 약수(제수)를 제한합니다.
주어진 숫자가 소수인지 아닌지 판단해야 하는데 기존의 컬렉터로는 컬렉터로는 컬렉터 수집 과정에서 부분결과(지금까지 발견한 소수 리스트)에 접근할 수 없습니다.
성능을 개선하기 위해 커스텀 컬렉터를 이용합니다.

//After - 중간 결과 리스트를 전달하도록 구현

public boolean isPrime(List<Integer> primes, int candidate) {
    return primes.stream().noneMatch(i -> candidate % i == 0);
}

대상 숫자의 제곱근보다 작은 소수만 사용하도록 최적화 해야하는데 스트림 API에는 이런 기능을 제공하는 메서드가 없습니다.
filter(p -> p <= candidateRoot)를 이용해서 대상의 제곱근보다 작은 소수를 필터링할 수 있지만, 결국 filter는 전체 스트림을 처리한 다음에 결과를 반환하게 되어 소수 리스트와 대상 숫자의 범위가 아주 크다면 성능 문제가 발생할 수 있습니다.
대상의 제곱보다 큰 소수를 찾으면 검사를 중단함으로써 성능 문제를 없앨 수 있습니다.

//After - 중간 결과 리스트를 전달하도록 구현 & 제곱근 이하의 수로 제한

public boolean isPrime(List<Integer> primes, int candidate) {
    int candidateRoot = (int) Math.sqrt((double)candidate); //제곱근 이하의 수로 제한
    return primes.stream()
                 .takeWhile(i -> i <= candidateRoot)
                 .noneMatch(i -> candidate % i == 0);
}

정렬된 리스트와 프레디케이트를 인수로 받아 리스트의 첫 요소에서 시작해서 프레디케이트를 만족하는 가징 긴 요소로 이루어진 리스트를 반환하는 takeWhile 메서드를 구현합니다.

참고 - Java 8에서 takeWhile(Java 9부터 지원) 흉내내기

정렬된 리스트와 프레디케이트를 인수로 받아 프레디케이트를 만족하는 가징 긴 첫 요소 리스트를 반환하도록 구현합니다.

public static <A> List<A> takeWhile(List<A> list, Predicate<A> p) {
    int i = 0;
    for (A item : list) {
        if (!p.test(item)) { //리스트의 현재 항목이 프레디케이트를 만족하는지 확인
            return list.subList(0, i); //만족하지 않으면 현재 검사한 항목의 이전 항목 하위 리스트를 반환
        }
        i++;
    }
    return list; //리스트의 모든 항목이 프레디케이트를 만족하므로 리스트 자체를 반환
}

1단계: Collector 클래스 시그니처 정의

Collector 인터페이스 정의

public interface Collector<T, A, R>
/*
    T: 스트림 요소의 형식
    A: 중간 결과를 누적하는 객체 형식
    R: collect 연산의 최종 결과 형식
 */

public class PrimeNumbersCollector implements Collector<Integer, 
        Map<Boolean, List<Integer>>, //누적자 형식
        Map<Boolean, List<Integer>>> { //수집 연산의 결과 형식
}

2단계: 리듀싱 연산 구현

Collector 인터페이스 메서드 구현

public class PrimeNumbersCollector implements Collector<Integer, 
        Map<Boolean, List<Integer>>, Map<Boolean, List<Integer>>> {
    
    //supplier - 누적자를 만드는 함수를 반환
    public Supplier<Map<Boolean, List<Integer>>> supplier() {
        return () -> new HashMap<Boolean, List<Integer>>() { {
            put(true, new ArrayList<Integer>());
            put(false, new ArrayList<Integer>());
        } };
    }
    
    //accumulator - 스트림 요소를 어떻게 수집할지 결정
    public BiConsumer<Map<Boolean, List<Integer>>, Integer> accumulator() {
        return (Map<Boolean, List<Integer>> acc, Integer candidate) -> {
            acc.get( isPrime(acc.get(true), candidate) )
                    .add(candidate);
        };
    }
}

3단계: 병렬 실행할 수 있는 컬렉터 만들기(가능하다면)

두 번째 맵의 소수 리스트와 비소수 리스트의 모든 수를 첫 번째 맵에 추가하는 연산을 구현합니다.

public class PrimeNumbersCollector implements Collector<Integer, 
        Map<Boolean, List<Integer>>, Map<Boolean, List<Integer>>> {
    
    public BinaryOperator<Map<Boolean, List<Integer>>> combiner() {
        return (Map<Boolean, List<Integer>> map1, Map<Boolean, List<Integer>> map2) -> {
            map1.get(true).addAll(map2.get(true));
            map1.get(false).addAll(map2.get(false));
            return map1;
        };
    }
}

참고로 알고리즘 자체가 순차적이라 컬렉터를 실제 병렬로 사용할 수는 없습니다.
=> combiner는 호출될 일이 없기 때문에 빈 구현으로 남겨도 괜찮습니다! (또는 UnsupportedOperationException 던지도록 구현)

4단계: finisher 메서드와 컬렉터의 characteristics 메서드

public class PrimeNumbersCollector implements Collector<Integer, 
        Map<Boolean, List<Integer>>, Map<Boolean, List<Integer>>> {

    //finisher - 항등 함수 identity를 반환
    public Function<Map<Boolean, List<Integer>>, Map<Boolean, List<Integer>>> finisher() {
        return Function.identity();
    }
    
    //Characteristics - CONCURRENT & UNORDERED(X), IDENTITY_FINISH(O)
    public Set<Characteristics> characteristics() {
        return Collections.unmodifiableSet(EnumSet.of(IDENTITY_FINISH));
    }
}

최종 `PrimeNumbersCollector`

public class PrimeNumbersCollector implements Collector<Integer, 
        Map<Boolean, List<Integer>>, Map<Boolean, List<Integer>>> {

    @Override
    public Supplier<Map<Boolean, List<Integer>>> supplier() {
      return () -> new HashMap<Boolean, List<Integer>>() { { //두 개의 빈 리스트를 포함하는 맵으로 수집 동작 시작
        put(true, new ArrayList<Integer>());
        put(false, new ArrayList<Integer>());
      } };
    }
  
    @Override
    public BiConsumer<Map<Boolean, List<Integer>>, Integer> accumulator() {
      return (Map<Boolean, List<Integer>> acc, Integer candidate) -> {
        acc.get( isPrime(acc.get(true), candidate) ) //지금까지 발견한 소수 리스트를 isPrime()에 전달
           .add(candidate); //isPrime()의 결과에 따라 맵에서 알맞은 리스트를 받아 현재 candidate 추가
      };
    }

    @Override
    public BinaryOperator<Map<Boolean, List<Integer>>> combiner() {
      return (Map<Boolean, List<Integer>> map1, Map<Boolean, List<Integer>> map2) -> { //두 번째 맵을 첫 번째 맵에 병합
        map1.get(true).addAll(map2.get(true));
        map1.get(false).addAll(map2.get(false));
        return map1;
      };
    }

    @Override
    public Function<Map<Boolean, List<Integer>>, Map<Boolean, List<Integer>>> finisher() {
        return Function.identity(); //최종 수집 과정에서 데이터 변환이 필요하지 않으므로 항등 함수 반환
    }
    
    @Override
    public Set<Characteristics> characteristics() {
        return Collections.unmodifiableSet(EnumSet.of(IDENTITY_FINISH)); //발견한 소수의 순서에 의미가 있으므로 컬렉터는 IDENTITY_FINISH
    }
}

public Map<Boolean, List<Integer>> partitionPrimesWithCustomController(int n) {
    return IntStream.rangeClosed(2, n).boxed().collect(newPrimeNumbersCollector());
}

커스텀 컬렉터로 교체

6.6.2 컬렉터 성능 비교

ex. 팩토리 메서드 partitioningBy로 만든 코드와 커스텀 컬렉터의 성능을 비교하는 예제입니다.

public class CollectorHarness {
    public static void main(String[] args) {
        long fastest = Long.MAX_VALUE;
        for (int i = 0; i < 10; i++) {
            long start = System.nanoTime();
            partitionPrimes(1_000_000);
            long duration = (System.nanoTime() - start) / 1_000_000;
            if (duration < fastest) fastest = duration;
        }
        System.out.println("Fastest execution done in " + fastest + " msecs");
    }
}

partitioningBy로 백만 개의 자연수를 소수와 비소수로 분류하는 작업을 10번 반복하면서 가장 빨리 실행된 속도를 기록합니다.

← → Top