Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[METRICS] MetricStore takes a considerable amount of time to sample a large-scale cluster. #1615

Open
garyparrot opened this issue Mar 28, 2023 · 8 comments
Milestone

Comments

@garyparrot
Copy link
Collaborator

套用 #1476 這個有 10000 多個 Partition 的情境,進行資料的撈取測試:

  @Test
  void test10000PartitionCluster() {
    try (var admin = Admin.of(bootstrap)) {
      var brokers = admin.brokers().toCompletableFuture().join();
      var jmxPort = 16926;
      var clients = brokers.stream()
          .collect(Collectors.toUnmodifiableMap(
              NodeInfo::id,
              b -> MBeanClient.jndi(b.host(), jmxPort)));
      var cost = HasClusterCost.of(Map.ofEntries(
          Map.entry(new NetworkIngressCost(), 1.0),
          Map.entry(new NetworkEgressCost(), 1.0)));
      try (var store = MetricsStore.builder()
          .localReceiver(() -> CompletableFuture.completedStage(clients))
          .sensorsSupplier(() -> Map.of(cost.metricSensor().get(), (a, b) -> {}))
          .beanExpiration(Duration.ofSeconds(30))
          .build()) {
        while (true) {
          var cb = store.clusterBean();
          System.out.println(cb.all().entrySet()
              .stream()
              .collect(Collectors.toUnmodifiableMap(
                  Map.Entry::getKey,
                  x -> x.getValue().size())));
          Utils.sleep(Duration.ofSeconds(1));
        }
      }
    }
  }

然後對下列程式碼測量 209 行撈取的時間。

https://github.com/skiptests/astraea/blob/0b9fae3eb5026993f97d2918bc2a2c8087656f5b/common/src/main/java/org/astraea/common/metrics/collector/MetricsFetcher.java#L205-L215

測量用的程式碼

        long s = System.nanoTime();
        beans = clients.get(identity.id).beans(BeanQuery.all(), e -> {});
        long t = System.nanoTime();
        System.out.println("[" + identity.id  +"]MetricsFetcher#updateData " + Duration.ofNanos(t - s).toMillis() + "ms, sample " + beans.size() + " mbeans");

執行結果 (Client 和 Broker JMX 設備之間的 RTT 約 1.5ms)

[5]MetricsFetcher#updateData 13089ms, sample 22912 mbeans
[1]MetricsFetcher#updateData 22183ms, sample 23148 mbeans
[0]MetricsFetcher#updateData 36837ms, sample 22939 mbeans
[4]MetricsFetcher#updateData 37239ms, sample 22728 mbeans
[3]MetricsFetcher#updateData 25484ms, sample 22583 mbeans
[2]MetricsFetcher#updateData 25208ms, sample 22388 mbeans
[5]MetricsFetcher#updateData 30096ms, sample 22912 mbeans
[1]MetricsFetcher#updateData 31196ms, sample 23148 mbeans
[4]MetricsFetcher#updateData 30611ms, sample 22728 mbeans
[0]MetricsFetcher#updateData 46564ms, sample 22939 mbeans
[3]MetricsFetcher#updateData 27797ms, sample 22583 mbeans
[2]MetricsFetcher#updateData 37786ms, sample 22388 mbeans
[5]MetricsFetcher#updateData 35129ms, sample 22912 mbeans
[1]MetricsFetcher#updateData 36829ms, sample 23148 mbeans
[4]MetricsFetcher#updateData 41201ms, sample 22728 mbeans
[0]MetricsFetcher#updateData 31216ms, sample 22939 mbeans
[3]MetricsFetcher#updateData 27864ms, sample 22583 mbeans

一個節點一次的撈取大約要耗費 13 秒到 46 秒。

@garyparrot garyparrot changed the title [METRICS] MetricStore sampling at large-scale cluster takes a considerable amount of time [METRICS] MetricStore takes a considerable amount of time to sample a large-scale cluster. Mar 28, 2023
@chia7712
Copy link
Contributor

@chinghongfang 有什麼想法嗎?或許我們上次討論到的all中要排除一些用不到的指標?

@chinghongfang
Copy link
Collaborator

這是我用之前的 MetricCollector 測出來的數據

[12329]MetricsFetcher#updateData 24954ms, sample 12008 mbeans
[14344]MetricsFetcher#updateData 26082ms, sample 12008 mbeans
[19171]MetricsFetcher#updateData 27247ms, sample 12008 mbeans
[11774]MetricsFetcher#updateData 32754ms, sample 12008 mbeans
[11641]MetricsFetcher#updateData 39145ms, sample 12008 mbeans
[12329]MetricsFetcher#updateData 31732ms, sample 12008 mbeans
[19171]MetricsFetcher#updateData 30419ms, sample 12008 mbeans
[14344]MetricsFetcher#updateData 39835ms, sample 12008 mbeans
[11641]MetricsFetcher#updateData 41180ms, sample 12008 mbeans
[11774]MetricsFetcher#updateData 48508ms, sample 12008 mbeans

這是我用現在的版本測的結果

[11774]MetricsFetcher#updateData 38064ms, sample 20615 mbeans
[14344]MetricsFetcher#updateData 55894ms, sample 20613 mbeans
[19171]MetricsFetcher#updateData 57072ms, sample 20612 mbeans
[12329]MetricsFetcher#updateData 70108ms, sample 20613 mbeans
[11641]MetricsFetcher#updateData 38222ms, sample 20611 mbeans
[14344]MetricsFetcher#updateData 49942ms, sample 20613 mbeans
[11774]MetricsFetcher#updateData 56219ms, sample 20615 mbeans
[19171]MetricsFetcher#updateData 43439ms, sample 20612 mbeans
[12329]MetricsFetcher#updateData 61944ms, sample 20613 mbeans
[11641]MetricsFetcher#updateData 67746ms, sample 20611 mbeans

全部 mbeans 都拉的數量,大約是 1.71 倍 指定 mbean 數量,所以我想如果只拉指定 mbean 的話可以降低拉取時間。

@chia7712
Copy link
Contributor

讚讚,開個 PR 來看看有哪些一定要拿的指標

@chia7712
Copy link
Contributor

@chinghongfang @garyparrot

我嘗試測試拉取大量 metrics 的情境,在“建立大量空的topic/partition“的拉取測試下,我看到的數據如下:

elapsed: 8606, beans: 201504

跟你們貼出來的數據落差蠻大的,所以我想請問一下你們測試的過程是有一邊灌資料嗎?也就是有其他佔用資源的狀況?

@chinghongfang
Copy link
Collaborator

測試的過程是有一邊灌資料嗎?也就是有其他佔用資源的狀況?

沒有一邊灌資料,5 broker 建立好 10000 topic-partition 後就沒有做任何事了。

我嘗試測試拉取大量 metrics 的情境,在“建立大量空的topic/partition“的拉取測試下,我看到的數據如下

請問這個是 beans 的數量是 MBeanClient拉下來的量嗎?還是從 MetricStore or ClusterBean 取得的數量?
如果是從 MetricSensor or ClusterBean 取得數量,數量有可能會被 "重複的 MetricSensor" 放大,(可能有兩個 Cost 都要使用某個 metrics)。

                allBeans.forEach(
                    (id, bs) -> {
                      var client = MBeanClient.of(bs);
                      var clusterBean = clusterBean();
                      lastSensors.forEach(
                          (sensor, errorHandler) -> {
                            try {
                              beans
                                  .computeIfAbsent(id, ignored -> new ConcurrentLinkedQueue<>())
                                  .addAll(sensor.fetch(client, clusterBean));
                            } catch (Exception e) {
                              errorHandler.accept(id, e);
                            }
                          });
                    });

fetch 可能只有一次,但若出現相同的 MetricSensor ,就會重複存多筆相同的 metric。不知是否是這個原因?

@chia7712
Copy link
Contributor

fetch 可能只有一次,但若出現相同的 MetricSensor ,就會重複存多筆相同的 metric。不知是否是這個原因?

如果是這樣,那故事就有兩個要討論,第一個是拉取的速度、另一個是處理 (sensor) 的速度。我想先確定第一個故事是否真的存在

@garyparrot
Copy link
Collaborator Author

garyparrot commented Mar 31, 2023

跟你們貼出來的數據落差蠻大的,所以我想請問一下你們測試的過程是有一邊灌資料嗎?也就是有其他佔用資源的狀況?

MetricFetcher 的撈取速度好像和執行者以及 Kafka Broker 之間的網路 latency 非常有關聯。

我用跨縣市的連線,重試 #1615 提到的測量方法,用最新的 main,測量的時候沒有資料讀寫

現在我和 Broker 的 Network Latency 大約是 5ms (透過 ping 測試)

[4]MetricsFetcher#updateData 29391ms, sample 12856 mbeans
[3]MetricsFetcher#updateData 29641ms, sample 13704 mbeans
[5]MetricsFetcher#updateData 31500ms, sample 13789 mbeans
[2]MetricsFetcher#updateData 31515ms, sample 12716 mbeans
[0]MetricsFetcher#updateData 28213ms, sample 13369 mbeans
[1]MetricsFetcher#updateData 30431ms, sample 13814 mbeans
[3]MetricsFetcher#updateData 30377ms, sample 13704 mbeans
[4]MetricsFetcher#updateData 30543ms, sample 12856 mbeans
[5]MetricsFetcher#updateData 30719ms, sample 13789 mbeans
[2]MetricsFetcher#updateData 29431ms, sample 12716 mbeans
[1]MetricsFetcher#updateData 28420ms, sample 13814 mbeans
[0]MetricsFetcher#updateData 30786ms, sample 13369 mbeans
[4]MetricsFetcher#updateData 30903ms, sample 12856 mbeans
[5]MetricsFetcher#updateData 29811ms, sample 13789 mbeans
[3]MetricsFetcher#updateData 32391ms, sample 13705 mbeans
[2]MetricsFetcher#updateData 31543ms, sample 12716 mbeans
[0]MetricsFetcher#updateData 30123ms, sample 13370 mbeans
[1]MetricsFetcher#updateData 30518ms, sample 13814 mbeans
[4]MetricsFetcher#updateData 30409ms, sample 12856 mbeans
[5]MetricsFetcher#updateData 29454ms, sample 13789 mbeans
[2]MetricsFetcher#updateData 28634ms, sample 12716 mbeans
[1]MetricsFetcher#updateData 28850ms, sample 13814 mbeans

我換到從網路拓樸上,更接近 Broker 的設備執行上述的測試。

設備和 Kafka Broker 之間的 Network Latency 是 0.20 ms (透過 ping 測試)

[4]MetricsFetcher#updateData 747ms, sample 12858 mbeans
[3]MetricsFetcher#updateData 770ms, sample 13706 mbeans
[2]MetricsFetcher#updateData 795ms, sample 12717 mbeans
[5]MetricsFetcher#updateData 868ms, sample 13791 mbeans
[0]MetricsFetcher#updateData 391ms, sample 13371 mbeans
[1]MetricsFetcher#updateData 393ms, sample 13816 mbeans
[3]MetricsFetcher#updateData 538ms, sample 13706 mbeans
[2]MetricsFetcher#updateData 536ms, sample 12717 mbeans
[4]MetricsFetcher#updateData 583ms, sample 12858 mbeans
[5]MetricsFetcher#updateData 532ms, sample 13791 mbeans
[0]MetricsFetcher#updateData 399ms, sample 13371 mbeans
[1]MetricsFetcher#updateData 386ms, sample 13816 mbeans
[4]MetricsFetcher#updateData 558ms, sample 12858 mbeans
[3]MetricsFetcher#updateData 591ms, sample 13706 mbeans
[2]MetricsFetcher#updateData 568ms, sample 12717 mbeans
[5]MetricsFetcher#updateData 590ms, sample 13791 mbeans
[0]MetricsFetcher#updateData 379ms, sample 13371 mbeans
[1]MetricsFetcher#updateData 394ms, sample 13816 mbeans

@chia7712
Copy link
Contributor

@garyparrot ok,感謝回覆。你提供的兩個點還蠻重要的,一般來說 balancer 會運作在離伺服器端較近的地方(維運的人),但其他客戶端 (producer and consumer) 則反過來會比較遠(vpn)也有可能。所以考量 producer and consumer 的情境,這段我們還是需要試著縮減一下多餘的 query/fetch

@chia7712 chia7712 added this to the 0.3.0 milestone Mar 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants