XOR

2025.06«
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
» 2025.08

プロフィール

ＨＮ：

nwpfh

カレンダー

2025/07

S	M	T	W	T	F	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

SNS

Tweets by nwpfh

ブログ内検索

RSS

RSS 0.91

RSS 1.0

RSS 2.0

RSS

RSS 0.91

RSS 1.0

RSS 2.0

リンク

考察

ぽち＊ろぐ

diffツール

[PR]

[PR]上記の広告は3ヶ月以上新規記事投稿のないブログに表示されています。新しい記事を書く事で広告が消えます。

【2025/07/19 09:42 】 |

C++17でモンテカルロを並列化してみた。(GNU Version)

C++17でモンテカルロを並列化してみた。に触発されて、GNU版で作ってみました
■コンパイル

$ g++ test.cc -Wall -march=native -std=c++17 -O3  -fopenmp

■ソース

#include <parallel/algorithm>
//#include <execution>
#include <atomic>
#include <mutex>
#include <iostream>
#include <random>
#include <array>
#include <stdlib.h>
int main(int argc, char *argv[])
{
    static int const NUM = 1000000000;
    static int threads = 8;
    static_assert(std::atomic<int>::is_always_lock_free);
    if (argc >= 2)
    {
        threads = atoi(argv[1]);
    }
    auto nums = std::vector<int>(threads);
    for (auto &num : nums)
    {
        num = NUM / threads;
    }
    __gnu_parallel::_Settings s;
    s.algorithm_strategy = __gnu_parallel::force_parallel;
    __gnu_parallel::_Settings::set(s);

    std::atomic counter = {0};

    __gnu_parallel::for_each(nums.begin(), nums.end(), [&counter](int num) {
        std::random_device rnd;
        thread_local std::mt19937 mt(rand());
         std::uniform_real_distribution<double> score(0.0, 1.0);
        for (auto &&no = 0; no < num; ++no)
        {
            auto &&x = score(mt);
            auto &&y = score(mt);
            if ((x * x + y * y) < 1.0)
            {
                counter++;
            }
        }
    });

    std::cout << "PI = " << 4.0 * counter / NUM << std::endl;
}

■結果

$ time ./a.out 1
PI = 3.14155

real	0m27.461s
user	0m27.469s
sys	0m0.001s
$ time ./a.out 2
PI = 3.14159

real	0m29.238s
user	0m58.008s
sys	0m0.016s
$ time ./a.out 3
PI = 3.14151

real	0m28.447s
user	1m24.458s
sys	0m0.000s
$ time ./a.out 4
PI = 3.14155

real	0m32.101s
user	2m5.238s
sys	0m0.020s

なぜか延べ時間(user)が増えるだけでreal時間が減りません。。。
そこでソースを修正してみました。
■ソース修正後

#include <parallel/algorithm>
//#include <execution>
#include <atomic>
#include <mutex>
#include <iostream>
#include <random>
#include <array>
#include <stdlib.h>
int main(int argc, char *argv[])
{
    static int const NUM = 1000000000;
    static int threads = 8;
    static_assert(std::atomic<int>::is_always_lock_free);
    if (argc >= 2)
    {
        threads = atoi(argv[1]);
    }
    auto nums = std::vector<int>(threads);
    for (auto &num : nums)
    {
        num = NUM / threads;
    }
    __gnu_parallel::_Settings s;
    s.algorithm_strategy = __gnu_parallel::force_parallel;
    __gnu_parallel::_Settings::set(s);

    std::atomic counter = {0};

    __gnu_parallel::for_each(nums.begin(), nums.end(), [&counter](int num) {
        std::random_device rnd;
        thread_local std::mt19937 mt(rand());
        int _counter = 0;
        std::uniform_real_distribution<double> score(0.0, 1.0);
        for (auto &&no = 0; no < num; ++no)
        {
            auto &&x = score(mt);
            auto &&y = score(mt);
            if ((x * x + y * y) < 1.0)
            {
                _counter++;
            }
        }
        counter += _counter;
    });

    std::cout << "PI = " << 4.0 * counter / NUM << std::endl;
}

■結果

$ time ./a.out 1
PI = 3.14155

real	0m20.263s
user	0m20.255s
sys	0m0.000s
$ time ./a.out 2
PI = 3.14159

real	0m10.117s
user	0m20.223s
sys	0m0.000s
$ time ./a.out 3
PI = 3.14151

real	0m7.409s
user	0m22.209s
sys	0m0.000s
$ time ./a.out 4
PI = 3.14155

real	0m6.129s
user	0m24.492s
sys	0m0.001s

core4個分で20秒から6秒にその後、8個に増やしても増えませんでした。。。
なかなかのパフォーマンスなり。

[0回]