問題描述
MPI 生成和合併問題 (Issue with MPI spawn and merge)
我正在嘗試開始在 MPI 中創建動態進程。我有一個父代碼(main.c)試圖產生新的工作/子進程(worker.c)並將兩者合併到一個內部通信器中。父代碼(main.c)是
#include<stdio.h>
#include "mpi.h"
MPI_Comm child_comm;
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if(rank == 0 )
{
int num_processes_to_spawn = 2;
MPI_Comm_spawn("worker", MPI_ARGV_NULL, num_processes_to_spawn, MPI_INFO_NULL, 0, MPI_COMM_SELF, &child_comm, MPI_ERRCODES_IGNORE );
MPI_Comm intra_comm;
MPI_Intercomm_merge(child_comm,0, &intra_comm);
MPI_Barrier(child_comm);
int tmp_size;
MPI_Comm_size(intra_comm, &tmp_size);
printf("size of intra comm world = %d\n", tmp_size);
MPI_Comm_size(child_comm, &tmp_size);
printf("size of child comm world = %d\n", tmp_size);
MPI_Comm_size(MPI_COMM_WORLD, &tmp_size);
printf("size of parent comm world = %d\n", tmp_size);
}
MPI_Finalize();
工人(子)代碼是:
#include<stdio.h>
#include "mpi.h"
int main( int argc, char *argv[] )
{
int numprocs, myrank;
MPI_Comm parentcomm;
MPI_Comm intra_comm;
MPI_Init( &argc, &argv );
MPI_Comm_size( MPI_COMM_WORLD, &numprocs );
MPI_Comm_rank( MPI_COMM_WORLD, &myrank );
MPI_Comm_get_parent( &parentcomm );
MPI_Intercomm_merge(parentcomm, 1, &intra_comm);
MPI_Barrier(parentcomm);
if(myrank == 0)
{
int tmp_size;
MPI_Comm_size(parentcomm, &tmp_size);
printf("child size of parent comm world = %d\n", tmp_size);
MPI_Comm_size(MPI_COMM_WORLD, &tmp_size);
printf("child size of child comm world = %d\n", tmp_size);
MPI_Comm_size(intra_comm, &tmp_size);
printf("child size of intra comm world = %d\n", tmp_size);
MPI_Finalize( );
return 0;
}
}
我運行這段代碼使用
mpirun ‑np 12 main.c
拆分和合併後,我希望輸出為
size of intra comm world = 14
size of child comm world = 2
size of parent comm world = 12
child size of parent comm world = 12
child size of child comm world = 2
child size of intra comm world = 14
但我得到以下錯誤輸出。
size of intra comm world = 3
size of child comm world = 1
size of parent comm world = 12
child size of parent comm world = 2
child size of child comm world = 2
child size of intra comm world = 3
我不明白錯誤在哪裡,請有人告訴我錯誤在哪裡。
謝謝,克里斯
參考解法
方法 1:
Your code suffers from a few problems, which I'll try to list here:
- In the master part, only process 0 calls
MPI_Comm_spawn()
. This isn't a mistake as such (especially since you useMPI_COMM_SELF
as parent communicator), but it de facto excludes all other processes from the subsequent merging. - In both the master and worker parts, you use
MPI_Comm_size()
to get the size of the remote communicator instead ofMPI_Comm_remote_size()
. Therefore you will only get the size of the local communicator inside the inter‑communicator, instead of the size of the remote communicator. - In the master code, only process 0 calls
MPI_Finalise()
(not to mention thatmain()
andMPI_Init()
are missing)
Here are some fixed version of your codes:
master.c
#include <stdio.h>
#include <mpi.h>
int main( int argc, char *argv[] ) {
MPI_Init( &argc, &argv );
int rank;
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
MPI_Comm child_comm;
int num_processes_to_spawn = 2;
MPI_Comm_spawn( "./worker", MPI_ARGV_NULL,
num_processes_to_spawn, MPI_INFO_NULL,
0, MPI_COMM_WORLD,
&child_comm, MPI_ERRCODES_IGNORE );
MPI_Comm intra_comm;
MPI_Intercomm_merge( child_comm, 0, &intra_comm );
if ( rank == 0 ) {
int tmp_size;
MPI_Comm_size( intra_comm, &tmp_size );
printf( "size of intra comm world = %d\n", tmp_size );
MPI_Comm_remote_size( child_comm, &tmp_size );
printf( "size of child comm world = %d\n", tmp_size );
MPI_Comm_size( MPI_COMM_WORLD, &tmp_size );
printf( "size of parent comm world = %d\n", tmp_size );
}
MPI_Finalize();
return 0;
}
worker.c
#include <stdio.h>
#include <mpi.h>
int main( int argc, char *argv[] ) {
MPI_Init( &argc, &argv );
int myrank;
MPI_Comm_rank( MPI_COMM_WORLD, &myrank );
MPI_Comm parentcomm;
MPI_Comm_get_parent( &parentcomm );
MPI_Comm intra_comm;
MPI_Intercomm_merge( parentcomm, 1, &intra_comm );
if ( myrank == 0 ) {
int tmp_size;
MPI_Comm_remote_size( parentcomm, &tmp_size );
printf( "child size of parent comm world = %d\n", tmp_size );
MPI_Comm_size( MPI_COMM_WORLD, &tmp_size );
printf( "child size of child comm world = %d\n", tmp_size );
MPI_Comm_size( intra_comm, &tmp_size );
printf( "child size of intra comm world = %d\n", tmp_size );
}
MPI_Finalize();
return 0;
}
Which gives on my laptop:
~> mpirun ‑n 12 ./master
child size of parent comm world = 12
child size of child comm world = 2
child size of intra comm world = 14
size of intra comm world = 14
size of child comm world = 2
size of parent comm world = 12