使用 MPI_Isend 時出現分段錯誤 (Segmentation Fault when using MPI_Isend)


問題描述

使用 MPI_Isend 時出現分段錯誤 (Segmentation Fault when using MPI_Isend)

我的程序的目的是計算內部導體和外部導體之間的靜電勢,方法是將其分成網格,然後分成網格切片。每個處理器獲得一個切片並在每個切片上運行計算。我使用 MPI_Isend 和 MPI_Irecv 在處理器之間發送數據。測試代碼時出現分段錯誤:

[physnode5:81440] *** Process received signal ***
[physnode5:81440] Signal: Segmentation fault (11)
[physnode5:81440] Signal code: Address not mapped (1)
[physnode5:81440] Failing at address: 0x58
[physnode5:81440] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x2ab8069df5d0]
[physnode5:81440] [ 1] /opt/yarcc/libraries/openmpi/2.1.0/1/default/lib/libmpi.so.20(ompi_request_default_wait+0xd)[0x2ab8066495ed]
[physnode5:81440] [ 2] /opt/yarcc/libraries/openmpi/2.1.0/1/default/lib/libmpi.so.20(MPI_Wait+0x5d)[0x2ab80667a00d]
[physnode5:81440] [ 3] ./mpi_tezt.exe[0x400ffc]
[physnode5:81440] [ 4] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2ab806c0e3d5]
[physnode5:81440] [ 5] ./mpi_tezt.exe[0x4009b9]
[physnode5:81440] *** End of error message ***

當執行這段代碼時。請不要我已經 ssh 到一個集群。文件名是 mpi_tezt.exe(是的,我拼錯了)。我檢查了我要發送的數組是否正確分配,並且發送和接收沒有發送或接收不存在的數據(即發送超出數組範圍的數據。我的 MPI_Isend 和 MPI_Irecv 代碼如下:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <mpi.h>

int main(int argc, char *argv[])
{
  /*MPI Specific Variables*/
  int my_size, my_rank, up, down;
  MPI_Request reqU, reqD, sreqU, sreqD;
  MPI_Status rUstatus, rDstatus, sUstatus, sDstatus;

   /*Physical Dimensions*/
  double Linner = 5.0;/*mm*/
  double Rinner = 1.0;/*mm*/
  double phi_0 = 1000.0;/*V*/

  /*Other Variables*/
  int grid_size = 100;
  int slice;
  int x,y;
  double grid_res_y = 0.2;
  double grid_res_x = 0.1;
  int xboundary, yboundary;

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
  MPI_Comm_size(MPI_COMM_WORLD, &my_size);

  /*Determining neighbours*/
  if (my_rank != 0) /*if statemets used to stop highest and lowest rank neighbours arent outside 0 ‑ my_size‑1 range of ranks*/
    {
      up = my_rank‑1;
    }
  else
    {
      up = 0;
    }

  if(my_rank != my_size‑1)
    {
      down = my_rank+1;
    }
  else
    {
      down = my_size‑1;
    }

  /*cross‑check: presumed my_size is a factor of gridsize else there are odd sized slices and this is not coded for*/
  if (grid_size%my_size != 0)
    {
      printf("ERROR ‑ number of procs =  %d, this is not a factor of grid_size %d\n", my_size, grid_size);
      exit(0);
    }

  /*Set Up Distributed Data Approach*/
  slice = grid_size/my_size;

  yboundary = Linner/grid_res_y; /*y grid index of inner conductor wall*/ 
  xboundary = Rinner/grid_res_x; /*x grid and individual array index of inner conductor wall*/


  double phi[slice+2][grid_size]; /*extra 2 rows to allow for halo data*/

  for (y=0; y < slice+2; y++)
    {
      for (x=0; x < grid_size; x++)
        { 
          phi[y][x] = 0.0;
        }
    }

  if(my_rank == 0) /*Boundary Containing rank does 2 loops. One over part with inner conductor and one over part without inner conductor*/
    {
      for(y=0; y < slice+1; y++)
        {
          for(x=xboundary; x < grid_size; x++)
            {
              phi[y][x] = phi_0;
            }
        }   
    }


  if (my_rank < my_size‑1)
    {
      /*send top most strip up one node to be recieved as bottom halo*/
      MPI_Isend(&phi[1][0], grid_size  , MPI_DOUBLE, down, 1, MPI_COMM_WORLD, &sreqU);  
      /*recv top halo from up one node*/
      MPI_Irecv(&phi[slice+1][0], grid_size, MPI_DOUBLE, down, 2, MPI_COMM_WORLD, &reqU);
    }

  if (my_rank > 0)
    {
      /*recv top halo from down one node*/
      MPI_Irecv(&phi[0][0], grid_size , MPI_DOUBLE, up, 2, MPI_COMM_WORLD, &reqD);
      /*send bottom most strip down one node to be recieved as top halo*/
      MPI_Isend(&phi[slice][0], grid_size , MPI_DOUBLE, up, 1, MPI_COMM_WORLD, &sreqD);
    }

  if (my_rank<my_size‑1)
    {
      /*Wait for send to down one rank to complete*/
      MPI_Wait(&sreqD, &sDstatus);
      /*Wait for recieve from up one rank to complete*/
      MPI_Wait(&reqD, &rDstatus);
    }

  if (my_rank>0)
    {
      /*Wait for send to up down one rank to complete*/
      MPI_Wait(&sreqU, &sUstatus);
      /*Wait for recieve from down one rank to complete*/
      MPI_Wait(&reqU, &rUstatus);
    }


  MPI_Finalize();

  return 0;
}


參考解法

方法 1:

You're faulting in the first MPI_Wait (for rank 0). This is step 7 in the example code below.

Using mpirun ‑np 2 ./whatever:

It appears that sReqD is not being set correctly. This is set at step 5 by rank 1.

But, step 7 is being executed by rank 0, which does not set sReqD.

So, you need to adjust your if statements to match up correctly for which rank does which MPI_Wait, etc.


Here is your code with some debug printf statements:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <mpi.h>

int
main(int argc, char *argv[])
{
    /* MPI Specific Variables */
    int my_size,
     my_rank,
     up,
     down;
    MPI_Request reqU,
     reqD,
     sreqU,
     sreqD;
    MPI_Status rUstatus,
     rDstatus,
     sUstatus,
     sDstatus;

    /* Physical Dimensions */
    double Linner = 5.0;                /* mm */
    double Rinner = 1.0;                /* mm */
    double phi_0 = 1000.0;

    /*V*/
        /* Other Variables */
    int grid_size = 100;
    int slice;
    int x,
     y;
    double grid_res_y = 0.2;
    double grid_res_x = 0.1;

    int xboundary,
     yboundary;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
    MPI_Comm_size(MPI_COMM_WORLD, &my_size);

    /* Determining neighbours */
    /* if statemets used to stop highest and lowest rank neighbours arent
    outside 0 ‑ my_size‑1 range of ranks */
    if (my_rank != 0) {
        up = my_rank ‑ 1;
    }
    else {
        up = 0;
    }

    if (my_rank != my_size ‑ 1) {
        down = my_rank + 1;
    }
    else {
        down = my_size ‑ 1;
    }

    printf("my_rank=%d my_size=%d up=%d down=%d\n",my_rank,my_size,up,down);

    /* cross‑check: presumed my_size is a factor of gridsize else there are
    odd sized slices and this is not coded for */
    if (grid_size % my_size != 0) {
        printf("ERROR ‑ number of procs =  %d, this is not a factor of grid_size %d\n", my_size, grid_size);
        exit(0);
    }

    /* Set Up Distributed Data Approach */
    slice = grid_size / my_size;

    /* y grid index of inner conductor wall */
    yboundary = Linner / grid_res_y;
    /* x grid and individual array index of inner conductor wall */
    xboundary = Rinner / grid_res_x;

    if (my_rank == 0) {
        printf("Linner=%g grid_res_y=%g yboundary=%d\n",
            Linner,grid_res_y,yboundary);
        printf("Rinner=%g grid_res_x=%g xboundary=%d\n",
            Rinner,grid_res_x,xboundary);
        printf("slice=%d grid_size=%d phi=%ld\n",
            slice,grid_size,sizeof(double) * (slice + 2) * grid_size);
    }

    /* extra 2 rows to allow for halo data */
    double phi[slice + 2][grid_size];

    for (y = 0; y < slice + 2; y++) {
        for (x = 0; x < grid_size; x++) {
            phi[y][x] = 0.0;
        }
    }

    /* Boundary Containing rank does 2 loops. One over part with inner
    conductor and one over part without inner conductor */
    if (my_rank == 0) {
        for (y = 0; y < slice + 1; y++) {
            for (x = xboundary; x < grid_size; x++) {
                phi[y][x] = phi_0;
            }
        }
    }

    if (my_rank < my_size ‑ 1) {
        /* send top most strip up one node to be recieved as bottom halo */
        printf("1: my_rank=%d MPI_Isend\n",my_rank);
        MPI_Isend(&phi[1][0], grid_size, MPI_DOUBLE, down, 1, MPI_COMM_WORLD,
            &sreqU);

        /* recv top halo from up one node */
        printf("2: my_rank=%d MPI_Irecv\n",my_rank);
        MPI_Irecv(&phi[slice + 1][0], grid_size, MPI_DOUBLE, down, 2,
            MPI_COMM_WORLD, &reqU);

        printf("3: my_rank=%d\n",my_rank);
    }

    if (my_rank > 0) {
        /* recv top halo from down one node */
        printf("4: my_rank=%d MPI_Irecv\n",my_rank);
        MPI_Irecv(&phi[0][0], grid_size, MPI_DOUBLE, up, 2, MPI_COMM_WORLD,
            &reqD);

        /* send bottom most strip down one node to be recieved as top halo */
        printf("5: my_rank=%d MPI_Isend\n",my_rank);
        MPI_Isend(&phi[slice][0], grid_size, MPI_DOUBLE, up, 1, MPI_COMM_WORLD,
            &sreqD);

        printf("6: my_rank=%d\n",my_rank);
    }

    if (my_rank < my_size ‑ 1) {
        /* Wait for send to down one rank to complete */
        printf("7: my_rank=%d\n",my_rank);
        MPI_Wait(&sreqD, &sDstatus);
        printf("8: my_rank=%d\n",my_rank);

        /* Wait for recieve from up one rank to complete */
        printf("9: my_rank=%d\n",my_rank);
        MPI_Wait(&reqD, &rDstatus);
        printf("10: my_rank=%d\n",my_rank);
    }

    if (my_rank > 0) {
        /* Wait for send to up down one rank to complete */
        printf("11: my_rank=%d\n",my_rank);
        MPI_Wait(&sreqU, &sUstatus);
        printf("12: my_rank=%d\n",my_rank);

        /* Wait for recieve from down one rank to complete */
        printf("12: my_rank=%d\n",my_rank);
        MPI_Wait(&reqU, &rUstatus);
        printf("13: my_rank=%d\n",my_rank);
    }

    MPI_Finalize();

    return 0;
}

Here is the output. Notice that step 7 prints (which is before the first MPI_Wait for rank 0). But, rank 0 never gets to step 8 (the printf after that call)

my_rank=0 my_size=2 up=0 down=1
Linner=5 grid_res_y=0.2 yboundary=25
Rinner=1 grid_res_x=0.1 xboundary=10
slice=50 grid_size=100 phi=41600
1: my_rank=0 MPI_Isend
2: my_rank=0 MPI_Irecv
3: my_rank=0
7: my_rank=0
my_rank=1 my_size=2 up=0 down=1
4: my_rank=1 MPI_Irecv
5: my_rank=1 MPI_Isend
6: my_rank=1
11: my_rank=1
[manderly:230404] *** Process received signal ***
[manderly:230403] *** Process received signal ***
[manderly:230403] Signal: Segmentation fault (11)
[manderly:230403] Signal code: Address not mapped (1)
[manderly:230403] Failing at address: 0x58
[manderly:230404] Signal: Segmentation fault (11)
[manderly:230404] Signal code: Address not mapped (1)
[manderly:230404] Failing at address: 0x58
[manderly:230403] [ 0] [manderly:230404] [ 0] /lib64/libpthread.so.0(+0x121c0)/lib64/libpthread.so.0(+0x121c0)[0x7fa5478341c0]
[0x7fa0ebe951c0]
[manderly:230404] [ 1] [manderly:230403] [ 1] /usr/lib64/openmpi/lib/libmpi.so.20(ompi_request_default_wait+0x31)[0x7fa0ec0e9a81]
[manderly:230404] [ 2] /usr/lib64/openmpi/lib/libmpi.so.20(ompi_request_default_wait+0x31)[0x7fa547a88a81]
[manderly:230403] [ 2] /usr/lib64/openmpi/lib/libmpi.so.20(PMPI_Wait+0x60)[0x7fa0ec12c350]
[manderly:230404] [ 3] ./fix2[0x400f93]
[manderly:230404] [ 4] /usr/lib64/openmpi/lib/libmpi.so.20(PMPI_Wait+0x60)[0x7fa547acb350]
[manderly:230403] [ 3] ./fix2[0x400ef7]
/lib64/libc.so.6(__libc_start_main+0xea)[0x7fa0ebaedfea]
[manderly:230404] [ 5] ./fix2[0x40081a[manderly:230403] [ 4] ]
[manderly:230404] *** End of error message ***
/lib64/libc.so.6(__libc_start_main+0xea)[0x7fa54748cfea]
[manderly:230403] [ 5] ./fix2[0x40081a]
[manderly:230403] *** End of error message ***
‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑
mpirun noticed that process rank 0 with PID 0 on node manderly exited on signal 11 (Segmentation fault).
‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑

(by AaronCraig Estey)

參考文件

  1. Segmentation Fault when using MPI_Isend (CC BY‑SA 2.5/3.0/4.0)

#mpi #C






相關問題

MPI 在根進程上收集數組 (MPI gather array on root process)

如何為 xcode 安裝 Openmpi? (how to install Openmpi for xcode?)

在 ARM 上的 Linux 上運行 MPI (OpenMPI) 應用程序時出現問題 (Problems running MPI (OpenMPI) app on Linux on ARM)

在 C++ 和 MPI 中獨立並行寫入文件 (independent parallel writing into files in C++ and MPI)

傳輸一些數據後 MPI_Bcast 掛起 (MPI_Bcast hanging after some data transferred)

來自一個文件的多個 mpirun 與多個文件運行 (Multiple mpiruns from one file vs multiple file runs)

Isend/Irecv 不起作用,但 Send/Recv 可以 (Isend/Irecv doesn`t work but Send/Recv does)

MPI 要求在 localhost 上進行身份驗證 (MPI asks authentication on localhost)

MPI 生成和合併問題 (Issue with MPI spawn and merge)

mpiexec 拋出錯誤“mkstemp 失敗,沒有這樣的文件或目錄” (mpiexec throws error "mkstemp failed No such file or directory")

使用 MPI_Isend 時出現分段錯誤 (Segmentation Fault when using MPI_Isend)

MPI_Comm_split 不適用於 MPI_Bcast (MPI_Comm_split not working with MPI_Bcast)







留言討論