Subversion Repositories Open64

[/] - Rev 4065

Rev

Filtering Options

Clear current filter

Rev Log message Author Age Path
4065 [CAF] Misc. fixes in CAF implementation

* The front-end may generate temp STs with coarray types for handling
allocatable components of coarrays. The back-end needs to ignore these, as
they are not to be reserved space on the stack. Not doing so resulted in the
back-end allocating randomly large stack frame sizes for program units.

* Fixing the IR generated in the front-end for allocatable coarray
components. The problem originates from the duplication of statement
expansion operands (specifically, for the PE subscript) in the IR, as the
statement expansion operand is duplicated in different IR_List nodes. The
process_deferred_functions routine will fail to update the duplicated operands
after completing the expansion (and it should only do expansion once, not for
all of the duplicates!). To fix this, I introduce a Plus_Opr into the PE
subscript, with the Statement Expansion being the left operand of that
operator, and a constant 0 being the right operand. This way, there is only
one copy of the statement expansion operand in memory.

* Correct illegal assignment of pointer to int type

* Correct usage of ARMCI_Test in libcaf. The ARMCI_Test routine actually
returns 0 if the transfer is still in progress, and non-zero if it has
completed. This is the opposite of what the online ARMCI documentation
(for now defunct ARMCI 1.4) says. Needed to correct the usage according to
the latest spec.
dreachem 3d 09h /
4064 [CAF] Allocate Fortran target objects in remote-access segment

Before, this implementation did not allocate Fortran objects with the target
attribute in a remotely-accessible memory segment. We allowed accesses to such
objects via a pointer dereference by using GASNet active messages, but this
was not a good, general approach. Now, the compiler will allocate objects with
the target attribute in remote-accessible memory. This allows both GASNet and
ARMCI based CAF implementations to support access of remote targets.

* We add a pass in coarray prelowering phase to replace global variables and
automatic variables (not dummy variables) with dereferenced pointers pointing
into remote-access memory. Also, because pointer allocations may be pointed to
by other pointers, allocate them in the remote-access segment using
coarray_asymmetric_allocate_ if libcaf is being linked in. They will be
deallocated using coarray_asymmetric_deallocate_.

* Generate copyin/copyout code for args with target attribute. When a dummy
argument has the target attribute and the effective argument does not, the
compiler generates copyin/copyout code to ensure the dummy argument is
associated with remote-access memory. This will allow the called procedure (or
some procedure called within it) to associate a pointer component of a coarray
with the target.

* Support allocations for target objects in remote-access segment. We add runtime
and compiler-script support for allocation of objects with the target attribute
in remote-access memory. Global variables with target attribute are now allocated
in the symmetric heap along with coarrays. Additionally, the program can allocate
objects in the asymmetric portion of the remote-access segment for handling local
objects with a target attribute.

Contributor: Deepak
dreachem 3d 09h /
4063 set ST_addr_saved in FFE for LDA'd STs.

When there is a ISTORE statement with an LDA on the RHS, set the addr_saved
attribute for the LDA's ST. This was being done for STID statements, but not
ISTORE. Doing this fixes a bug that can surface when the FFE generates code
for the ALLOCATE statement. Specifically, when the FFE use the _ALLOCATE
library routine, it can generate code that causes an unsafe propagation
across the _ALLOCATE call.

Contributor: Deepak
dreachem 3d 09h /
4062 [CAF] Handle non-array Iload/Mload/Istore/Mstore in coarray assignments

A recent bug was introduced where if an assignment statement had an
Iload/Mload/Istore/Mstore tree that didn't contain an array operator, it would
cause a crash in the prelowering phase. Now, we use add an array operator
into such nodes in the IR tree to prevent this.

Also, some changes were made in runtime to prevent nested calls to
gasnet_hold_interrupts
dreachem 3d 09h /
4061 [CAF] Misc. improvements/fixes to CAF implementation

Several improvements/fixes were made in CAF implementation, in particular to
improve the existing support for non-blocking communication and the logic for
when/how to create communication buffers.

* We add an environment variable, UHCAF_NB_XFER_LIMIT, to control the number
of outstanding non-blocking PUT or GET calls that are allowed for a given
image. By default, its being set to 16. We found that allow too many PUT or
GET calls to accumulate can really slow down the underlying library (be it
GASNet or ARMCI), so this provide a configurable way of throttling that.

* An LCB transfer was sometimes unnecessarily used when the root of the RHS tree
was OPR_CVTL, which should now be fixed. Also, the algorithm for statically
detecting if an array section access is contiguous was flawed. It always
returned false if the lower and upper bounds weren't constant, but for the
case of an access like X(l:u) we know this is a contiguous access. This has
been fixed.

* Fix Coarray_Prelower logic for creating LCB on coarray write

* Fix inter-image assignment of single value to array section

A bug was introduced in 3.0.27 where inter-image assignment statements of a
single value to an array section was not working. The following statements
were broken:

a(:)[i] = b(1)
a(:)[i] = d
b(:) = a(1)[i]
b(:) = d[i]

This was fixed by correcting the logic used to determine whether an LCB
should be created for a found co-indexed term in the IR. Additionally, the
change handles statements like the following more efficiently so that bulk
transfers are generated.

x(z(:)) = y(:)[i]
y(:)[i] = x(z(:))

* The defer_sync directive was not being applied properly to statements
containing function calls, like:

x(:) = y(:)[num_images()]

This is now corrected. Also, we now use integer type instead of pointer
type for caf_sync_handle

* Print more useful message when cafrun can't find executable to run.

* GASNet buffering optimization for nbread. It is observed that when doing an
nbread where the local destination is not in the shared memory segment, there
is a significant overhead (apparently due to GASNet having to register/pin the
destination buffer). This instead will allocate a buffer from the asymmetric
heap in the shared memory segment to use as the destination buffer. Upon
syncing on the handle, the data will be copied from this buffer to the actual
destination.

* A new routine is added for asymmetric memory allocation which will return NULL
if there wasn't sufficient space. This is now used for allocation of LCB and
for the GASNet nbread local buffer.

* Fix setting of final_dest for nbread. The field handle_node->final_dest
should only be set if a temporary local buffer is allocated from asymmetric
shared memory heap for the purposes of this nbread transfer.

Contributor: Deepak
dreachem 3d 09h /
4060 OMP runtime update/fix and fix for newer perl versions

* In previous commit, accidentally removed the initialize of __omp_seed and
__omp_myid in __ompc_level_1_slave, resulting in a potential runtime crash.
Now initializing them both to uthread_index, as before.

* Also, if O64_OMP_VERBOSE is set to true, then the program will print OMP runtime
variable settings.

* Migrate from old getopt require to module, for newest perl versions
(such as on Fedora 18).

updating patch level to 3.0.29

Contributors: Deepak, Tony
dreachem 3d 09h /
4059 [CAF] fix check for sync handle before call to __coarray_sync

The compiler was generating a bad opcode when checking the value of the sync
handle returned by a __coarray_*read call, resulting in an error in the code
expansion phase. This was corrected.
dreachem 56d 01h /
4058 [CAF] adds opaque derived type caf_sync_handle and FFE fix for enums check

This adds a new opaque derived type for the sync-handle variables,
type(caf_sync_handle), to be used with defer_sync and sync directive. It
requires 'use caf_types' statement. E.g.

use caf_types
...
type(caf_sync_handle) :: s1
integer :: a(SZ)
integer :: b(SZ)[*]
...

!dir$ defer_sync(s1)
a(:) = b(:)[i]
...

!dir$ sync(s1)
!! now a(:) has been defined and can be safely read.

Also, fixing check_enums_for_change in fortran FE. Previous merge from open64
added back a couple unused operators in the operator_values enum type to get
FFE_BUILD_OPTIMIZE=DEBUG to work. But earlier, I had adjusted
check_enums_for_change for the same purpose. To keep OpenUH in sync with
Open64, I'm using their fix reverting check_enums_for_change to what it was
initially.

Contributor: Deepak
dreachem 56d 07h /
4057 [CAF] several improvements for remote accesses and defer_sync

Runtime:
- For coarray writes, no LCB used for direct accesses. The API for the
compiler includes a *write_from_lcb and a *write interface. *write is used
for "direct" writes that don't use an intermediate LCB. It requires local
completion to be guaranteed before returning, unless the statement had a
defer_sync directive attached to it.

- Remove ENABLE_NB_PUT variable (since we assume it is set).

- When allocating LCB, try to do so out of assymmetric heap if there is
sufficient space remaining (conservatively, if requested size is less than
half of the remaining space). Relying on OS calls for allocating/freeing LCB
memory can result in unpredictable behavior in GASNet due to an issue in the
GASNet Firehose implementation.

- Fix address calculation in comm_end_symmetric_mem.

- Handle case where RHS of a coarray write statement has a single term with an
MLOAD operator.

I also now allow defer_sync for coarray writes and adjust semantics for
defer_sync in general:

!dir$ defer_sync[(s)]
t(...) = u(...)[i]

- it will block until previous conflicting PUTs to same image complete
- it will initiate a non-blocking GET and return.
- t will not be defined until
- image synchronizes with i using sync images or sync all
- image encounters sync memory or matching !dir$ sync directive

!dir$ defer_sync[(s)]
u(...)[i] = t(...)

- it will block until previous conflicting PUTs to same image complete
- it will initiate a non-blocking PUT and return
- u[i] will not be defined until
- image synchronizes with i using sync images or sync all
- image encounters sync memory or matching !dir$ sync
directive
- while the PUT can be subsequently synchronized on by the
programmer, subsequent coarray reads or writes will not
block on it even if their addresses conflict with it.

Contributor: Deepak
dreachem 56d 07h /
4056 [CAF] changes tracing behavior and modification for GASNet get/put calls

The user now supplies UHCAF_TRACE_DIR instead of UHCAF_TRACE_FILE. Traces
generated are now per-image -- $UHCAF_TRACE_DIR/trace.x for each image x. The
default value for UHCAF_TRACE_DIR is uhcaf.traces.

Also, routine uhcaf_tracedump_shared_mem_alloc will print to the image's trace
file (so, tracing should be enabled to use this).

Also, adds routines uhcaf_trace_suspend and uhcaf_trace_resume to suspend and
resume tracing in the runtime.

Also, we use the bulk versions for GASNet get and put in the general case, in
case data is unaligned.

Contributor: Deepak
dreachem 56d 07h /
4055 [CAF] changes to how shared memory is allocated in CAF runtime

This makes several changes to how shared memory is allocated and managed in
CAF runtime:

- When using GASNet ibv conduit with FAST segment configuration, we observe
that the segment size returned by gasnet_MaxLocalSegmentSize() may still be
an unsafe shared memory allocation size. So, we set to the maximum allowable
segment size to 75% of what it returns in this case, to be safe.

- Before, the size of the allocated shared memory segment was controlled by
UHCAF_IMAGE_HEAP_SIZE. This could lead to under-allocation, even when the
program only consists of save coarray data (for which the requisite total
size is statically known).

- We redefined the term "IMAGE HEAP" to refer specifically to shared memory
space for which dynamic allocations and deallocations can occur. This is
what the user can control (either with UHCAF_IMAGE_HEAP_SIZE, or with
--image-heap). The default of 30 MB is still retained. The total shared
memory per image requested by the program will be the image heap size plus
the space needed to accomodate the save coarrays.

- To help the user know what heap size to use, we add a trace MEMORY_SUMMARY.
Running with --trace=memory_summary will print heap memory usage for each
image (compiler/runtime should be configured with --enable-cafrt-traces).

Other fixes/improvements:

- forgot to initialize some armci handles

- cafrun: set UHCAF_LAUNCHER appropriately for gasnet/armci
uhcaf: use gasnet-compatible MPI instead of MPI_LIB.

- Fix check for MPI_INCLUDE during configure. Ensures that the MPI_INCLUDE it
uses contains mpi.h.

TODO: Allow uhcaf to set the image heap size for a program, so that the user
doesn't need to specify it when launching the program. Of course, should still
be able to override the value set during compilation, if need be.

Contributor: Deepak
dreachem 56d 07h /
4054 runtime fix for OMP level 1 par regions race condition

We observed a race condition which could potential cause a deadlock in the OMP
runtimes used by OpenUH and Open64. An OMP program initializes by allocating
several slave threads, depending on the omp-num-threads ICV. Suppose there's
the scenario of a small parallel region executed on a subset of those
threads, and then immediately another parallel region is executed. Its
possible that slave threads not involved in the first region can "get stuck"
in a pthread_cond_wait call from __ompc_level_1_slave.

See the comments embedded in the following code snippet to see how the
deadlock situation can occur.

-- code snippet from omp_thread.c, __ompc_level_1_slave --

for( counter = 0; __omp_level_1_team_manager.new_task != task_expect;
counter++) {

// **********
// (1) Current slave thread spins without seeing a signal from the master
// that a new parallel region is to be executed
// *********

if (counter > __omp_spin_count) {

// *********
// (2) In the mean time while current thread is trying to acquire
// the mutex, the master toggles new_task and broadcasts a signal
// to start parallel region with a subset of threads. The
// parallel region completes with a subset of the other slave
// threads (i.e., not the current one).
//
// (3) Then, the master immediately sends a signal for *another*
// parallel region. The new_task variable has now been toggled
// twice, all before the current slave thread has a chance to
// acquire the level-1 mutex and recheck the value of new_task.
// *********

pthread_mutex_lock(&__omp_level_1_mutex);

// **********
// (4) Now, the current slave thread thinks new_task never
// changed, and it calls pthread_cond_wait. It completely misses
// the signal from the master for the second parallel region, and
// it gets stuck here.
// **********

while (__omp_level_1_team_manager.new_task != task_expect) {
pthread_cond_wait(&__omp_level_1_cond,
&__omp_level_1_mutex);
}
pthread_mutex_unlock(&__omp_level_1_mutex);
}
}

-- end of code snippet --

To sum it up, the issue occurs because the implementation allowed the master
thread to create two parallel regions, one after the other, before all the
allocated slave threads had a chance to observe the signal for the first
parallel region. And because new_task is treated as a boolean flag, the slave
thread can not tell when it has been toggled twice. Note that this is currently
only an issue for level 1 parallel regions, because slave threads are created
and destroyed for every nested parallel region.

To resolve this, instead of toggling new_task, the master thread will
increment it. Slave threads will check if the value has changed from what it
was, and if so they will break out of the spin loop. This will prevent a slave
thread from executing the pthread_cond_wait within the spin loop if the master
has already started a new parallel region, eliminating the deadlock situation.

Contributor: Deepak
dreachem 56d 07h /
4053 [CAF] fix missing #ifdef _UH_COARRAYS and fix for glibc 2.16+

* Added a missing #ifdef _UH_COARRAYS in s_directiv.c. Without it, the build
failed when not configuring the using --enable-coarrays.

* Use siginfo_t instead of struct siginfo *. Some glibc (2.16+) requires the
use of siginfo_t instead of struct siginfo *.

Also, increasing patch level to 3.0.28.
dreachem 85d 06h /
4052 adding GNUFE_PREFIX_OVERRIDE arg to C/C++ front-end rules

When building the C/C++ front-ends, need to always pass in the
GNUFE_PREFIX_OVERRIDE. This will ensure that a modified prefix in Makefile.in
will propagate down to the gcc Makefiles.
dreachem 95d 16h /
4051 [CAF] fix return type for Is_LCB and Was_Visisted in coarray_lower.cxx

Return type for these macros should be void * instead of BOOL.
dreachem 96d 01h /
4050 [CAF] fixes for non-blocking read and subsequent sync

This fixes some problems with the non-blocking read support. The runtime now
exposes blocking and non-blocking read interfaces. All non-blocking reads will
return a handle which can be sync'd on with __coarray_sync. If the handle is
NULL, then a __coarray_sync on it will simply return.

Generated __coarray_sync is now conditionally executed. The call to
__coarray_sync is made only if the handle argument is not NULL. If the
argument is NULL, that indicates that there is nothing to wait on. This
reduces overhead for unnecessary syncs on NULL handles. Also, we add a simple
change to make the check for outstanding, potentially overlapping
communication less expensive. It will first check that there exists
outstandings PUTS/GETS before making any function calls.

This commit also fixes the generation of remote reads in Coarray_Lower so that
it uses the non-blocking interface correctly. Also, an LCB is not used in
Coarray_Lower if its a direct remote read into the LHS (single term on RHS).

Contributor: Deepak
dreachem 96d 02h /
4049 export __omp_collector_api symbol for libopenmp

The symbol __omp_collector_api should be exported to allow an OpenUH compiled
OpenMP program to run with TAU.
dreachem 96d 02h /
4048 fix diagnostic phase name string copy for W2F

For W2F, the phase name (path to 'be' executable) is copied into a string
Diag_Phase_Name using strcpy. However, if it exceeded 80 characters, it would
overrun the string and write into the subsequent variable Diag_File. This
caused a seg fault later on, when the compiler erroneously calls fclose on
Diag_File. To fix, simply check if the phase name exceeds the allowable number
of characters (80), and if so we truncate it. A warning is also printed to
stderr in this case.
dreachem 96d 02h /
4047 [CAF] adds defer_sync/sync directives + other fixes

We changed the runtime to use non-blocking get and put by default. So, without
optimization the compiler will insert a "sync" after every coarray read
operation to ensure completion. To give the programmer control over when the
image will sync on completion, the compiler now supports defer_sync and sync
directives. The defer_sync directive will effectively prevent the compiler
from inserting the subsequent sync, allowing the programmer to manually manage
communication-computation overlap.

Optionally, the programmer can specify a "sync" variable with the defer_sync
directive. The type for this variable should be the same word length as the
machine address (usually, 8 bytes).

Example:

! using the defer_sync directive in combination with sync memory, the
! programmer can manually overlap communication and computation.

!dir$ defer_sync(s1)
a1 = x1[p]
!dir$ defer_sync(s2)
a2 = x2[p]
!dir$ defer_sync(s3)
a3 = x3[p]
!dir$ sync(s1) ! wait specifically on the first read to complete
!dir$ sync ! wait on all pending reads to complete
y = a1 + a2 + a3

Runtime will automatically generate non-blocking read and write. In case of
writes, it is assumed that the source buffer is LCB, unless a call to
comm_write_unbuffered is made. LCBs for writes are freed once the write has
completed (user code doesn't need to explicitly call __release_lcb for it).

Other fixes and improvements made:

* Fix computation of local strides when using LCB and other fixes. The
compiler was generating wrong values for the local strides array when a
local communication buffer (LCB) is used. We now set it based on the count
array values: count(1), count(1)*count(2), count(1)*count(2)*..., etc.

- Fix for coindexed MSTORE

- Don't use ARMCI lock when doing atomic update in sync images. We were using
ARMCI lock for some reason when doing an atomic RMW update in the sync
images implementation for ARMCI. This shouldn't be necessary.

- Change LAUNCHER to UHCAF_LAUNCHER in cafrun. The cafrun script now uses the
environment variable UHCAF_LAUNCHER instead of LAUNCHER. Also,
UHCAF_LAUNCHER_OPTS instead of LAUNCHER_OPTS.

- Allow single element remote read without LCB. Also don't use LCB for single
element remote read accesses. Before, the compiler generated a temporary
buffer for accesses of the form: Y = A[i], where Y is a scalar or array
reference that is not an array section. This wasn't strictly necessary, and
it prevented use from making use of the defer_sync directive which could be
useful for such statements. Now, we avoid the use of the intermediate LCB.

Contributor: Deepak
dreachem 96d 02h /
4046 fixing configure script

Added a bad configure script in last commit, so fixing it.
dreachem 104d 09h /

1 2 Next >

Show All