Subversion Repositories Open64

[/] - Rev 4059

Rev

Filtering Options

Clear current filter

Rev Log message Author Age Path
4059 [CAF] fix check for sync handle before call to __coarray_sync

The compiler was generating a bad opcode when checking the value of the sync
handle returned by a __coarray_*read call, resulting in an error in the code
expansion phase. This was corrected.
dreachem 25d 23h /
4058 [CAF] adds opaque derived type caf_sync_handle and FFE fix for enums check

This adds a new opaque derived type for the sync-handle variables,
type(caf_sync_handle), to be used with defer_sync and sync directive. It
requires 'use caf_types' statement. E.g.

use caf_types
...
type(caf_sync_handle) :: s1
integer :: a(SZ)
integer :: b(SZ)[*]
...

!dir$ defer_sync(s1)
a(:) = b(:)[i]
...

!dir$ sync(s1)
!! now a(:) has been defined and can be safely read.

Also, fixing check_enums_for_change in fortran FE. Previous merge from open64
added back a couple unused operators in the operator_values enum type to get
FFE_BUILD_OPTIMIZE=DEBUG to work. But earlier, I had adjusted
check_enums_for_change for the same purpose. To keep OpenUH in sync with
Open64, I'm using their fix reverting check_enums_for_change to what it was
initially.

Contributor: Deepak
dreachem 26d 05h /
4057 [CAF] several improvements for remote accesses and defer_sync

Runtime:
- For coarray writes, no LCB used for direct accesses. The API for the
compiler includes a *write_from_lcb and a *write interface. *write is used
for "direct" writes that don't use an intermediate LCB. It requires local
completion to be guaranteed before returning, unless the statement had a
defer_sync directive attached to it.

- Remove ENABLE_NB_PUT variable (since we assume it is set).

- When allocating LCB, try to do so out of assymmetric heap if there is
sufficient space remaining (conservatively, if requested size is less than
half of the remaining space). Relying on OS calls for allocating/freeing LCB
memory can result in unpredictable behavior in GASNet due to an issue in the
GASNet Firehose implementation.

- Fix address calculation in comm_end_symmetric_mem.

- Handle case where RHS of a coarray write statement has a single term with an
MLOAD operator.

I also now allow defer_sync for coarray writes and adjust semantics for
defer_sync in general:

!dir$ defer_sync[(s)]
t(...) = u(...)[i]

- it will block until previous conflicting PUTs to same image complete
- it will initiate a non-blocking GET and return.
- t will not be defined until
- image synchronizes with i using sync images or sync all
- image encounters sync memory or matching !dir$ sync directive

!dir$ defer_sync[(s)]
u(...)[i] = t(...)

- it will block until previous conflicting PUTs to same image complete
- it will initiate a non-blocking PUT and return
- u[i] will not be defined until
- image synchronizes with i using sync images or sync all
- image encounters sync memory or matching !dir$ sync
directive
- while the PUT can be subsequently synchronized on by the
programmer, subsequent coarray reads or writes will not
block on it even if their addresses conflict with it.

Contributor: Deepak
dreachem 26d 05h /
4056 [CAF] changes tracing behavior and modification for GASNet get/put calls

The user now supplies UHCAF_TRACE_DIR instead of UHCAF_TRACE_FILE. Traces
generated are now per-image -- $UHCAF_TRACE_DIR/trace.x for each image x. The
default value for UHCAF_TRACE_DIR is uhcaf.traces.

Also, routine uhcaf_tracedump_shared_mem_alloc will print to the image's trace
file (so, tracing should be enabled to use this).

Also, adds routines uhcaf_trace_suspend and uhcaf_trace_resume to suspend and
resume tracing in the runtime.

Also, we use the bulk versions for GASNet get and put in the general case, in
case data is unaligned.

Contributor: Deepak
dreachem 26d 05h /
4055 [CAF] changes to how shared memory is allocated in CAF runtime

This makes several changes to how shared memory is allocated and managed in
CAF runtime:

- When using GASNet ibv conduit with FAST segment configuration, we observe
that the segment size returned by gasnet_MaxLocalSegmentSize() may still be
an unsafe shared memory allocation size. So, we set to the maximum allowable
segment size to 75% of what it returns in this case, to be safe.

- Before, the size of the allocated shared memory segment was controlled by
UHCAF_IMAGE_HEAP_SIZE. This could lead to under-allocation, even when the
program only consists of save coarray data (for which the requisite total
size is statically known).

- We redefined the term "IMAGE HEAP" to refer specifically to shared memory
space for which dynamic allocations and deallocations can occur. This is
what the user can control (either with UHCAF_IMAGE_HEAP_SIZE, or with
--image-heap). The default of 30 MB is still retained. The total shared
memory per image requested by the program will be the image heap size plus
the space needed to accomodate the save coarrays.

- To help the user know what heap size to use, we add a trace MEMORY_SUMMARY.
Running with --trace=memory_summary will print heap memory usage for each
image (compiler/runtime should be configured with --enable-cafrt-traces).

Other fixes/improvements:

- forgot to initialize some armci handles

- cafrun: set UHCAF_LAUNCHER appropriately for gasnet/armci
uhcaf: use gasnet-compatible MPI instead of MPI_LIB.

- Fix check for MPI_INCLUDE during configure. Ensures that the MPI_INCLUDE it
uses contains mpi.h.

TODO: Allow uhcaf to set the image heap size for a program, so that the user
doesn't need to specify it when launching the program. Of course, should still
be able to override the value set during compilation, if need be.

Contributor: Deepak
dreachem 26d 05h /
4054 runtime fix for OMP level 1 par regions race condition

We observed a race condition which could potential cause a deadlock in the OMP
runtimes used by OpenUH and Open64. An OMP program initializes by allocating
several slave threads, depending on the omp-num-threads ICV. Suppose there's
the scenario of a small parallel region executed on a subset of those
threads, and then immediately another parallel region is executed. Its
possible that slave threads not involved in the first region can "get stuck"
in a pthread_cond_wait call from __ompc_level_1_slave.

See the comments embedded in the following code snippet to see how the
deadlock situation can occur.

-- code snippet from omp_thread.c, __ompc_level_1_slave --

for( counter = 0; __omp_level_1_team_manager.new_task != task_expect;
counter++) {

// **********
// (1) Current slave thread spins without seeing a signal from the master
// that a new parallel region is to be executed
// *********

if (counter > __omp_spin_count) {

// *********
// (2) In the mean time while current thread is trying to acquire
// the mutex, the master toggles new_task and broadcasts a signal
// to start parallel region with a subset of threads. The
// parallel region completes with a subset of the other slave
// threads (i.e., not the current one).
//
// (3) Then, the master immediately sends a signal for *another*
// parallel region. The new_task variable has now been toggled
// twice, all before the current slave thread has a chance to
// acquire the level-1 mutex and recheck the value of new_task.
// *********

pthread_mutex_lock(&__omp_level_1_mutex);

// **********
// (4) Now, the current slave thread thinks new_task never
// changed, and it calls pthread_cond_wait. It completely misses
// the signal from the master for the second parallel region, and
// it gets stuck here.
// **********

while (__omp_level_1_team_manager.new_task != task_expect) {
pthread_cond_wait(&__omp_level_1_cond,
&__omp_level_1_mutex);
}
pthread_mutex_unlock(&__omp_level_1_mutex);
}
}

-- end of code snippet --

To sum it up, the issue occurs because the implementation allowed the master
thread to create two parallel regions, one after the other, before all the
allocated slave threads had a chance to observe the signal for the first
parallel region. And because new_task is treated as a boolean flag, the slave
thread can not tell when it has been toggled twice. Note that this is currently
only an issue for level 1 parallel regions, because slave threads are created
and destroyed for every nested parallel region.

To resolve this, instead of toggling new_task, the master thread will
increment it. Slave threads will check if the value has changed from what it
was, and if so they will break out of the spin loop. This will prevent a slave
thread from executing the pthread_cond_wait within the spin loop if the master
has already started a new parallel region, eliminating the deadlock situation.

Contributor: Deepak
dreachem 26d 05h /
4053 [CAF] fix missing #ifdef _UH_COARRAYS and fix for glibc 2.16+

* Added a missing #ifdef _UH_COARRAYS in s_directiv.c. Without it, the build
failed when not configuring the using --enable-coarrays.

* Use siginfo_t instead of struct siginfo *. Some glibc (2.16+) requires the
use of siginfo_t instead of struct siginfo *.

Also, increasing patch level to 3.0.28.
dreachem 55d 04h /
4052 adding GNUFE_PREFIX_OVERRIDE arg to C/C++ front-end rules

When building the C/C++ front-ends, need to always pass in the
GNUFE_PREFIX_OVERRIDE. This will ensure that a modified prefix in Makefile.in
will propagate down to the gcc Makefiles.
dreachem 65d 15h /
4051 [CAF] fix return type for Is_LCB and Was_Visisted in coarray_lower.cxx

Return type for these macros should be void * instead of BOOL.
dreachem 65d 23h /
4050 [CAF] fixes for non-blocking read and subsequent sync

This fixes some problems with the non-blocking read support. The runtime now
exposes blocking and non-blocking read interfaces. All non-blocking reads will
return a handle which can be sync'd on with __coarray_sync. If the handle is
NULL, then a __coarray_sync on it will simply return.

Generated __coarray_sync is now conditionally executed. The call to
__coarray_sync is made only if the handle argument is not NULL. If the
argument is NULL, that indicates that there is nothing to wait on. This
reduces overhead for unnecessary syncs on NULL handles. Also, we add a simple
change to make the check for outstanding, potentially overlapping
communication less expensive. It will first check that there exists
outstandings PUTS/GETS before making any function calls.

This commit also fixes the generation of remote reads in Coarray_Lower so that
it uses the non-blocking interface correctly. Also, an LCB is not used in
Coarray_Lower if its a direct remote read into the LHS (single term on RHS).

Contributor: Deepak
dreachem 66d 00h /
4049 export __omp_collector_api symbol for libopenmp

The symbol __omp_collector_api should be exported to allow an OpenUH compiled
OpenMP program to run with TAU.
dreachem 66d 00h /
4048 fix diagnostic phase name string copy for W2F

For W2F, the phase name (path to 'be' executable) is copied into a string
Diag_Phase_Name using strcpy. However, if it exceeded 80 characters, it would
overrun the string and write into the subsequent variable Diag_File. This
caused a seg fault later on, when the compiler erroneously calls fclose on
Diag_File. To fix, simply check if the phase name exceeds the allowable number
of characters (80), and if so we truncate it. A warning is also printed to
stderr in this case.
dreachem 66d 00h /
4047 [CAF] adds defer_sync/sync directives + other fixes

We changed the runtime to use non-blocking get and put by default. So, without
optimization the compiler will insert a "sync" after every coarray read
operation to ensure completion. To give the programmer control over when the
image will sync on completion, the compiler now supports defer_sync and sync
directives. The defer_sync directive will effectively prevent the compiler
from inserting the subsequent sync, allowing the programmer to manually manage
communication-computation overlap.

Optionally, the programmer can specify a "sync" variable with the defer_sync
directive. The type for this variable should be the same word length as the
machine address (usually, 8 bytes).

Example:

! using the defer_sync directive in combination with sync memory, the
! programmer can manually overlap communication and computation.

!dir$ defer_sync(s1)
a1 = x1[p]
!dir$ defer_sync(s2)
a2 = x2[p]
!dir$ defer_sync(s3)
a3 = x3[p]
!dir$ sync(s1) ! wait specifically on the first read to complete
!dir$ sync ! wait on all pending reads to complete
y = a1 + a2 + a3

Runtime will automatically generate non-blocking read and write. In case of
writes, it is assumed that the source buffer is LCB, unless a call to
comm_write_unbuffered is made. LCBs for writes are freed once the write has
completed (user code doesn't need to explicitly call __release_lcb for it).

Other fixes and improvements made:

* Fix computation of local strides when using LCB and other fixes. The
compiler was generating wrong values for the local strides array when a
local communication buffer (LCB) is used. We now set it based on the count
array values: count(1), count(1)*count(2), count(1)*count(2)*..., etc.

- Fix for coindexed MSTORE

- Don't use ARMCI lock when doing atomic update in sync images. We were using
ARMCI lock for some reason when doing an atomic RMW update in the sync
images implementation for ARMCI. This shouldn't be necessary.

- Change LAUNCHER to UHCAF_LAUNCHER in cafrun. The cafrun script now uses the
environment variable UHCAF_LAUNCHER instead of LAUNCHER. Also,
UHCAF_LAUNCHER_OPTS instead of LAUNCHER_OPTS.

- Allow single element remote read without LCB. Also don't use LCB for single
element remote read accesses. Before, the compiler generated a temporary
buffer for accesses of the form: Y = A[i], where Y is a scalar or array
reference that is not an array section. This wasn't strictly necessary, and
it prevented use from making use of the defer_sync directive which could be
useful for such statements. Now, we avoid the use of the intermediate LCB.

Contributor: Deepak
dreachem 66d 00h /
4046 fixing configure script

Added a bad configure script in last commit, so fixing it.
dreachem 74d 07h /
4045 tweaks to barrier and task pool implementation

The existing barrier implementation, in which waiting threads would
continuously check the tasks queue for work, had high overhead. This was
especially noticeable when oversubscibing the system with threads. We now use
pthread_cond_timedwait instead of pthread_cond_wait when a thread enters a
wait state. This ensures that threads will exit the wait state every 5
microseconds in case they missed a wakeup signal.

Also, modified how the default task pool implementation does work stealing. If
a thread attempts to remove a task from an unempty queue and is unsuccessful,
it will return NULL rather than continue on to the next queue. This was done
to reduce overall contention on the task queues, especially when there are a
lot of threads in the team.

Contributor: Deepak
dreachem 77d 22h /
4044 open64 merge r3826:4037

This merges several months worth of commits from Open64 5.x main trunk into
OpenUH. This spans numerous bug fixes in LNO, WOPT, and CG, as well as several
new features. Included is a patch for x6-ppc32 cross compiling support, CGSSA
implementation, auto-detection of CPU on the host system for a more accurate
target architecture, correction of ipa-link arguments (for debugging),
allowing debugging of sources compiled from different directory, supporting
variable length arrays in structs, removing some unused PROMPF implementation
code, updates to CG scheduler, removal of unused GCC 4.0 C/C++ front-end, a
new perl version of kopencc script, some misc. transformations to enable
various loop optimizations, fixing build errors when compiling with gcc 4.7,
and a configure-based build system for PPC native compiler.

The removal of PROMPF is something we may try to reverse in the future,
because some analyses implemented in OpenUH several years ago used it.

Also, I reversed a previous fix done in OpenUH for debugging codes compiled
outside its source directory, because this has been fixed in a different way
in this merge.

Open64 bugs fixed by this merge are: 362, 778-779, 783, 785, 787, 798, 830,
889, 891, 897, 903, 908, 910-912, 924, 929, 933-934, 938-941, 943-944,
947-952, 954-955, 962-963, 966-967, 969, 970-972, 978, 999, 1003-1004, 1800
(from Open64 bugzilla).

See the open64 (main trunk) commit log for more details.

Updating patch level to 3.0.27.
dreachem 78d 14h /
4043 OMP runtime: remove busy wait loops and fixes for nested par regions

* Remove busy wait loops. Busy wait was used while threads were waiting at a
barrier for other threads to reach and for the tasks queues to empty. A busy
wait was also used for the implicit barrier at the end of a nested parallel
region, and also while slave threads in nested parallel regions are waiting
for the master thread's signal to start executing. All of this resulted in
high CPU contention, particularly when over-subscribing threads and when
generating nested parallel regions. Now, a condition variable + mutex is
used instead, which frees up the CPU to do useful work while the threads are
waiting for their signal to proceed.

* Fix call to __ompc_set_state at end of nested par region. The call to the
__ompc_set_state (used for collector interface) at the end of an nested
parallel region should occur after the original thread ID
(__omp_current_v_thread) has been restored. This wasn't happening, and as a
result it was trying to set the state for a stale thread.

* Fix thread stack memory leak. Before, the runtime was creating thread stack
manually using the pthread_attr_setstack call. However, for nested regions
it wasn't freeing the memory allocated for these stacks. So now, instead we
use pthread_attr_setstacksize. At the end of nested parallel regions, the
master thread will call pthread_detach on all the nested slave threads in
order to properly free its memory.

* Also, we no longer using a hash table to store the u_threads for threads
spawn in a nested parallel region. It appears that the hash table was
getting corrupted some how. The hash table really isn't necessary anymore
since we're storing the current v_thread in __omp_current_v_thread, and the
current u_thread is available through __omp_current_v_thread->executor.

Updating patch level to 3.0.26

Contributor: Deepak
dreachem 196d 07h /
4042 Clean up FFE checks of OMP directives and MP-lowering warnings

* Fix semantic checking of OMP directive placement (Fortran). This refines
the way in which the Fortarn front-end checks the placement of OMP directives
with respect to eachother. In particular, more checks are added to prevent
work-sharing constructs from being closely nested within eachother or within
explicit tasks. This also accounts for nested directives to more accurately
track what the current "directive state" is.

* Fix MP-lowering warnings for writing to shared variables. The MP-lowering
phase was giving spurious warnings about writing to a variables in a parallel
region that had implicit shared scope. Fixed it so that warnings aren't given
for writes in a single or critical region. Same for shared variables in a task
region that occurs lexically within a master, single, or critical section.

Contributor: Deepak
dreachem 202d 23h /
4041 prevent potentially unsafe if-conversion optimization

Open64/OpenUH will perform an if-conversion optimization in WOPT that
converts an if statement with single store statement into a SELECT
instruction. The SELECT instrument is of the form:

x = (cond)? y1 : y2

This optimization can potentially result in a data-race when applied in the
context of a parallel region.

For example, consider the following statement to be executed in a parallel
region, where x is shared and j has the value of the thread id:

if (j == 0) x = y;

WOPT may convert it to:
x = (j == 0)? y : x; // if statement becomes a SELECT statement.

This introduces a data race if, for example, thread 0 and thread 1 attempt to
execute it simultaneously.

We fix it (at least partially) by adding an extra condition in CFG:Screen_cand
function, which screens candidates for if-conversion. If its a store to a
variable that is declared in a scope outside the current PU (potentially
shared), and there is either an empty if-block or an empty else-block, we
don't allow if-conversion.

Contributor: Deepak
dreachem 204d 09h /
4040 Several OpenMP fixes for Fortran

This commits adds several fixes for the Fortran OpenMP implementation.

* Fix support for OMP WORKSHARE construct. The implementation for the
workshare construct was broken. This fixes it by ensuring only one thread
executes the work units within it, except for intrinsics which are converted
into parallelized reduction loops or array assignments which are converted
into parallelized do loops. The fix also allows for orphaned workshare
construct and the parallel omp workshare construct. New runtime APIs were
added for selecting which thread will execute the work units, similar to the
routines used for the single construct.

* Fix handling of task construct in front-end. The entry for the
Open_MP_End_Task_Stmt in the stmt_in_blk array (p_driver.h, Fortran FE) was
incorrect. This caused the !$omp end task statement to fail in the front-end
(a problem since it was added in 3.0.13 of the compiler). The entry needed to
exclude Open_MP_Task_Blk and Open_MP_Workshare_Blk as blocks that it can
appear in. Also, some fixes includes to allow nested task constructs and also
to give an error if a task construct occurs within a workshare construct.

* Consider scope for enclosing task construct for loop index variables. When
choosing the default scope for sequential loop index variables (Fortran)
occuring in a parallel region, the compiler should consider the scope for the
variable in the inner most parallel or task construct. However, it was only
considering the scope in the inner most parallel construct. This was fixed by
adding a case for WN_PRAGMA_TASK_BEGIN in Pragmas_For_Par_Region in
omp_lower.cxx.

* Use old Gen_MP_Copyin for Fortran in wn_mp.cxx Last year, the Gen_MP_Copyin
routine was modified for supporting a TLS implementation of threadprivate.
However, Fortran threadprivate is stil done the old way (without using TLS).
This caused the copyin clause to not work for Fortran programs. So now, we
have a separate Gen_MP_Copyin_TLS which will be used if the source language of
the PU being compiled is not Fortran, otherwise the original Gen_MP_Copyin is
used.

Updating patch level to 3.0.25

Contributor: Deepak
dreachem 204d 09h /

1 2 Next >

Show All