Introduction to OpenMP Data Structure: kmp_base_info

In order to understand how OpenMP works internally, in addition to the essential part of functions, we also need to learn some basic data structures. This series of posts are going to give you a basic impression on some OpenMP essential data structures in runtime library. Let’s get started from the most basic and also one of the most complicated one: kmp_base_info.

typedef struct kmp_base_info {
  kmp_desc_t th_info;
  kmp_team_p *th_team; /* team we belong to */
  kmp_root_p *th_root; /* pointer to root of task hierarchy */
  kmp_info_p *th_next_pool; /* next available thread in the pool */
  kmp_disp_t *th_dispatch; /* thread's dispatch data */
  int th_in_pool; /* in thread pool (32 bits for TCR/TCW) */

  int th_team_nproc; /* number of threads in a team */
  kmp_info_p *th_team_master; /* the team's master thread */
  int th_team_serialized; /* team is serialized */
  microtask_t th_teams_microtask; /* save entry address for teams construct */
  int th_teams_level; /* save initial level of teams construct */

  kmp_uint64 th_team_bt_intervals;
  omp_allocator_handle_t th_def_allocator; /* default allocator */
  /* The data set by the master at reinit, then R/W by the worker */
  int th_set_nproc; /* if > 0, then only use this request for the next fork */
  kmp_hot_team_ptr_t *th_hot_teams; /* array of hot teams */
      th_set_proc_bind; /* if != proc_bind_default, use request for next fork */
      th_teams_size; /* number of teams/threads in teams construct */
  int th_prev_level; /* previous level for affinity format */
  int th_prev_num_threads; /* previous num_threads for affinity format */
  kmp_local_t th_local;
  struct private_common *th_pri_head;

  kmp_team_p *th_serial_team; /*serialized team held in reserve*/
  /* The following are also read by the master during reinit */
  struct common_table *th_pri_common;
  volatile kmp_uint32 th_spin_here; /* thread-local location for spinning */
  /* while awaiting queuing lock acquire */
  volatile void *th_sleep_loc; // this points at a kmp_flag
  ident_t *th_ident;
  unsigned th_x; // Random number generator data
  unsigned th_a; // Random number generator data

  /* Tasking-related data for the thread */
  kmp_task_team_t *th_task_team; // Task team struct
  kmp_taskdata_t *th_current_task; // Innermost Task being executed
  kmp_uint8 th_task_state; // alternating 0/1 for task team identification
  kmp_uint8 *th_task_state_memo_stack; // Stack holding memos of th_task_state
  // at nested levels
  kmp_uint32 th_task_state_top; // Top element of th_task_state_memo_stack
  kmp_uint32 th_task_state_stack_sz; // Size of th_task_state_memo_stack
  kmp_uint32 th_reap_state; // Non-zero indicates thread is not
  // tasking, thus safe to reap

  /* More stuff for keeping track of active/sleeping threads (this part is
     written by the worker thread) */
  kmp_uint8 th_active_in_pool; // included in count of #active threads in pool
  int th_active; // ! sleeping; 32 bits for TCR/TCW
  struct cons_header *th_cons; // used for consistency check

  /* Add the syncronizing data which is cache aligned and padded. */
  kmp_balign_t th_bar[bs_last_barrier];

  volatile kmp_int32
      th_next_waiting; /* gtid+1 of next thread on lock wait queue, 0 if none */
  kmp_cond_align_t th_suspend_cv;
  kmp_mutex_align_t th_suspend_mx;
  std::atomic th_suspend_init_count;
  std::atomic th_blocking;
  kmp_cg_root_t *th_cg_roots; // list of cg_roots associated with this thread
} kmp_base_info_t;

In order to make it more understandable, I’ve removed some unimportant fields controlled by macros, like OMPT. I also remove OS dependent part, only focusing on Linux. However, they just have different types on different OS. It does not affect the essential concept.

(To be continued…)

Internal OpenMP: How parallel directive works

Let’s start with parallel directive, which is maybe the most widely used directive in OpenMP. In this post, I’ll use the following basic example to demonstrate how OpenMP library internals works.

void func() {
  int i = 0, j = 1;
#pragma omp parallel

After compilation, it is transformed to following code snippet by compiler (well, different compiler may generate code with a little different style but essentially they are same):

void omp_outlined(int gtid, int btid, int *a, int *b) {
void func() {
  int i = 0, j = 1;
  ident_t loc;
  // some operation on loc
  // ...
  __kmpc_fork_call(loc, 2, omp_outlined, &i, &j);

Since our focus is not on front end so we just briefly introduce what is done by compiler. The compiler will take all statements in the parallel region and outline them into a new function, omp_outlined in this case. This function is actually of type void(int, int, ...) where the first argument is global thread id, and the second one is bound thread id. We’ll introduce them later. The rest of arguments are actually variadic standing for all variables captured by the parallel region. In the example above, the integers i and j are used in the parallel region so they’re captured. In the outlined function, the statements are pretty simple which is just to operate the two variables using each pointer. What’s more, the compiler also creates a new variable of type ident_t describing a source location. It is mainly for debug. After that, a function call to __kmpc_fork_call is generated, and that’s almost it.

Although the example above is pretty simple, it actually shows the essence of how OpenMP compiler deals with OpenMP directive, that is, outlines parallel regions into dedicated functions, and then emits function calls correspondingly. It is actually pretty good for feature developments because it hides all details into the runtime function in case of breaking compilation. However, it also prevents classic compiler optimizations like constant propagation, etc. We’ll not detail them here but we will come back to this topic in another series about how OpenMP compiler works.

Alright, back to our topic.

(To be continued…)

Internal OpenMP: Understand How OpenMP Works Step by Step

If you have seen the page "About Me", you will know my research interest is LLVM and OpenMP. Basically, the main part is to optimize OpenMP to improve its performance, well, more about offloading to GPU. The reason that LLVM is also mentioned is that my vision is to introduce more information into runtime library so that more optimization can be performed during runtime which requires some works in compiler. I’ve been working on OpenMP for about half a year. There’re plenty of awesome tutorials and books to show you how to write OpenMP correctly and efficiently, but there are seldom articles to tell you how OpenMP works underneath. From my perspective, it is not very friendly to friends who want to contribute to OpenMP. In order to help me understand things better, I’d like to write a series of blogs about how OpenMP works internally.

Honestly, I’m still a newbie so the organization of this series may not be the best way. Plus, some statements may not be correct. So I sincerely welcome any comment. I’ll start with what I’m working on that is related to parallel region, task and so forth. Later as I understand the whole structure more deeply, I’ll update previous posts correspondingly if anything is wrong or partial.

003 Longest Substring Without Repeating Characters


Given a string, find the length of the longest substring without repeating characters.

Example 1:

Input: "abcabcbb"

Output: 3

Explanation: The answer is "abc", with the length of 3.

Example 2:

Input: "bbbbb"

Output: 1

Explanation: The answer is "b", with the length of 1.

Example 3:

Input: "pwwkew"

Output: 3

Explanation: The answer is "wke", with the length of 3.

Note that the answer must be a substring, "pwke" is a subsequence and not a substring.


The most intuitive way is to save each character and its index into a map. Every time it encounters a repeating character, let’s say c, remove all characters before c and their indices from the map. Besides, the max length should also be stored and updated accordingly. Well, it does work, but not perfectly. The only reason is we operate each character twice. In the worst case, the time complexity is O(n^2).

So do we really need to check whether a character is in the map so that we could know that whether the character is repeating? Let’s change our mind a little bit. Our previous thought is to check the map to see whether the character is in the substring. That’s why we need to remove all characters that are not in the substring from the map. What if we set a variable s representating the start of substring? In this way, as long as the index is less than s, the corresponding character is not in the substring. We don’t need to perform deletion any more. We only check each character once!


class Solution {
  int quick_max(int x, int y) { return x ^ ((x ^ y) & -(x < y)); }

  int lengthOfLongestSubstring(string s) {
    vector<int> map(256, -1);
    auto max_len = 0, start = -1;

    for (int i = 0; i < s.size(); ++i) {
      if (map[s[i]] > start) {
        start = map[s[i]];

      map[s[i]] = i;
      max_len = quick_max(max_len, i - start);

    return max_len;

There’s a tricky function quick_max to compare two integers and return the larger one. It is bitwise operation. std::max uses conditional statement which impacts pipeline very much. I generally avoid using conditional statement as much as possible to gain better performance. In this case, the comparison happens in every iteration, we do need to use a much more efficient way to make it. This small change greatly improves performance. The running time decreases from 16ms to 8ms. 1X speedup!

001 Two Sum

From now on, I’d like to solve all algorithm questions on Leetcode one by one to help me pass the job hunting interview in the future, no matter for internship or regular job. 🙂 I also pushed my solution on GitHub.

Let’s get started from the first question: Two Sum.


Given an array of integers, return indices of the two numbers such that they add up to a specific target.

You may assume that each input would have exactly one solution, and you may not use the same element twice.


Given nums = [2, 7, 11, 15], target = 9,
Because nums[0] + nums[1] = 2 + 7 = 9,
return [0, 1].


The main thought is to traverse the whole vector. For each element, check whether the subtrahend exists in the vector. If yes, then return indices of current element and subtrahend. As a result, we need to maintain a structure that could provide us both the value and its index with a good performance. unordered_map is a best choice here.

We’ve not considered another case where the subtrahend doesn’t exist for the moment. What should we do? Think about it, for example, a + b = c where c is our target. Now we get a and check whether b exists. It doesn’t. In order to let us find a when we reach b, we should store a and its index. That’s it. In this way, we can outline the whole traverse.


class Solution {
  vector<int> twoSum(vector<int> &nums, int target) {
    unordered_map<int, int> map;

    for (auto i = 0; i < nums.size(); ++i) {
      const auto key = target - nums[i];

      if (map.find(key) != map.end()) {
        return vector<int>({i, map[key]});

      map[nums[i]] = i;

    return vector<int>();

Runtime: 4 ms, faster than 99.70% of C++ online submissions for Two Sum. Memory Usage: 10.1 MB, less than 55.86% of C++ online submissions for Two Sum.


Three years and four months ago, I submitted a solution with same idea but different code. Its running time is 16ms. Wow. Totally identical idea! This could demonstrate how important writing code efficiently. 🙂 Here is the old solution:

class Solution {
    vector<int> twoSum(vector<int>& nums, int target) {
        vector<int> ret;
        if (nums.size() == 0) {
            return ret;
        unordered_map<int, size_t> hash;
        for (int i = 0; i < nums.size(); ++i) {
            int left = target - nums[i];
            if (hash.find(left) != hash.end()) {
                return ret;
            } else {
                hash[nums[i]] = i;
        return ret;

OpenMP Learning Notes

  1. In C/C++, OpenMP directives are specified by using the #pragma mechanism provided by the C and C++ standards.
  2. OpenMP directives for C/C++ are specified with #pragma directives. The syntax of an OpenMP directive is as follows:
    #pragma omp directive-name [clause[ [,] clause] ... ] new-line
    Each directive starts with #pragma omp. The remainder of the directive follows the conventions of the C and C++ standards for compiler directives. In particular, white space can be used before and after the #, and sometimes white space must be used to separate the words in a directive. Some OpenMP directives may be composed of consecutive #pragma directives if specified in their syntax.
  3. Preprocessing tokens following #pragma omp are subject to macro replacement.
  4. Directives are case-sensitive. Each of the expressions used in the OpenMP syntax inside of the clauses must be a valid assignment-expression of the base language unless otherwise specified.
  5. Directives may not appear in constexpr functions or in constant expressions. Variadic parameter packs cannot be expanded into a directive or its clauses except as part of an expression argument to be evaluated by the base language, such as into a function call inside an if clause.
  6. Only one directive-name can be specified per directive (note that this includes combined directives). The order in which clauses appear on directives is not significant. Clauses on directives may be repeated as needed, subject to the restrictions listed in the description of each clause.
  7. Some clauses accept a list, an extended-list, or a locator-list.
    • A list consists of a comma-separated collection of one or more list items. A list item is a variable or an array section.
    • An extended-list consists of a comma-separated collection of one or more extended list items. An extended list item is a list item or a function name.
    • A locator-list consists of a comma-separated collection of one or more locator list items. A locator list item is any lvalue expression, including variables, or an array section.
  8. Some executable directives include a structured block. A structured block:
    • may contain infinite loops where the point of exit is never reached;
    • may halt due to an IEEE exception;
    • may contain calls to exit(), _Exit(), quick_exit(), abort() or functions with a _Noreturn specifier (in C) or a noreturn attribute (in C/C++);
    • may be an expression statement, iteration statement, selection statement, or try block, provided that the corresponding compound statement obtained by enclosing it in { and } would be a structured block.
  9. Stand-alone directives do not have any associated executable user code. Instead, they represent executable statements that typically do not have succinct equivalent statements in the base language. There are some restrictions on the placement of a stand-alone directive within a program. A stand-alone directive may be placed only at a point where a base language executable statement is allowed. A stand-alone directive may not be used in place of the statement following an if, while, do, switch, or label.
  10. In implementations that support a preprocessor, the _OPENMP macro name is defined to have the decimal value yyyymm where yyyy and mm are the year and month designations of the version of the OpenMP API that the implementation supports. If a #define or a #undef preprocessing directive in user code defines or undefines the _OPENMP macro name, the behavior is unspecified.

Compiler Techniques/Terminologies

Tail call

In computer science, a tail call is a subroutine call performed as the final action of a procedure. If a tail call might lead to the same subroutine being called again later in the call chain, the subroutine is said to be tail-recursive, which is a special case of recursion.

Tail calls can be implemented without adding a new stack frame to the call stack. Most of the frame of the current procedure is no longer needed, and can be replaced by the frame of the tail call, modified as appropriate (similar to overlay for processes, but for function calls). The program can then jump to the called subroutine. Producing such code instead of a standard call sequence is called tail call elimination. Tail call elimination allows procedure calls in tail position to be implemented as efficiently as goto statements, thus allowing efficient structured programming. In the words of Guy L. Steele, "in general, procedure calls may be usefully thought of as GOTO statements which also pass parameters, and can be uniformly coded as [machine code] JUMP instructions."

Call site

In programming, a call site of a function or subroutine is the location (line of code) where the function is called (or may be called, through dynamic dispatch). A call site is where zero or more arguments are passed to the function, and zero or more return values are received.

// this is a function definition
function sqr(x) {
  return x * x;

function foo() {
  // these are two call sites of function sqr in this function
  a = sqr(b);
  c = sqr(b);

Constant folding

Constant folding is the process of recognizing and evaluating constant expressions at compile time rather than computing them at runtime. Terms in constant expressions are typically simple literals, such as the integer literal, but they may also be variables whose values are known at compile time. Consider the statement:

int i = 320 * 200 * 32;

Most compilers would not actually generate two multiply instructions and a store for this statement. Instead, they identify constructs such as these and substitute the computed values at compile time (in this case, 2,048,000).

Constant folding can make use of arithmetic identities. If x is numeric, the value of 0 * x is zero even if the compiler does not know the value of x (note that this is not valid for IEEE floats since x could be Infinity or NotANumber. Still, some languages favoring performance like GLSL allows this for constants, which can occasionally cause bugs).

Constant folding may apply to more than just numbers. Concatenation of string literals and constant strings can be constant folded. Code such as "abc" + "def" may be replaced with "abcdef".

Constant propagation

Constant propagation is the process of substituting the values of known constants in expressions at compile time. Such constants include those defined above, as well as intrinsic functions applied to constant values. Consider the following pseudocode:

int x = 14;
int y = 7 - x / 2;
return y * (28 / x + 2);

Propagating x yields:

int x = 14;
int y = 7 - 14 / 2;
return y * (28 / 14 + 2);

Continuing to propagate yields the following (which would likely be further optimized by dead code elimination of both x and y.)

int x = 14;
int y = 0;
return 0;

Constant propagation is implemented in compilers using reaching definition analysis results. If all a variable’s reaching definitions are the same assignment which assigns a same constant to the variable, then the variable has a constant value and can be replaced with the constant.

Constant propagation can also cause conditional branches to simplify to one or more unconditional statements, when the conditional expression can be evaluated to true or false at compile time to determine the only possible outcome.

Reaching definition

In compiler theory, a reaching definition for a given instruction is an earlier instruction whose target variable can reach (be assigned to) the given one without an intervening assignment. For example, in the following code:

d1 : y := 3
d2 : x := y

d1 is a reaching definition for d2. In the following, example, however:

d1 : y := 3
d2 : y := 4
d3 : x := y

d1 is no longer a reaching definition for d3, because d2 kills its reach: the value defined in d1 is no longer available and cannot reach d3.

The similarly named reaching definitions is a data-flow analysis which statically determines which definitions may reach a given point in the code. Because of its simplicity, it is often used as the canonical example of a data-flow analysis in textbooks. The data-flow confluence operator used is set union, and the analysis is forward flow. Reaching definitions are used to compute use-def chains.

Peephole optimization

Peephole optimization is an optimization technique performed on a small set of instructions in a segment of assembly-language code, known as the peephole or window. Peephole optimization involves changes to individual assembly-language instructions, such as eliminating redundant code, replacing slower instructions with faster ones, optimizing flow control, and performing algebraic simplification.

Common techniques applied in peephole optimization:

  • Null sequences – Delete useless operations.
  • Combine operations – Replace several operations with one equivalent.
  • Algebraic laws – Use algebraic laws to simplify or reorder instructions.
  • Special case instructions – Use instructions designed for special operand cases.
  • Address mode operations – Use address modes to simplify code. There can be other types of peephole optimizations.

Value numbering

Value numbering is a technique of determining when two computations in a program are equivalent and eliminating one of them with a semantics preserving optimization.

Global value numbering

Global value numbering (GVN) is a compiler optimization based on the static single assignment form (SSA) intermediate representation. It sometimes helps eliminate redundant code that common subexpression elimination (CSE) does not. At the same time, however, CSE may eliminate code that GVN does not, so both are often found in modern compilers. Global value numbering is distinct from local value numbering in that the value-number mappings hold across basic block boundaries as well, and different algorithms are used to compute the mappings.

Global value numbering works by assigning a value number to variables and expressions. The same value number is assigned to those variables and expressions which are provably equivalent. For instance, in the following code:

Common subexpression elimination

In compiler theory, common subexpression elimination (CSE) is a compiler optimization that searches for instances of identical expressions (i.e., they all evaluate to the same value), and analyzes whether it is worthwhile replacing them with a single variable holding the computed value.


In the following code:

a = b * c + g;
d = b * c * e;

It may be worth transforming the code to:

tmp = b * c;
a = tmp + g;
d = tmp * e;

If the cost of storing and retrieving tmp is less than the cost of calculating b * c an extra time.


The possibility to perform CSE is based on available expression analysis (a data flow analysis). An expression b * c is available at a point p in a program if:

  • every path from the initial node to p evaluates b * c before reaching p,
  • and there are no assignments to b or c after the evaluation but before p.

The cost/benefit analysis performed by an optimizer will calculate whether the cost of the store to tmp is less than the cost of the multiplication; in practice other factors such as which values are held in which registers are also significant.

Compiler writers distinguish two kinds of CSE:

  • local common subexpression elimination works within a single basic block
  • global common subexpression elimination works on an entire procedure

Both kinds rely on data flow analysis of which expressions are available at which points in a program.