Lecture Notes on Scientific Computing
with a focus on developing research software for the physical sciences

Ramses van Zon    Marcelo Ponce
February 25, 2020
Contents

Preface

These notes are still under development. Contact us if you notice typos or errors of any kind or if you have suggestions for improvement.

These lecture notes are for the graduate course “PHY1610H Scientific Computing for Physicists” at the Department of Physics at the University of Toronto. This course, which has been taught every winter term since 2016, is intended for graduate students in the physical and engineering sciences, who want to or need to write and maintain research or engineering code for which performance, reproducibility, and responsibility are important.

Although part of the Physics graduate program at the University of Toronto, the course originated within another department, the SciNet High Performance Computing Consortium. The research support analysts at SciNet have scientific research background, and have had personal experience with the following frustrating path in computational research projects: a vague idea of what needs to be coded, diving head-first into programming a quick solution, followed by a long struggle to get the bugs out, to get everything right, and to add features, and new bugs. The result of such an approach is rarely a quality piece of software. It is often unknown if it works or it requires magical incantations and instructions known only by a select set of people, which may leave the group or the field. The training offered by the SciNet HPC Consortium (which can be found online at https://courses.scinet.utoronto.ca) is an effort to alleviate these pains, and to try to show a set of best practices and techniques that will avoid most of this mess. By partnering with other departments at the university, some of the training offerings of SciNet turned into graduate course in those departments, with SciNet analysts as the courses’ lecturers. The “Scientific Computing for Physicists” is one of those courses.

We started writing up these notes to collect and expand the material taught in the physics course. The material is primarily focused on scientific programming and aimed at developers of research software. But many of the techniques covered here have a much wider application than the originally target fields, and much of the material is now also taught in graduate courses to students in the medical and biological sciences[CSEgap-inpress]; the wish for a common reference was another reason for writing these notes.

To be able to cover research software development at all scales, from small computations to large parallel codes running on supercomputers, we use C++ as the programming language. We chose C++ over Fortran to allow these notes to be useful for those outside of academia and engineering, because it allows one to reason about performance, and because some high-performance computing techniques are based on C++ (The possibility of including both C++ and Fortran was considered, but was thought to be more confusing than illuminating). Being aware of the higher learning curve that a compiled language, and particularly C++, can pose, we restrict ourselves to a subset of the language that is enough to utilize the other techniques covered here, like modular programming, calling libraries, optimization, and parallelization.

The students of the course are expected to have some coding experience, preferably in a compiled language, and to this end, the first few lectures of the course have always been an overview of the parts of C++ used in the course. While these notes will contain a slightly expanded introduction to programming in C++ than the lectures, we cannot pretend that it is a proper introduction to programming for the non-programmer, nor that it is a proper treatment of programming in C++, rather it just shows what parts of C++ will be used within these notes. This will allow the rest of the material to focus on the main topic, scientific computing.

These notes, and the course, also assume a basic familiarity of the Linux shell. They contain a very brief overview on how to use the Linux shell in the appendix for the uninitiated. The choice for Linux simply stems from its role as a very common operating system on research computing facilities. For our purposes, we need very little from the Linux command line, except to start the compilations, run codes, run make files, and a few other tools. Most of the techniques work unaltered on MacOS systems, and the principles carry over to Microsoft Windows.

The Linux shell also provides a uniform way to explain compilation, running code, debugging and other techniques in a way that works on many different computers. In particularly, we will not have to explain how to compile, debug and run your code within Visual Studio, and again within XCode, and within Eclipse, and using Code::Blocks, or for every other possible IDE. Unfortunately, of course, that also means that we won’t be able to use any IDE specific features.

We want to thank all our colleagues at SciNet for their support, encouragement and help, in particular the analysts involve in teaching: Dr. Erik Spence, Dr. Alexey Fedoseev, Dr. Bruno Mundim, and Dr. Scott Northrup, and SciNet’s CTO Dr. Daniel Gruner and its Scientific Director Prof. W. Richard Peltier. A special thanks goes out to Dr. Jonathan Dursi, a former SciNet analyst who was involved in many of the early SciNet teaching efforts.

Finally, we wish that thank the departments at the University of Toronto that we have partnered with to create courses: the Physics Department, the Institute of Medical Science, and the Department of Physical & Environmental Sciences at the University of Toronto at Scarborough, all of which have supported us for many years in our teaching endeavours.

Dr. Ramses van Zon
Dr. Marcelo Ponce
SciNet High Performance Computing Consortium, University of Toronto
Toronto, Canada, January 2020

Part 1 Introduction

1.1 What is Scientific Computing?

Computing is everywhere in our daily lives, and so also in scientific research, but scientific computing is more than “just” using a computer in research. As is the case with other foundational tools of research, such as mathematics and statistics, scientific computing is a discipline on its own. As a discipline, it combines the scientific method, computer science, applied mathematics, and software engineering, as well as domain-specific knowledge and statistics. The goal of scientific computing is to answer scientific questions, in particular ones that cannot be answered by other means such as experiment, mathematical derivations, or calculations by hand. We will present a few examples that rely on scientific computing below. Other terms used for scientific computing or its sub-disciplines are “advanced research computing”, “computational science”, “high performance computing”, and “data science”. In the literature, scientific computing is sometimes taken to mean computing in the physical sciences and engineering, but we will use the term scientific computing in a broader sense, to encompass all of these disciplines.

Research computing is done for a variety of reasons, such as having to do large data processing or data mining, interpreting experimental results, finding simpler models from more complex ones, creating models out of data (including deep learning), creating scientific visualizations, and investigating behaviour of models using computer simulation.

The use of computer simulations where theoretical or experimental approaches are not feasible is why scientific computing has sometimes been called the third leg of science. Regardless of whether such a denomination is useful, scientific computing is used by experiment and theory and can learn from best practices in both. However, scientific computing requires some additional knowledge and skills that are unique to computing, such as computer programming.

In addition to being able to solve a particular scientific problem numerically with a computer, there is often another important factor to scientific computing: efficiency. It is very common for a research project initially not to be constrained by efficiency, as the results of preliminary investigations may be computable within a reasonable time scale, but when the problem is ’scaled up’ to its actual size, it would take too long to complete. In those cases, gains in efficiency can be a science-enabling factor. This is the realm of high performance computing, in which compiled languages, parallelization, and hardware-specific programming play a central role.

As a rule, scientific research tries to answer new questions, which means that few of the applications in the next section can utilize existing, “canned-code” applications. Even those few that can use existing codes often require modifications to that code and recompilation. As a result, much of the practice of scientific computing has to do with programming, code development, software engineering, and collaborations (many custom codes are ’community codes’, maintained by a collaboration of different researchers or research groups). Apart from short introductions to programming, it is hard to find courses that teach you the required skills for creating and maintaining research software. The aim of these notes is to provide a path to becoming a more efficient computational scientist. There will be some upfront amount of effort, a resistance to the urge to just get something to show your boss, supervisor or peers, and some diligence to keep your projects in decent shape for your collaborator, your successor, or your future self.

Before proceeding to present some use cases of scientific computing, we should point out that some aspects of computing in research are not covered in as much detail in these lecture notes yet as they might deserve.

One such aspect is data movement and storage. In many cases, the ideal setup is one when computation happens wherever the data is. The recent trend of ’edge computing’ is an example of this idea, but this has been a common practice in scientific computing for a long time. The tools used to get the data to the computation or the computation to the data, are constantly changing, and are, to some extent, independent of the subsequent calculation, and this reason, we focus more on the computation than on data motion in these notes. Another aspect of research computing that we don’t cover particularly in depth are workflows, where various interdependent stages of computation are to be executed and coordinated. At this point in time, workflow software is also an area where the tools used are still highly variable. Yet another aspect is data-driven research; although the techniques covered in these notes are very applicable and relevant to data science, artificial intelligence and machine learning, these are not incorporated as their own topic in here yet.

Finally, computing in research can take more forms than scientific computing, for example science portals, database services and other data repositories, grid computing, interactive visualization, virtual reality, etc. These are mostly applications of techniques that are not specific to science, and are already well covered by literature regarding those specific techniques.

1.2 Examples of Scientific Computing

It can be helpful to look at areas in which scientific computing is used to get an idea of what scientific computing entail for these applications.

1.2.1 Computational Fluid Dynamics (CDF)

This sub-field of scientific computing deals with techniques to compute the behaviour of fluids, i.e., gases, liquids and similar materials. CFD is used in climate science, weather simulations, aerospace engineering, astrophysics, medical science, and other disciplines that deal with fluids. Fluids are described by hydrodynamic equations which govern fields like pressure, temperature and concentration, that depend continuously on time and space. The computational approach to solving these partial differential equation requires a discretization of time and space.

1.2.2 Molecular Dynamics (MD) and N-Body Simulations

These refer to techniques to solve the trajectories of particles, which could be atoms or molecules interacting through electromagnetic forces, in which case one speaks of Molecular Dynamics (also known as Molecular Mechanics), or stars in a galaxy interacting through gravity, in which case one called it an N-body problem. Both involve many point particles of which the motion in continuous time is described by a large set of coupled ordinary differential equations, so there is large overlap between their computational approaches, all of which require discretization of time and clever optimizations of the computation of the particle interactions. For MD, extensions of these techniques that take into account approximate quantum effects exist as well (Quantum Mechanics/Molecular Mechanics, or QMMM). N-body dynamics have their main application in astrophysics, while MD is used in physics, chemistry, material science, molecular biology and medical science.

1.2.3 Smooth Particle Hydrodynamics (SPH)

This is a technique to solve hydrodynamic equations without requiring a discretization of space. Instead, discrete elements which are not point particles, but do have specific positions are used to approximate the hydrodynamic equations. This can be very useful in applications with large variations in densities that make grid approaches too expensive, such as where some parts of the physical domain are void of any substance, e.g. in ballistics, vulcanology and astrophysics.

1.2.4 Monte Carlo (MC) Simulations

‘Monte Carlo’ is the generic term for computations that use randomness. This could be done to simulate some physical phenomenon that has noise, like Brownian motion and Nyquist noise, to perform averages or integrals in systems with many degrees of freedom, to estimate a parameter’s distribution from using data using Markov Chain Monte Carlo (MCMC), or to test a statistical model. This type of computation requires generating pseudo-random numbers and is heavily based on statistics and the mathematics of stochastic processes. It is used in condensed matter physics, chemistry and biochemistry, material science, but also in finance and other fields that have randomness in their models.

1.2.5 Computational Quantum Chemistry

In contrast to Molecular Dynamics, which uses classical-mechanics models with effective force fields between constituents that are constructed using experimental data, computational quantum chemistry is a collection of methods that take into account quantum mechanical effects of the constituents of molecules and atoms. The degree to which the quantum nature of the constituents are included can vary. For instance, treating only “outer” or “valence” electrons of atoms quantum mechanically, leads to so-called semi-empirical methods, while ab initio methods and density functional theory (DFT) methods treat all the electrons equally. Ab initio methods deal with the wavefunction and are usually more accurate (but computationally more expensive) than DFT, in which the wavefunction is replaced with a density functional, which needs to be carefully constructed to incorporate the desired chemical effects. Even using high-performance computing techniques, quantum mechanical calculations are only feasible for relatively small molecules, at least until the era of general purpose quantum computers.

1.2.6 Bioinformatics

Life on a molecular scale is governed by many interacting substances, such as DNA, RNA, and proteins. These systems of molecules and their interactions influence many biological functions as well as diseases, but to understand how requires bioinformatics researchers to analyze and compare vast amounts of genomic, proteomic and other -omic data. These analyses have to be done computationally. The rate at which biological data is created is increasing faster than the rate at which it can be computationally analyzed, and as a result, computational biology is thriving and ever-changing field. Two examples of the bioinformatics computations are DNA alignment and DNA assembly. DNA, as you may well know, is a sequence of nucleotides (with only a choice of 4 possibilities for each nucleotide). It is the blueprint for making RNA and proteins. Although DNA is unique to each living individual, the variations of DNA among a species is very small, and biologists have been able to construct the generic DNA of many species. This reference DNA can be used to identify DNA and genes from new experimental samples. Finding the position in the reference DNA where a new sample fits in (or determining that is does not fit), is called alignment. This problem is less trivial than it looks, because experimental techniques aren’t entirely error-free and because DNA can undergo a variety of transformations, which the alignment algorithms need to be adapted for. The determination of the reference DNA is often done using a procedure called assembly. The DNA sequence data comes in as a large set of short fragments of nucleotide sub-sequences, without information on the order of the fragments in the full DNA sequence. The positions where the sub-sequences start and end varies, and the same fragment can be present in the data multiple times, which makes it possible to reconstruct the DNA, by checking for overlaps of sequences, but this is a computationally intensive procedure.

1.2.7 Data Science and Machine Learning

Data science and machine learning are techniques with a wide range of applications that have gained a lot of momentum recently due to the fact that it is now possible to access vast amounts of data and be able to process these data. This data processing often requires large scale computations of a kind that is essentially scientific computing.

Data science is a very general denomination; anything data-centric research can be said to be data science. For instance, when real data is assimilated into weather prediction calculations, that is data science. Virtually all of the bioinformatics computations are data science. Available data is growing in all fields, and eventually the moniker data science might become meaningless as all science will have become data science.

On the other hand, machine learning, a set of techniques often used in data science, is an application of scientific computing, and will remain a meaningful term. There are several modes of machine learning. The first is where the model underlying the data is well understood, and processing is relatively straightforward. For example, one can count how often the work “the” occurs in a text. The second is when the model is well understood but the parameters of the model are not. Most traditional regression and statistical methods fall in this category. A third is when the model itself is not quite known, and very generic dynamic models are used, such as “deep learning” with neural networks. A fourth mode is where there is no preset outcome for the data, so called unsupervised learning. Machine learning can be applied to many different fields, sometimes even as replacements of techniques mentioned above. It can sometimes uncover useful patterns in data that would otherwise not have been found, or at a much larger computational cost.

1.3 What You Will Need

There are a few software requirements to be able to follow along with the examples in these notes and to apply the techniques in the exercises:

  • A Linux-like terminal using, preferably, the Bash shell. For Linux and MacOS users, your default terminal will work. For Microsoft Windows users, try MSYS, Cygwin, or, in Windows 10, the Linux Subsystem for Windows. See Appendix C for details on how to set this up.

  • A C++ compiler that supports the C++14 standard. The GNU Compiler Collection (GCC) contains a C++ compiler called g++ which uses C++14 by default starting from version 6 (at the time of writing, the latest version was already 9.2.0). However, any other C++14 compiler should do as well. See Appendix C for details on how to get these compilers.

  • The make program. This is an an application that is helpful to automate building . make sometimes gets a bad rep for being antiquated and awkward, but it is simpler than many modern and more complex build systems, and all we will need for the purpose of these lecture notes.

  • An editor. Choose what you want, as long as the editor can save file in plain text format (so not .doc, .odt, or .rtf, but plain ASCII or UTF-8 format).

  • GIT. We’ll use this version control technology at length.

To be able to use the parallel programming sections, you’ll need access to a computer with multiple cores (which is standard these days), and, to truly scale up, you’ll need access to a cluster. If you are an academic researcher, there is a good chance that you can get access to such resources. For instance, in academic researchers in Canada can currently get access to parallel computing resources like those at SciNet through the Compute Canada services.11 1 Students of the PHY1610H Physics course get access to SciNet’s teach cluster for this purpose.

Part 2 Scientific Software Engineering

2.1 Basics of Programming

2.1.1 Programming, Coding, Scripting

To have the computer perform a number of similar computations or data manipulations, we will have to let it know what it should do. Programming is the process of writing the precise instructions for the actions the computer should take.

A group of instructions or commands that can be executed as a whole by the computer is called a program or a script. One can also call these applications. Although coding as an activity is a synonym for programming, the word code can refer to any (sub)set of instructions, even a single instruction, and does not necessarily refer to a complete program or script.

When using interpreted languages that have an interactive mode, such as Python, R or bash, the programs are usually named scripts and the languages referred to as scripting languages. Writing scripts in these languages is a way to automate what we can otherwise do interactively, with minimal to no human intervention.

In contrast to interpreted languages, there are compiled languages such as C, C++, Fortran, and others; usually these low-level programming languages are referred as compiled languages. They are sometime called low level languages, because their instructions are closer to the actual machine code that the CPU uses, in contrast to the so-called high-level languages, such as R, Python, Julia, etc. which are less “cryptic” than the lower level ones.

When we want to execute the program, i.e. ask the computer to execute each of the instructions included in the program, we refer to this as running or calling the program.

2.1.2 The Elements of a Program

Nowadays there is a large number of programming languages, and in some way this large variety reflects that languages can be designed to suite particular tasks. Yet there are some universal elements to each of these different programming languages, which we will now present.

A program specifies the actions that the computer should take, as well as (restrictions on) the order in which they should be taken. Each action will have a net effect on the program’s state. There is a limited set of predefined actions, in terms of which we must express all other actions: that is programming.

A common pattern of actions to achieve a specific net effect is an algorithm. Algorithms needn’t be written in any specific computer language and are often expressed in pseudo-code, i.e., in a non-existent, loosely defined, semi-formal code that could be implemented in any language of choice. A function (also known as a procedure or subroutine) is a set of actions expressed in a specific computer language that can be used as a newly defined action in other functions, or, for interpreted languages, on the command line. This is very useful to encapsulate units of functionality that can be reused, without having to re-typing them, in different parts of the program or even in different programs by calling the function.

A program can be seen as a function that can be executed or run. Programs may accept some external data as input and produce data as output.

The program’s state is stored in memory. The state is made up of the program’s variables (although there also can be hidden parts of the state). Variables are stored values that are assigned to a variable name. This variable name is associated with a physical portion of memory in the computer that holds the variable’s value.22 2 Conceptually one can think about the file system as part of the state, but this is not really helpful and raises the question as to whether the program’s runtime is strictly speaking infinite, since the file system persists after all the actions of the program have been executed.

The whole idea of writing a program is that you can store it and use it as many times as needed. It follows that programming should adhere to “ecological” principles: reusability, recycling and modularity.

To avoid a program doing the same thing over and over, or to constantly having to rewrite the code for different case, it needs a way to have additional inputs that either sets parameters or specifies the data files to use. There are several ways to do this. One is to make the program ask for user input, or read out the motion of a mouse, or listen to a microphone, etc. In some cases this can be a good approach; it is the one used in all GUIs. However this does require some sort of interactive mode for the program, which won’t be a viable option if we’re trying to set up some automated process. Another way is to use a parameter file where these values can be specified, and which the program can be made to read in. Finally, one can rely on command line arguments, which the program can also access; this is the approach taken by most Linux commands. These input parameters can then further specify the name of one or more data files to be read in by the program (these files are then also considered to be part of the input of a program).

Programs also produce output. They might print out text to screen, open files and write to them, or they could send data to devices, such as graphics and sound. All of these are forms of output.

Just as programs need to have input and output to not produce the same result every time, so do functions. Usually, functions take their input as function arguments, and return their output (the result of the action of calculation) explicitly, although functions might also read and write files. The function parameters are specified when the function is called. They could be variables, and depending on the programming language, these variables may be changed by the function (pass-by-reference), or not (pass-by-value), or the language may allow the programmer to indicate which of these should apply for a given function.

In addition to variables and actions, programming languages have loop constructs for allowing repetition of tasks; this is something that computers are really good at.

To handle decision making, programming languages have conditional statements. This enables program flow control based on conditions. The most common structure of this type are if/then/else statements.

For a more in-depth introduction to programming, see the excellent and surprisingly still relevant lecture notes from 1971 by E. W. Dijkstra on the topic[Dijkstra1971].

2.1.3 Programming Paradigms

The general outline in the previous section is a bit restrictive as it looks at a program as a set of variables upon which actions are performed in a specific order. It would seem to imply that defining the variables and writing the actions is all that there is to programming. But there are other, often more productive, programming paradigms than this linear, imperative programming paradigm:

Structured Programming

As the term implies, rather than coding a straight set of instructions, one attempts to apply some structure to the code by not jumping around in code with goto statements, and using control structures, functions and block structures. Most current programming languages enable this, but still often allow unstructured “spaghetti” code.

Modular Programming

Where structured programming just means using functions and control structures, in modular programming one would decompose the code into pieces, called modules, each of which contains the functions belonging to one specific functionality. For instance, one could have a separate module for I/O, one for reading in the parameter file, and one for running the simulation. Each module should be called through its interface, which is merely the definition of its functions, independent of its implementation (the actual code of the functions). The idea is to have a strong separation of concerns, such that one module cannot influence another. In modular programming, using global variables is strongly discouraged.

Object-Oriented Programming

Objects are data structures that combine data and functions. The point being that certain functions only can be used on certain data, and by encapsulating those functions inside the data type, there is no chance of using a function for the wrong data type.

Languages with explicit constructs for object often have ways to define a type of objects by specifying its data attributes and functions (methods). One can define a hierarchy of classes (a class is an object type), where one class can be derived from another. Other aspects of object-oriented programming include dynamic overloading of functions based on the class, data hiding, and polymorphism.

Object-oriented programming can be combined with structured and modular programming.

Note that languages without language constructs for objects, like C, could (with substantial effort, care, and restraint) still be used to do object-oriented programming.

Functional Programming

The paradigms discussed so far were all of the imperative variety, where there is a state and a list of actions to perform on it sequentially. Functional programming is a very different approach, one where there is no state and running a program is to apply a single function which can be expressed in terms of other functions. The program is then a collection of function definitions, and this style of programming is called declarative. In functional programming, functions are modeled after their mathematical namesakes: a mapping from one type to another, without reference to a context, state, or external I/O. Such functions are called pure. They will return the output if given the same input, without side effects like changing global variables or creating functions. Functions with side effects are called impure.

There are languages that are functional, and in which writing imperative code is impossible or very hard. One of the trickier issue in purely functional languages is that it would seem that one could also never read or write a file. The way out for these languages is by mimicking state to some extent using so-called monads. Essentially, this treats the whole file system as a function parameter. Functional languages also do not map naturally on the existing computer hardware architecture, which makes reasoning regarding their performance very difficult.

It should be noted that many imperative languages allow one to write functional or partially functional programs, so one could combine all the paradigms in one program (which is not necessarily the best idea). It’s a good idea to write functions that have little to no side effects unless that is explicitly part of the functionality. However, the burden is on the programmer to take care that there are indeed no side effects.program.

Generic Programming

This paradigm, which can be combined with the previous ones, refers to writing code that takes the same form for different object type. Programming languages that support this allow one to write the generic code for such functions only once and have the compiler generate a version of that function for each specific type that the function is applied to. A similar type-independent specification can usually be given for classes.

This is as far as we will go in describing programming without referring to a specific programming language. We will next give an overview of the part of C++ (a programming language that supports all the above paradigms) that are used in these lectures. Along the way, we will revisit the above aspects of programming.

2.2 C++

2.2.1 Why C++?

Given that there are few cases where efficiency does not matter at all, we will use a compiled language throughout these notes, and that compiled language is C++. The fact that it is a compiled language means that the basic ’actions’ in the program are to translated into a set of optimized native instructions that the processor can execute. C++ is a language that is still being augmented and developed, so we should mention that we use C++ that conforms to the C++14 standard.

We do not intend to put down interpreted languages such as Python and R here, which offer a less severe learning curve and a flexible high-level language that makes them very suitable for interactive work and for high-level frameworks built on top of optimized libraries. But the performance of code written in compiled languages is almost always faster than the same code written in interpreted languages, and the later chapters on using libraries, performance tuning and parallel programming will be much easier and more productive when working in a compiled language. Both Fortran and C++ (and its near-subset C) allow relatively straightforward use of libraries, and enable popular parallel programming techniques like OpenMP, OpenACC, CUDA and others. C++’s constructs are close enough to the way modern CPUs work that it allows one to reason about the performance, something that is hardly possible for higher-level languages. The final considerations to choose C++ were that it is more commonly used outside of academia than Fortran, it is more general than C, and that some current HPC technologies, like the threading library in C++11, CUDA from NVIDIA for programming graphics cards, and the more recent OneAPI initiative from Intel, all are built upon C++.

2.2.2 Basic C++ syntax

The following is a basic C++ program that prints the message “Hello, world.” on the console (a.k.a. terminal, a.k.a. shell):

1 // hw.cc - prints ”Hello world.”
2 #include <iostream>
3 #include <string>
4 int main() // the main function which is called when the app is run
5 {  // braces delimit a code block
6    std::string message = "Helloworld."; // a variable
7    std::cout << message                  // prints to console out
8              << std::endl;               // end of line
9    return 0;
10    // return value to the shell
11 }

This can be compiled on the command-line as follows:

$ g++ -std=c++14 -o hw hw.cc
$ ./hw
Hello world.
$ echo $?

We added “-std=c++14” to the g++ command to ensure the right standard is used, although for newer versions of g++ this is the default. Note that any other c++14 compliant compiler could be used instead of g++.

Note that if you need to brush up on your Linux skills, Appendix B contains an overview of the Linux command line.

Let’s dissect the “hello world” program above, focussing on syntactical aspects of C++:

  • Line 1: Comments can be added using the double slashes //.

  • Line 2 and 3: Other C++ files can be included with the #include directive.

  • Lines 6,8,9: Each executable statement or declaration ends with a semicolon.

  • Lines 5 and 11: Curly braces delimit a code block.

  • Line 6: When declaring a variable or function to be of a certain type, the type is specified before the variable or function name.

  • Line 9: The value to be given back by a function is specified by the return statement, which exits the function.

For completeness sake, we will mention that <iostream> is a library responsible for input and output (which is needed since the program is supposed to write out a string). The <string> library is necessary to be able to work with text strings (a lot of basic C++ functionality like this is actually in libraries). Functions and data types of libraries can sometimes be prefixed with a so-called namespace name and two colons. For standard libraries like string and iostream, the namespace is std, as can be seen here in std::string, std::cout and std::endl. Console ouput is done using the << streaming operator, which can stream any variable into the console through the std::cout object. Streaming std::endl into std::cout ends the line and goes to the next line in the console.

C++ is a large and complex language, of which we will not necessarily need all the features in these lecture notes. The following is a list of the elements of C++ we will be using in these notes:

  • Variables

  • Functions

  • Function arguments by value and by reference

  • Loops

  • Pointers

  • Dynamically allocated arrays

  • Conditionals

  • Const

  • Objects

  • Templates

  • Libraries

  • Namespaces

  • IO and file streams

  • Compiling and linking

Objects and templates will be used, but will rarely need to be written (though you can).

2.2.3 Variables

Variables are named pieces of data and values stored in the computers memory. They are an essential part of programming. When we store information in variables, it remains internal to the computer. One can think of variables as “boxes” in the memory of the computer where we put “stuff”, in this case information, e.g. numbers, text, etc.; and we need to label them so we can access them later on.

That label is the name of a variable. You should always try to choose representative (meaningful) variable names, something that will give the person reading the code a sense of what that variable may contain. For instance, if you find a variable name myvar or var1 vs another one named totalEnergy, which one do you think is more illuminating for someone reading the code? A variable name should convey what is being store in the corresponding variable.

In C++, variables have a particular type, such as a floating point number, an integer, or a string. In C++ there is no difference between the name of the variable and the data in the variable: they are always tied together. So we could say that variable name has a type as well.

The data stored in a variable can be manipulated using the variable name, e.g. through assignment, addition, etc.

Variable names must be unique, that is, the same name cannot be used to refer to both an integer and a string, and a string cannot be stored in a variable that has been declared as an integer.

Before you can use a variable, you have to declare it and in that declaration, you have to define the type of variable. The syntax of a variable declaration is as follows:

type name [=value];

Here, type may be a

  • floating point type, such as
    float, double, long double, std::complex<float>

  • integer type, such as
    [unsigned] short, int, long, long long

  • character or string of characters, such as
    char, char*, std::string

  • boolean, i.e.,
    bool

  • array, pointer, class, structure, enumerated type, union.

Here are some examples:

int a;     // defines an integer variable called a
int b;     // defines an integer variable called b
a = 4;     // puts the value 4 in the variable a
b = a + 2; // adds 2 to the value of a and stores the result in b
float f = 4.0f; // defines a float variable and sets it to 4.0
double d = 4.0; // defines a double variable and sets it to 4.0
d += f;         // converts f to double precision, and adds it to d.
char* str = "HelloThere!"; // defines a character pointer and points it
                            // at the start of the string ”Hello There!”.
bool itis2018 = false; // defines a bool variable and sets it to false.

It is important to stress that in C++, non-initialized variables are not 0, but have random values!

2.2.4 Functions

A function is a piece of code that can be reused. In C++, a function has:

  1. 1.

    a name

  2. 2.

    a set of arguments of specific type

  3. 3.

    and returns a value of some specific type

These three properties are called the function’s signature.

To write a piece of code that uses (”calls”) the functions, we only need to know its signature or interface. To make the signature known, one has to place a function declaration before the piece of code that is to use the function. The actual code (the function definition) can be in a different file or in a library.

Here’s an example:

1 // funcexample.cc
2
3 // external function prototypes:
4 #include <cmath>
5
6 // function prototype:
7 double arithmetic_mean(double a, double b);
8
9 // main function to call when program starts:
10 int main()
11 {
12     double x = 16.3;
13     double y = 102.4;
14     return arithmetic_mean(x,y);
15 }
16
17 // function definition:
18 double arithmetic_mean(double a, double b)
19 {
20     return sqrt(a*b);
21 }
$ g++ -o funcexample funcexample.cc
$ ./funcexample
$ echo $?
4

The general syntax of a function declaration, a.k.a. as the prototype, the signature, or the interface , is as follows:

returntype name(argument-spec);

where argument-spec is a comma separated list of variable definitions.

The general syntax of a function definition, a.k.a, the code or implementation, is similar, just with an added code block with the statements that the function should execute when called.

returntype name(argument-spec) {
    statements
    return expression-of-type-returntype ;
}

Functions which do not return anything have to be declared with a returntype of void. Functions which have a non-void return type must have a return statement, except the function main().

To call a function, the syntax is:

var = name(argument-list);
f(name(argument-list));
name(argument-list);

where argument-list is a comma separated list of values.

In C++, functions cannot be nested, i.e., you cannot define a function without another function.

2.2.5 Function arguments

In the example above, the variables x and y are arguments to the function arithmetic_mean. We say that they are passed to the function There are in fact two different ways that function arguments can be passed to a function: by value and passing by reference.

When passing by value is used, a new variable is created inside the function, and the data in the passed variable in copied into that new variable. Changes to this new variable have no effect on the original variable. This is a the default in C++, and it is one way to avoid side effects of a function, but for complex variable types, making a copy can be expensive.

When passing by reference is used, the memory location assigned to the variable is passed, so the function will work directly on the variable and can modify its value. This avoids copying, but potentially introduces side effects of the function.

Independently of what approach you decide to use, you should always pass arguments into functions and use return-statements for returning results.33 3 When several results are returned, one usually passes some function parameters by reference that the function fills with the result values. Do not use global variables as a way to pass information to and from the function, even if this means your function needs a lot of arguments (there are ways to improve that situation without resorting to global variables).

Some programming languages, like R and Python, can be thought of as having a mix of both ways of passing variables but you don’t control how the variables are passed into the functions. In C++, one can specify exactly how each of the function arguments are passed.

Here’s an example of passing function arguments by value in C++:

1 // passval.cc
2 void inc(int i)
3 {
4     i = i+1;
5 }
6
7 int main()
8 {
9     int j = 10;
10     inc(j);
11     return j;
12 }
$ g++ -o passval passval.cc
$ ./passval
$ echo $?
10
$

Here,

  • j is set to 10.

  • j is passed toinc,

  • where it is copied into a variable calledi.

  • i is increased by one,

  • but the original j is not changed, as the output shows.

In contrast, here’s almost the same code using pass by reference:

1 // passref.cc
2 void inc(int &i)
3 {
4     i = i+1;
5 }
6
7 int main()
8 {
9     int j = 10;
10     inc(j);
11     return j;
12 }
$ g++ -o passref passref.cc
$ ./passref
$ echo $?
11
$

Now:

  • j is set to 10.

  • j is passed to inc,

  • where it referred to asi (but it’s stillj).

  • i is increased by one,

  • becausei is just an alias for j, the output reflects this change.

2.2.6 C++ Operators

C++ has a large number of operators, most of which it inherits from C, and which were based on elementary machine instructions. Table 1 shows the most common ones.

Arithmetic
a+b Add a and b
a-b Subtract a and b
a*b Multiply a and b
a/b Divide a and b
a%b Remainder of a over b
Logic
a==b a equals b
a!=b a does not equal b
!a a is not true (also: not a)
a&&b Both a and b are true (also: a and b)
a||b Either a or b is true (also: a or b)
Assignment
a=b Assign a expression b to the variable b
a+=b Add b to a (result stored in a)
a-=b Subtract b from a (result stored in a)
a*=b Multiply a with b (result stored in a)
a/=b Divide a by b (result stored in a)
a++ Increase value of a by one
a-- Decrease value of a by one
Logic/Numeric
a<b is a less than b
a>b is a greater than b
a<=b is a less than or equal to b
a>=b is a greater than or equal to b
Table 1: C++ arithmetic, logic, and assignment operators

Here’s something unexpected that happens with the division operator: 1/4 equals 0. Why? Well:

  • In the expression 1/4 both operands, i.e., 1 and 4, are integers.

  • Hence, the result of 1/4 is the integer part of the division, which is 0.

  • Generally, literal expressions, such as "Hi", 0, 1.2e-4, 2.4f, 0xff, true have types, just as variables do.

  • The result type of an operator depends on the types of the operands.

To fix this, we need conversions between types. This is called type casting in C++.

2.2.7 Type Casting

The way to convert one type to another it to treat the type as a function. Let’s look at some example:

1 // 1over4.cc
2 #include <iostream>
3
4 int main()
5 {
6   int   a = 1;
7   int   b = 4;
8   int   c = a/b;
9   float d = float(a)/float(b);
10
11   std::cout << c << ""
12             << d << ""
13             << int(d) << std::endl;
14 }
$ g++ 1over4.cc -o 1over4
$ ./1over4
0 0.25 

Note: the “proper” C++ way to do casting is to use static_cast<TYPE>(value) or
reinterpret_cast<TYPE>(value).

Sometimes, casting is done automatically by the compiler. If an expression expects a variable or literal of a certain type, but it receives another, C++ will try to convert it automatically. E.g. 1.0/4 is equal to 1.0/4.0.

The expression to convert may be a function argument as well so that in

1 int unchanged(int i)
2 {
3    return i;
4 }
5 int main()
6 {
7    return unchanged(2.3);
8 }

the argument 2.3 gets converted to an int first, and then passed to the function unchanged, so the returned value is 2.

2.2.8 Loops

In scientific computing, we often want to do the same thing for all points on a grid, or for every piece of experimental data, etc.

If the grid points or data points are numbers, this means we consecutively want to consider the first point, do something with it, then the second point, do something with it, etc., until we run out of points.

That’s called a loop, because the same “do something” is executed again and again for different cases.

C++ has three forms of loops

  1. 1.

    a for loop

    for (initialization ; condition ; increment){
        statements
    }
  2. 2.

    a while loop:

    while (condition) {
        statements
    }
  3. 3.

    a range-based for loop:

    for (type var: iterable-object-or-expression ) {
        statements
    }

You can use the break statement to exit the loop.

Here’s an example of a for loop:

1 // count.cc
2
3 #include <iostream>
4
5 int main()
6 {
7     for (int i=1; i<=10; i++) {
8         std::cout << i << "";
9     }
10     std::cout << std::endl;
11 }
$ g++ -o count count.cc
$ ./count
1 2 3 4 5 6 7 8 9 1

And an example of a range-based for loop:

1 // rangebasedfor.cc
2
3 #include <iostream>
4
5 int main()
6 {
7     for (int i: { 1,2,3,4,5,6,7,8,9,10 } ) {
8         std::cout << i << "";
9     }
10     std::cout << std::endl;
11 }
$ g++ -o rangebasedfor rangebasedfor.cc
$ ./rangebasedfor
1 2 3 4 5 6 7 8 9 10
$

2.2.9 Pointers

Pointers are memory addresses of variables. For each type of variable type, there is a pointer type type* that can hold an address of such a variable.

The Null pointer, denoted by nullptr, points to nowhere.

Pointers are good for arrays, dynamic memory allocation, linked lists, binary trees, etc. They are also used when calling C library functions.

The definition of a pointer variable looks like this:

type* name ;

To assign the memory address of a variable to this pointer, one can use the “address-of” operator, which is an ampersand character, i.e., &:

name = &variable-of-type ;

Given a memory address stored in a pointer, we can get the content using “dereferencing” operator, which is a star character, i.e. *, as follows:

variable-of-type = *name ;

Here’s an example of working with pointers:

1 int main()
2 {
3     int  a = 5;  // a equal to 5
4     int* b = &a; // b points to a
5     *b = 7;      // *b is equivalent to a
6     return a;    // so this returns 7
7 }

Note that we’ve only covered so-called raw pointers, in contrast with smart pointers that are in the C++ standard library.

2.2.10 Automatic Arrays

Arrays are ordered collections of elements where each element is of the same type, and in which each element can be accessed by an integer index.

There are different types of arrays in C++. One of these types is called an automatic array. You define an automatic array as follows:

type name [ number ];

Here, the square brackets are not indicating an optional part here, but are part of the syntax. The variable name is equivalent to a pointer to the first element of the array. Access to element with index i is denoted by name[i]. C++ arrays are zero-based, which means the first element has index 0.

Example:

1 // autoarr.cc
2
3 #include <iostream>
4
5 int main()
6 {
7     int a[6]= { 2,3,4,6,8,2 } ;
8     int sum=0;
9     for (int i=0;i<6;i++) {
10         sum += a[i];
11     }
12     std::cout << sum << std::endl;
13 }
$ g++ -o autoarr autoarr.cc
$ ./autoarr
25
$

Using automatically arrays is dangerous, for the following reason:

  • The standard only says at least one automatic array of at least 65535 bytes can be used.

  • In practice, the limit of the total size of automatic arrays is set by the so-called stack size, which is set by the compiler and the operating system.

  • Compilers will not warn you if your program goes over the list; the program will just crash for no apparent reason.

Consider what happens when we increase the size of the automatic array in the example above:

1 // autoarr1e8.cc
2
3 #include <iostream>
4
5 int main()
6 {
7     int a[100000000]= { 2,3,4,6,8,2 } ;
8     int sum=0;
9     for (int i=0;i<100000000;i++) {
10         sum += a[i];
11     }
12     std::cout << sum << std::endl;
13 }
$ g++ -o autoarr autoarr.cc
$ ./autoarr
25
$ g++ -o autoarr1e8 autoarr1e8.cc
$ ./autoarr1e8
Segmentation fault (core dumped)
$

This is bad, and there are better ways to use arrays, i.e., using dynamic memory allocation.

2.2.11 Dynamically allocated arrays

A dynamically allocated arrays are defined as a pointer to memory, so you declare it like a pointer.

type* name ;

After this allocation, the array does not exist yet, it has to be allocated using the keyword new:

name = new type [ number ];

(the square brackets are part of the syntax). This sets aside enough memory for an array of number elements of that type, and stores the pointer to the first element in name.

After allocation, name behaves much like an automatic array in that it can be indexed with square brackets. There are two main differences with automatic arrays:

  • The information on the size of the array is lost and not accessible to the program.

  • The memory is not automatically returned to the system even if pointer name goes out of scope or is assigned another pointer value.

Because the memory is not automatically returned to the system, we need to program this ourselves. To deallocate an array requires a delete call:

delete [] name ;

The brackets here are again part of the syntax. Without the brackets, only the first element of the array would be deleted.

You can usage these dynamically allocated arrays in the same way as automatic arrays, but dynamically allocated arrays are not restricted to the stack memory and can use all available memory. You are also in control over when memory is given back. It is very important to deallocate, or you’ll have a memory leak, and your program may run out of memory.

Here’s the “improved” version of the above example that used automatic arrays:

1 // dynarr.cc
2
3 #include <iostream>
4
5 int main()
6 {
7     int* a = new int[100000000];
8     a[0] = 2; a[1] = 3; a[2] = 4;
9     a[3] = 6; a[4] = 8; a[5] = 2;
10     int sum=0;
11     for (int i=0;i<100000000;i++) {
12         sum += a[i];
13     }
14     std::cout << sum << std::endl;
15     delete[] a;
16 }
$ g++ -o dynarr dynarr.cc
$ ./dynarr
25
$

You would think that this generalized to multidimensional arrays, but unfortunately, no fully dynamic multi-dimensional version of the new keyword exists C++. There are solutions to using multi-dimensional arrays, which will be covered in section 2.2.24, where we’ll also see what’s wrong with multidimensional automatic arrays.

2.2.12 Dynamic allocation of single variables

One can also dynamically allocate a single variable:

1 int main() {
2     double* v = new double;
3     *v = 4.2;
4     std::cout << *v << std::endl;
5     delete v;
6 }

Note the absence of “[]” in the delete statement, compared to the delete statement for arrays.

You might use this in dynamic data structures. But in these cases, the smart pointers of C++, like unique_ptr and shared_ptr, are almost always a better choice.

1 #include <memory>
2 int main() {
3     std::unique_ptr<double> v = std::make_unique<double>();
4     *v = 4.2;
5     std::cout << *v << std::endl;
6 }

No delete is necessary here, the memory is automatically deleted when v goes out of scope.

2.2.13 Arrays as function arguments

Passing an array to a function is trickier in C++ than you might expect, because of the equivalence of array expressions and pointers. Consider a case where we have a function to print an array of integers:

void printarr(int size, int x[])
{
    for (int i=0; i<size; i++) {
        std::cout << x[i] << "";
    }
    std::cout << std::endl;
}

We would call this function with an automatic array as follows:

int main() {
    int numbers[4] = {1,2,3,4};
    printarr(4, numbers);
}

Here, the size of the array has to be explicitly given to the function as its first argument. This is because the array variable numbers, which is used as an expression for the second argument, is converted to a pointer to the first element of the array. From this pointer, there is no way to deduce how big the array was. It is as if the array forgets its size when passed to a function (if that already seems odd and inconvenient, just wait until we discuss multidimensional arrays in the next section).

Since the x parameter is essentially a pointer, we could write the function also as follows:

void printarr(int size, int* x)
{
    for (int i=0; i<size; i++) {
        std::cout << x[i] << "";
    }
    std::cout << std::endl;
}

This syntax is entirely equivalent to the first version of printarr.

2.2.14 Command line arguments

Command typed at the Linux command line can be followed by arguments. To know the command line arguments that were give to our C++ program, we need to augment the definition of the main function from int main() to

int main(int argc, char* argv[])
{
        ....
}

The arguments of this form of main have the following meaning:

  • argc is the number of arguments, where the command itself counts as an argument as well;

  • argv is an array of character strings, with the first string, argv[0] equal to the command.

Note this is precisely how we just saw that an array needs to be passed to a function.

All arguments are (C-style) strings. To convert them to integers or floats, use functions like atoi and atof, e.g. int n = atoi(argv[1]); stores the integer value of the first command line argument into the variable n.

Example:

1 // printargs.cc
2 #include <iostream>
3 int main(int argc, char* argv[]) {
4     for (int i=0; i<argc; i++) {
5         std::cout << argv[i] << std::endl;
6     }
7 }
$ g++ -o printargs printargs.cc
$ ./printargs Hello There!
./printargs
Hello
There!
$

2.2.15 Conditionals

This is just the if–else statement, whose syntax is as follows

if (condition) {
    statements
} else if (othercondition) {
    statements
} else {
    statements
}

Here’s an example:

1 int main() {
2     int  n = 20;
3     int* a;
4     a = new (std::nothrow) int[n];
5     if (a == nullptr) {
6         std::cout << "Errorinmain" << std::endl;
7     } else {
8         for (int i=0; i<n; i++) {
9             a[i] = i+1;
10         }
11         printarr(n,a);
12         delete[] a;
13     }
14 }
$ g++ -o ifm ifm.cc
$ ./ifm
0 1 4 9 16 25 36 49 64 81 100
121 144 169 196 225 256 289
324 361

If we change n = 20 to n = 2000000000:

$ g++ -o ifm ifm.cc
$ ./ifm
Error in main
$

Don’t really use this trick to check memory allocation, instead let the program fail with an exception:

1 int main() {
2     int  n = 2000000000;
3     int* a;
4     a = new int[n];
5     for (int i=0; i<n; i++) {
6         a[i] = i+1;
7     }
8     printarr(n,a);
9     delete[] a;
10 }
$ g++ -o ifm ifm.cc
$ ./ifm
terminate called after throwing std::bad_alloc
  what():  std::bad_alloc
Aborted (core dumped)
$

2.2.16 Exceptions

C++ code can raise an exception (sometimes also called throwing an exception) at any point. When an exception is raised, the execution of current function is stopped and the execution goes back to the caller function, which can catch the exception using the following syntax:

try {
    statements
}
catch (type varname) {
  statements
}

For instance:

1 int main() {
2     int  n = 2000000000;
3     int* a;
4     try {
5         a = new int[n];
6     }
7     catch (std::bad_alloc b) {
8         std::cout << "Errorinmain" << std::endl;
9         return 1;
10     }
11     for (int i=0; i<n; i++) {
12         a[i] = i+1;
13     }
14     printarr(n,a);
15     delete[] a;
16 }
$ g++ -o ifm ifm.cc
$ ./ifm
Error in main
$

If the caller function does not catch the exception, the caller function throws the same exception to its caller. If none of the caller functions in the call stack catches the exception (i.e., not even int main()), the program terminates as we saw above.

2.2.17 Const

The C++ keyword const is a type modifier. It indicates that the value of that type is fixed. This can be useful for constants, e.g.

const int arraySize = 1024;

It can also be very useful to show read-only arguments to functions:

int f(const Type &in, Type &out);

Be warned: const is contagious! Once you start using it, everything has to be “const correct”.

2.2.18 Classes and Objects

Classes are a generalization of types.

Objects are a generalization of variables.

Syntax similar to variable declarations:

classname objectname;
classname objectname(arguments);
classname objectname{arguments};

The differences between classes and regular types are that

  • Object declarations can have arguments, supplied to construct the object.

  • An object has members (fields) and member functions (methods), accessed using the “.” notation.

    object.field
    object.method(arguments)
  • You can create your own classes (though this isn’t required for your course work).

Here’s an example of using a member function, or method.

#include <string>
std::string s("Hello");
int stringlen = s.size();

and an example of a member/field:

#include <utility>
std::pair<int,float> p(1, 0.314e01);
int    int_of_pair   = p.first;
float  float_of_pair = p.second;

If you’re wondering about those angular brackets with types in between them in this example, well, those are template arguments, and templates are will descussed in the next section.

It is useful to discuss the main ideas and mechanisms behind object oriented programming (OOP) in C++,because many C++ libraries use an object oriented approach.

Objects are reusable components, inside of which the complexity can be hidden. They allow to separate interface (how you use the object in the rest of the code) from the implementation (the code that makes the object actually work). We will see a similar idea in the section on modular programming, but with object oriented programming, one can have multiple instances of the same type of component. One can reuse components to build other components, and extend an object’s functionality can be extended using inheritance. One can also have the same interface can be implemented by different objects, which facilitates generic programming.

As attractive all these qualities are, it is common to find that at the lowest level of your computarions, OOP may need to be broken for best performance.

Objects in C++ have their roots in the user-defined data type (a.k.a. variable type) in C called structs. Here’s an example of a C struct, with some functions that use that struct.

1 struct Point2D {
2   double x, y;
3   int i;
4 };
5
6 void print1(struct Point2D a) {
7   std::cout << a.i <<  << a.x <<  << a.y;
8 }
9
10 void print2(struct Point2D* a) {
11   std::cout << a->i <<  << a->x <<  << a->y;
12 }

The Point2D is a collection of three member variable, i.e., two doubles and an integer, which have names inside the struct (x,y, and i). After the struct is defines (in the first 4 lines), struct Point2D now is a new variable type. As explained above the dot (.) notation is how one refers to the member variables inside a struct, as in the function print1. The function print2 takes a pointer to a struct as an argument. For a pointer to a struct, one ises the arrow (->) notation to refer to the member variables.

There is an important distinction between the two print functions. Calling print1 makes a copy of the content of some existing struct and stores it in a. On the other hand, calling print2 only copies the address of a struct. For large structs, this matters, because copies are not cheap! So the print2 function avoid copying. In C++, the way of avoiding copies would be to use a reference parameter. i.e.

1 void print3(Point2D& a) {
2   std::cout << a.i <<  << a.x <<  << a.y;
3 }

The & character makes this a call by reference, so no copy is performed. Also note that in C++, we can omit the struct keyword; Point2D becomes a new data type.

Using structs, all the data in the objects is exposed and accessible to the rest of the code. In true OOP, data is encapsulated and accessed using methods specific for that (kind of) data. The interface (collection of methods) should be designed around the meaning of the actions: abstraction. Programs typically contain multiple objects of the same type, called instances.

To do OOP in C++, one defines classes instead of structs. Like structs, a class definition defines a new data type that can be used to define new variables which are instance of that class. So a class is a type of object. Syntactically, classes are structs with member functions.

One adds such member functions to a class by declaring a member function inside the definition of the class, e.g.

1 class Point2D {
2 public:
3   int j;
4   double x,y;
5   void set(int ai, double ax, double ay);
6 };

One could also include the member function definition (i.e., the function body) inside the class definition, but this can clutter the class definition, so we’d recommend that the code with the definition of the member function is given separately, outside of the class { .... } construct. Outside of the class, one has to use the name of the class as well as the name of the member function, separated by two semi-colons (::), as follows:

1 void Point2D::set(int ai, double ax, double ay)
2 {
3     j = aj;
4     x = ax;
5     y = ay;
6 }

This completes the definition of the Point2D class. We can now use it to create objects of that type, e.g.:

1 int main() {
2     Point2D myobject;
3     myobject.set(1,-0.5,3.14);
4     std::cout << myobject.j << std::endl;
5 }

A good components hides its implementation details. To allow hiding of parts of the object, each member function or data member can be

private

only member functions of class have access

public

accessible from anywhere

protected

accessible only to this and derived classes.

These are specified as sections within the class.

For example, to hide the j, x and y member variables in the Point2D class, one could write

1 class Point2D {
2 private:
3     int j;
4     double x,y;
5 public:
6     void set(int aj,double ax,double ay);
7     int get_j();
8     double get_x();
9     double get_y();
10 };

With the following member function definitions

1 int Point2D::get_j() {
2     return j;
3 }
4 double Point2D::get_x() {
5     return x;
6 }
7 double Point2D::get_y() {
8     return y;
9 }

(the definition of the set member function is the same as above). When using the Point2D now, one can no longer access the j, x and y member variables with the dot operator. Instead, one needs to use the get_.... functions, e.g.

1 Point2D myobject;
2 myobject.set(1,-0.5,3.14);
3 std::cout << myobject.get_j() << std::endl;

Since the get functions were written such that they cannot change the variable, thsi class has enforced the rule that j, x and y can only be changed simultaneously, using the set function. While the above example is a contrived, in real applications, the member variables of an object may have to keep a certan relationship to one another (so-called constraints, and allowing the member variables to be set separated could violate that and lead to bugs.

It is important to note that when hiding the data through these kinds on accessor functions, now, each time the data is needed, a function has to be called, and there’s an overhead associate with that. The overhead of calling this function can sometimes be optimized away by the compiler, but often it cannot. Considering making data is that is needed often by an algorithm just public, or use a friend.

Classes are a bit more then just structs with functions. A class defines a type, and when an instance of that type is declared, memory is allocated for that object. But that object might require member arrays to be allocated with new when it is created. Similarly, when the object ceases to exist (see the scope section above), some clean-up may be required (delete …). These setup and cleanup routines are standardized in C++. Each class has a constructor, which is a member function that is called when an object is created, and a destructor, which is called when an object is destroyed. There are other specialized automatic member functions, but these two are the most important.

For example, to write a constructor for the Point2D function, we could write

1 // class declaration:
2 class Point2D {
3
4   private:
5     int j;
6     double x,y;
7
8   public:
9     Point2D(int aj,double ax,double ay);
10     int get_j();
11     double get_x();
12     double get_y();
13 };
14
15 // constructor definition
16 Point2D::Point2D(int aj,double ax,double ay) {
17     j = aj;
18     x = ax;
19     y = ay;
20 }
21
22 // use Point2D as a data type to define a variable, for
23 // which will the constructor will be called.
24 Point2D myobject(1,-0.5,3.14);

Note that the constructor is a member function with the same name as the class, and that is does not have a return type.

Destructors are called when an object is destroyed. This occurs when a non-static object goes out-of-scope, or when delete is used. This is a good opportunity to release memory. You declare destructor as a member function of the class with no return type, with a name which is the class name plus a   attached to the left. A destructor cannot have arguments.

By the way, you should try to avoid any print statement or other side effects in constructors and destructors, which should only be concerned with setting up a fully functionaly object and cleaning up after its destroyes, respectively.

There is a lot more to object oriented programming in C++. There is operator overloading, inheritance, polymorphism, friends, automatic conversion, But this is beyond what we need for the purposes of this text. As we already noted above: objects can be very elegant, and simplify coding considerably, especially given the right context, such as data analysis, but objects come with a fair amount of overhead, and can slow down your code considerably under certain circumstances. As such, when aiming for speed, you may find that not using them is the way to go.

2.2.19 Templates

Some algorithms and classes depend on a type. E.g. an list of doubles, a list of ints, … These objects can often be implemented with the same code, except for a change in type. Using generic programming, one can write this code once, with one or more type parameters.

In C++, generic programming uses templates. Type parameters appear in between angular brackets <> instead of parenthesis. Many templated functions and classes are in the standard library, which we’ll use.

Templates are possible for classes and functions. For classes, one can be create an object from a template class called tc:

tc<type> object(arguments);

For example:

std::complex<float> z;    // single precision complex number
std::vector<int> i(20);   // array of 20 integers
rarray<float,2> x(20,20); // 2d array of 20x20 floats w/rarray library

2.2.20 Libraries

To use a library, you have to put an include line in the source code, e.g.

#include <iostream>
#include <mpi.h>

These statements only includes the interface of the library. To link in the implementation of the library, you need to add an argument to the compilation line (or really, the link line) of the form -lLIBNAME. But this latter argument is not needed for the C++ standard libraries, nor for so-called header-only libraries.

Common libraries from the C++ Standard Template Library are

  • string: character strings

  • iostream: input/output, e.g., cin and cout

  • fstream: file input/output, e.g., ifstream and ofstream

  • containers: vector, complex, list, map, . . .

  • cmath: special functions (inherited from C), e.g. sqrt

  • cstdlib, cstring, cassert, ... : C header files

2.2.21 Namespaces

Variables and functions, as well as variable types, have names. In larger projects, you could have variable types of the same name. To avoid such name clashes, one can use namespaces. One usually puts all functions, types, … of a module in a namespace:

namespace MODNAME {
...
}

Here, namespace is the keyword, MODNAME is an identifier of your choosing. This effectively prefixes everything that is defined in the code block ... with “MODNAME::”.

We already saw an example of this for writing to the console

std::cout << "Hello,world" << std::endl;

Many of the standard functions/types are in namespace std.

2.2.22 Scope

Variables do not live forever, they have well-defined scopes in which they exist. These are the some of the rules.

  • If you define a variable inside a code block, it exists only until the your code hits the closing curly brace (}) that correspond to the opening curly brace ({) that started the block. This is its local scope.

  • The variable will only be known in that code block and its subblocks.

  • Subblocks can define a variables with an existing name, masking the original variable in that subblock.

  • If you call a function from a code block, variable from that block will not be known in the body of the function.

  • It is possible to define variables outside of any code block; these are global variables. Avoid those.

  • When a variable goes out of scope, the memory associated with it is returned to the system, except for memory that was dynamically allocated.

  • When that variable is an object, a special member function of it called the destructor is called. This gives objects that dynamically allocate memory the opportunity to delete that memory.

2.2.23 IO Streams

In C++, stream object are responsible for text-based I/O. You can output an object obj to a stream str simply by

str << obj

while you can read an object obj from a stream str simply by

str >> obj

The stream will encode these object in text, provided a proper operator is defined (true for the standard C++ types).

There are a few standard streams:

  • std::cout For output to the screen (buffered)

  • std::cin For input from the keyboard

  • std::cerr For error messages (by default to the screen too)

These are defined in the header file iostream.

IO streams example:

1 #include <iostream>
2 int main() {
3     std::cout << "Enteranumber:" << std::endl;
4     int i;
5     std::cin >> i;
6     std::cout << "Twicethatis:" << 2*i << std::endl;
7 }

File streams are IO streams that can work with files.

  • The ofstream class is for output to a file.

  • The ifstream class is for input from a file.

You have to declare an object of these classes first. After that, you can use the streaming operators << and >> , or use the member functions read/write to read/write binary.

File streams example for writing to a file:

1 #include <fstream>
2 int main() {
3     std::ofstream fout("out.txt");
4     int x = 4;
5     float y = 1.5;
6     fout << x << "" << y << std::endl;
7     fout.close();
8 }

File streams example for reading from file:

1 #include <fstream>
2 #include <iostream>
3 int main() {
4     std::ifstream fin("out.txt");
5     int x;
6     float y;
7     fin >> x >> y;
8     fin.close();
9     std::cout << "x=" << x << "y=" << y <<std::endl;
10 }

2.2.24 Multidimensional Arrays

Most scientific programming depends on arrays in one form or another. They show up everywhere, as fields, grid information, discretization, machine learning and linear algebra. Often times, these are large arrays whose sizes are dynamic, i.e., depend on input files or change in the course of the program.

C++ does not particularly shine when it comes to arrays, as we already saw in the sections above. We will see in this section that the built-in mechanisms of C++ for multidimensional arrays, which are very common in scientific computing, are even stranger, more error prone and more laborous to work with, and this hurts productivity. For this reason, we will discuss multidimensional arrays in details in the section, and suggest a library, rarray, that can be used to get around the pecularities of C++ when it comes to multidimensional arrays.

A multidimensional array is a number of elements of the same type that are are organized into a 1d, 2d, 3d, … rectangular shape or grid. 1-dimensional arrays are sometimes called vectors, their grid is trivial. 2-dimensional array are sometimes called matrices; there grid represent rows and columns. 3-and-higher-dimensional arrays are sometimes called tensors, and are generalizations of the matrix concept.

We saw in section 2.2.10 that automatic arrays are only useful if you know the size of your array ahead of time and they have modest dimensions. We also saw that array expressions are equivalent to pointers to the first element of the array. This equivalence has dire consequences. But first, let’s consider what is good about this equivalence when calling functions.

Because C++ functions pass arguments by value by default (and for its C predecessor this was the only way), if arrays were first-class types, the whole array would need to be copied to the function, which takes cpu time and stack memory. Due to the pointer conversion, the function actually takes a copy of the pointer to the first element of the array, and is able to manipulate the original array in memory, without making a copy of it. In addition, since differently sized arrays are equivalent to the same pointer type, so you can use the same function for arrays of different sizes, as well as for dynamic arrays. We just have to pass the number of element of the array as an argument to make up for this flexibility. Here’s an example to illustrate how one can use the same function for automatic and dynamic arrays of different sizes:

1 #include <iostream>
2 int sum_arr(const int* arr, int n) {
3    int total = 0;
4    for (int i = 0; i < n; i++)
5       total += arr[i];
6    return total;
7 }
8 int main() {
9    int cookies[] = {1, 2, 4, 8, 16, 32, 64, 128};
10    int sum = sum_arr(cookies, 8);
11    std::cout<<"Cookieseaten:"<<sum<<"\n";
12    int* mc = new int[5];
13    mc[0]=1;mc[1]=2;mc[2]=4;mc[3]=8;mc[4]=16;
14    int sum2 = sum_arr(mc, 5);
15    std::cout<<"MoreCookieseaten:"<<sum2<<"\n";
16    delete[] mc;
17 }
$ g++ -std=c++14 cookies.cc -o cookies
$ ./cookies
Cookies eaten: 255
More cookies eaten: 31

How would we use a multidimensional array in C++? Suppose we would like to have a matrix structure like the following:

In this array, there are 16 integers, 4 rows, 4 columns, the shape or grid is 4x4, and there are two dimensions, both of which are are 4.

One of the issues with multi-dimensional arrays is that computer memory addresses are linear, so a bunch of numbers has no inherent shape. One approach to define a multidimensional array, is to linearize the array using a helper function:

1 int linear(int row, int col, int nrow, int ncol) {
2    return row*ncol + col;
3 }
4 int main() {
5     int nrows=4, ncols=4;
6     int linarr[nrows*ncols] = {1,2,3,4,9,8,7,6,2,4,6,8,6,5,4,3};
7     int r = 1, c = 2;
8     int i = linear(r,c,nrows,ncols);
9     linarr[i] = 5.0;
10 }

However, we shouldn’t have to, at least not for such a small 2d array, because C++ has automatic multidimensional arrays, so we can just do

1 int main() {
2     int nrows=4, ncols=4;
3     int arr[nrows][ncols] = {{1,2,3,4},{9,8,7,6},{2,4,6,8},{6,5,4,3}};
4     int r = 1, c = 2;
5     arr[r][c] = 5.0;
6 }

The variable arr contains an array which is already mapped row by row. I.e., in memory, the elements are layout out contiguously as they appear in the code, and each subsequent set of 4 numbers is one row in the 2d array. This is called a row major memory layout.

We would like to be able to pass this array to function. When passing arrays around, these get converted to pointers to the first element. Because C++ considers this 2d array to be an array of arrays, the first element is a whole row, whose type in this case is int[4] – that’s right, the size is part of the type. This means that if you pass an array to a function, you must specify the size of the arrays being pointed to, so that C++ knows how to index things properly. The function declaration would have to look like this:

1 int sum_arr(int data[][4], int size);

But because the number 4 is hard-coded here, and C++ does not allow it to be replaced by a variable, you’d need a separate function for every possible number of columns. The convenience of being able to write one function for all sizes that we got from the array expression/pointer equivalence brings for one-dimensional arrays does not carry over to multi-dimensional arrays.

The repeated square bracket indexing method, as used in arr[r][c] = 5.0 in the example above, also indicates this array-of-arrays idea: the first index, r, picks a row from the array of rows, the second, c, picks an element from a specific row. Another way to look at this is that just as a 1D array is actually a pointer to a block of memory, a 2D array is a pointer to a block of memory which contains pointers to other blocks of memory, so kind of like this:

For example, consider

1 int data[4][4] = {{1,2,3,4}, {9,8,7,6}, {2,4,6,8}, {6,5,4,3}};

The compiler considers this an array of 4 int[4]’s. Although most compilers will optimize out the array of pointers for automatic arrays, nonetheless, the expression data[1] is a pointer to the second row.

For dynamic 2d arrays, the construction of the array of pointers must be explicitly coded. There are two main ways to go about this. The first way of allocating a matrix is the one found in most C++ textbooks. It allocates an array of as many pointers as there are rows, and then allocates memory separately for each row:

1 double** allocate_matrix1(int n, int m) {
2    double **a = new double*[n];
3    for (int i = 0; i < n; i++)
4       a[i] = new int[m];
5    return a;
6 }

This results a pointer-to-a-pointer, which is denoted by a double *, so double** is a pointer-to-a-pointer-to-a-double. The memory layout of the data in this matrix looks as follows:

One can note that the data is in general not contiguous when using this method.

A second way to allocate the matrix uses a memory allocation for the whole data of the array, as welll as an array of pointers, but it adjusts those pointers to point to different offsets in the contiguous array, as follows:

1 double** allocate_matrix2(int n, int m) {
2    double **a = new double*[n];
3    a[0] = new double[n * m];
4    for (int i = 1; i < n; i++)
5        a[i] = &a[0][i * m];
6    return a;

In memory this looks like this:

It turns out many libraries need contiguous memory! So use the second way.

It should be noted that the way you allocate determines the way you should deallocate the array. For the first, discontinuous method, the matrix can be deleted like this:

1 void deallocate_matrix1(double** a, int numrows) {
2   for (int i = 0; i < numrows; i++)
3       delete [] a[i];
4   delete [] a;
5 }

In contrast, for the second, continuous matrix allocation method, the deallocation should be done as follows:

1 void deallocate_matrix2(double** a) {
2    delete[] a[0];
3    delete[] a;
4 }

Note that ‘delete[]‘ must be called as many times as ‘new …[]‘ was called.

Both ways of allocating produce a pointer-to-a-pointer. Thus, passing such a two-dimensional array of type T to a function requires the function to take a pointer-to-a-pointer argument, i.e., a T** (just as the deallocate_matrix functions above). Since this pointer has no information about the number of rows and the number of columns, this information needs to be passed to the function as additional arguments.

The following example illustrates this with a function with sums up all the element of a matrix:

1 #include <iostream>
2 double sum_arr(double** p, int nrows, int ncols) {
3    double total = 0;
4    for (int i = 0; i < nrows; i++)
5       for (int j = 0; j < ncols; j++)
6          total += p[i][j];
7    return total;
8 }
9 int main()
10 {
11    int numrows = 3, numcols = 4;
12    double** p = allocate_matrix2(numrows, numcols);
13    for (int i = 0; i < numrows; i++) {
14       for (int j = 0; j < numcols; j++) {
15          p[i][j] = i + j;
16          std::cout << p[i][j] << "";
17       }
18       std::cout << std::endl;
19    }
20    std::cout << "Total="
21              << sum_arr(p,numrows,numcols)
22              << std::endl;
23    deallocate_matrix2(p);
24 }

When looping over a the elements of a two-dimensional array, there are at least two ways. Which of the following is faster?

1 int sum_arr1(double** p, int numrows, int numcols) {
2    int total = 0;
3    for (int i = 0; i < numrows; i++)
4       for (int j = 0; j < numcols; j++)
5           total += p[i][j];
6    return total;
7 }

or

1 int sum_arr2(double** p, int numrows, int numcols) {
2    int total = 0;
3    for (int j = 0; j < numcols; j++)
4       for (int i = 0; i < numrows; i++)
5          total += p[i][j];
6    return total;
7 }

The two calculations are identical except that the i and j loops are interchanged. Because the memory layout of the matrix is row-major, the first methods traverses p linearly in memory. Because that’s what memory subsystems are best at, and the first method will, for large enough matrix size, be substantially faster than the second way, which jumps around in memory, and makes the memory subsystem work harder (look up “cache”).

In general, when looping over blocks, arrange your arrays so that you are looping over the last index for row major languages (it’s opposite for column-major languages).

2.2.24.1 Existing C++ packages for multidimensional arrays

With all this being said about “native” C++ multi-dimensional arrays, there are several C++ packages available to allow you to do matrix algebra (Armadillo, Eigen, Blitz++, boost, rarray). In addition to helping out with the peculiarities of the native C++ multidimensional arrays, some of these packages come with built-in functionality that you may need:

  • Matrix multiplication operations.

  • Matrix inversion, solving systems of equations.

  • Decompositions, and factorizations.

  • Eigenvalue calculations.

These are useful, and should be used first before trying to speed things up by building your own. There are also the BLAS and LAPACK libraries, which are the classic libraries for doing linear algebra.

2.2.24.2 rarray: A Multidimensional Array Library

The rarray library provides multidimensional arrays that are fast, can have arbitrary rank, have the same access syntax as C++ arrays, and are stored in the same way as well (row major). It is an open source, header-only library available at https://github.com/vanzonr/rarray.

Rarray was written specifically as an optimized way to circumvent the weirdness of C++ arrays. Rarrays behave much like pointers: If you copy rarray a to rarray b, both share the same data. However, rarrays are objects that remember their shape. They also automatically delete memory and use reference counters to avoid dangling references from copies. Rarrays use heap memory, thus allowing for large arrays, and they store the data in a contiguous block, so they are compatible with common numerical libraries (BLAS, FFTW, …)

The general type is rarray<TYPENAME,RANK>, where rank is a positive integer. There are also three shorthands available for 1d, 2d and 3d arrays:

rarray<TYPENAME,1> = rvector<TYPENAME>
rarray<TYPENAME,2> = rmatrix<TYPENAME>
rarray<TYPENAME,3> = rtensor<TYPENAME>

Here’s an example:

1 // rex.cc
2 #include <rarray>
3 #include <rarrayio>
4 #include <iostream>
5 int main() {
6     rarray<double,2> a(3,3);  // or, equivalently, rmatrix<double> a(3,3);
7     a.fill(4);
8     for (int i=0; i<3; i++)
9         for (int j=0; j<i; j++)
10             a[i][j] = i+j;
11     std::cout << a << std::endl;
12 }

To compile this code, you need the two header files rarray and rarrayio, which you either get by following the installation instructions in the github repository, or by directly downloading them from https://raw.githubusercontent.com/vanzonr/rarray/master/rarray and https://raw.githubusercontent.com/vanzonr/rarray/master/rarrayio. The compilation command would be fairly simple

$ g++ -std=c++14 -O2 rex.cc -o rex
$./rex
{
{4,4,4},
{1,4,4},
{2,3,4}
}

Note: compile with optimization (at least ‘-O2‘), otherwise, the abstraction layers of rarray make things unnecessarily slow. Table 2 explains some of the ways you can use rarray.

Define a n×m×k array of doubles: rarray<double,3> b(n,m,k);
Define it with preallocated memory: rarray<double,3> c(ptr,n,m,k);
Element i,j,k of the array b: b[i][j][k]
Pointer to the contiguous data in b: b.data()
Extent in the ith dimension of b: b.extent(i)
Array of all extents of b: b.shape()
Define an array with same shape as b: rarray<double,3> b2(b.shape());
Shallow copy of the array: rarray<double,3> d=b;
Deep copy of the array: rarray<double,3> e=b.copy();
A rarray re-using an automatic array: double f[10][20][8]=...;
rarray<double,3> g = RARRAY(f);
Output a rarray to screen: std::cout << h << endl;
Read a rarray from keyboard: std::cin >> h;
Table 2: Rarray in a Nutshell.

Let’s see how the variable C++ array constructs that we have seen so far compare to rarrays.

Consider this code using C++ arrays:

1 #include <iostream>
2 int main()
3 {
4    double a[] = {0, 1, 2, 3, 4};
5    double *b = new double[5];
6
7    for (int i = 0; i < 5; i++)
8       b[i] = i;
9
10    for (int i = 0; i < 7; i++)
11       std::cout << a[i] << "";
12    std::cout << std::endl;
13
14    for (int i = 0; i < 7; i++)
15       std::cout << b[i] << "";
16    std::cout << std::endl;
17
18    delete [] b;
19 }

We can do the same with rarrays:

1 #include <rarray>
2 #include <rarrayio>
3 int main()
4 {
5    rvector<double> a(5);
6    a = 0, 1, 2, 3, 4;
7
8    rvector<double> b(5);
9    for (int i = 0; i < 5; i++)
10       b[i] = i;
11
12    rvector<double> c = linspace<double>(0,4,5);
13
14    std::cout << a << std::endl;
15    std::cout << b << std::endl;
16    std::cout << c << std::endl;
17 }

To pass an array to a function with plain C++ arrays, we needed pass the size as well:

1 #include <iostream>
2 int sum_arr(int arr[], int n) {
3    int total = 0;
4    for (int i = 0; i < n; i++)
5       total += arr[i];
6    return total;
7 }
8 int main() {
9    const int arrsize = 8;
10    int cookies[arrsize]
11      = {1, 2, 4, 8, 16, 32, 64, 128};
12    int sum = sum_arr(cookies, arrsize);
13    std::cout << "Cookieseaten:" << sum << "\n";
14 }

But with rarrays, the size comes with the array:

1 #include <iostream>
2 #include <rarray>
3 int sum_arr(rvector<double> arr) {
4    int total = 0;
5    for (int i = 0; i < arr.extent(0); i++)
6       total += arr[i];
7    return total;
8 }
9 int main() {
10    const int arrsize = 8;
11    rvector<double> cookies(arrsize)
12    cookies = 1, 2, 4, 8, 16, 32, 64, 128;
13    int sum = sum_arr(cookies);
14    std::cout << "Cookieseaten:" << sum << "\n";
15 }

Passing a 2d array to a function required passing a pointer to a pointer and the number of rows and columns. With rarray, this becomes

1 #include <iostream>
2 #include <rarray>
3
4 int sum_arr(rmatrix<double> p) {
5   int total = 0;
6   for (int i = 0; i < p.extent(0); i++)
7     for (int j = 0; j < p.extent(1); j++)
8       total += p[i][j];
9   return total;
10 }
11
12 int main() {
13   int numrows=3,numcols=4;
14   rmatrix<double> p(numrows,numcols);
15   for (int i = 0; i < extent(0); i++) {
16     for (int j = 0; j < extent(1); j++) {
17       p[i][j] = i + j;
18       std::cout << p[i][j] << "";
19     }
20     std::cout << std::endl;
21   }
22   std::cout << "Total=" << sum_arr(p) << "\n";
23 }

Returning an array from a function in pure C++ is tricky, as you need to return a pointer that was allocated in the function, but the caller still has to dealllocated it rarray after the function. With rarray, there is no problem:

1 rarray<double,2> zeros(int n, int m) {
2    rarray<double,2> r(n,m);
3    r.fill(0.0);
4    return r;
5 }
6
7 int main() {
8    rarray<double,2> s = zeros(100,100);
9    return s[99][99];
10 }

This involves no hidden copies, as rarray fully supports C++11 move semantics.

Finally, there is the rarrayio header for text-based input and output of rarrays. If you include this header, you can use streaming operators into ofstreams, cout, etc. It will only outout in ascii for now (not great). It’s output format is such as you would initialize an automatic array, i.e., with curly braces. Doing so means the result can read it back in as well.

2.2.24.3 For comparison: The vector class (don’t do this!)

A quick warning about the vector class from the standard template library, which you will bump into if you start poking around. The vector class has advantages, and is quite satisfactory for 1D arrays.

However, generalizing the vector class to higher dimensions quickly becomes extremely ugly and non-contiguous, e.g.:

1 using std::vector;
2 int n = 256;                // size per dimension
3 vector<vector<vector<float>>> v(n);// allocate for top dimension
4 for (int i=0;i<n;i++) {
5    v[i].reserve(n);         // allocate vectors for middle dimension
6    for (int j=0;j<n;j++)
7       v[i][j].reserve(n);   // allocate elements in last dimension
8 }

So we aren’t going to discuss it further.

2.2.25 More C++ resources

Some online resource that may help you out:

2.3 Best Practices

When creating an innovative solution to an brand-new type of problem, one can talk about the “art of programming”. But a large portion of programming requires following precise and concrete rules and conventions to be effective.

The term Best Practices refers to following some general guidelines developed and used by professional code developers and software engineers. It is all about reproducibility and developing code with a scientific approach. The best practices methodology that we will discuss here, are

  1. 1.

    Modularity

  2. 2.

    Documentation

  3. 3.

    Defensive Programming

  4. 4.

    Testing

  5. 5.

    Version Control

You may wonder why these best practices are so important. At some point you may notice that some things will work even if you don’t follow the best practices mentioned here. For instance, not passing arguments to functions and instead accessing the values as global variables might sound simpler, but breaks the modularity of the code and leaves it in a fragile condition. What if other parts of the code using and modifying this variable? How can you be sure the variable is initialized? Following best practices help you avoid these issues and create robust, maintainable, reproducible and extensible research code.

2.3.1 Modularity

You can think of programming as playing with ‘Lego’ pieces, where the goal of the game is to build something bigger and more complex starting from smaller pieces with predefined shapes and forms. In this analogy, the individual Lego pieces would be the basic commands provided by the programming language that you will utilize, but one of the most powerful and useful features in any programming language is the possibility of combining these very basic building blocks into new building blocks by writing functions (a.k.a. subroutines and procedures), classes, modules and libraries.

Just as functions are bundled collections of commands that perform one task, a modules is a bundle of functions that are responsible for one particular functionality. For instance, one might have an I/O modules that has functions readfile and writefile. Modules can also contain classes and occasionally even data. Outside of the realm of C++, modules can also contain applications and utilities (Fig. 1).

Figure 1: Schematic representation of the software “ecosystem” that can be developed using different modules, functions and scripts.

You might still wonder why modularity matters for scientific computing. The answer lies in the realization that scientific software can be large, complex and subtle, and if each section uses the internal details of other sections, you must understand the entire code at once to understand what the code in a particular section is doing. The number of possible interactions grow as the number of lines of code squared, so the larger a code gets, the more interactions you need to understand if the code is doing something unexpected that you are supposed to fix.

Let’s look at how modular programming is done in C++. We will use the following example, which pretends to compute the ground state of the hydrogen atom:

1 // hydrogen.cc
2 #include <iostream>
3 #include <fstream>
4 const int n = 100;
5 double m[n][n], a[n], b = 0.0;
6 void pw() {
7     double q[n] = {0};
8     for (int i = 0; i < n; i++)
9         for (int j = 0; j < n; j++)
10             q[i] += m[i][j]*a[j];
11     for (int i = 0; i < n; i++)
12         a[i] = q[i];
13 }
14 double en() {
15     double e = 0.0, z = 0.0;
16     for (int i = 0; i < n; i++) {
17         z += a[i]*a[i];
18         for (int j = 0; j < n; j++)
19             e += a[i]*m[i][j]*a[j];
20     }
21     return b + e/z;
22 }
23 int main() {
24     for (int i = 0; i < n; i++) {
25         a[i] = 1.0;
26         for (int j = 0; j < n; j++) {
27             m[i][j] = ...;
28         }
29     }
30     for (int i = 0; i < n; i++)
31         if (m[i][i] > b)
32             b = m[i][i];
33     for (int i = 0; i < n; i++)
34         m[i][i] -= b;
35     for (int p = 0; p < 20; p++)
36         pw();
37     std::cout<<"Groundstateenergy="<<en()<<"\n";
38     std::ofstream f("data.txt");
39     for (int i = 0; i < n; i++)
40         f << a[i] << std::endl;
41     std::ofstream g("data.bin", std::ios::binary);
42     g.write((char*)(a), sizeof(a));
43     return 0;
44 }

(Note that the ... in line 27 is missing; it would have to contain the matrix representation of the Hamiltonian operator of the neutral hydrogen atom, which, frankly, is hard to put in a single line of code.)

The hydrogen.cc code uses functions, so you might think that it is already modular. However, there are a few things that far from modular, or otherwise “bad”:

  • It uses global variables that all of the code can modify.

  • All code is in one file.

  • There are no comments.

  • It is not clear which part does what, or what part needs which variables.

  • It uses cryptic variable and function names.

  • Hard-coded filenames and parameters.

  • Automatic arrays.

Even though this code may be able to run, it is worth fixing these things, because code is almost never a one-off and will be reused. So code should not just be written for a computer, but also, or perhaps even primarily, for humans.

Instead, use modularity, i.e., split your code up in separate sections or files. You must enforce boundaries between these sections of code so that you have self-contained modules of functionality. This is not just for your own sanity, there are added benefits. Each section can then be tested individually, which is significantly easier. Modularity makes rebuilding software more efficient. Furthermore, it makes version control more powerful, and it makes changing the code easier.

Separation of concerns is the principle behind modularity. The basic guiding principles on how to split up the code is to think about the blocks of functionality that you are going to need. Determine how the routines within these blocks are going to be used. Think about what you might want to use these routines for, and only then design the interface. Note that the interfaces to your routines may change a bit in the early stages of your code development, but if it changes a lot you should stop and rethink things because you’re not using the functionality the way you expected to.

Like documentation, thinking about the overall design, enforcing boundaries between modules, and testing, is more work up-front but results in higher productivity in the long run. Developing good infrastructure is always time well spent.

Let’s continue with our hydrogen.cc example, and extract just one module. Towards the end of the code, it is writing out an array in binary and text formats. Let’s start with putting those parts in functions, and put function calls where the code use to be, i.e.

1 //hydrogen.cc
2 #include <string>
3 void writeBinary(const std::string& s, int n, const double x[]) {
4    std::ofstream g(s, std::ios::binary);
5    g.write((char*)(x), n*sizeof(x[0]));
6    g.close();
7 }
8 void writeText(const std::string& s, int n, const double x[]) {
9    std::ofstream f(s);
10    for (int i=0; i<n; i++)
11       f << a[i] << std::endl;
12    f.close();
13 }
14 //…
15 int main() {
16    //…
17    writeText("data.txt", n, a);
18    writeBinary("data.bin", n, a);
19    //…
20 }

These function definitions are both the interface and implementation, but we separate these by add extracting the prototypes, and moving the functions

1 //hydrogen.cc
2 #include <string>
3 void writeBinary(const std::string& s, int n, const double x[]);
4 void writeText(const std::string& s, int n, const double x[]);
5 //…
6 int main() {
7    //…
8    writeText("data.txt", n, a);
9    writeBinary("data.bin", n, a);
10    //…
11 }
12 void writeBinary(const std::string& s, int n, const double x[]) {
13    std::ofstream g(s, std::ios::binary);
14    g.write((char*)(x), n*sizeof(x[0]));
15    g.close();
16 }
17 void writeText(const std::string& s, int n, const double x[]) {
18    std::ofstream f(s);
19    for (int i=0; i<n; i++)
20       f << a[i] << std::endl;
21    f.close();
22 }

We are ready to make a module now. To create the module, we put the prototypes for the functions in their own header file called outputarray.h:

1 //outputarray.h
2 #include <string>
3 void writeBinary(const std::string& s, int n, const double x[]);
4 void writeText(const std::string& s, int n, const double x[]);

(this is not the final version of the header file as we will see below). The source code with the definitions of the functions should be put into its own separate file as well, called outputarray.cc:

1 //outputarray.cc
2 #include "outputarray.h"
3 #include <fstream>
4 void writeBinary(const std::string& s, int n, const double x[]) {
5    std::ofstream g(s, std::ios::binary);
6    g.write((char*)(x), n*sizeof(x[0]));
7    g.close();
8 }
9
10 void writeText(const std::string& s, int n, const double x[]) {
11    std::ofstream f(s);
12    for (int i=0; i<n; i++)
13       f << a[i] << std::endl;
14    f.close();
15 }

Finally, the original code that uses these functions would look like:

1 //hydrogen.cc
2 #include "outputarray.h"
3 //…
4 int main() {
5    //…
6    writeText("data.txt", data, n);
7    writeBinary("data.bin", data, n);
8    //…
9 }

The code is all there, but in separate files. This is how do we should compile this code, including the separate outputarray module:

  • Before the full program can be compiled, all the source files (hydrogen.cc, outputarray.cc) must be compiled.

  • outputarray.cc doesn’t contain a main function, so it can’t be an executable (no program to run).
    Instead outputarray.cc is compiled into a “.o” file, an object file, using the “-c” flag.

  • It is customary and advisable to compile all the code pieces into object files.

  • After all the object files are generated, they are linked together to create the working executable.

        $ g++ outputarray.cc -c -o outputarray.o    // compile
        $ g++ hydrogen.cc -c -o hydrogen.o          // compile
        $ g++ outputarray.o hydrogen.o -o hydrogen  // link
  • If you leave out one of the needed .o files you will get a fatal linking error: “symbol not found.”

By creating a header file we separated the interface from the implementation.

implementation

The actual code for writeBinary and writeText goes in the .cc (or .cpp or .cxx) or ’source’ file. This is compiled on its own, separately from any program that uses its functions.

interface

What the calling code needs to know; goes in the .h or ’header’ files. This is also called the API (Application Programming Interface).

This distinction is crucial for writing modular code.

Let’s review the compilation process one more time.

  • When hydrogen.cc is being compiled, the header file outputarray.h is included to tell the compiler that there exists out there somewhere functions of the form specified in that header.

  • This allows the compiler to check the number and type of arguments and the return type for those functions (the interface).

  • The compiler does not need to know the details of the implementation, since it’s not compiling the implementation (the source code of the routine).

  • The programmer of hydrogen.cc also does not need to know the implementation, and is free to assume that writeBinary and writeText have been programmed correctly.

Compilation pipelines can get fairly complex, as the example in Fig.2 shows.

Figure 2: Example of a compilation pipeline.

There is one aspect of header file we still need to address, and that is that we should guards them against multiple inclusion. Header files can include other header files, and it can be hard to figure out which header files are already included in the program. But including a header file twice will lead to doubly-defined entities, which results in a compiler error. The solution is to add a preprocessor guard to every header file, like so

1 //outputarray.h
2 #ifndef OUTPUTARRAY_H
3 #define OUTPUTARRAY_H
4 #include <string>
5 void writeBinary(const std::string& s, int n, const double x[]);
6 void writeText(const std::string& s, int n, const double x[]);
7 #endif

What do we mean by this preprocessor? Well, before the compiler actually compiles the code, a preprocessor is run on the code. For our purposes, the preprocessor is essentially just a text-substitution tool. Every line that starts with # is interpreted by the preprocessor. The most common directives a beginner encounters are #include, #ifndef, #define, and #endif.

So what exactly should go into the header file? At the very least, the function prototypes. There may also be constants that the calling function and the routine need to agree on (error codes, for example) or definitions of data structures, classes, etc. Comments, which give a description of the module and its functions, should also be present. Not necessarily every function prototype should be in the header file, just the public ones. Routines internal to the module are not in the public header file. Finally, ideally there should really only be one header file per module. In theory there can be multiple source files, but you would only do that for very large modules.

The module’s source file should then contain the everything which is defined in the .h file which requires code that is not in the .h file. Particularly, the function definitions and internal routines which are used by the routines prototyped in the .h file. To ensure consistency, we advise to include the module’s .h file at the top of the source file. In short, anything that needs to be compiled and linked to code that uses the .h file should be in the module’s source file.

The isolation of different components that results from modular programming comes with a number of benefits. The first benefit is that the code’s structure and the functions of different sections is much clearer if the code is split up if different files, assuming these files are well named. For instance, instead of a single file hydrogen.cc, the listing of the source directory might look like this:

$ ls
hydrogen.cc    outputarray.cc  outputarray.h
initmatrix.cc  initmatrix.h    eigenvalues.cc
eigenvalues.h  Makefile
$

The names make it clear what each piece is responsible for. Having each file have a single responsibility, makes the code easier to understand and navigate for you collaborators and for your future self! A clear modular structure is easier to maintain and adapt.

A second benefit is faster compilation and re-compilation. Consider again the compilation pipeline in Fig .2. Each compilation of an .cc file into an .o file can be done simultaneously, i.e., we can have parallel processing of code. Furthermore, if only one .cc file has changed, only that file needs to be compiled to rebuild the application.

A third benefit is the potential for better version control. As we will see in a later section, version control systems can keep track of changes in files. By splitting our code up into several files, we can more easily see what changes were made in what functional piece of the code (most code changes will only involve one or two files). This also makes it easier to restore a previous functionality by restoring just that file.

Finally, it is easiest to test modular code and to reuse it. Each module can and should have a separate test suite. If the code is properly modular, each module tests should not need any of the other modules. Testing will give confidence in your module, and will tell you which modules have stopped working properly. Once the test of a module are okay, you now have a piece of code that you could easily use in other applications as well, and which you can comfortably share.

2.3.2 Automating Software Building with make

Imagine that you have a code, which has hundred of modules (this is a pretty common scenario in scientific computing, examples of this are codes like Cactus, Flash, NAMD, …) where realistic research problems require of complicated and complex recipes to be implemented. Imagine now that you modified a tiny portion of this code, taking advantage of the modular approach described before, this is possible without major impact in the rest of the code. However the process of compilation required that each module will need to be compiled. In other words, if you have to issue a compilation line in the terminal for each module, can result in an endless process.

On top of that, let’s recall that the typical compilation of any program can be done in a two-step process (see Fig.2):

  1. 1.

    First individually compile all .cpp files, but not the .h (header) files, to generate .o (library) files.

  2. 2.

    Link all of the .o files together, including external .so and .a (shared-object and static library files), to generate an executable.

However, it can get complicated and redundant:

  • you need to keep track of what depends upon what.

  • you need to retype in the entire compilation command every time you need to recompile.

  • It’s easy to forget all of your compiler flags from one day to the next, as well as the location of external libraries.

It’s better to keep all of this information contained in a single file. This is where the so-called ’make’ program can be used to help with all these tasks.

make is an utility program that is used to build programs from multiple .cpp, .h, .o, and other files.

make can be considered as two languages in one: the first language describes dependencies graphs consisting of targets and prerequisites; while the second language is a macro language for performing textual substitutions.

Some general considerations about make
  • make is a very general framework that is used to compile code, of any type.

  • make takes a Makefile as its input, which specifies what to do, and how.

  • The Makefile contains variables, rules and dependencies.

  • The Makefile specifies executables, compiler flags, library locations, …

  • It is a crucial component of Professional Software Development

  • Besides building programs, make can be used to manage any project where some files must be updated automatically from others whenever the others change (e.g. papers, …)

  • Further references available at the “make GNU reference”44 4 https://www.gnu.org/software/make/manual/html_node/index.html.

make can be invoked with a list of target filenames to build as command-line arguments, for instance

$ make  [TARGET ...]

When used without arguments, make builds the first target that appears in its makefile, which is traditionally a symbolic “phony” target named ALL.

make will determine whether a target needs to be regenerated by comparing file modification times, aka time stamp.

This solves the problem of avoiding the building of files which are already up to date, but be aware that sometimes it may fail when a file changes but its modification time stays in the past. Such changes could be caused by restoring an older version of a source file, or when a network file system is a source of files and its clock or timezone is not synchronized with the machine running make. In this cases, the user must handle this situation by forcing a complete build.

make can take many command-line arguments (try make --help), and can also tell you the available targets in a makefile.

The make-file

make will search in the current directory for a makefile to use, (e.g. makefile, Makefile, GNUmakefile) and when found will run the specified (or default) target(s) from such a file. It is also possible to specify a different filename to be used as a makefile, by using a -f flag, e.g.

$ make -f myMakefile [TARGET ...]

A makefile is just a plain-text file with a particular structure, it may include rules and even use commands form the shell or other programs.

Rules

Each rule begins with a textual dependency line which defines a target followed by a colon (:) and optionally an enumeration of components (files or other targets) on which the target depends – i.e. dependencies. For instance, a target is a file that has to be created or updated.

The dependency line is arranged so that the target (left hand of the colon) depends on components (right hand of the colon). It is common to refer to components as prerequisites of the target. Each line of commands under a given rule, must start with a “Tab”, to be recognized as a command otherwise make will through an error message stating something like “*** missing operator. Stop.” It is important that ”tabs” are used and not spaces, some text editors may have this intrinsic behaviour of switching tabs for spaces and this will break your makefile. Similarly sometimes porting files between different operating systems can result in changing the encoding of the files and yield in similar problems.

TARGET: dependencies...
        [command 1]
        . . .
        [command n]
TARGET1 [TARGET2]: dep1 dep2 ...
        [command 1]
        . . .
        [command n]
Nomenclature and Further Considerations in make-files

Each command is executed by aseparate shell or command-line interpreter instance. A backslash
can be used to have commands executed by the same shell, which will indicate a line-continuation. Similarly, commands can be separated by semi-colons ”;” and comments are included using #.

An @, results in the command not to be printed to standard ouput. The dependency line can consist solely of components that refer to targets

Macros & Variables

Macros are usually referred to as variables when they hold simple string definitions, like for instance “CXX = g++”’. Macros in makefiles may be overridden in the command-line arguments passed to the Make utility. Macros allow users to specify the programs invoked and other custom behaviour during the build process. For example, the macro “CXX” is frequently used in makefiles to refer to the location of a C compiler. Environment variables are also available as macros to be used within makefiles.

Similarly as it happens on the shell, a variable can be referred using $ and its name is enclosed within parentheses “(”…“)” or braces “{”…“}”. Single character variables do not need the parentheses, for instance, $(CC),$(CCFLAGS),$@,$^.

Additionally, make has some predefined “automatic” variables based on the contextual information of the rules, i.e. see Table 3.

$@ refers to the target filename
$* refers to the Target filename, without the file extension
$< refers to the filename of the first prerequisite
$^ refers to the filenames of all the prerequisites, separated by spaces, and it will discard duplicated ones
$+ similar to $^, but including duplicated filenames
$? refers to the names of all the prerequisites that are newer than the target, separated by spaces
Table 3: List of Automatic Variables available in make
Special rules
  • Suffix Rules: have target with names in the form “.FROM.TO”, and are used to launch actions based on file extensions. E.g.

    .SUFFIXES: .txt .html
    # From .html to .txt
    .html.txt:
            lynx -dump $< >$@
    $ make file.txt
    lynx -dump file.html > file.tx
  • Pattern Rules: is like any usual rule, but it includes the symbol ”%”, which implies that anything matching this “pattern” will define the rule. For instance,

    %.o : %.c
            $(CXX) -c $(CFLAGS) $(CXXFLAGS) $< -o $@

    this rule compile files ending in “.c” into object files ending in “.o”. In addition, the rule employs a few macros predefined for the compiler (CXX) and compilation flags (CFLAGS and CXXFLAGS). The automatic variables ‘$@’ and ‘$<’ to substitute the names of the target file and the source file are also used.

In addition to these examples, there are other “built-in rules” predefined in make, see the make’s Catalogue of Rules for more details.

Phony targets

are targets that do not correspond to a file or other object. Phony targets are usually symbolic names for sequences of actions.

Parallel Execution

Normally, make will execute only one recipe at a time, waiting for it to finish before executing the following one. However, one could use the flag ‘-j’ or ‘--jobs’ option, which tells make to execute several recipes simultaneously. If the ‘-j’ option is followed by an integer, this is the number of recipes to execute at once; this is called the number of job slots. E.g. by running “make -j 8”, will result in make executing the default makefile attempting to run upto 8 rules at the same time. This is really useful flag to consider and usually employed when compiling large pieces of code or big programs. If there is nothing looking like an integer after the ’-j’ option, there is no limit on the number of job slots. The default number of job slots is one, which means serial execution (one thing at a time). It is possible to inhibit parallelism in a particular makefile with the “.NOTPARALLEL” pseudo-target.

Examples of makefiles

We present here some useful examples of makefiles that can be used in different scenarios, from compiling and distributing code, or generating documents from Latex or even self-documenting your makefiles.

  • Simple example for listing a directory content:

    # target: list - List source files from src directory
    list:
            # it won’t work, as each cmd is in a separeated shell
            cd src
            ls
    list:
            # Correct way, continuation of the same shell
            # Notice the semicolon (;) and backslash (\)
            cd src; \
            ls
  • Example of a “distribution” rule, utilizing macros

    # some macros definitions
    # name of the package, ie. your code
    PACKAGE = name.of.your.code
    # version will be given by the current date
    VERSION =  date +"Y.%m%d" 
    # combine name and version
    ARCHIVE = $(PACKAGE)-$(VERSION)
    # distribution rule
    dist:
            # the macros will be expanded for the shell
            # to interpret:
            #  tar -cf name.of.your.code-‘ date +”Y.%m%d” ‘.tar
            tar -cf $(ARCHIVE).tar .
  • Compilation of a program with a single file

    # This Makefile compiles a program compose by a single file
    # It also uses two external libaries called GSL and GSLBLAS
    # Define the compiler to use
    CXX = g++
    # Specify the location of the library
    # the operator ?= will assign the current directory (.)
    # to the macro only if that variable is not defined
    GSL_INC ?= .
    GSL_LIB ?= .
    # Compiler and linker flags
    # -I specifies location of header files
    # -L location of libraries
    # -O2 is an optimization flag, indicating
    #   the level of optimization to be used by the compiler
    # -std set the language standard to compile, in this case
    #   will be c++14
    CXXFLAGS = -I${GSLINC} -O2 -std=c++14
    LDFLAGS = -L${GSLLIB}
    # -l specifies the name of the libraries
    LDLIBS = -lgsl -lgslcblas
    all: myprog
    myprog: myprog.o
            ${CXX}-o myprog myprog.o ${LDFLAGS} ${LDLIBS}
    myprog.o: myprog.cpp
            ${CXX}$ {CXXFLAGS} -c -o myprog.o myprog.cpp
  • Compilation of a program with several modules

    # This Makefile compiles a program compose by two modules:
    # outputarray.cpp and  MyArray.cpp
    # It also uses two external libaries called GSL and GSLBLAS
    # Define the compiler to use
    CXX = g++
    # Specify the location of the library
    # the operator ?= will assign the current directory (.)
    # to the macro only if that variable is not defined
    GSL_INC ?= .
    GSL_LIB ?= .
    # Compiler and linker flags, -O2 is an optimization flag
    CXXFLAGS = -I${GSLINC} -O2 -std=c++14
    LDFLAGS = -L${GSLLIB}
    # name of the libraries
    LDLIBS = -lgsl -lgslcblas
    MyArray: MyArray.o outputarray.o
            ${CXX} -o MyArray MyArray.o outputarray.o ${LDFLAGS} ${LDLIBS}
    MyArray.o: MyArray.cpp outputarray.h
            ${CXX} ${CXXFLAGS} -c -o MyArray.o MyArray.cpp
    outputarray.o: outputarray.cpp
            ${CXX} ${CXXFLAGS} -c -o outputarray.o outputarray.cpp
    clean:
            rm -f MyArray.o outputarray.o MyArray

    outputarray.cpp

    outputarray.o

     MyArray.cpp

     MyArray.o

      MyArray

      gsl

      gslcblas

  • Example of a makefile for compiling LaTex-documents

    # Makefile for compiling a latex paper
    NAME=manuscript
    TARGET=$(NAME).pdf
    SOURCE=$(NAME).tex
    JUNK=.aux .bbl .blg .dvi .log .nav .out .ps .pdf .snm .tex.backup .tex.bak .toc Notes.bib
    .PHONY: clean
    $(TARGET): $(SOURCE)
            @pdflatex $(SOURCE)
            @bibtex $(NAME)
            @pdflatex $(SOURCE)
            @pdflatex $(SOURCE)
    all: $(TARGET)
    clean:
            @for ext in $(JUNK); do \
                    rm -v $(NAME)$$ext; \
            done
  • Generic structure for a more general Makefile, including self-documentationg by displaying comments included with double hashtags, ie. “##”

    ## debug:  compile code with debugging flags
    debug:  ...
            ...
    ## build:  compile code with default options
    build:  ...
            ...
    ## install:  install package
    install:  ...
            ...
    ## clean :  Remove auto-generated files
    .PHONY: clean
    clean: ...
            ...
    .PHONY: help
    help:  Makefile
            @sed -n s/^##//p’ $<

2.3.3 Documentation

Documenting your code means describing, in that code, what you try to achieve with the code and how. There are many documentation styles and philosophies:

  • No documentation

  • Comments-only documentation

  • Auto-generated documentation

  • New-user oriented

The “no documentation” style also often advocates no comments. The pretense is that clear code is “self-documenting”. Sure, but, …yeah, sorry, no. At some point, your code will get read by someone with (much) less understanding of what it’s supposed to do than you. This includes your future self.

One of the most important and useful ways to inform this person on the point of the code is via comments. Comments are fragments of text that are included in the code, but that the computer completely ignores. While the C-way of commenting, with comments starting with /* and ending with */, is still supported, in C++, comments can start with // and anything that follows it on that line is a comment (much like in bash, as in R and Python, where the start of the comment is indicated with a #). It is sometimes said that code is for computers while comments are for the humans reading the code. Commenting is a really powerful tool to try to match the instructions given to the computer with the mental model we have of what our program is supposed to be doing. In addition, comments will help you to quickly recall what the code supposes to do, or get familiar with someone’s else code.

The comments can also include a proper description of your functions, arguments expected and its proper type and range of values, limiting cases, etc. By adding specially formatted comments, one can use tools like doxygen to auto-generate documentation. DOxygen can read your code and combine the definitions of your functions and scripts with your comments, and generate a manual for your code. This is pretty decent, and a good way to keep documentation up-to-date when the code changes.

If your code is sufficiently complex, you do not want the user to have to read it through to use it. Think about what documentation would you need to be able to use the module? You might want a more pedagogical manual or tutorial, which is beyond what automated tools can do. These are additional files, and while it’s true that these now have to be maintained and updated when the code changes, they are an important tool to be able to use a complex code. If you do nothing else, at least add a README text file or README.md markdown file.

Let’s look at the outputarray module of the previous section, and see how one could use doxygen to generate documentation for it, using specially formatted comments. We only need to change the header file, and one way to do it would be as follows:

1 /// @file   outputarray.h
2 /// @author Ramses van Zon
3 /// @date   January 14, 2019
4 /// @brief  Module for writing a 1d array of doubles to text and binary files.
5 #ifndef OUTPUTARRAYH
6 #define OUTPUTARRAYH
7 #include <string>
8
9 /// @brief Function to write an array of doubles to a binary file.
10 /// This function does a raw dump of the array file to file.
11 /// @param  s  the filename
12 /// @param  n  number of elements of the array to write to file
13 /// @param  x  pointer to the first element of the array of doubles
14 void writeBinary(const std::string& s, int n, const double x[]);
15
16 /// @brief Function to write an array of doubles to a text file.
17 /// The file will contain each element of the array on a separate line.
18 /// @param  s  the filename
19 /// @param  n  number of elements of the array to write to file
20 /// @param  x  pointer to the first element of the array of doubles
21 void writeText(const std::string& s, int n, const double x[]);
22 #endif

To have doxygen create documentation out of this, one must first generate a configuration file called Doxygen, with (then edit it):

$ doxygen -g

and then edit it to change the PROJECT_NAME variable to Outputarray.

Next, one can create a text file called README.md, which has a special first line that tell doxygen to put this on the first page of the documentation:

 [//]: # \mainpage
 Outputarray is a module for writing a 1d array of doubles to text and binary files.
 Compile with: "g++ -c -std=c++14 outputarray.cc -o outputarray.o"
 Generate documentation with doxygen as follows

    doxygen -g
    sed -i ’s/PROJECT_NAME[ ]*=.*/PROJECT_NAME=Outputarray/’ Doxyfile
    doxygen
    make -C latex
 This requires doxygen and latex to be installed.
 The resulting documentation will be in latex/refman.pdf and html/index.html.

Next, we can generate the documentation in html and latex form using the command

$ doxygen

The html result is shown in Fig. 3.

Figure 3: Example output of doxygen.

2.3.4 Defensive Programming

Defensive programming refers to develop code in such a manner that the program checks against certain situations that could result in the code to fail (crash). The main idea is to think in all the possible situations or cases that could cause the program to fail and implement protections against that. For instance, if your code is computing the squared-root of a given variable, you may want to check that such variable is a number and a positive number. More generally, this means checking that function arguments, or script command-line arguments, meet certain criteria. How this is accomplished depends upon the context, and the programming language, but there are some common situations. Examples might include: check to make sure that numbers are not negative (when they shouldn’t be); check to make sure that arguments are of the correct type or that there are even arguments passed to the function or script. At a more basic (“low-level”) level, for instance in C/C++ memory allocation must be requested when working with dynamically allocated arrays (see 2.2.11), one could think of checking whether this type of request has been succesfully accomplish during runtime or it fail which might point to an issue with the amount of free memory avaialble at the moment of running the program. Similarly, other aspects when the code will interact with the OS or even other libraries can and should be protected against possible and different type of failures.

One thing that should be noted is that implementing defensive programming strategies, will not avoid the code to fail and not run, but instead it will stop the program from running in a controlled manner, and equally important inform the user of what type of error or situation was detected, why the code was terminated and make a suggestion of how to fix this.

2.3.5 Testing

Testing refers to the process of writing additional routines (functions and/or scripts) to check that the code we are developing is giving correct results for known cases. The purpose of these testing routines is to test the code against known situations, now and especially in the future as we develop or updates parts of the code. It can be used to test that the result of the code are not affected by where (i.e. in which computer) is being run.

There are two broad categories of testing:

Integrated testing

refers to testing the whole gigantic program, integrating the outputs of your many modularized pieces. Integrated testing is important, but should not be the only type of testing done. As you develop a large program, with many interacting parts, bugs will develop. It will be difficult to determine the source of the problem, if testing is only performed on the integrated whole.

Unit testing

refers to testing the elementary pieces of your codes, i.e. modules, individually. This means writing code that will test the pieces of code in question. Test against easy solutions, typical solutions, edge cases, special cases. This enormously speeds up the detection of bugs.

In general, if you are given a code which does not come with testing routines, be careful and suspicious, try implementing your own testing cases in order to verify the validity and sanity of the code.

One very important observation you should notice is that many of the elements of ‘best practices’ would come together at some point. For instance, you won’t be able to implement unit tests if the code is not modular. Similarly, defensive programming is closely related to testing, as a matter of fact you can think about them as in some sort of “symbiotic relationship”… where by testing the code one might learn about new special cases or protections the functions you develop might need, and vice-versa, by implementing special guards in your code may give raise to special tests you would like to check the code against. However, they shouldn’t be confused and mixed, defensive programming is about making your code robust and able to handle all type of situations; while testing is about verifying the outcome from your code/or/routines under known situations. Another thing one might want to notice, and this unfortunately is a reality that is faced always when making code more robust (i.e. more defensive): is that usually there is a price that will be paid in performance. For instance, consider a function that receives certain arguments, if we want to implement that checks the nature of the arguments (i.e. make the function more robust) but it means that every time the function is called it will do these checks usually implemented using some sort of conditionals, such as if-statements (usually if-statements are sort of a expensive operation for the computer, hence why is always better to use vectorized operations, like slicing). It is also true that in some cases we can make assumptions about the internal data types handled by our functions, nonetheless having defenses in place is always a good idea, hence the dichotomy here.

Types of tests

  • Comparison to analytic solutions:

    • Solutions tend to be for simple situations - not hard tests of the computation.

    • If your code doesn’t get these right you’ve got serious problems.

  • Benchmarking: comparing the results of your code to other codes which solve the same problem, in the same parameter regime.

    • Does not demonstrate that either solution is correct.

    • Can show that at last one code or version has a problem, or that something has caused changes.

    • Is more powerful if different algorithm types are used.

    • Save the results of benchmarks in your testing directory.

  • Testing against reality:

    • If your code calculates something that can be compared against reality, do it as one of your tests.

    • Assume that reality is correct.

Every so often, especially when editing your routines, re-run your test suite55 5 The set of testing routines is usually referred as “test suite”.. There also exist more-advanced testing frameworks to build into your code, such as Boost.Test (from the Boost library suite –https://www.boost.org/–), the Google C++ Testing Framework (a.k.a. googletest –https://github.com/google/googletest–), catch2 (https://github.com/catchorg/Catch2), etc. In most of the cases, these suites define a series of “macro” functions that just help with the implementation of the tests, for instance, when implementing the conditionals or inquiring about a particular result, but the actual logic and implementation details are upto the programmer.

More importantly, when developing code, is customary and part of “best practices” to include testing routines that the user can run or that are ran at the moment of the installation so to guarantee that each module and part is working as expected.

2.3.5.1 General Guidelines for Testing
  • Each module should have a separate test suite.

  • If the code is properly modular, those module test should not need any of the other .cc files.

  • Include new rules for testing in your makefile, so that can be easily executed.

  • Especially when your project contains a lot of tests, you may want to use a unit testing framework.

  • Testing will give confidence in your module, and will tell you which modules have stopped working properly.

  • Once your tests are okay, you now have a piece of code that you could easily use in other applications as well, and which you can comfortably share.

2.3.5.2 Examples of Tests

Let’s take the example of the Hydrogen’s energy levels calculation introduced in Sec. 2.3.1. The first approach we could take is to perform an integrated test of the modular implementation of the code versus the original “monolithic” one (Lst.1). By saving the results from both codes and comparing against each other, we would be able to see in the results generated with the modular implementation differs from the original starting code.

Integrated test alone, is not enough to guarantee that everything is working fine. More importantly, if an integrates test is failing, it won’t be very useful in determining which module is failing. Hence why we need to implement unit testing, which will allow us to improve our confident in each of the individual modules and parts of the code as help us narrowing down a problem in case of having one.

The following is an example of a unit test for the outputArray module presented in Lst.1 and 2.

Listing 1: “Unit Testing for OutputArray module”
1 // outputarr_test.cc
2
3 #include"outputarr.h"
4 #include<iostream>
5 #include<fstream>
6
7 int main(){
8         std::cout << "UNITTESTFORFUNCTION’toAsc’\\n";
9
10         // create file:
11         rvector<double> a(3);
12         a = 1,2,3;
13         toAsc("testoutputarr.txt", a);
14
15         // read back:
16         std::ifstream in("testoutputarr.txt");
17         std::string s1,s2,s3;in >> s1 >> s2 >> s3;
18
19         // check:
20         if (s1!="1" or s2!="2" or s3!="3"){
21                 std::cout << "TESTFAILED\n";
22                 return 1;
23         } else {
24                 std::cout << "TESTPASSED\n";
25                 return 0;
26         }
27 }
#Makefile
...
test: outputarr_test
        ./outputarr_test
outputarr_test: outputarr_test.o outputarr.o
        $(CXX) -g -o outputarr_test outputarr_test.o outputarr.o
outputarr_test.o: outputarr_test.cc outputarr.h
$ make test
  g++ -g -std=c++14 -c -o outputarr_test.o outputarr_test.cc
  g++ -o outputarr_test outputarr_test.o outputarr.o
  ./outputarr_test
  UNIT TEST FOR FUNCTION toAsc
  TEST PASSED
$ echo $?

Listing 2 shows an example of a test module using the BOOST Test framework. A couple of important observations should be noted, for starting BOOST can be thought as a series of macros provided so that your testing unit module does not need to have a main function, basically BOOST will create an equivalent one for each test suite created. Similarly, BOOST offers a couple of ways in which it can be compiled, either statically or dynamically. The former implies that the basic needed functions are dumped into your executable, while the latter keep pointers to the needed libraries. Each one has it own advantages and disadvantages, but one important point to keep in mind is that these two approaches should not be mixed. The reason for this warning is that unexpected side effects might result from this, such as, runtime segmentation faults while BOOST tries to remove non-existent memory allocable objects. One typical case, for instance, would be to include a header file indicating to use a subdirectory .../included/... combined with the -lboost_unit_test_framework compilation flag. This will result in BOOST attempting to duplicate definitions: i.e. you get one Boost.Test library built by ”.../included/...” and a second instance linked in at the moment of linking. This is usually referred as one definition rule (ODR) violation.

To avoid this type of issues, one must either use the “included” variant or link against the library, one or the other but not both!

The correct #include for linking against the library is #include <boost/test/unit_test.hpp> or just don’t link against the library at all.

Listing 2: “Unit Testing for OutputArray module using the BOOST Test framework”
1 // output_bt.cc
2 #include"outputarr.h"
3 #include<iostream>
4 #include<fstream>
5
6 #define BOOST_TEST_DYN_LINK
7 #define BOOST_TEST_MODULE output_bt
8 #include<boost/test/unit_test.hpp>
9
10 BOOST_AUTO_TEST_CASE(toAsc_test){
11         // create file:
12         rvector<double> a(3);
13         a = 1,2,3;
14         toAsc("testoutputarr.txt", a);
15
16         // read back:
17         std::ifstream in("testoutputarr.txt");
18         std::string x,y,z;in >> x >> y >> z;
19
20         // check:
21         BOOST_CHECK(x=="1" && y=="2" && z=="3");
22 }
$ g++ -std=c++14 -g -c output_bt.cc
$ g++ -g -o output_bt output_bt.o outputarr.o -lboost_unit_test_framework
$ ./output_bt --log-level all

2.3.6 Debugging Techniques

Debugging refers to the process of attempting to identify and found errors in our codes.

This is, in general, not a quite easy task to achieve for several reasons. For starting, it implies that we need to match the ‘mental’ model, i.e. the conceptual abstraction of what we want and think that our code is doing with what it is actually doing in reality: what part of our mental model is not matching reality. In addition to that, are the pragmatical implications of actually finding what part/implementation in the code is responsible for that.

Debugging Workflow

As soon as you are convinced there is a real problem, create the simplest situation in which it repeatedly occurs. After all, this should follow a scientific methodology, i.e.: model, hypothesis, experiment, conclusion.

Try a smaller problem size, turning off different physical effects with options, etc, until you have a simple, fast, repeatable example where the problem occurs.

Try to narrow it down to a particular module/function/class, and this is why modularity is again a crucial element to be present in addition to testing.

Additionally, one would want to have “integrated calculations” and intermediate results written out, so that they can be inspected and compared to expected results to check whether something unexpected is noticed (i.e. integrated tests).

One thing that should be noted, is that in the process of “debugging” many times we are tempted to add ”print” statements to inspect the changes or evolution of variables, flags or even tell us ”where” (in which part of the code) something is happening. Even when this “methodology” might sound tempting and many of us have fallen by doing this, this represents a real pathological approach. There are several problems with it, for starting it represents a never-ending cycle which involves: strategically add print statements, recompile your code, run it, analyze the output… and repeat… Furthermore, one needs to remove all the extra code after the problem was found and fixed; and of course this has to be repeated for every problem and possible bug we may have. By the end, this is really time consuming, it is error prone as we are already manipulating and changing a code which we believe has flaws. More importantly this method will change the actual code, hence the memory footprint of the code, and in cases like when the errors are strongly related to memory allocation (e.g. pointers allocation, uninitialized pointers/memory addresses, out of range indexes, etc. ) this can directly affect the behavior of the program and change the type of error reported.

Instead the best way to assess a debugging situation is to use an specialized program named debugger. Debugger is a program used to help us detecting errors in other programs. One thing to note is that, in general a debugger will not find problems by itself, but instead will help us (i.e. the developer) to try to find where or/and what the problem is.

Recommendations to avoid debugging, ie. to reduce “bugs” when coding
  • Write cleaner, better code by:

    • including simple, clear, straightforward code (a paraphrase of the “KISS”66 6 “Keep it simple stupid” acronym).

    • using modularity (avoid global variables and 10,000 line functions).

    • avoiding “cute tricks”, (no obfuscated C code winners – IOCCC).

  • Do not write code, use existing libraries (see Sec.2.3.8).

  • Write (simple) tests for each module (see Sec.2.3.5).

  • Switch on the -Wall flag, inspect all warnings, fix them or understand them all.

  • Use defensive programming (see Sec.2.3.4): check arguments, use assert (which can be switched off with -DNDEBUG).

2.3.6.1 Typical error types

Table 4 shows a list of different paradigmactical type of errors that can be seen during run-time.

Arithmetic Corner cases (sqrt(-0.0)), infinities
Memory access Index out of range, uninitialized pointers.
Logic Infinite loop, corner cases
Misuse Wrong input, ignored error, no initialization
Syntax Wrong operators/arguments
Resource starvation Memory leak, quota overflow
Parallel Race conditions, deadlock
Table 4: Summary of some common errors during run-time.
2.3.6.2 Debuggers

As mentioned before the best and recommended way to tackle the task of finding a bug in your code, ie. debugging, is by using a debugger.

Among the most interesting and useful features, debuggers allow us to:

Crash inspection

In many cases, when a code crashes77 7 A code “crashes” when its execution ends abruptally without the control of the program, in other words is being killed by the OS or by a system violation., the OS left behind a file containing the current stage of the memory of the system related to the program as well as the calls to the function stack of the program. These files are usually known as “core dumps” and can be analyzed by debuggers, ie. debuggers can read them and reconstruct from the state of the memory and the functions calls. In this way, is possible to inspect and look at what the state of the program was just before crashing.

Function call stack

When running a program through a debugger, the debugger will keep a list of the calls to the functions, i.e. “function call stack”.

Step through code

It is possible by running the code through a debugger to advance in a controlled manner the execution of the program. This can be done line by line, or stepping into calls to functions, evaluating the instructions at the same time as the program goes executing in this “controlled” manner.

Automated interruption

Similarly as in the previous feature is possible to indicate to execute the program up to certain point in the code and when this point is reached, then to step into the code and inspect the values of the different variables, evaluate line by line from that point onwards, etc. The points that indicated were to take control of the execution, are usually referred to as “break points”. They can be assigned to particular lines or functions.

Variable checking and setting

As the debugger allow us to take control over the execution of the program, it also allows us to inspect the values of the variables used by the program (similar to what we would do using “print statements” but in a more elegant, efficient and controlled fashion). Additionally a debugger allows us to be notified when a given variable changes its value, and interrumping the execution o the program, as well as, to manually change the given value of a variable while the program is running.

It is possible to use a text-based debugger or a graphical one. One of the most common debuggers is gdb (GNU debugger). The advantage of using a graphical debugger is that simplifies the interaction with the user by providing a more easy to use interface. Although the utilization of a graphical debugger may have a downside when using it remotely. For instance, when debugging from a cluster or remote computer as the graphical information has to be send from the remote computer to our local station. Some debuggers have developed a way to alleviate running a GUI remotely by providing a client-server way to establish the connection and only send the relevant information from one side to the other. An example of this Allinea’s DDT which is a commercial debugger. Another example of a graphical debugger is DDD (which stands for Data Display Debugger), which it just a GUI to gdb. Even gdb provides an text-based interface to aid with the different actions available. This TUI (text user interface) can be accessed by indicating the -tui flag when executing gdb in the command line.

Independently of whether the debugger has a graphical interface or a traditional text one, they share the same concepts.

help h print description of command
run r run from the start (+args)
backtrace/where ba function call stack
break b set breakpoint
delete d delete breakpoint
continue c continue
list l print part of the code
step s step into function
next n continue until next line
print p print variable
display disp print variable at every prompt
finish fin continue until function ends
set variable set var change variable
down do go to called function
until unt continue until line/function
up up go to caller
watch wa stop if variable changes
quit q quit gdb
Table 5: Summary of gdb commands
2.3.6.3 Using a Debugger

In order to be able to properly use and take advantage of whole functionalities offered by a debugger, there a couple of things we need to do in order to prepare the code for that.

The first step would be to re-compile our code using some compilation flags:

  • -g enables the creation of information for the debugger.

  • -fstats make the executable to include “symbols” which is away to keep the names for variables and functions with the executable, that otherwise will be striped out by the compiler.

  • Turn off optimization flags, which is achieved by -O0. The reason for this is that the different levels of optimization allow the compiler to rearrange our lines of code (instructions) so that it can find the most optimal (faster) way of performing the computations required. The issue is then that trying to follow the execution of the code, when the code is optimized may not be as we would have indicated initially in our source code.

In some cases, it could be worth considering adding a new rule to our make file, with the solely purpose of compiling the executable for debugging purposes, ie. utilizing this flags.

Running a code through a debugger can be done in a couple of different ways. The simplest one is to load the executable in the debugger and launch it from within it. Alternatively one could “attach” a debugger to an already running program.

  • Preparing executable for debugging
    $ g++ -g -gstats -O0 code.cc -o app

  • Launching the debugger
    $ gdb app

  • Attaching the debugger to a running program

    • First, get the “process id” from the running program, eg.

      $ ps -C app -o pid h

      this will result in the number assigned to this process, let’s say pid.

    • After getting this, attach the debugger to the process.

      $ gdb -p pid

    Alternatively this can also be done from within the debugger by using the attach command with the proper pid.

Memory issues

An equally important part of debugging is the overall view of the memory layout used by a program. Some type of bugs are directly related to memory problems, such as, unitilization of variables, attempts to access not valid memory address –e.g. when going over the limits of an array–, not releasing memory (this is usually referred as memory leakage), etc.

One remarkable tool to tackle these issues is an open source free software called Valgrind https://www.valgrind.org/. Valgrind is a set of tools, that can detect several of these memory problems, as well as, profiling the program.

2.3.7 Version Control

Version Control is a tool, usually referred as an utility program, i.e. a program that is useful for a very particular task. The main purpose of any version control tool is to track changes in a set of files. In other words, Version Control is a tool for managing changes in a set of files: it keeps historical versions for easy tracking, by essentially taking snapshot of the files at a given moment in time.

For instance, if you are writing a manuscript and want to keep copies of different stages as the paper evolve, instead of keeping n number of arbitrary copies floating around, a version control tool will allow you to just one file but keep track of every single change in the file(s), restore old versions, compare changes among the different stages, go back and forward in the evolution of the manuscript. If you are writing a program, it will allow you to track changes in the code, which it can be really useful at the moment of identifying problems with newly implemented functions. Version control can also be used in collaborative environments, for example, either writing a manuscript or collaborating in the development of a complex program with many different parts, a version control tool will avoid the co-authors and developers to overwrite and overstep the changes from different authors. As mentioned, there are many different applications for version control tools, among the most used ones are: writing and developing programs, writing manuscripts, collaboration among several developers and authors, etc.

As a general rule, it is recommended not to include derivative files that are generated while compiling the code or manuscript. For instance, object files, intermediate files or the executable should not be added to the repository, as these are regenerated continously, in many cases are in binary fomrat and most likely will end up reporting conlficts due to the continuos regeneration that could be hard to resolve due to their binary nature.

There are many different types of version control programs, each of them have different variations on the same main functionality. Examples of different type of version control programs are: git, mercurial, svn, cvs, etc. The differences vary from style to the main core of how the changes are tracked. The particular program we will use and focus in the course is: git.

Version control systems are used at the highest level of professional software development.

There are also web-based implementations of this version control tools, among the better known ones are: GITHub (https://github.com/), BitBucket (https://bitbucket.org), GITLab (https://about.gitlab.com/), etc.

Figure 4: Schematic representation of Version Control using GIT. Version Control is a tool for managing changes in a set of files. It keeps historical versions for easy tracking. It essentially takes a snapshot of the files at a given moment in time. The vertical black arrow on the left represents the flow of time –i.e. the direction in which the time increases–, hence different points at different heights represent different moments in time. The next orange line represents the actual state of the file(s) in the local directory where the repository exist. The shade region is a representation of the data that is stored in the repository and at the same time the arrows pointing from the “local directory” into the repository the operations that are needed in order to populate the story time of the repository. One particular detail associated with the use of GIT as a version control system is that GIT requires two steps in order to store the instances of time. This has advantages, such as several changes can be grouped together in one single instance in time. If you would like to take the interpretation of this diagram, as space-time diagram –such as the ones used in theoretical physics– you could think that version control is so powerful that would allow you to travel in time across your files’ space-time.

Fig.4 shows an schematic representation of how version control using GIT works. The critical components of any Version Control system are the “checkpoints” or “snapshots”; or in the case of GIT, commits: are the moments in time when we basically store the state of the files we are interested in.

One important detail that should be considered is that these “snapshots” or “checkpoints” in time, i.e. the different instances and versions that are tracked by the program, should be explicitly indicated by the user. In other words this does not happen automatically, and the developer is fully responsible of indicating when the snapshots (or commits in the version control language should be taken). This basically determines the granularity with which we can recover previous versions of the file, the more commits we perform the more intermediate instances and different version we will have of the file(s) we are tracking.

An important detail to bare in mind are the commit messages, i.e. the description associated to each of the commits, which similarly to the comments we include in the scripts or programs, are “more or less” a matter of personal preference. The important thing to be aware of is that they should help us to create some sort of “index” or “map” of what happened while the creation of the script or program, like beacons guiding us to the right moment or stage of the project.

Action GIT command Frequency (when to use it)
initialize the repository git init only once per repo, at the very beginning
stage files git add ... repeat as many times as you edit files …
commit files git commit -m "..."
see the status of the files in the repo git status repeat as many times as needed, when ever you want to check if you have to commit something or the difference between the version in the repo and the current working version
see differences between files in repo and changes git diff
git diff file1
see the history of commits git log whenever you want to look at the history of commits
more details about the log and commits git log --stat adds details about changes
git log --graph adds tracks for branches…
git log --stat --graph combines both
Table 6: Summary of GIT “gymnastics” and commands to use when working with a repository.
2.3.7.1 Summary of GIT commands and options

The way we will use git is through the shell terminal, and as such this is the usual way git works:

git <command> [args]

where <command> is an specific git-command (sometimes also called “sub-command”, see Table 7) indicating an specific action to be performed.

A more general way to specify details for the git command is as follows,

usage: git [--version] [--help] [-C <path>] [-c name=value]
           [--exec-path[=<path>]] [--html-path] [--man-path] [--info-path]
           [-p|--paginate|--no-pager] [--no-replace-objects] [--bare]
           [--git-dir=<path>] [--work-tree=<path>] [--namespace=<name>]
           <command> [<args>]

although we should notice that most of the time –depending on what you are working on– you will restrict to a very particular sub-sets of this, see for instance Table 6.

git command description
Basic GIT commands
add Add file contents to the index
clone Clone a repository into a new directory
commit Record changes to the repository
diff Show changes between commits, commit and working tree, etc
init Create an empty Git repository or reinitialize an existing one
log Show commit logs
mv Move or rename a file, a directory, or a symlink
rm Remove files from the working tree and from the index
status Show the working tree status
More advanced GIT commands
bisect Find by binary search the change that introduced a bug
branch List, create, or delete branches
checkout Checkout a branch or paths to the working tree
fetch Download objects and refs from another repository
grep Print lines matching a pattern
merge Join two or more development histories together
pull Fetch from and integrate with another repository or a local branch
push Update remote refs along with associated objects
rebase Forward-port local commits to the updated upstream head
reset Reset current HEAD to the specified state
show Show various types of objects
tag Create, list, delete or verify a tag object signed with GPG
Table 7: Most commonly used git commands

’git help -a’ and ’git help -g’ list available subcommands and some concept guides. See ’git help <command>’ or ’git help <concept>’ to read about a specific subcommand or concept. Additionally, depending on the OS you are using, you can try man git – this will work on Linux and Mac OS.

Further resources and references are available in the Sec.2.3.7.4, as well as on the Internet if you search for GIT.

2.3.7.2 More Advanced GIT Options

GIT, as many other version control systems is quite powerful and offers a lot of funtionalities. The level of expertise and complexity will vary from user to user and depending on what is aimed with the repository. Most users will be OK with the basic functionality, while others might decide to embrace a more deeper dive on it. We will in the next paragraph present just a few of these “more advanced” topics.

Branching

is a really powerful technique which allow to develop in parallel, several instances (branches) of the project. This can be really useful when, for isntnace, implementing new functionalities in a code or alternative methods to achieve or for solving the same problem, and you don’t want them to interfere but instead develop them in separated ways but using the same base code. An example of branching, can be seen when developing a code, where you already a version that is ready to be released (this could be your master branch –which is the default one in GIT–), you can have a ’pre-release’ (or alpha/or/beta) one, a ’test-devel’ for testing and currently under development, etc. Another advantage of hacing different branches in repository is that you can compare them, take changes from one branch into another one and even merge them when you are happy with the changes implemented let’s say from your ’test-devel’ nbranch into your ’master’ or stable one.

Branches manipulation is relatively “simple”, “git branch -a” will show all the branches present in the repository. If you want to create a new branch, you just issue the command “git branch NEWbranch”, being ‘NEWnbranch’ the name of the branch you want to create. For changing to a different branch, let’s say ‘test-devel’, you will use the command “git checkout test-devel” and for returning to the ’master’ branch “git checkout master”.

For comparing branches you can use “git diff branch1 branch2”, or it can be even done at the level of individual files, e.g. “git diff branch1:myprog.cpp test-devel:myprog.cpp” where in this case we are comparing the implementations of the file “myprog.cpp” between the branches ‘branch1’ and ‘test-devel’.

Branches can also be merged and even you can select which changes you would like to import from one branch into another. The latter one is usually refer as “cherry-pick” in git terminology. A good tutorial about branching can be found here: https://www.atlassian.com/git/tutorials/using-branches

Remote repositiories

The most basic implementation of a repository would be to create it locally in your own machine, however it is possible (and quite frequently very useful) to have a central version of the repository and “satellite” ones distributed in different machines and even for different users. This can be quite straight forward be done with git, you will of course need that the computer where the “central” version of the repo exists allows remote connections, for example via ssh.

For the following we will assume that:

  1. 1.

    the machine where the central repository is hosted is named can be accessible from the internet and the address is “myServer.somewhere.IP”

  2. 2.

    the user “myuser” is allowed to connect to the machine. In principle the user creating the central repository doesn’t need to be the same user accessing the repo, as a matter of fact there could be multiple users, but all of them should have the proper authentication and permissions for accessing the repository, ie. directory in the server.

These are the necessary steps for setting up the remote repo:

  • in the server (central repository):

    # connect to the machine where the central repo is going to be hosted
    ssh myuser@myServer.somewhere.IP
    # create a directory for the repo
    mkdir project.git
    # go into the directory
    cd project.git
    # initialize the repo
    git init --bare
  • in a machine to host a local version of the repo:

    # go into local directory
    cd local-project
    # initialize a git-repo
    git init
    # add previous files to the repo (if needed)
    git add *
    # commit changes
    git commit -m "Myinitialcommitmessage"
    # connect the local repo with the central hosted one
    git remote add origin myuser@myServer.somewhere.IP:PATH.to.REPO/project.git
    # push through to the central master
    git push -u origin master
  • additional copies in other machines

    git clone myuser@myServer.somewhere.IP:PATHtoREPO/project.git
  • For sending and brining changes between the repos:

    #check for diffs between repos and committed versions…
    git diff
    #retrieve changes from the repo
    git pull
    #commit local changes to the repo
    git add ...
    git commit -m "..."
    git push
Linking local and web-based repositories

A particular case of a centralized repo in a server, is the utilization of web-based repositories, such as github, bitbucketor gitlab. In this case the server side is provided by what ever is the web-service utilized. Each of these sites offer detailed instructions in how to perform the “cloning” of the repository in a local machine. The actual tricky part is the authentication, for which they offer alternatives as utilizing “public/private key pairs”. Most likely, the easiest way would be by cloning the existant repository from the web-based one into your local machine, assuming you don’t have a pre-existant one.

Looking into the git configuration details

If you ever need to look into the details of how the actual information of the repository is stored or customize at low level the options and configuration of your repository, then you need to look into the hidden directory “.git” within your repository.

There are a few files and subdirectories within this “.git” directory, for instance, the following one are quite useful to understand:

description

in this file git stores the information describing what the repository is about.

config

this file is used to indicate to git several configuration details, such as, the different branches in the repo, information about the user, etc.

hooks

this is asubidectory under “.git” containing a series of files, that can be seen as “plugins” for git; i.e. that they can be executed depending on a particular action. For instance, if you want to recieve notifications by emails when a new commit has been pushed into the repository you will need to activate one of this plugins indicating to do so.

Connecting multiple repositories

One extreme case, one could think of is the fact of having a repository duplicated in serveral servers. This is not the same as having mutliple cloned version of a single repository, where the server is unique; in this case we are talking about having multiple distributed servers and multiple clones.

So, let’s say that we have two repositories created in github and bitbucket, and we want to also use the repository in oru local machine. We should start by creating our local repo (if we don’t have it yet), then we will use the same commands as before when ‘connecting’ and ‘synchronizing’ our repos,

# this will connect with bitbucket
git add master git@bitbucket.org:yourBITBUCKETuser/project.git
git push -u origin master
# this will connect with github
git add master git@github.com:yourGITHUBuser/project.git
git push -u origin master

The local copy in this case would work as synchronization beacon between both web-based repositories.

Alternatively, one could this “manually” just editing the config file within the corresponding “.git” sub-directory. As a matter of fact, several git commands related to configuration features will do exactly that: modify the configuration of the repository by adding/changing features in theconfig file. For instance, the previous command will result in adding the following sections to our config file:

.
.
.
[remote "origin"]
        fetch = +refs/heads/*:refs/remotes/origin/*
        url = myuser@myServer.somewhere.IP:PATH.to.REPO/project.git
[remote "bitbucket"]
        url = git@bitbucket.org:yourBITBUCKETusre/project.git
        fetch = +refs/heads/*:refs/remotes/origin/*
[remote "github"]
         url = git@github.com:yourGITHUBuser/project.git
         fetch = +refs/heads/*:refs/remotes/origin/*
[branch "master"]
        remote = origin
        merge = refs/heads/master
2.3.7.3 GIT GUIs and more…

Although we strongly believe and encourage the use of git from the command line, there are some editors and IDEs that can integrate the version control features. Similarly, there are some auxiliary programs that offer some sort of GUI to GIT. The following link offers a list of available options for GUIs: https://git-scm.com/downloads/guis.

Among the most interesting examples are: gitk which is a program that can display the history of commits in a graphical interface, i.e. a browser for your repository; but perhaps, the ultimate application in terms of displaying the git commits in a visual way is gource https://gource.io - which allows you to create a movie of the commits and their corresponding contributors.

2.3.7.4 More GIT Resources

2.3.8 Using External Libraries

Code is bad. That is to say, there is a big different in the way scientists view code and the way software developer view it. Scientists view code as an asset. Software developers, on the other hand, know that code is a liability, since every line of code you write has potential issues now or in the future and needs to be maintained.

Since scientists’ goal is to get results, it is not uncommon for scientific code to contaon “quick and dirty” solutions. But this causes headaches later in the development process. This is known as technical debt, and sooner or later this debt needs to be repaid by rewriting the code.

The final issue why writing code is bad is that there is a lot of code that has already been written and that can be reused, so you might be reinventing the wheel, duplicating effort, remaking mistakes and introducing new ones.

The solution is simple: code less! One should try to reuse and recycle code that is out there by using libraries.

Essentially, libraries are modules that can be reused across different applications. So let’s start looking at a modular code in which several object files for different modules that need to be linked together. For example, suppose that thisapp.cc contains the main function and helper.h and helper.cc are a module consisting of a header file and an implementation file.. The makefile for such a project could like like this:

# makefile for ’thisapp’
CXX=g++
CXXFLAGS=-O2 -std=c++14
all: thisapp
thisapp.o: thisapp.cc helper.h
  ${CXX} ${CXXFLAGS} -c -o thisapp.o thisapp.cc
helper.o: helper.cc helper.h
  ${CXX} ${CXXFLAGS} -c -o helper.o helper.cc
thisapp: thisapp.o helper.o
  ${CXX} -o thisapp thisapp.o helper.o

Imagine that we want to reuse the module. We could just copy over the files helper.h and helper.cc to the source directory of a new project, and compile them along with the rest of that new project’s source. If the project reuses a lot of modules, that can could a lot of additional compilation. What if we could use helper in another project without recompiling helper.cc?

One thing we could do is to install .o and .h to separate directories. Say, helper.h gets copied to /base/include/helper.h and helper.o get /base/lib/helper.o. But we would have to let compiler know where these files are. For the header files, there is a compiler option to let the compiler know where to look for header files, and for many compilers, this is the -I option: a minus sign with a capital letter I. This option should be followed by a directory where header file may be found. If several directories should be added, one should give several -I options. If we do that the makefile would look like this:

# makefile for ’newapp’
CXX=g++
CXXFLAGS=-I/base/include -O2 -std=c++14
all: newapp
newapp.o: newapp.cc /base/include/helper.h
        ${CXX} ${CXXFLAGS} -c -o newapp.o newapp.cc
newapp: newapp.o /base/lib/helper.o
        ${CXX} -o newapp newapp.o /base/lib/helper.o

Because the helper files are not part of the code base of newapp (they are imported, in a way), they are rarely included in the dependencies, so more than likely, the makefile would look like this:

# makefile for ’newapp’
CXX=g++
CXXFLAGS=-I/base/include -O2 -std=c++14
all: newapp
newapp.o: newapp.cc
  ${CXX} ${CXXFLAGS} -c -o newapp.o newapp.cc
newapp: newapp.o
  ${CXX} -o newapp newapp.o /base/lib/helper.o

What we just did is a poor man’s library building. We’ll now slowly transform this setup and the corresponding makefile, to one that applies to ‘real’ libraries. Real libraries are similar; they have to be installed (and perhaps built first), header files (‘.h‘ or ‘.hpp‘) need to be in some folder, library files (object code) in a related folder. For real libraries under linux, library filenames start with lib and end in .a or .so, so the makefile should rather look like this:

# makefile for ’newapp’
CXX=g++
CXXFLAGS=-I/base/include -O2 -std=c++14
all: newapp
newapp.o: newapp.cc
  ${CXX} ${CXXFLAGS} -c -o newapp.o newapp.cc
newapp: newapp.o
  ${CXX} -o newapp newapp.o /base/lib/libhelper.a

We want to make the compile and link lines as generic as possible, so the makefile is as portable as possible). As such, any system-specific parts, such as the location of header and library files, are ideally only listed only once at the top of the Makefile. At this stage, we cannot avoid explict paths in makefile completely, but just as we could specify the location of the header files in the CXXFLAGS variable, there is an analogous procedure for library files. We can specify:

  • the path to the library’s object using the -L option in the LDFLAGS variable;

  • the object code using -lNAME (a lower case l, for library) stored in variable LDLIBS.

With these transformations, the makefile for this project will now become:

# makefile for ’newapp’
CXX=g++
CXXFLAGS=-I/base/include -O2 -std=c++14
LDFLAGS=-L/base/lib
LDLIBS=-lhelper
all: newapp
newapp.o: newapp.cc
  ${CXX} ${CXXFLAGS} -c -o newapp.o newapp.cc
newapp: newapp.o
  ${CXX} ${LDFLAGS} -o newapp newapp.o ${LDLIBS}

Adding a clean rule and extracting the common path, the Makefile for newapp, give us the final makefile:

# makefile for ’newapp’
CXX=g++
HELPERBASE?=/base
HELPERINC=${HELPERBASE}/include
HELPERLIB=${HELPERBASE}/lib
CXXFLAGS=-I${HELPERINC} -O2 -std=c++14
LDFLAGS=-L${HELPERLIB}
LDLIBS=-lhelper
all: newapp
newapp.o: newapp.cc
  ${CXX} ${CXXFLAGS} -c -o newapp.o newapp.cc
newapp: newapp.o
  ${CXX} ${LDFLAGS} -o newapp newapp.o ${LDLIBS}
clean:
  \rm -f newapp.o

Note that we have not covered here how to create your own libraries, only how to use existing libraries in your own code. But that was the whole point of using libraries to begin with.

A few final remarks on using libraries in C++ are in order:

  • C++ standard libaries such as vector and cmath, do not need any -l options (but they do require include statements).

  • There are standard search paths for libraries that needn’t be specified in -I or -L options Libraries installed through a package manager end up in standard paths; they just need -l... options in ‘LDLIBS‘.

  • If you compile your own libraries in non-standard locations, you do need -I and -L options.

2.3.9 Installing libraries from source

What to do when your package manager does not have that library, or you do not have permission to install packages in the standard paths? Or, what if you are on a shared system (such as one of the supercomputers at SciNet) where you do not have permissions to install using the package manager, and the software stack on that system does not have a module for that library either?

In those cases, you’ll have to endeavour to compile the library from source, and, because you won’t have administrative right on that shared system, you will have to install the library in a non-system directory, using a so-called base or prefix directory.

Let’s look at a few common installation procedures, but let us also warn you to always read the libraries installation documentation as well.

Many libraries come with an autotools installation procedure, which in many case means, after uncompressing the source code into a directory, the installation is a matter of three commands issued in that directory:

$ ./configure --prefix=<BASE>
$ make -j 4
$ make install

Here, you choose the <BASE>, but it should be a directory that you have write permission to, e.g., a subdirectory of your $HOME. These are non-standard installation directories.

Note that there may be more options that can be given to the configure command, which you can find out with the command ./configure --help.

Another common type of installation procedure that some libraries allow, is using cmake. In that case, there are typically four commands needed:

$ mkdir builddir && cd builddir
$ cmake .. -DCMAKE_INSTALL_PREFIX=<BASE>
$ make -j 4
 $ make install

For cmake builds, it can be a bit harder to find documentation on which further options could be given to the cmake command.

Note: If the documentation says to use sudo, it is wrong, except for system-wide installations on personal computers.

One a library is installed, you can use it by including its header file(s) in your code, and linking with -lLIBNAME. In addition, you need -I<BASE>/include and -L<BASE>/lib options to the compilation and linking line, respectively.

As an alternative to giving the -I and -L options, you can omit these for g++ under linux by setting some environment variables:

$ export CPATH="$CPATH:<BASE>/include"                 # compiler looks here for include files
$ export LIBRARY_PATH="$LIBRARY_PATH:<BASE>/lib"       # and here for library files
$ export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:<BASE>/lib" # runtime linker looks here

Enter these commands on the linux prompt before make or add them to your  /.bashrc88 8 Adding these to your .bashrc can lead to some hard-to-debug failures if you ever change versions of the libraries, and it therefore not our first choice.. The LD_LIBRARY_PATH variable is necessary to run application linked against dynamic libraries (.so).

If the library installs binary applications (i.e. commands) as well, you’ll also need to set

$ export PATH="$PATH:<BASE>/bin"     # linux shell looks for executables here

Always read the documentation that came with the library before searching the web!

In many shared systems, the variable CPATH, LIBRARY_PATH and LD_LIBRARY_PATH may be set up for you by their software module system, in which case you also do not need -I or -L options for libraries.

2.3.9.1 Library Example: GNU Scientific Library

The GNU Scientific Library (GSL) is a C library containing many useful scientific routines, such as:

  • Root finding

  • Minimization

  • Sorting

  • Integration, differentiation, interpolation, approximation

  • Statistics, histograms, fitting

  • Monte Carlo integration, simulated annealing

  • Ordinary differential equation solvers

  • Polynomials, permutations

  • Special functions

  • Vectors and matrices

Because this is a C library means we’ll need to deal with some pointers and type casts.

For a specific exmples, let’s see how we could use the GSL to finding the roots of the function f(x)=acos(sin(v+wx))+bx-cx2. Roots is were a function is zero and we will come back to this problem’s numerical aspects in section 3.2. Here, we will just consider how using this library works.

The code that performs the root finding with the help of the GSL, could as follows:

1 // gslrx.cc
2 #include <iostream>
3 #include <gsl/gsl_roots.h>
4
5 struct Params {
6  double v, w, a, b, c;
7 };
8 double examplefunction(double x, void* param){
9  Params* p = (Params*)param;
10  return p->a*cos(sin(p->v+p->w*x))+p->b*x-p->c*x*x;
11 }
12
13 int main() {
14  double x_lo = -4.0;
15  double x_hi = 5.0;
16  Params args = {0.3, 2/3.0, 2.0, 1/1.3, 1/30.0};
17  gsl_root_fsolver* solver;
18  gsl_function      fwrapper;
19  solver = gsl_root_fsolver_alloc(
20               gsl_root_fsolver_brent);
21  fwrapper.function = examplefunction;
22  fwrapper.params = &args;
23  gsl_root_fsolver_set(solver,&fwrapper,x_lo,x_hi);
24
25  std::cout << "iterlowerupperrooterr\n";
26
27  int status = 1;
28  for (int iter=0; status and iter < 100; ++iter) {
29    gsl_root_fsolver_iterate(solver);
30    double x_rt = gsl_root_fsolver_root(solver);
31    double x_lo = gsl_root_fsolver_x_lower(solver);
32    double x_hi = gsl_root_fsolver_x_upper(solver);
33    std::cout << iter <<""<< x_lo <<""<< x_hi
34              <<""<< x_rt <<""<<x_hi-x_lo<<"\n";
35    status=gsl_root_test_interval(x_lo,x_hi,0,1e-3);
36  }
37
38  gsl_root_fsolver_free(solver);
39  return status;
40 }

The first thing one might notice is that there is still a lot of code involving gsl_.... Although the root finding algorithms come from the GSL, the rest of the code is just wrappers, setting up parameters and calling the appropriate functions. One might notice that this wrapper code involves pointers and typecasts, because we’re dealing with a C library. . . .

To compile on the command line this code on the command line would, in the simplest case, involve the commands

$ g++ -c gslrx.cc -o gslrx.o
$ g++ gslrx.o -o gslrx -lgsl -lgslcblas

but this assumes the gsl was installed in a standard system’s directory. When this is not the case, you’d more likely need something like

$ GSL\_ROOT=...
$ GSLINC=$GSL\_ROOT/include
$ GSLLIB=$GSL\_ROOT/lib
$ g++ -c -I${GSLINC} gslrx.cc -o gslrx.o
$ g++ gslrx.o -o gslrx -L${GSLLIB} -lgsl -lgslcblas

You will have to set GSL_ROOT yourself. In the case of a shared supercomputer, the module system may provide a variable for you, for instance, on SciNet systems, one can do

$ module load gcc gsl
$ export GSLINC=$SCINET_GSL_ROOT/include
$ export GSLLIB=$SCINET_GSL_ROOT/lib
$ g++ -c -I${GSLINC} gslrx.cc -o gslrx.o
$ g++ gslrx.o -o gslrx -L${GSLLIB} -lgsl -lgslcblas

Once compiled, the application can be run and would show something like the following.

$ ./gslrx
iter lower      upper    root      err
0    -4        -1.27657  -1.27657  2.72343
1    -1.95919  -1.27657  -1.95919  0.682622
2    -1.75011  -1.27657  -1.75011  0.473542
3    -1.75011  -1.74893  -1.74893  0.0011793
$

The compilation can be automated with a makefile like the following:

CXX=g++
GSLROOT?=${SCINET_GSL_ROOT}
GSLINC=${GSLROOT}/include
GSLLIB=${GSLROOT}/lib
CXXFLAGS=-I${GSLINC} -O2 -std=c++14
LDFLAGS=-L${GSLLIB}
LDLIBS=-lgsl -lgslcblas
all: gslrx
.PHONY: all clean
gslrx.o: gslrx.cc
  ${CXX} ${CXXFLAGS} -c -o gslrx.o gslrx.cc
gslrx: gslrx.o
  ${CXX} ${LDFLAGS} -o gslrx gslrx.o ${LDLIBS}
clean: ; \rm -f gslrx.o

Compilation on a shared cluster using this makefile may look like this

$ module load gcc gsl
$ make

while compilation on your own computer where you installed the GSL in some prefix, would require the make command to take the GSLROOT as a parameter, like this:

$ make GSLROOT=...

where … is the prefix path you gave when you installed the GSL.

It’s worth reiterating that the point of using libraries is to not “reinvent the wheel”. There are many possible algorithms to implement for root finding, but they are all pretty standard, and can be found in libraries. The GNU Scientific Library is one such library. So don’t implement algorithms lf if there is a library that does it for you.

The root finding example hopefully also shows that existing solutions like the ones in the GSL can’t really be used until you understand the algorthims on a high level, so we will not be able to make much sense of the code until we look at root finding in more detail in Sec. 3.2.

2.3.10 File IO & File Formats

2.3.10.1 File I/O Ops.

The way in which we access data in our codes, i.e. how we read filesand save our results is crucial for several reasons. Among the most important ones is that it will affect the performance and speed of our codes. Furthermore having standardized ways in which we save the data will allow us to share this data with colleagues and be read from other codes and programs. An important and crucial part of many codes, is the process of saving our results, i.e. saving the data generated in the code. Hence having appropriate techniques and etiquette for dealing with this is fundamental.

Let’s start by defining some basic terms:

File System

This refers to the physical and logical place where we keep (store) most data. Typically it consist of spinning disks or hard drive devive (HDD), although nowadays many desktop computers and laptop would use a solid state device (SSD). SSDs are faster than traditional HDDs, the main reason is that they are electronic devices whilst HDDs are an interface with physical devices. Logically the file system is organized in a structured fashion: directories, subdirectories and files. But on the disk (hardware) side, these are just blocks of bytes. This is relevant in particular for HDD, as it means that the information not necesarily is laid down continguosly in the physical device.

File I/O

refers to any operation to be performed with a file, such as,

  • finding a file: this also implies that the OS must check if that file exists and where on disk the data is, reading its metadata (file size, date stamp, permissions to acces it, etc.)

  • opening a file: this requires to check whether that file exists, see if opening the file is allowed, possibly create it, find the block that has the –first part of– data in the file system;

  • reading a file: this requires positioning to the right spot, read a block and retrieve the proper section of data;

  • looking inside a file, search for certain contain inside the file;

  • saving data or results into a file or normally referred as writting to a file: which implies checking whether there is enough space in the device, positioning again to the right spot and actually dumping the content;

  • finally closing the file, which is a crutial operation so that the OS is aware of where is the end of the file (EOF), otherwise that would result in corrupted files!

Needless to say many of these operations have to be repeated several times if the data that is being read or written spans to multiple blocks.

For instance, at a more basic level, the OS has to keep track of a “pointer” to the next block of data, update and move the file pointer (“seek”) depending on the type of operation we are performing.

All of these operations are usually known as file Input/Output operations, or File I/O Ops., or just “IOPS” for short.

Metadata

is data about your data, i.e. details associated to a particular data set describing the data itself. Metadata will include who created the data (the author), the date and time the data was created, how was it created: if it was collected in an experiment in a lab, or generated by a program, the name and the version number of the code, which was used to create it, where it was created, what operating system, the values of key variables and parameters that were used to create the data, anything and everything that might help you, in six months, to understand the what/where/why/how of the data.

In general, any type of File IO is the slowest part in any computer program. The reason is that accessing data in devices such as hard-drives (even with the newest types of solid state devices) is still the most costly operation in a computer in terms of time. As a matter of fact, each I/O operation (IOPS) gets hit by latency. Hence, one might want to minimize this type of operations so that our codes don’t get affected as much by this. Of course, we still need to read or write data, otherwise we won’t be able to save or restore results but there are prop per ways of doing so.

2.3.10.2 File Formats

Files come in different formats, i.e. organized differently so that the computer can actually understand them. For our purposes, let’s consider that there are two basic types: Text and Binary. The former is the type of files we usually create when we are writing a program or script. The binary format, corresponds to saving data the way the machine keeps the data in memory. It is indeed called “binary” because that is the actual language computers understand, sequence of 0s and 1s which at the end represent electrical signs passing across a circuit board or not. One caveat of this formar though, is that in principle it can only be read by someone who exactly knows in which way the data has been saved in the file. For example, do the 1000 blocks in the binary file correspon to a one dimensional vector with a 1000 elements or to a two-dimensional vector with 10×10 elements? In other words, as the data in this format is saved “raw” without any additional information is not possible to know unless we are the ones creating the file. To overcome this limitation one can consider using a standard binary format as the ones described next.

Among the most used and known standard binary file formats, are: NetCDF (Network Common Data Format) and HDF5 (Hierarchical Data Format). Both of them are self-describing binary data formats, meaning that you can also store metadata describing the data. Additionally there are many libraries across different languages that would make possible to read the data from these file formats even when using different languages. Moreover these formats when installed properly also bring with them, utilities that allow you to inspect and visualize the data. Further advantages are that these formats are OS independent and machine agnostics, meaning that they can be easily ported between different computers and architectures. This represents an additional advantage with respect to text-format files, as these are not always OS-agnostic, meaning that when a file that has been –for instance– generated on a Windows machine could show “strange” symbols in a Linux machine, as the encoding used is different. And in some cases, many of these formats provide parallel I/O and native FS support.

Table 8 shows a comparison among these different file formats.

format Pros Cons
American Standard Code for Information Interchange (ASCII) Human readable Could embed metadata Portable (architecture independent) Inefficient storage Expensive for read/write (conversions)
Native (raw) Binary Efficient storage Efficient read/write (native) Have to know the format to read Portability (Endianness)
NetCDF Efficient storage Efficient read/Write Portability Embedded metadata Only for multi-dimensional arrays
Table 8: Comparison between different file formats: ASCII (text), binary and self-describing NetCDF.
Format C C++ Fortran
ASCII (text) f=fopen(name,"w"); ofstream f(name); open(6,file=name)
fprintf(f,...); f << ... ; write(6,*)
fclose(f) ofstream.close() close(6)
Binary (raw) f=fopen(name,"wb"); ofstream f(name,ios::binary); open(6, file=name, form=’unformatted’)
fwrite(f,...); f.write(...); write(6,*)
fclose(f) f.close() close(6)
Table 9: Syntax to generate output in plain-text (ASCII) or raw binary using C, C++ or Fortran. The structure is very similar in these languages, the first line opens the file, assigning a handle to the file that can be used to refer to this file and write the content into it –second line–. Examples of how to write and read NetCDF files are presented in Listings 3 and 4.
NetCDF

NetCDF is a file format as well as an Applications Program Interface (API). The library in particular mitigates the need to do low-level binary formatting. NetCDF gives you a higher level approach to writing and reading multi-dimensional arrays. Suitable for many common scientific use-cases.

Not only can you read in only the variables that you’re interested in, it is also possible to access subsections of an array, rather than reading in the entire object. It also allows to have “infinite” arrays (UNLIMITED dimensions), which effectively means that the arrays can grow. This is quite useful for example for timestepping, i.e. when a program stores or progress by time steps and the number of instances or repetitions is not predefined. It is also possible to save custom datatypes (e.g. objects). NetCDF also includes “conventions” for units associated to the variables. There are lists of conventions that can be used for variable names and unit names (e.g. ”cm”, ”centimetre”, ”centimeter”), etc. This in particular is quite frequent issue in disciplines like physics or engineering, and it is indeed highly recommended when planning for interoperability with other codes. More details avaialble at http://www.unidata.ucar.edu/software/netcdf/conventions.html.

HDF5

HDF5 is also self-describing file format and set of libraries. Unlike NetCDF, is much more general as it can store almost any type of data. HDF5 has a structure a bit like a filesystem with “group” and “datasets”, playing the roles of ‘directories’ and ‘files’ respectively.

Additionally, both NetCDF and HDF5 implementations can perform parallel IO, which makes them perfectly well suited for implementations in parallel or HPC environments.

2.3.10.3 Strategies

For instance, as mentioned before, it is important to minimize file input/output as much as possible.

Possible strategies for achieving this, are:

  • Load everything into memory once99 9 As a matter of fact, the fact of reading something in a file, implies several operations that the computer has to do: first find the file, check the accessibility of the file, permissions, open the file, it might need to go to a particular place in the file to grab what you need, copy that into memory –programs will access the data from variables in memory–, if the data is too big it might need to do it in chunks, and finally close the file!, so you don’t need to continue reading things1010 10 Of course this has some caveats, the main one is whether the size of the data you are reading would fit in memory… if not then that would lead to a further more sophisticated discussion about which would be the best strategy for dealing with this data.;

  • reuse data, don’t keep re-reading it from files;

  • if you must keep writing temporary things to disk, use ramdisk (i.e. memory as disk, this is only possible in Linux systems);

  • keep your results in memory, and then write your results in one shot when you are finished.

Finally, when ever is possible try using binary file formats, which are smaller in size and faster to be written and read from disk. Moreover, the reason why the binary formats are smaller and faster is mainly because when saving data located in memory into an ASCII format the data must be converted into text. This is an additional cost in terms of performance and quite frequently implies losting precision in the representation of the numeric types. Hence is strongly recommended to save numerical results in binary as much as possible.

In addition to this, there are some specialized file formats, which follow certain standards or conventions that allow to be used by different programs, include description of the data that we are saving (aka metadata), etc.

IOPS Performance use binary data formats save in large files spatial locality reduce number of IOPS IO Best Practices include metadata use self-describing file formats use the ones already available via libraries Eg. NetCDF, HDF5, etc.

2.3.10.4 Code Examples

To write a NetCDF file (Listing 3), one sholud go through the following steps:

  1. 1.

    Create the file

  2. 2.

    Define the dimensions

  3. 3.

    Define the variables

  4. 4.

    End definitions

  5. 5.

    Write variables

  6. 6.

    Close file

To read in (part of) a NetCDF file (Listing 4), one should go through the following steps:

  1. 1.

    Open the file

  2. 2.

    Get dimension ids

  3. 3.

    Get dimension lengths

  4. 4.

    Get variable ids

  5. 5.

    Read variables

  6. 6.

    Close file

Listing 3: “Example of writing a netCDF file”
1 // netCDF.writing.cpp
2 #include <vector>
3 #include <netcdf>
4 using namespace netCDF;
5 int main() {
6    int nx = 6, ny = 12;
7    int dataOut[nx][ny];
8    for (int i = 0; i < nx; i++)
9       for (int j = 0; j < ny; j++)
10          dataOut[i][j] = i * ny + j;
11    // Create the netCDF file.
12    NcFile dataFile(”first.netCDF.nc”, NcFile::replace);
13    // Create the two dimensions.
14    NcDim xDim = dataFile.addDim(”x”, nx);
15    NcDim yDim = dataFile.addDim(”y”, ny);
16    std::vector<NcDim> dims(2);
17    dims[0] = xDim;
18    dims[1] = yDim;
19    // Create the data variable.
20    NcVar data = dataFile.addVar(”data”, ncInt, dims);
21    // Put the data in the file.
22    data.putVar(&dataOut);
23    // Add an attribute.
24    dataFile.putAtt(”Creation date:”, ”22 Jan 2019”);
25    return 0;
26 }
Listing 4: “Example of reading a netCDF file”
1 // nc_reading2.cpp
2 #include <iostream>
3 #include <netcdf>
4 using namespace netCDF;
5
6 int main() {
7 // Specify the netCDF file.
8    NcFile dataFile("first.netCDF.nc", NcFile::read);
9    // Read the two dimensions.
10    NcDim xDim = dataFile.getDim("x");
11    NcDim yDim = dataFile.getDim("y");
12    int nx = xDim.getSize();
13    int ny = yDim.getSize();
14    std::cout << "Ourmatrixis"
15        << nx << "by" << ny << std::endl;
16
17   int **p = new int *[nx];
18   p[0] = new int[nx * ny];
19   for(int i = 0; i < nx; i++)
20       p[i] = &p[0][i * ny];
21   // Create the data variable.
22   NcVar data = dataFile.getVar("data");
23   // Put the data in a var.
24   data.getVar(p[0]);
25   for(int i = 0; i < nx; i++) {
26       for(int j = 0; j < ny; j++) {
27           std::cout << p[i][j] << "";
28       }
29       std::cout << std::endl;
30   }
31   return 0;
32 }

The codes in Listings 3 and 4 should be compiled and linked using the NetCDF library, eg.

g++ -I${NETCDF_INC} netcdf_Example.cpp -c -o netcdf_Example.o
g++ -L${NETCDF_LIB} netcdf_Example.o -o netcdf_Example -lnetcdf_c++4

In this case, the source file is called netcdf_Example.cpp, and there are two environment variables defined NETCDF_INC and NETCDF_LIB indicating the locations of the header and implementation of the NetCDF library.

Part 3 Numerical Techniques in Scientific Computing

3.1 Numerics

3.1.1 Numerical Representations

Contrary to what one would imagine, computers can not represent most of the numbers precisely. The reason for this, is the way in which computers store the numerical information. It comes back to the way in which represent numerical quantities. We have learned in the school to use the decimal system, i.e. based on ten digits: 0…9, so that we can quantify things. But computers use a different system to represent quantities, they use a binary one, ie. based on 2 possible values: 0 and 1. Why? Well, a binary possibility is the closest representation to a circuit conducting electricity or not, and that is at the end what a computer will be measuring: a gate opened (on) or closed (off), a 0 or a 1.

For instance, take the number 1034 written in a decimal base, this can be also given as

1034=103410=(1×1000)+(0×100)+(3×10)+(4×1)(1×103)+(0×102)+(3×101)+(4×100)

or in base 7 (septenary base)

103410=(3×73)+(0×72)+(0×71)+(5×70)30057

Or in a binary base:

103410 = (1×210)+(0×29)+(0×28)+(0×27)+(0×26)+(0×25)+(0×24)+(1×23)+
+(0×22)+(1×21)+(0×20)
100000010102

One thing to notice is that given a numerical base, the digits used for representing a particular quantity range from 0 up to the base-1, e.g. decimal: 0 to 9, septenary: 0 to 6, binary: 0 and 1.

Additionally, the number of possible quantities to be able to represent in a given base B, with a given number of digits is given by

Bd-1 (1)

where d is the number of digits. For example, in binary base (B=2) with 5 digits, we could represent upto 25-1=31 numbers.

3.1.2 Numeric Types

The previous examples and comments apply to a particular type of quantities, integer numbers. I.e. number without any decimal figures, i.e. quantities that can be counted exactly by natural numbers. But we all know that the actual set of numbers are way “denser” than that, and they can have decimal parts or even they can not be represented “exactly” or in a close form. For instance, take the number “pi” or the natural base of the logarithm “e”, or have endless repeating or recurrent patterns such as 1/3, or 9/11, or 1/81, etc.

For this reason, is also important to distinguish what type of numbers are we dealing with and trying to represent in a computer. Among the most common types of numbers to be represented in its own type in a computer program, we might find: integers, floating point numbers, complex numbers, and in more specialized domains quaternions, etc.

In the following sections we will focus in the most common numerical representations and some of the limitations we should be aware of when dealing with these numerical types in a computational representation.

3.1.2.1 Integers

One of the main features of integer number is that these are exactly representable in a computer. The range of values though, would depend on the size of the variable, the OS and programming language. For most languages a typical integer is 32 bits (4 bytes) long, and 1 bit is reserved for storing the sign of the number. Because each representation has a finite size, it also has a finite range, which for a signed integer can go from -231 to 231-1, i.e. -2,147,483,648 to -2,147,483,647. Unsigned integers then range from 0 to 232-1.

It is also possible to have longer integer representations, namely long integers, which are like regular integers but with a large representation size, usually 64 bits (8 bytes). Consequently they cover a larger range of numbers, going from 263 to 263-1, i.e. -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807. Unsigned long integers then range from 0 to 264-1.

In Sec.3.1.2.7 we present a detailed list of the different numerical types available in C/C++.

Integers Overflow

Due to the finite range of numbers that can be represented for a given numerical type, it is possible when performing mathematical operations with integer numbers to run into a situation where the resulting number can not be represented in the given range. This is usually known as integer overflow. A typical sign of an overflow is that the number will cycle from one of the extremes of the range into the other one. Listings 5 and 6 show examples of how this can happen at both ends of the representation range.

Listing 5: “Integer overflow example”
1 #include <iostream>
2
3 int main()
4 {
5     using namespace std;
6     unsigned short x = 0; // smallest 2-byte unsigned value possible
7     cout << "xwas:" << x << endl;
8     x = x - 1; // overflow!
9     cout << "xisnow:" << x << endl;
10     return 0;
11 }
Listing 6: “Integer overflow example”
1 #include <iostream>
2
3 int main()
4 {
5     using namespace std;
6     unsigned short x = 65535; // largest 16-bit unsigned value possible
7     cout << "xwas:" << x << endl;
8     x = x + 1; // 65536 is out of our range  we get overflow because x can’t hold 17 bits
9     cout << "xisnow:" << x << endl;
10     return 0;
11 }
3.1.2.2 Fixed Point Representation

So far we have discussed how to represent integer numbers, but how do we deal with decimal places? One possibility would be to treat real numbers like integers, and only keep the last two digits after the decimal point. This is usually referred as fixed point representation, since the decimal place is always in the same location. It is often used for financial time series data, since in these datasets only a finite number of decimal places are used. However, this is not really appropriate for scientific computing. One of the main reasons is that the relative precision of the represented quantities varies with their magnitude. Definitely in scientific applications one should be able to represent small and large quantities in a precisely manner independently of the numerical representation employed.

3.1.2.3 Floating Point Representation

The alternative then is to consider a representation where the decimal point is not fixed but it is adjusted so that only one figure is kept as the integer part of the number and a multiplying quantity is included to adjust of the different order of magnitude in the quantity. This is the analog of the scientific notation, with the inclusion of the multiplicative factor and power (or also known as exponent) means that the decimal point is floating. In this case, a generic number would be represented as follows,

sM×Be (2)

there is one bit s again reserved for indicating the sign of the number, a so-called mantissa M which represent the numeric value and a base B and a exponent e. For example, in the number -1.34×10-7, the sign is negative s=1, the mantissa M=1.34, the base (which by convention is set to 10, so there is no actual need to store it) B=10 and exponent e=-7.

In general, a typical single precision real floating point number, or single precision real –for short–, uses 32 bits (4 bytes), from which 1 bit is used for the sign, 8 bits for storing the exponent and 23 bits for the mantissa. A typical double precision real floating point number allocates 64 bits (8 bytes) with similar proportions. And one could even define quadruple precision floats, which depending on the implementation can employ 12/16 bytes. Details about the ranges and precision are given in Table 10 in Sec.3.1.2.7.

3.1.2.4 Special “Numbers”
3.1.2.5 Numeric Limits Interface
3.1.2.6 IEEE Standards
3.1.2.7 Numerical Types in C++
Integers in C++
Type Minimum size
char 1 byte
short 2 bytes
int 2 (4) bytes
long 4 bytes
long long 8 bytes (C99/c++11)
1 char c;
2 short int si; // valid
3 short s;      // preferred
4 int i;
5 long int li; // valid
6 long l;      // preferred
7 long long int lli; // valid
8 long long ll;      // preferred
9
10 signed char c;
11 signed short s;
12 signed int i;
13 signed long l;
14 signed long long ll;
15
16 unsigned char c;
17 unsigned short s;
18 unsigned int i;
19 unsigned long l;
20 unsigned long long ll;
Size/Type Range
1 byte signed -128 to 127
1 byte unsigned 0 to 255
2 byte signed -32,768 to 32,767
2 byte unsigned 0 to 65,535
4 byte signed -2,147,483,648 to 2,147,483,647
4 byte unsigned 0 to 4,294,967,295
8 byte signed -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807
8 byte unsigned 0 to 18,446,744,073,709,551,615
Floats in C++
1 float fValue;
2 double dValue;
3 long double dValue2;
4
5 int n(5); // 5 means integer
6 double d(5.0); // 5.0 means fp (double by default)
7 float f(5.0f); // 5.0 means fp, f suffix means float
8
9 double d1(5000.0);
10 double d2(5e3); // another way to assign 5000
11 double d3(0.05);
12 double d4(5e-2); // another way to assign 0.05
Type Size Range Precision
float 4 bytes ±1.18×10-38 to ±3.4×1038 6-9 sign.digits, typically 7
double 8 bytes ±2.23×10-308 to ±1.80×10308 15-18 sign.digits, typ. 16
long double 12 bytes ±3.65×10-4951 to ±1.18×104932 18-21 significant digits
long double (quad) 16 bytes ±3.36×10-4932 to ±1.18×104932 33-36 significant digits
Table 10: Details of the different type of representations for floating point numbers, in compliance with the IEEE-754 standard.

3.1.3 Classes of Numerical (Representation) Errors

3.1.3.1 Testing for Equality
3.1.3.2 Roundoff Errors
3.1.3.3 Machine Epsilon
3.1.3.4 Precision Problems
3.1.3.5 Overflow
3.1.3.6 Underflow

3.1.4 Universal Errors

3.1.4.1 Discretization Error
3.1.4.2 Truncation Error

3.2 Root Finding and Optimization

It is not uncommon in scientific computing to want solve an equation numerically.

In the case when there is one unknown and one equation only, we can always write the equation as

f(x)=0

If x satisfies this equation, it is called a “root”.

If there’s a set of equations, one can write:

f1(x1,x2,x3,) = 0
f2(x1,x2,x3,) = 0

However, the one-dimensional case is considerably easier to solve numerically, so we will treat that one first.

3.2.1 One-dimensional root finding

Root finding algorithms always start from an initial guess and (usually) a bounding interval [a,b] in which the root is to be found.

What’s nice about the one-dimensional case is that if f(a) and f(b) have opposite signs, and f is continuous, there must be a root inside the interval: the root is bracketed. Many algortihms use this to perform a consecutive refinement of the bracketing interval until that iterval is smaller than some tolerance ε.

There is no general algorithm to finding your initial bounding interval first! Possible approaches are to:

  • Plot the function.

  • Make a conservative guess for [a,b] then slice up the interval, checking for sign change of f.

  • Make a wild guess and expand the interval until f exhibits a sign change.

However, there are cases where these approaches do not works well, such when a root that is also an extremum (no sign change), or roots that are very close to one (easily missed), or near a singularity with a sign change but no root.

Suppose we’ve bracketed a root, what’s next? There arae quite a few classic root finding algorithms, such as

  • Bisection

  • Secant/False Position

  • Ridders’/Brent

  • Newton-Raphson (requires derivatives)

All of these focus in on the root, but have different convergence and stability characteristics.

Note that for polynomial f there exist specialized routines (Muller, Laguerre). Also for eigenvalue problems, there exist specialized routines (linear algebra).

Let’s look at the bisection method. The main steps in the algorithm are as follows:

  • Sign change: Root must lie in [a1,b1]

  • Find midpoint of interval.

  • Compute f(midpoint), and its sign.

  • Sign same as upper bound: midpoint becomes new upper bound b2.

  • Find midpoint again.

  • Sign same as upper bound: becomes new lower bound a2.

  • Next midpoint becomes new upper bound b3.

  • Next midpoint becomes new lower bound a3.

  • etc.

A first few steps of this process are illustrated in figure 5.

Figure 5: Bisection method of finding roots of a function.

To get an idea of some other algortithm that could be used, let’s consider the false position method, which is based on a linear interpolation between known points.

  • Same start: root within [a1,b1]

  • Linearly approximate/interpolate f.

  • Solve for next x.

  • Compute f(x), and inspect its sign.

  • Get new root bracket.

  • Interpolate again.

  • Solve for next x.

  • Compute f(x), inspect sign: new bracket.

  • etc.

A few iteration oof this method are shown in Fig. 6.

Figure 6: False position method of finding a function’s root.

These two methods have in common that they do not require knowledge about the derivative of the function. A few other non-derivate methods are the Secant method, which is like false position, but keeps the most-recent point, Ridders’ method, which uses an exponential fit, and Brent’s method which uses an inverse quadratic fit.

If the derivative of a function can also be computed, that enables another set of algorithms. The most common derivative-based root finding algorithm is the Newton-Raphson method. Instead of starting from a bracket, one can starts from a single point in this methods, and then follow these steps:

  • From the function value and the value of the derivative at that point, get a linear approximation.

  • Compute approximate root using that approximation.

  • Take that approximate root as the next point and repeat until the position of the point does not change much anymore (given some tolerance).

A few steps in this procedure are shown in Fig. 7. This method tends to converge much faster than the previous methods.

Figure 7: Newton-Raphson method

Each method has different convergence and stability properties. Table 11 show the properties of a few of the most common root finding algorithms.

Table 11: Convergence and stability of common root finding algorithms
method convergence stability
Bisection ϵn+1=12ϵn Stable
Secant ϵn+1=cϵn1.6 No bracket guarrantee
False position ϵn+1=12ϵn-cϵn1.6 Stable
Ridders’ ϵn+2=cϵn2 Stable
Brent ϵn+1=12ϵn-cϵn2 Stable
Newton-Raphson ϵn+1=cϵn2 Can be unstable

To use these methods, one does not have to implement the algorithms oneself. Instead, root finding algortihms can be found in GNU Scientific Library, which we already saw in Sec. 2.3.8.

OVERVIEW ROOT FINDING GSL.

3.2.2 Multi-dimensional root finding

We can write the multidimensional root finding problem as f(x)=0  , or, sa

f1(x1,x2,x3,,xD) = 0
f2(x1,x2,x3,,xD) = 0
fD(x1,x2,x3,,xD) = 0

In contrast to the roots of a function in one dimension, one cannot bracket a multidimensional root with a finite number of points. The roots of each equation define a D-1 hypersurface, so one cannot solve one equation after another either. Looking for possible intersections of hypersurfaces is a lot harder than the 1d case.

Luckily, given a good initial guess, the Newton-Raphson method can work in arbitrary dimensions:

f(x0 + δx)=0
f(x0) + fxδx=0
f(x0) = -fxδx
f(x0) = -Jδx
δx = -J-1f(x0)

The solution at every step requires inverting a D×D matrix, or at least, solving a linear set of equations. We will see in the section on linear algebra how this could be done.

Unfortunately, ass in the one-dimensional case, this Newton-Raphson method can be unstable. The algortihm needs to be augmented to have some safe guards that our iteration steps do not spin out of control. There are several ways to do, e.g. make sure that f(x)2 gets smaller in each time step. This can potentially still fail, but usually does the trick.

This enhanced root finding technique is also part of the GSL.

3.3 Optimization

Optimization is the same as minimization and the same as maximization. Suppose we have a function f(x1,x2,,,xD). We want to know for which set of x1,x2, the function is at its minimum. If instead, we needed a maximum, we could use that maximizing f is minimizing -f.

One might think that this is similar to root finding, as formally et least, the problem is equivalent to setting the derivatives of f with respect to every x zero, i.e, to solving f/xi=0. However, specialized routines can take advantage of integration properties of these equations.

Once more, the one-dimensional case is easier and substantially different from higher dimensional cases. Finally, no matter what numerical technque you use, there is no guarantee that you find the global minimum instead of a local minimums.

3.3.1 One-dimensional minimization

In one-dimensional minimization, one is given a variable x and a function f(x), and wants to find that value of x for which f(x) is smallest. As an example, consider the function in Fig. 8.

Figure 8: Example of a function to minimize.

The function in Fig. 8 has several minima and maxima. Some are local extrema, i.e., the smallest and largest function value in its neighbourhood. There is only one global minimum and maximum, i.e., the smallest or largest such function value, respectively, for all x.

The strategy to numerically find a minimum is a generalization of the root bracketing that we saw in the previous section. One tries to find three values for x, called a, b, and c, such that f(a)>f(b)<f(c); these bracket a minimum of f. This bracket can be successively refined in an iterative process. Two such processes are the Golden section method and Brent’s method.

The Golden Section Method works as follows. Given a triplet a, b, c that bracket a minimum, choose the larger of the intervals [a,b] and [b,c]. Then place a new point d a fraction w into the larger interval. Among the four points a, b, c and d there are three that again bracket a minimum, and this is the new bracket. How to choose the point d is somewhat open, but often the optimal choice is to choose if a fraction w=12(3-5)0.38197 into the largest interval.

Figure 9 show a few iterations of this algorithm.

Figure 9: Illustatation of the golden section method for minimization.

With this procedure, there’s no guarantee that we find the global minimum, the method returns local minima. But this is true for all numerical minimization algorithms. If we’re not sure if we found the right minimum, we can try again from a different starting point. It that yields a different result, we can take whichever minima is the lowest. Some methods, especially those using random numbers like simulated annealing (see below), have ways of escaping local minima, but even they cannot guarrantee that the global minimum is found.

3.3.2 Multi-dimensional minimization

Multidimensional minimization, where a given function f depends on several variables x=(x1,x2,,xD), and the values of the variables where f takes on its lowest value, is quite common too: many machine learning algorithms, in particular neural networks, are based on multi-dimensional minimization.

What sets multidimensional minimization apart from one-dimensional minimization is that there is no way to bracket a minimum in higher dimensions. However, the situation is not as as dire for multi-dimensional root finding, because, in the end, down is down, do as local as the function is bounded from below, minimization algorithms are likely to find at least a local minimum.

A common strategy of most methods is to build upon one-dimensional methods. Starting from a trial point x, one perform one-dimensional minimizations in D ‘independent’ directions, and test for convergence.

Finding the right ‘independent’ directions is not a trivial matter. Just choosing D directions can lead to a staircase scenario, where one take many step at right angles that are not quite in the correct direction. This might get find the minimum eventually, but requires a lot of small one-dimensional minimizations.

If the derivatives (also known as gradients) or the function f are know, one can do a bit better, although sometimes one can do without, such as with Powell’s Methods which use previous one-dimensional minimizations to guide the choice of direction sets.

In gradient-based direction sets, one uses the derivatives to determine where to go next. This, too, can be done in different ways, but the main ways are:

  • Steepest Descent (a.k.a. Gradient Descent) Just to go in the direction of the gradient. This works, but you may still have to take many straight turns.

  • Conjugate Gradient Essentially, you’re solving the minimum as if the function is (locally) quadratic. Better but more complicated.

All of these methods mentioned above are in the GNU Scientific Library.

3.4 Simulated annealing

All of the previous methods are troublesome for functions with (many) local minima.

Let’s look for inspiration in Nature. For instance, when temperature drops below 0oC, water molecules find a new low energy configuration, which any of our previous minimization methods would have a tough time finding: D=𝒪(1023).

How: while the temperature drops, the molecules are constantly moving in and out of local mimima, exploring configurational space, finding the lowest among many local minimum.

Simulated Annealing mimics this. Schematically, this is illustrated in Fig. 10.

Figure 10: Simulated annealing.

3.5 Ordinary Differential Equations

3.6 Partial Differential Equations

3.7 Linear Algebra

3.8 Fourier Transform

3.9 Fitting

3.10 Random Numbers

Part 4 High Performance Computing

4.1 Introduction to parallel computing

4.2 Measuring Performance

4.3 Optimizing Serial Performance

4.4 Job scheduling and load balancing

4.5 Shared-memory parallel programming (OpenMP)

4.6 Distributed-memory parallel programming (MPI)

4.7 General parallel computing on accelerators

4.8 Hybrid Parallel Computing

Part 5 Exercises

In this part, we list the assignments given in the course throughout the years, which are a good way to exercise the material presented in the text. We’ve also included a number of further exercises.

5.1 C++ exercises

General Remarks

A few suggestions that are recommended, in particular with respect to code development and coding style:

  • Try to avoid duplicating the specific values when implementing your code (i.e., -5, 5 or any other quantities you might end up using repeatedly). E.g. storing them, once, in a variable, and use that variable; this will result in your code to have wider applicability.

  • Give your variables understandable and meaningful (ie. representative) names.

  • Do NOT use global variables!

  • Indent your code properly. Mostly, that means that you should indent consistently and in such a way that all code in the same block has the same indentation.

  • Comment and document your code!

  • In general: try and write a code that your future self could still read and understand it in, say, 1 year.

  • Put your code under Version Control and create a repository to host it.

5.1.1 Moving average

The goal of this is assignment is to write a C++ program that computes the moving average of a given vector.

5.1.1.1 Part I

The program should accept a command line number which will indicate how many points are going to be considered in the vector. The program will generate a table with three columns, one column with the original values of the vector defined as indicated below, the second and third columns will be the computation of the moving average using a window of 5 and 10 elements respectively.

Moving Average definition: in its simplest interpretation the moving average of a vector of N elements considering a window of m elements, is given by:

yi=1mj=ii+m-1yj (3)

Notice that for each element of the vector, you will have a “moving average” value associated. Additionally, one should implement special considerations when dealing with the “latest” elements of the vector, i.e. the points where i>N-m.

In particular, your program should:

  • Contain at least two functions (besides int main), one that generates the actual values of the vector y. And a second function which should implement the computation of the moving average.

  • For generating the vector y, consider generating a vector x taken from N evenly spaced values from the interval [-5,5] (including the end points), where N is given by the command line argument passed to the program; the vector y then is defined as the cosine of the elements from the vector x.

  • It should write out the results in tabular form (i.e. three-column ASCII text) to the screen. Don’t try to add lines to the table, just use spaces and newlines to format the text.

Compile and run your code using g++ with the -std=c++14 flag. Capture the output for the case N=100, for instance if you named your code movingAverage, you can run it in this way

$ ./movingAverage 100 > averages.txt

this should generate a text file named “averages.txt” containing a vector with 100 elements and 3 columns, for the vector y, and the moving averages for windows of 5 and 10 elements respectively. Inspect these values by plotting them and comparing to what you would expect from the standard average computation.

5.1.1.2 Part II

Depending on the strategy you used for the first part of this exercise you can have implemented the computation in different ways.

For instance, if we consider the definition previously given for the moving average, see Eq.(3), this definition assumes a forward consideration of the elements, i.e. it goes from j=i up to j=i+m-1.

One could have defined this as a backwards or centered implementation, where in each respective case the range of j will go from i-(m-1) up to i or i-(m-1)/2 up to i+(m-1)/2 respectively.

Taking this into consideration, add an additional argument to your moving average function indicating which type of moving average computation (forward, backward or centered) you want to compute and implement this particular feature.

Another element that was left open to interpretation was how to deal with the final elements of the vector, i.e. when the index i was larger than N-m, when N is the size of the vector and m the length of the window where to compute the moving average.

Consider now adding another additional argument to the moving average function, where depending on the following values, is how the computation will be done:

  • if the argument is restricted, the elements considered when reaching the element N-m will shrink by one, so that the window is reducing so that it does not go beyond the last element of the array, i.e. when j=N then m=1.

  • if the argument is cyclic, then the elements of the beginning of the array will be used to complete the m elements in the window. I.e. if j=N-m+2 then the element j=0 and 1 should be considered in the average.

  • if the argument is padding, then additional elements with the same value as the last element, will be added to the array so the computation can be continued with them.

5.1.2 Lissajous

The goal of this exercise is to write a C++ program that generates a table of data as indicated below.

This table should have three columns, one column with x double precision values ranging from -5 to 5, a second column with values y=sin(2x), and a third column with values z=cos(3x).

In particular, your program should:

  1. 1.

    Contain at least two functions (besides ”int main”), called ”f” and ”g”. The function ”f” should take x as an argument and return the value of sin(2x), while the function ”g” should that take x as an argument and return cos(3x).

  2. 2.

    The program should take 101 evenly spaced x-values from the interval [-5,5] (including the end points).

  3. 3.

    It should use the functions f and g to compute the values of those functions for those 100 x-values.

  4. 4.

    It should write out the x, f, and g values in tabular form (i.e. three-column ascii text), either to a file called ”lissajous.txt”. Don’t try to add lines to the table, just use spaces and newlines to format the text.

Compile and run your code using g++ with the -std=c++14 flag, and capture the output. Inspect the output by plotting the columns, in particular, g vs f; you should be able to recognize the so-called Lissajous shapes.

5.1.3 Arrays

Write a program which:

  • Dynamically allocates an array of 100,000 integers.

  • Fills the array such that element i has the value 9*sin(i), rounded down to an integer.

  • Counts the number of times the value -9 occurs in the array, how often -8 occurs, -7, -6, -5, -4, -3, -2, -1 ,0, 1, 2, 3, 4, 5, 6, 7, 8, 9. (I.e., for each value separately, how often does it occur in the array?)

  • Print these values and counts out to screen.

  • Deallocate any allocated memory

  • Returns 0 (why?)

References

References

Appendices

Appendix A Terminology

Program

set of instructions written to be followed by the computer to perform an specific task.

Algorithm

it refers to the techniques, strategies and methods (recipes) used to achieve a particular task or perform an specific computation.

Variable

place in memory where we can store information, eg. numbers, text, etc.

Function

particular collection of instructions written with a very specific task in mind.

script

it is the file (plain text ASCII file) that we create where we write the code, usually that contains the commands in a particular computer language. For instance, a shell script is the file containing commands from the shell to achieve certain task.

routine

a more generic term (and older way, frequently used in Fortran) for referring to functions or scripts.

Modularity

it is a method or programming strategy or approach, to write programs, where a program is created using individual units “modules”, which are independent of each other and they use variables to explicitly communicate information among these modules.

Scope

By scope of a function we refer to the “extend”, in other words the “dominion” of the function, in which the variables and internal functions defined within the original function might be accessed. Each function will have its own scope, and is not possible to see or access variables within the scope of other functions (unless they are defined within the same function and even in this case, it is not good form and breaks the modularity of the code if you access them without using arguments to do so).

Definition (or implementation) of a function

This is the process of actually writing what the function is supposed to do, i.e. the precise instructions that “define” the function. It should not be confused with executing the function.

The definition (or implementation) of a function should not be confused with the execution (or running) the function.

The former is the process of writing the function, instruction by instruction, but not evaluating them – something like if you were typing each of the instructions in the prompt but not hitting ‘enter’. The latter is the actual process of hitting ‘enter’, i.e. evaluate each of the lines in the function.

Execution (or calling or running)

This is the process of actually take the definition of the function, and evaluate each line as it was being typed with the values passed to the function.

Local vs Global

Local Variables are the variables defined within the scope of the function, i.e. such variables are local to the function, they can be accessed from within the function.
Global Variables are the variables defined at the main level of our code, in principle these variables are accessible from within any other function. However, you should NOT access these variables unless they are passed as arguments into other functions!

compiler
interpreter

refers to a particular type of programming languages, where there is a program (i.e. interpreter) waiting for instructions from the user, and after you enter the instructions will execute them immediately after. Examples of this type of languages are: R, shell, python, matlab, etc. As a counterpart to this type of programming languages there are other ones, known as compiler languages, where the compiler –another computer program– will read your whole set of instructions, i.e. code, and translate it directly into machine language (language that the computer can directly understand) generating a program that can be executed in the computer.

shell

usually referred as a “meta-program”. i.e. a program that can execute other programs, the instructions (aka. commands) are input as words that have to be type by the user

prompt

is a symbol (e.g. $ in the shell, or > in R) indicating that the program is waiting for instructions from the user, i.e. you. Variations on this: sometimes the interpreter program will continue waiting for you to finish inputing commands, e.g. when you open quotation marks or start a block-code, and will switch the usual prompt for a prompt indicating that it is still waiting for you to finish the command. In the shell: $>; in R: >+.

Driver script

refers to the main script in charge of executing your program, the driver script is in charge of organizing the flow of your program like an orchestra director.

Utilities files

are the files containing the definition of the functions used by the drive script, it is where the functions are defined so that they can be called from the driver script.

Vectorization

is a technique that allows the computer to operate on all the elements of a vectors with one single operation, in comparison to looping over individual elements which requires as many operations as elements are in the vector, hence this is much more inefficient. When ever is possible try always to use ‘vectorization’, R makes very simple to operate in vectors using normal operations instead of looping over their elements.

Case Sensitive

This refers to the fact that most computer languages differentiate between capitals and lower case letters, i.e. Aa. R, the shell, python and many other computer languages are case sensitive. Although simple and almost trivial concept, in many occasions, it will haunt you down with errors such as func1 not found! – where instead you have defined Func1!

Hardcoding

refers to include specific values within a program, for instance, it can refer to the location where a particular file is in your computer. It usually considered bad practice, precisely because it breaks the flexibility of the code to work in other computers.

Bug

refers to an unwanted mistake in the implementation of your code. The term comes from the actual found of a ’bug’ in the early version of computer causing spurious and unexpected results.

Debugging

is the systematic process of trying to identify errors or mistake in the implementation of programs, aka “catching bugs”.

NaN

is a special code that means not-a-number, it is the result of an illegal mathematical operation, such as taking the squared root of a negative number of the logarithm of a negative number

directory

another way to say folder, inherited from Linux systems, where folders are usually refered as directories; a variation on this is a sub-directory, which is a folder inside another folder.

metadata

is data about your data, i.e. details associated to a particular data set describing the data itself. Metadata will include who created the data (the author), the date and time the data was created, how was it created: if it was collected in an experiment in a lab, or generated by a program, the name and the version number of the code, which was used to create it, where it was created, what operating system, the values of key variables and parameters that were used to create the data, anything and everything that might help you, in six months, to understand the what/where/why/how of the data.

ASCII

it is a standard convention to represent characters in text format: all the letters in the alphabet, numbers and special symbols. It stands for “American Standard Code for Information Interchange”. Usually employed as a synonym of a text file.

GUI

Graphical User Interface, graphical interface that allows the user to control a program using a graphic device such a mouse or windows.

IDE

Integrated Development Environment, usually a GUI that integrates several tools like a text editor, console, variable explorer, and even a debugger, in a graphical environment. E.g. Oracle, Visual Studio, Xcode, etc.

OS

(Operating System), refers to the basic software (programs) running at the very basic level in a computer, it provides all the basic functionalities and access to the hardware. It interfaces between the user and the hardware. Examples are Linux, Mac OS, Windows.

Appendix B Summary of the Linux command-line

This appendix contains a very brief overview on how to use the Linux shell,1111 11 Much of the command overview here also applies to UNIX and MacOS systems, but details of the commands do vary, so for definiteness sake, we restrict ourselves to Linux, also known as the console, the terminal, the command prompt, and the command line interface. In these lecture notes, Linux commands are mostly used just to start the compilations, run codes, and run make files. Because of that, most of the techniques work presented here will work unaltered on MacOS systems, and carry over to Microsoft Windows.

There is only one part of these lecture notes which is more Linux-specific, when we cover high performance computing using distributed memory clusters. It is so for a very valid reason: virtually all distributed memory clusters on the TOP500, the list of the top 500 fastest super computers in the world, run Linux[Dongarra2011].

So what is this Linux shell precisely? It is a command-line interface for access to the operating system’s services. It is named a shell because it is the outermost layer around the operating system kernel. The shell interprets commands and allows a user to execute commands by typing them manually at a terminal, or automatically in programs called shell scripts. It is not an operating system but a way to interface with the operating system and run commands, and is a powerful and portable tool.

There are different shells such as sh, bash, tcsh, ksh and zsh, but at the time of writing, the most common one on Linux machines, and the one we will restrict ourselves to here, is the Bourne Again Shell, or bash. You likely have this shell if you are working on a Linux system or on a Mac. On a Microsoft windows machine, you can get it if you install Cygwin (https://www.cygwin.com), MSYS2 (https://www.msys2.org), Git-bash (https://git-scm.com), MobaXterm (https://mobaxterm.mobatek.net) or, in Windows 10, by using the Linux Subsystem for Windows (see https://docs.microsoft.com/en-us/windows/wsl/about).

The commands that a user can give to bash, range from simple to complex. You type these command, one line at a time, at the prompt. For instance, to get the current directory, you would do the following:

$ pwd
/c/Users/rzon
$

The prompt is highly customizable, but often displays the user name, current directory and the name of the computer. Here, we will abbreviate all those forms simply with $. Note that you do not type in this prompt. Commands are single words (as in the example above, where the command was pwd) but they can be followed by one or more words called command line arguments. To be explicit, let’s elaborate a bit more on what the above example really does: On the terminal, a prompt is shown by bash. After the prompt, you typed pwd followed by the Enter key. The pressing of the Enter key triggers the bash shell to interpret what was entered. In this case, bash recognizes pwd as the command to print the “present working directory”, which it does and which happens to be /c/Users/rzon. Once the command is finished, bash shows a new prompt and waits for a new command.

One of the difficulties of using a command line interface (CLI) versus a graphical user interface (GUI) is that you have to know the commands that can be used. To help you with this, some common commands are listed in Table 12. To learn more about a particular command cmd, type

$ man cmd

with cmd replaced with command of which you want to see the manual page. Within man, the‘space bar and arrow keys should work as expected, “q” exits man, and “h” shows the help with all key commands that can be used within man.

Command description
echo arg echo the argument
pwd present working directory
ls [dir] list the directory contents
cd [dir] change directory
history [num] print the shell history
man cmd command’s man page
cp file1 file2 copy a file
mv file1 file2 move/rename a file
rm file delete a file
mkdir dir create a directory
rmdir dir delete a directory
file file type of file
more file scroll through file
less file scroll through file
cat file print the file contents
head file print first 10 lines of file
tail file print last 10 lines of file
wc file word count data of file
(a)
Command description
file filename info about the file’s type
locate filename find files by name
whereis cmd find files assoc.w/commands
which cmd locate a command
diff file1 file2 compare two (text) files
cmp file1 file2 compare two (binary) files
wget url downloads the url
curl url
tar args -f tarfile handles tarfiles
sort file sorts the lines of file
source file run the cmds in file
chmod p file change file permissions
grep arg file search for arg in file
cut flags output cut part of output
for..do..done for loop in bash
if..then..else..fi if statement in bash
logout close the terminal session
exit
(b)
Table 12: Summary of Linux shell commands. In the left columns, arg denotes a mandatory argument, [arg] an optional argument.

It is good to be a bit familiar with the directory structure of Linux, particularly when coming from a GUI environment. As in GUI environments, files are organized in directories which can further contain sub-directories. The path of a file is the sequence of directory names to get to that file. In Linux (as in UNIX, BSD and MacOS), the directory names in a path are separated by a “/” (in contrast to Microsoft Windows which uses “\”). The collection of all paths on a system is called the directory tree. In Linux there is only one directory tree, starting at the root directory, which is denoted as “/” (this differs from Windows, which has a directory tree for each drive). Additional drives are mounted somewhere in the directory tree (e.g., USB drives might get mounted at a path like /media/user/usbname). Linux has a strong ownership model for files. A lot of the directory tree is owned by the “root” user, and cannot be changed when you are logged in as a regular user. Each regular user will have a home directory, which it owns. This home directory is usually located at /home/user, where user is the login name of the user. The location of your home directory is also stored in an environment variable called HOME.

The Linux shell has variables that can store string values. To get the stored value of a variable, e.g., HOME, one has to prepend the name with a dollar sign, e.g. $HOME. Variables can be defined by assigning them a value, e.g. MYVAR=Hello. There can be no spaces around the equal sign, and if the value should contain spaces, the value has to be surrounded by quotes. Here’s an example:

$ MYVAR="Hello my"
$ echo $MYVAR world
Hello my world
$

From the set of all variables in bash, only those that are environment variables are passed on to the commands that bash executes. To turn a regular variables into an environment variables, you can use the export command, e.g. export MYVAR.

Finally, it is important to know that there are many symbols and characters that the shell interprets in special ways. This means that certain typed characters cannot be used in certain situations, or will automatically perform special operations unless they are escaped. The most important special character to the shell is space. All commands are split up into words that are separated by one or more spaces. Other special characters are listed in Table 13. To escape these characters, you need to prepend them with a slash (\), or surround them with quotation marks (which are a bit tricky in bash). Because of these spacial meanings, you should avoid using these special characters, including spaces, in the names of files and directories.

Character description
\ Escape character. If you want to reference a special character, you must “escape” it with a backslash first. Example: touch /tmp/filename\*
$ Variable value look-up character. E.g. $PATH gives the value of the string stored in the variable named PATH.
/ Directory separator, used to separate a string of directory names. Example: /usr/src/linux
. Current directory. Can also “hide” files when this is the first character in a filename.
.. Parent directory
~ User’s home directory
* Wildcard character. Represents 0 or more characters in a filename, or by itself, all files in a directory. Example: pic*2002\verb can represent the files pic2002, picJanuary2002, picFeb292002, etc.
? Represents a single character in a filename. Example: hello?.txt can represent hello1.txt, helloz.txt, but not hello22.txt
[ ] Can be used to represent a range of values in a filename, e.g. [0-9], [A-Z], etc. It will only use those values for which file names exist. Example: hello[0-2].txt represents the names hello0.txt, hello1.txt, and hello2.txt
{ } Can be used to represent a set or range of values, e.g. {0-9}, {in,out} etc. Example: hello{0..2}.txt represents the string “hello0.txt hello1.txt hello2.txt, and hello.{in,out} represents the string “hello.in hello.out” regardless of whether these exists as files.
cmd > file redirect output to file
cmd >> file append output to file
cmd < file use file as input to cmd
cmd1 | cmd2 Redirect the output of one command into another command. Example: ls | more
Table 13: Summary of special characters of bash

More resources on the Linux bash shell

Appendix C Getting the programming environment

These lecture notes assume that you are working on a Linux-like environment with the Bash shell and the gcc and g++ compilers installed, as well as gdb, make, and git. Try to ensure that the compiler is at least at version 6. You will also need a text editor. In this appendix, we’ll give some pointers on how to get such an environment.

For Linux: The terminal will already be there. The compilers might be there too, or they may have to be installed, e.g., under Ubuntu, with the command sudo apt install gcc g++ git. Most Linux distributions come with an X11 server for the GUIs and graphics.

For MacOS: The terminal on the Mac gets you most of the Linux-like stuff (MacOS is a BSD UNIX system, which is similar to but distinct from Linux). There are a couple of different ways o get the compilers:

  1. 1.

    Using the native backbone provided for developing by the OS, so called ”Xcode”. You will need to install XCode and then download the compilers from http://hpc.sourceforge.net. If you need X11 graphics (for graphics), you’ll need to install Xquartz (https://www.xquartz.org).

  2. 2.

    Alternatively you can use some programs that ”port” most of the missing ”Linux” applications to the MacOS. Among the most well-known ones are: fink (http://www.finkproject.org/), macports (https://www.macports.org/), homebrew (https://brew.sh/). These are basically package managers that will allow you to install additional tools such the compilers, debuggers, etc. Each one of them has its own way of implementing and installing programs, but the goal is alike.

For Microsoft Windows: There are a number of Linux(-like) terminal environments:

  1. 1.

    MSYS2 (https://www.msys2.org). If you want to install compilers and develop applications that work both in Windows and the Linux terminal environment (using e.g. native Windows editors like Atom and VSCode), then this is your choice. Once MSYS2 is installed, you can use the pacman command to install other packages. For instance, to install the compilers, you would do
     $ pacman -S --needed base-devel mingw-w64-x86_64-toolchain git
    This installs the MINGW gcc/g++ compilers (http://www.mingw.org). A nice overview on the installation procedure can be found at
    https://github.com/orlp/dev-on-windows/wiki/Installing-GCC--&-MSYS2.

  2. 2.

    Windows Subsystem for Linux (WSL) is an environment that translates Linux calls to native Windows calls (see https://docs.microsoft.com/en-us/windows/wsl/about). There are different versions of WSL with different Linux Operating systems, but if you choose Ubuntu, you can install the compilers and git with
     $ sudo apt install git gcc g++

  3. 3.

    Cygwin (https://www.cygwin.com) is a complete Linux emulator. Because of this completeness, it tends to be slower than MSYS2 or WSL. You can install gcc, git, and g++ in this environment using the Cygwin installer. Cygwin also has an X emulator.

  4. 4.

    Git-bash (https://git-scm.com) is another Bash emulator. It comes with git and a few other utilities, but you cannot install other applications or compilers in it (although you can install those separately and add them to the PATH).

  5. 5.

    MobaXterm (https://mobaxterm.mobatek.net) is a well-designed combination of a terminal, an X11 server, an ssh client, and a number of other tools. It also has an experimental package manager to install certain extensions, like compilers. This terminal is especially nice for copying and connecting to remote systems, as you only need this one application and it installs very quickly.

At the moment, we suggest trying option 1, but whatever environment you choose, make sure to read its installation documentation; things can change in newer versions.

Also not that for MSYS and WSL, if you need X11 graphics, you’ll need to install an Xserver like Xming (https://sourceforge.net/projects/xming/) separately, and set the DISPLAY environment variable to localhost:0.