This lesson is still being designed and assembled (Pre-Alpha version)

A Beginner's Guide to Programming and Data Analysis with R and BASH

Programming Fundamentals

Overview

Teaching: 30 min
Exercises: 15 min
Questions
  • What is the difference between programming and coding?

  • What are algorithms and how are they developed?

  • What is pseudocode and how can it be used?

  • What is the most common forms of logic used in programming?

Objectives
  • Become familiar with the key concepts of programming, coding, algorithms, and pseudocode.

  • Be able to design algorithms to accomplish basic everyday tasks.

  • Become familiar with logical expressions commonly used in programming.

Introduction

Some of the practical skills required of programmers include the ability to create algorithms, model problems, process data, and manage projects. Many of these same skills are also necessary for anyone interested in the analysis of complex or large biological data sets.

It is very common for beginning programmers and scientists to learn how to use software or pieces of code without really understanding how it was designed. Knowing how a software program was designed is not typically necessary if you want to use it to complete some simple data analysis. But if you want to have more control over the tools you are using and the way your analysis is performed, then you need to understand how the piece of software is being run by the computer.

Programming vs Coding

Before we begin learning about how to write helpful programs for data analysis, it is important that we consider fundamental concepts and best practices in programming. While sometimes used interchangeably, programming and coding have different definitions.

Programming vs Coding Image source

So although programming and coding have different meanings, they are related. The goal of coding is to create the code that acts as a set of computer instructions for a part of a programming project. The goal of programming on the other hand, is to produce programs that are complete and ready to use software products.

Based on your personal experiences, let’s discuss our current understanding of these important concepts.

Discussion - Programming vs Coding

Come up with a single sentence to describe programming.

Solution

Programming is the process of creating instructions or set of related activities to achieve a task or goal.

Come up with a single sentence to describe coding.

Solution

Coding is the process of transforming the set of instructions for a process into a written language that a computer can interpret.

Pseudocode, Code, and Algorithms… Oh My!

Although the differences seem small, there are important distinctions that we can make between the concepts of pseudocode, code, and algorithms.

Pseudocode vs Algorithms Image source

Everyone has some experienece with algorithms in their day-to-day life. For example, if you have ever cooked or done some task that requires you to follow instructions with a sequence of steps.

Discussion - Algorithms

Attempt to describe what algorithms are in a single sentence.

Solution

Algorithms are the set of step-by-step instructions that explain how to solve a given problem.

Algorithms need to be represented by some form of language in order to be understood and shared with others. The process of writing pseudocode can be tremendously helpful for figuring out how to start developing code to solve a problem, or implement an algorithm.

Discussion - Pseudocode

Attempt to describe what pseufocode is in a single sentence.

Solution

Pseudocode is the set of instructions for an algorithm written in a plain language.

As a first step before you begin developing an algorithm or writing any code, it is a good idea to write out the steps in a plain language. Let’s look at an example of pseudocode for a simple algorithm to make tea:

  1. Remove a teabag from the package
  2. Put the teabag in a cup
  3. Boil some water
  4. Add the hot water to the cup
  5. Allow the tea to steep for 5 minutes
  6. Remove the teabag

Challenge - Pseudocode

Write your own pseudocode for an algorithm to make buttered toast.

Solution

  1. Take a slice of bread from the package
  2. Place the bread in the toaster
  3. Allow the bread to toast for 5 minutes
  4. Remove the toasted bread from the toaster
  5. Put the toasted bread on a plate
  6. Open the container of butter
  7. Grab a knife by the handle
  8. Dip the kife blade into the butter
  9. Apply the butter to the toasted side of the bread

The primary advantage to using pseudocode in your programming process is that it improves the readability of your algorithms. By first writing algorithms for programs in a plain language, it allows you to break down a complex problem into smaller and more manageable pieces for coding. Furthermore, it gives you the chance to easily identify the most complex and potentially troublesome portions for code development.

Programming with Logic

A fundamental concept of computer programming, Boolean logic is the mathematical logic underlying Boolean algebra. In Boolean algebra expressions are evaluated to one of two values: TRUE or FALSE. Since an expression may only take on one of two values, Boolean logic is considered “two valued logic”.

Boolean Expressions Image source

Note that an expression is a combination of logical operands and operators. In Boolean logic the operands are statements that can be proven true or false, and the operators are the logical AND, OR and NOT.

Challenge - Boolean Expressions

1. What are some examples of simple boolean expressions?

Solution

  • It is raining
  • My cat is hungry
  • The temperature is < 32 degrees Fahrenheit
  • Today is NOT Wednesday

2. What are some examples of compound boolean expressions?

Hint

Use a combination of the following operators to add complexity to your expressions!

  • Comparison operators (>, <, =, >=, <=, !=)
  • Boolean operators (AND, OR, NOT)

Solution

  • It is raining AND it is cloudy
  • My cat is hungry OR my cat is cute
  • The temperature is < 32 degrees Fahrenheit AND it is snowing
  • Today is NOT Wednesday AND Today is Thursday

We can combine boolean expressions with control statements to specify how programs will complete a task. Control statments allow you to have flexible outcomes by selecting which pieces of codes are executed, or not.

Control Structures

The three primary types of control structures are:

Control Statements Image source

The most common type of control structures are sequential statements. These are indicated by code statements written one after another, and are executed line by line. This means that the statements are performed in a top to bottom sequence according to how they are written.

The following is an example of a sequential statement with every-day actions.

Pseudocode

  1. Brush teeth
  2. Wash face
  3. Comb hair
  4. Smile in mirror

Challenge - Sequential Statmenetes

What is an example of pseudocode with a sequential statment using every-day actions or items?

Solution

This is an example of pseudocode with the seuqence of actions to tie a shoe.

  1. Tie a basic knot
  2. Make a loop with one of the laces
  3. Use your other hand to wrap the other lace around the loop
  4. Pull the shoelace through the hole to form another loop
  5. Hold both loops and pull them tight

Iterative statements allow you to execute the same piece of code a specified number of times, or until a condition is reached. The most common iterative statements are defined using either FOR or WHILE loops. Let’s start by looking at a flow diagram for a FOR loop, which dipicts the flow of information from inputs to outputs.

Iterative FOR Statements Image source

There are many every-day actions that are done repetatively over a range of time, for example.

Pseudocode

  1. For each day of the year
    • Get up
    • Brush teeth
    • Wash face
    • Comb hair
    • Smile in mirror

Challenge - Iterative Statements 1

What is an example of pseudocode with a FOR loop iterative statment using every-day actions or items?

Hint: Iterative statements may contain, or be a part of sequential statements.

Solution

This is an example of pseudocode wth a FOR loop to brush your teeth.

  1. Add toothpaste to toothbrush
  2. For each tooth
    • Brush the outer surface
    • Brush the inner surface
    • Brush the chewing surface
  3. Brush the tounge surface
  4. Rinse with water

WHILE loops are another type of iterative statement that can be used as a control structure in your code. This type of iterative statement will continue to execute a piece of code until a condition is reached.

Iterative WHILE Statements Image source

We can also think of some every-day actions that are done repetatively until a certain point. For example, consider the process of braiding bair with a simple braid.

Pseudocode

  1. Divide the hair into three even sections
  2. While enough hair remains to weave
    • Cross the left section over the middle section
    • Cross the right section over the middle section
  3. Tie the ends of the sections together with a hair tie

Challenge - Iterative Statements 2

What is an example of pseudocode with a WHILE loop iterative statment using every-day actions or items?

Hint: Iterative statements may contain, or be a part of sequential statements or other iterative statements.

Solution

This is an example of pseudocode with a WHILE loop to play dodgeball.

  1. Gather 3 or more people per team
  2. Arrange 1 or more balls at the center of the court
  3. Set a timer for 10 minutes
  4. Rush to the balls in the center of the court when play begins
  5. While untagged players or time remains
    • Try to dodge balls that the other team throws at you
    • Throw balls at the other players to get them out
  6. Shake hands

The most common and simple form of conditional statements are of the IF… THEN form. These are statements that have two parts: hypothesis (if) and conclusion (then). The execution of the conclusion of the statement is conditional upon the state of the hypothesis, which must evaluate to TRUE.

Conditional IF... THEN Statements Image source

Situations requiring conditional descisions come up all the time in life, for example:

Challenge - Conditional Statements 1

What are some examples of simple IF… THEN conditional statments using every-day actions or items?

Hint

Use a combination of the following operators to add complexity to your statements!

  • Comparison operators (>, <, =, >=, <=, !=)
  • Boolean operators (AND, OR, NOT)

Solution

Here are some examples of every-day conditional statements.

  • IF you eat food, THEN you will NOT be hungry
  • IF it is my birthday AND I want to cry, THEN I will cry
  • IF my grade < 100 AND my grade > 90, THEN my grade is an A

The next type of conditional statement adds another level of complexity with the IF… THEN… ELSE format. By adding the ELSE condition to an IF… THEN statement we are able to have alternative conclusions to our hypothesis.

Conditional IF... THEN... ELSE Statements Image source

The following are examples of every-day moments that require decisions with multiple conditional outcomes.

Challenge - Conditional Statements 2

What are some more examples of IF… THEN… ELSE conditional statments using every-day actions or items?

Hint

Use multiple ELSE conclusions and a combination of the following operators to add complexity to your statements!

  • Comparison operators (>, <, =, >=, <=, !=)
  • Boolean operators (AND, OR, NOT)

A more advanced type of conditional statement combines multiple IF… THEN… ELSE statements to make a compound statememnt with many alternative outcomes.

Conditional Compound Statements Image source

The following is an example of an every-day moment that requires compound decisions with many alternative outcomes.

Pseudocode

IF it is snowing outside

Challenge - Conditional Statements 3

What are some more examples of compound IF… THEN… ELSE conditional statments using every-day actions or items?

Hint

Use a combination of multiple IF… THEN… ELSE statements and a combination of the following operators to add complexity to your statements!

  • Comparison operators (>, <, =, >=, <=, !=)
  • Boolean operators (AND, OR, NOT)

Key Points

  • Programming is the process of creating instructions or set of related activities to achieve a task or goal.

  • Coding is the process of transforming the set of instructions for a process into a written language that a computer can interpret.

  • Algorithms are the set of step-by-step instructions that explain how to solve a given problem.

  • Pseudocode is the set of instructions for an algorithm written in a plain language.

  • Boolean algebra uses mathematical expressions that are evaluated to one of two values: true or false.

  • Control statments allow you to have flexible outcomes by selecting which pieces of codes are executed, or not.


R Fundamentals

Overview

Teaching: 30 min
Exercises: 15 min
Questions
  • What is the R programming language?

  • How do I write code in the R programming language?

  • What is the utility of RStudio?

  • What are the components and features of R and RStudio?

  • How can I write and run R code in RStudio?

Objectives
  • Become familiar with the syntax and common functions of the R language.

  • Be able to write helpful and simple comments for programs.

  • Become comfortable with working in RStudio.

  • Practice writing R code to perform basic operations.

The R Programming Language

The R programming language is a great first language for anyone interested in using coding to help answer questions with data analysis, data visualization, and data science. R provides a wide variety of tools for statistical and graphical techniques, including; linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, and more. A strength of the R language is the ease with which publication-quality plots can be generated, including mathematical symbols and formulae.

Why Learn the R Programming Language Image source

Since the R programming language is open source it is not proprietary, and it can be modified and built upon by the public. Furthermore, R environment itself is an integrated suite of software for data manipulation, calculation, and graphics. The flexible R environment includes:

The Utility and Components of RStudio

RStudio is a very useful software program that allows you to work with the R programming language using a convienient user interface (UI). The interface for RStudio has four main components:

Components of the RStudio User Interface Image source

The most important component of RStudio is the console, which is essentially the heart of RStudio. It is from here that you can run code, and it is here that R actually evaluates code.

Another place you can enter and run code is from the source component of RStudio. This is where you can write and edit code to save in a file, or scripts. Note that R script files have .R or .r as their file extension.

R Programming Language Syntax

The syntax of a programming language defines the meaning of specific combinations of words and symbols. This is why we call programming coding. Each programming language uses different combinations of words and symbols to get the computer to follow the instructions specified in your code.

What is Syntax? Image source

Here is a fun example from the show 30 Rock to illustrate the impartance of semantics:

Tracy Jordan's Party Invite

But, he actually wanted the invite to read:

What Tracy Jordan Wanted Said Image source

R Variables & Data Types

In the R programming language a combination of letters and symbols are used to give names to the data you are actively using in the memory of your computer system. These names are called variables. Variables are named storage that your programs can access and manipulate. These variables may be storing the values you specify directly in your code, or data stored in other files.

To set (assign) the value of a variable means that it is referring (pointing) to a specific value or piece of data in the memory of the computer system that is running (executing) your code.

# here is an integer value
4

# here is a variable with an assigned integer value of 4
my_value <- 4

In general, the R programming language may be considered a more complicated and capable calculator. Let’s start learning to code in R by using the following common arethmatic operation symbols, or operators:

To get you started, the following is an example of R code that adds two values of 8 with the addition operator in a few of different ways.

# addition using the values directly
8 + 8

# addition using the values stored in a variable
my_value <- 8
my_value + my_value

# addition using two different variables with the different assigned values
# and the result is stored in the variable named my_result
my_value_1 <- 4
my_value_2 <- 8
my_result <- my_value_1 + my_value_2

Tip!

Note that the value of 8 was just stored in the my_value variable, which we used earlier to store a value of 4. It is possible to overwrite the data pointed to by a variable by assigning with the <- operator a new value to the same variable name. This essentially removes from active memory the previous data value that was stored using that variable name.

The type of data that is being stored and referred to by variables often needs to be specified. One reason for this is because some mathematical or computational operations cannot be performed on different data types.

R Variable and Data Types Image source

Coding Challenge

What are some more code examples of variables that have different data types?

Solution

# variable with a character value 
my_chars <- "hello"

# variable with a numeric value
my_nums <- 0.5

# variable with the result of two numeric values being added together (evaluated)
my_add_result <- 0.5 + 1.7

# variable with a value assigned from the data in the my_file.txt text file
my_file_data <- read.delim("my_file.txt")

R Functions - Printing, Vectors, Matrices, and Data Frames

Functions in R are used to The syntax of a function in R defines a block of code (statements) that can be used repeatedly and on demand (call) in a program.

Functions Syntax in R Image source

Perhaps the most fundamental function in any programming language is one that allows you to print data to the screen. This allows you to view the values assigned to variables and identify the source of problems in you code, for example.

Tip!

The quickest and easiest way to find out more information about a R function, including what it is and how to use it, is with the ? symbol in RStudio.

# examine the documentation for the print function in R
?print

The most common function to print outputs in R is named print. This function requires as input a R object, such as a character string or variable.

# print a character object to the screen
print("Coding is fun!")

# print a variable to the screen
my_var <- "Coding is fun!"
print(my_var)

Vectors, matrices, and data frames are called R objects. Objects are a concept fundamental to object oriented programming, and each obejct has specific attributes and behaviors. In the R programming language these are named storage that contains 1D and 2D collections of data.

Each piece of data in a vector can be accessed by specifying the index of the piece of data, or element. The data in vectors must all be of the same data type. Furthermore, vectors can be assigned to variables in R.

To create 1D vector storages we can use built in R functions. For example, we can create a vector of numbers representing the sequence of values as follows.

# 1D vector using the : binary operator
1:10

# 1D vector in the reverse order
10:1

# examine the documentation for the seq function in R
?seq

# 1D vector using the seq function and explicit arguments
seq(from = 1, to = 10)

# 1D vector using the seq function and implicit arguments
seq(1, 10)

# print a 1D vector to the screen using the print and seq functions together
print(seq(1,10))

# variable with an assigned value of a 1D vector object
my_vector <- seq(1, 10)

# view the data contents of myVector using the print function
print(my_vector)

# short hand way to view the data contents of myVector
my_vector

# access the second element of the vector stored in myVector
my_vector[2]

Tip!

Note that we were able to use (call) a print function using a seq function contained (nested) inside the arguments of the print function. Nested function calls allow you to perform multiple tasks using fewer lines of code, for example.

Coding Challenge

How would you print both a character object and a variable to the screen?

Hint: Use the internet to search “r print string and variable”, for example.

Solution

# store a integer value in a variable
my_var <- 20

# print the value stored both the variable and character object
# to the console using the cat function
cat("The value of my_var is", my_var)

We can also create a 1D list using the list function in R. These are R objects that can contain data elements of different types. What’s more, the data in lists can be variables, 1D, and 2D R objects.

# list of values with different data types
my_list <- list(1:4, TRUE, 0.5, "meow")

# view the contents of the list variable
print(my_list)

Next, we can use different R functions to create a 2D data frame and a 2D matrix that each contain multiple sets of sequences. Matrices are objects in which the elements are arranged in a 2D rectangular layout, and data frames are 2D tables in which each column contains values of one variable and each row contains one set of values from each column. Additionally, we can access different pieces of data (elements) stored in our matrix by using the column and row index of the element.

# examine the documentation for the matrix function in R
?matrix

# 2D matrix where the sequence data is filled in by row
# and the data specified using a nested seq function call
matrix(data = seq(1, 10), nrow = 2, byrow = TRUE)

# 2D matrix where the sequence data is filled in by column 
# with the default byrow argument value of FALSE and implicit data argument
matrix(seq(1, 10), nrow = 2)

# variable storing 1D sequence data
my_vector <- seq(1, 10)

# 2D matrix where the sequence data is added (passed) to the matrix function using a variable
matrix(my_vector, nrow = 2)

# variable storing 2D matrix where the data argument is passed using a variable
my_mat <- matrix(my_vector, nrow = 2)

# view the contents of my_mat
my_mat

# access the first element of the second column using the column and row index of the element
my_mat[1,2]

Coding Challenge

How would you access a specific element of a matrix that is not stored in memory using a variable?

Solution

# access the first element of the second column in a matrix not stored in memory while it is created
matrix(seq(1, 10), nrow = 2)[1,2]

Similar to using the matrix function, we can use the data.frame function to create 2D data tables. Remember that these are simply a collection of vectors that all have the same (equal) length. An interesting difference between data frames and matrices is that data frames can be a collection of vectors each with different data types, but matrices require all the row and column vectors of data to be the same type.

# examine the documentation for the c function in R
?c

# variable with a 1D vector of character data to be used as our first column
char_var <- c("coding in", "R", "is fun", ":)") 

# variable with a 1D vector of integer data to be used as our first column
seq_var <- seq(1, 4) 

# examine the documentation for the data.frame function in R
?data.frame

# variable with a 2D data frame storing our three 1D vectors using implicit column naming
df <- data.frame(char_var, seq_var)

# 2D data frame storing our three 1D vectors using explicit column naming
data.frame(characters = char_var, integers = seq_var)

As we have seen, it is often useful to think of the 2D storage of values in data frames and matrices as a combination of 1D storages. We use the [] operator to select not only a single element of a 1D or 2D data collection, but also to break down a 2D matrix or data frame and retrieve specific vectors or other data subsets.

# take a look at the letters function
?letters

# variable with a 4x2 matrix of sequential letters starting with a
my_mat <- matrix(letters[1:8], ncol = 2)

# retrieve the first element of my_mat
my_mat[1, 1]

# retrieve the entire third row of my_mat
my_mat[3,]

# retrieve the entire second column of my_mat
my_mat[,2]

# retrieve the subset of my_mat that is the second (bottom) half
my_mat[3:4, 1:2]

# variable with a 4x4 data frame
my_DF <- data.frame(
	chars = letters[1:4], 
	ints = 1:4, 
	logics = c(TRUE, FALSE, TRUE, TRUE), 
	nums = seq(from = 0.1, to = 1, length.out = 4)
)

# retrieve the first half of myDF
my_DF[1:2, 1:2]

# retrieve the second column using indexing
my_DF[,2]

# retrieve the second column using the $ operator and column name
my_DF$ints

Tip!

If a line of code becomes too long and shifts (wraps around) to the next line, it is a good idea to break it into appropriate code pieces on separate lines. For example, in the previous my_DF function call we wrote each argument on a separate line.

Coding Challenge

How are the values for in the sequence of decimals in the following seq function call calculated?

seq(from = 0.1, to = 1, length.out = 4)

Solution

From looking at the documentation for seq using the ? operator:

## Default S3 method:
seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)),
   length.out = NULL, along.with = NULL, ...)

Logic & Control Statements in R

Recall that we can combine boolean expressions with control statements to specify how programs will complete a task. Control statments allow you to have flexible outcomes by selecting which pieces of codes are executed, or not.

The three primary types of control statements are:

The most common control structure of sequential statements are lines of code written one after another, and executed line by line.

Coding Challenge - Sequential Statements

Write R code for the following sequential statments:

Pseudocode

  1. Assign x the character value of “hello”
  2. Print the value of x

Solution

R Code

x <- "hello"
print(x)
[1] "hello"

Iterative statements allow you to execute the same piece of code a specified number of times, or until a condition is reached. The most common iterative statements are defined using either FOR or WHILE loops.

Coding Challenge - Iterative Statements Part 1

Write R code for the following FOR loop output:

Pseudocode

  1. For each value in the sequence a, b, c, d
    • Assign x the current value
    • print the value of x

Solution

R Code

for (x in letters[1:4]) {
  print(x)
}
[1] "a"
[1] "b"
[1] "c"
[1] "d"

WHILE loops are another type of iterative statement that can be used as a control structure in your code. This type of iterative statement will continue to execute a piece of code until a condition is reached.

Coding Challenge - Iterative Statements Part 2

Write R code for the following WHILE loop output:

Pseudocode

  1. Assign x the value of 3
  2. While x is greater than 0
    • print the value of x
    • increment the value of x by 1

Solution

R Code

x <- 3
while (x > 0) {
  print(x)
  x <- x - 1
}
[1] 3
[1] 2
[1] 1

The most common conditional statements are defined using combinations of the IF… THEN format.

The most simple form of conditional statement is the IF… THEN form.

Coding Challenge - Conditional Statements Part 1

Write R code for the following IF… THEN conditional statement output:

Pseudocode

  1. Assign x the value of “a”
  2. If x is equal to “a”, then print the value of x

Solution

R Code

x <- "a"
if (x == "a") {
  print(x)
}
[1] "a"

The next type of conditional statement adds another level of complexity with the IF… THEN… ELSE format.

Coding Challenge - Conditional Statements Part 2

Write R code for the following IF… THEN… ELSE conditional statement output:

Pseudocode

  1. Assign x the value of “b”
  2. If x is equal to “a”, then print the value of x
  3. Else print “x is not equal to the character ‘a’”

Solution

R Code

x <- "b"
if (x == "a") {
  print(x)
} else {
	print("x is not equal to the character 'a'")
}
"x is not equal to the character 'a'"

A more advanced type of conditional statement combines multiple IF… THEN… ELSE statements to make a compound statememnt with many alternative outcomes.

Coding Challenge - Conditional Statements Part 3

Write R code for the following compound IF… THEN… ELSE conditional statement output:

Pseudocode

  1. Assign x the value of “c”
  2. If x is equal to “a”, then print “x is equal to ‘a’”
  3. Else if x is not equal to “c”, then print “x is not equal to ‘a’ or ‘c’”
  4. Else if x is equal to “c”, then print “x is equal to ‘c’”

Solution

R Code

x <- 'c'
if (x == 'a') {
  print("x is equal to 'a'")
} else if (x != 'c') {
	print("x is not equal to 'a' or 'c'")
} else if (x == 'c') {
	print("x is equal to 'c'")
}
"x is equal to 'c'"

Advanced Concept

An even more advanced concept, nested IF… THEN… ELSE statements can increase the flexability of your code by allowing you to specify more complex conditions.

Nested If... THEN... ELSE Statements Image source

Advanced Coding Challenge

If you are looking for an additional challenge, write R code for the following nested IF… THEN… ELSE statement:

Pseudocode

  1. Assign x the value of 8
  2. If x is less than 1, then
    • AND then If x is equal to ‘c’, then print “x is less than 1 and equal to ‘c’”
    • Else print “x is less than 1”
  3. Else print “x is greater than 1”

Solution

R Code

x <- 8
if (x < 1) {
  if (x == 'c') {
    print("x is less than 1 and equal to 'c")
  } else {
    print("x is less than 1")
  }
} else {
  print("x is greater than 1")
}
"x is greater than 1"

Key Points

  • Understanding the syntax of a programming language is crucial to writing error free code.

  • Use the ? symbol to examine the description of R functions.

  • Search the internet for further information about R functions.

  • Copy and paste!


Break

Overview

Teaching: 0 min
Exercises: 10 min
Questions
  • Take a break!

Objectives
  • Take a break!

Key Points

  • Take a break!


BASH Fundamentals

Overview

Teaching: 30 min
Exercises: 15 min
Questions
  • What is the BASH command language?

  • How do I write code in the BASH command language?

  • What are the components and features of BASH?

  • How can I write and run BASH code?

Objectives
  • Become familiar with the syntax and common functions of the BASH language.

  • Become comfortable with working in the terminal.

  • Extend knowledge of R to learn about complementary programs used in the Unix/Linux terminal.

  • Practice writing BASH code to perform basic operations.

  • Discover important similarities and differences between R and BASH programming.

The BASH Programming Language

The BASH command language (Bourne Again SHell) is a programming language that is sh-compatible. This means that it is a programming language through which a user communicates with the operating system or a software program (application).

BASH is the default shell on most Linux operating system installations, and its wide distribution with Linux and Unix systems makes it an important tool to know.

BASH Programming Language Image source

The BASH language is used to communicate with the interpreter component of a computer system. The interpreter executes program code commands read from the standard input (e.g., terminal) or from a file. BASH script files end with the .sh extension, in contrast to the .R or .r extension of R scripts.

The Utility and Components of BASH

To many beginning programmers BASH can appear intimidating, which can make it difficult to get started with BASH programming. But there are just a few componenets of BASH that we need to know to understand how BASH integrates with the computer system.

Components of the Computer with Shell Image source

The primary components of BASH include:

BASH vs RStudio

So, we can see that there are some important similarities and differences between BASH and RStudio components. These include:

Discussion

What are some other similarities and differences between BASH and RStudio?

Tip!

Notice that there is a Terminal tab in the RStudio window with the Console component. This allows you to run BASH commands in the RStudio interface, which also provides a convienient location to write and edit BASH script files.

Terminal in RStudio Image source

BASH Programming Language Syntax

Remember that the syntax of a programming language defines the meaning of specific combinations of words and symbols. This is why we call programming coding. Each programming language uses different combinations of words and symbols to get the computer to follow the instructions specified in your code.

Variables & Data Types

Similar to R, in the BASH programming language a combination of letters and symbols are used to give names (variables) to the data you are actively using in the memory of your computer system. However, in contrast to many programming laguages, you do not have to declare (set) the data type of variables. That is, BASH variables are untyped and in essence, character strings.

We use the = operator in BASH to initialize a variable and assign it a value. Again, this means that the variable is a name tag that points to a specific piece of data in the memory of the computer system. This is in contrast to the <- assignment operator that we typically use to assign value to variables in the R programming language.

# here is an integer value
8

# here is a variable with an assigned value of 8
my_value <- 8
my_value=8

Discussion

What happens when you enter the following BASH code in the command line?

8

And what happens when you enter this piece of BASH code in the command line?

my_value = 8

Finally, what happens when you enter this piece of BASH code in the command line?

# this is a comment

Checklist

Note that there are some features common to how we format and initialize variables in BASH:

  • variable names should be upper case
  • do not use spaces after the initialization of the variable name, or the specified value
  • variable names can have letters, numbers, or underscores

In the R language the = operator is used to set a variable equal to a value, rather than assign the value to the variable using the <- operator. The nuance of this difference hinges on how the value is being stored in memory, and the accessability of the value using the variable.

What this means is that to use a variable in R we simply need to call it by its name. However, for BASH variables we need to prepend the $ operator to the name of the variable that we have initialized to refer to it and similarly “call it by its name”. Before Bash interprets (runs) each line of code entered in the command line or shell script, it first checks to see if any variable names are present by looking for the $ operator.

# assign a variable a value of 1
my_value <- 1

# print the value stored in the previous variable
print(my_value)
my_value=1; echo $my_value

Discussion

Why happens if enter the following BASH code into the terminal’s command line?

my_value=1
echo $my_value

Now, what happens when you open a new terminal or tab (environment) and enter the following code?

echo $my_value

Since variables in BASH are essentially character strings, how can we perform mathematical operations? Well we can use functions in BASH to give context and perform arithmetic operations and comparisons on variables.

The let function in BASH allows you to perform arithmetic operations using the following operator symbols:

Recall that in the R programming language we have access to the following arithmetic operators:

As a first step to learning how to perform arithmetic in BASH, we should check out the documentation for the let function.

To find the documentation for functions in BASH we can search the internet for that function’s manual. So, to find the let documentation we will search “let manual bash”. The top search result has a description of the syntax and purpose of the let function.

Now let’s try an example comparison of arithmetic operations in R and BASH.

# addition using two different variables
my_value_1 <- 5
my_value_2 <- 10
my_result <- my_value_1 + my_value_2
my_value1=5; my_value2=10; let "my_result=$my_value1 + $my_value2"; echo $my_result

Discussion

You may have noticed a few interesting details about the formatting of the BASH code above, particularly in contrast to the R code. Let’s discuss some of those differences.

A couple of motivating questions:

  1. Why do we prepend the $ operator to the my_result variable only in the echo function?
  2. Why do we include the ; symbol at the end of each line (piece) of code?

BASH Commands - Printing & Arrays

Again, one of the most fundamental functions in any programming language is one that allows you to print data to the screen. The most common command to print outputs in BASH is named echo.

Tip!

Note that we call the BASH functions that we enter into the command line commands. Functions more specifically refer to the code definition that underlies the command being used to call the function.

After searching the internet for “echo manual bash”, we can see that this function has the following syntax:

echo [options]... [String]...

Checklist

We now see that the syntax for calling (running) a function in BASH has the following feautres:

  • function name
  • white space
  • options (arguments)

Now let’s take a look at these methods for printing data to the screen in both R and BASH in action.

# print a character object to the screen
print("cool cool cool")
echo "cool cool cool"

Recall that in the R programming language vectors, matrices, and data frames are the named storages that contains 1D and 2D collections of data. In BASH we can use arrays to create similar 1D and 2D collections of data. There are two types of arrays in the BASH language:

Tip!

As of the Catalina version of macOS they have adopted Z shell as their default shell, which is in replacement of BASH. There are a few differences between BASH and Z shell, many of which are centered on the user interface.

But there is an important difference with array indexing between BASH and Z shell. In the BASH language arrays start at the integer 0, whereas in Z shell array indexing begins with the integer 1.

BASH vs Z Shell Image source

Similar to R, we can easily create indexed 1D arrays using shorthand, without using an explicit function call (command).

# variable with an assigned value of a 1D vector object
my_vector <- 5:10

# view the data contents of myVector using the print function
print(my_vector)

# short hand way to view the data contents of myVector
my_vector

# access the second element of the vector stored in myVector
my_vector[2]
my_array=(5 4 3 2 1); echo $my_array

Discussion

Why do we need to use the echo BASH command to print the contents of a variable to the screen?

The simple shorthand form of creating arrays in BASH is very convienient. More powerfully, we can use the declare command to create both indexed and associative arrays in BASH.

First, we will create an indexed array in both the R and BASH languages:

# list of values with different data types
my_list <- c("first", "second")

# view the contents of the list variable
print(my_list[2])
declare -a my_indexed_array; my_indexed_array[1]="first"; my_indexed_array[2]="second"; echo $my_indexed_array[2]

Now, to create an associative array in R and BASH:

# list of values with different data types
my_list <- list(cat = "Meow", dog = "Woof")

# view the contents of the list variable
print(my_list$dog)
declare -A my_assoc_array; my_assoc_array[cat]="Meow"; my_assoc_array[dog]="Woof"; echo $my_assoc_array[dog]

Discussion

What happens when you enter the following BASH code into the command line?

declare -A my_assoc_array; my_assoc_array[cat]="Meow"; my_assoc_array[dog]="Woof"; echo $my_assoc_array[2]

And what happens when you enter the following R code in the RStudio console?

# list of values with different data types
my_list <- list(cat = "Meow", dog = "Woof")

# view the contents of the list variable
print(my_list[2])

Advanced Coding Challenge

Note that it is not possible to create multi-dimensional arrays, such as 2D arrays in the BASH language. But it is possible to basically simulate a multi-dimensional collection of data using associative arrays, for example.

Simulating 2D Arrays in BASH Image source

Try creating your own 2D array in the BASH command language!

So, we can use functions and evaluate mathematical expressions in BASH like we have done using the R programming language in RStudio. But our experience coding while using the BASH terrminal and command line so far has not been nearly as easy and streamlined as when using RStudio. For example, we have to write code in the restrictive and clunky terminal user interface.

Key Points

  • BASH and R share a lot of the same basic functionalities.

  • Use the -h flag to examine the description of some BASH commands.

  • Search the internet for further information about BASH commands.

  • Copy and paste!


R & BASH Scripting

Overview

Teaching: 30 min
Exercises: 30 min
Questions
  • Why is scripting useful?

  • How can I write and run R and BASH scripts RStudio and terminal?

  • How can I use scripts to automate my data analysis process?

  • What is the best way to tackle coding errors?

Objectives
  • Discover some reasons why scripting is helpful for data analysis.

  • Learn how to run R scripts using BASH scripts.

  • Learn techniques for approaching coding errors.

Introduction to Scripting

Not only can we save the code we have written by using BASH and R scripts, but we can also use scripting to create modular pieces of code for use in data analysis. This is particularly helpful for automating repetative tasks in data analysis pipelines.

It is also possible to have scripts recieve user inputs (arguments), just like the built-in and user-defined functions we have been using in R and BASH. This is great for making your code more generalizable and able to be run on a wider variety of data sets, or even allow users to specify file paths for data on different computer systems.

R & BASH Scripting

We can use BASH scripting to make the process of coding with BASH a bit more simple. BASH scripts are text files that have the .sh file extension. These are text files that you can use to save the lines of BASH code that you want the interpreter componenet of the computer operating system to execute (run).

The Interpreter Operating System Component Image source

There are several great text editors available for creating and editing code in a huge variety of programming languages. Just a few popular options:

Tip!

There are s couple of extremely useful features of RStudio that are helpful for working with BASH:

  • the source component of RStudio is essentially a text editor that you can use RStudio to create and edit any type of text file
  • the console component has a terminal tab, which gives you access to the BASH command line

R & BASH Function Definitions

A powerful benefit of coding in BASH and R is the ability to create our own user-defined function definitions. This allows us to re-use a set of code statements arranged to perform a specific task.

In R a function is created (defined) by using the keyword function. The basic syntax of an R function definition is as follows:

function_name <- function(arg_1, arg_2, ...) {
   # function body 
}

It is also possible to create user-defined functions in BASH using the following syntax:

function_name () {
  # function body
}

Checklist

Note that a function definition in R has the following components:

  • function name − the name of the function, which is stored in R environment as an R object with this name
  • function keyword - this is the tag word function that is always included before the parentheses with the list of arguments
  • arguments − an optional placeholder for when a function is called (invoked). If a function has arguments and they do not have default values, you need to give (pass) values to the arguments
  • function body − contains a collection of code statements that defines what the function does
  • return value − the last expression evaluated in the function body

Furthermore, the simplest form of function definitions in BASH have the following components:

  • function name − the name of the function, which is stored in R environment as an R object with this name
  • function body − contains a collection of code statements that defines what the function does

Challenge

How can you view the documentation for the function R function in RStudio?

Solution

While typing in ?function a message will pop up suggesting relevant R functions. While hovering your mouse over the function R function in the pop up, press F1.

RStudio function Help Tip

Let’s practice making our own user-defined functions in both the R and BASH languages. As a first step, we will make some functions with out arguments.

# definition of a function named my_function
my_function <- function() {
  print("hello")
  print("yellow")
}

# run the function by calling it by its name
my_function()
#!/bin/bash

# definition of a function named my_function
my_function () {
  echo "hello"
  echo "yellow"
}

# run the function by calling it by its name
my_function

Tip!

Notice that the first line of the above BASH script is the following code:

#!/bin/bash

This piece of code is called the shebang, and it is always included as the first line of a BASH script. The shebang is a specific sequence of symbols and characters that is used to tell the operating system to use the BASH interpreter to execute the code in the rest of the file line-by-line (parse).

Coding Challenge

What is the primary difference between the following definitions of my_function and my_better_function, and why is it important?

# definition of a function named my_function,
# which assigns values to two variables and adds them
my_function <- function() {
	first_val <- 2
	second_val <- 4
	result <- first_val + second_val
}

# run the function by calling it by its name
my_function()

# function with an extra final line of code added to the function body
my_better_function <- function() {
	first_val <- 2
	second_val <- 4
	result <- first_val + second_val
	result
}

# run the function by calling it by its name
my_better_function()

Next, let’s make some functions that require the input of arguments when they are run (callled).

Tip!

Note that you can use the $(()) symbols as a shorthand way to perform arithmetic expansion in the BASH language, allows you to easily evaluate mathematical expressions.

# function to add two variables using arguments
my_add_function <- function(first_arg, second_arg) {
	first_val <- first_arg
	second_val <- second_arg
	result <- first_val + second_val
	result
}

# run the function by calling it by its name
# THIS WILL result IN AN ERROR
my_add_function()

# run the function by passing the function call the necessary arguments
my_add_function(2, 4)

# function to add two variables using default arguments
my_default_add_function <- function(first_arg = 100, second_arg = -100) {
	first_val <- first_arg
	second_val <- second_arg
	result <- first_val + second_val
	result
}

# run the function by calling it by its name
my_default_add_function()

# run the function by passing the function call the necessary arguments
my_default_add_function(2, 4)
#!/bin/bash

# function to add two variables using arguments
my_add_function () {
	first_val=$1
	second_val=$2
	result=$((first_val + second_val))
	echo $result
}

# run the function by calling it by its name
# THIS WILL result IN AN ERROR
my_add_function

# run the function by passing the function call the necessary arguments
my_add_function 2 4

# function to add two variables using default arguments
my_default_add_function () {
	first_val=${1:-100}
	second_val=${2:--100}
	result=$((first_val + second_val))
	echo $result
}

# run the function by calling it by its name
my_default_add_function

# run the function by passing the function call the necessary arguments
my_default_add_function 2 4

The Scope of R and BASH Variables

Now let’s try an interesting example that illustrates the differences between the scope of variables in the R and BASH environments.

# assign values to environment variables
var1 <- "A"
var2 <- "B"

# declare a R function
my_function <- function() {
  # assign values to function variables
  var1 <- "C"
  var2  <- "D"
  # concatenate and print strings with the values of the variables inside the function
  cat("Inside my_function: var1:", var1, ", var2:", var2, "\n")
}

# concatenate and print strings with the values of the variables before the function is run
cat("Before running my_function: var1:", var1, ", var2:", var2, "\n")

# call the function
my_function()

# concatenate and print strings with the values of the variables after the function is run
cat("After running my_function: var1:", var1, ", var2:", var2, "\n")
#!/bin/bash

# assign values to environment variables
var1='A'
var2='B'

# declare a BASH function
my_function () {
  # assign values to function variables
  local var1='C'
  var2='D'
  # print strings with the values of the variables before the function is run
  echo "Inside my_function: var1: $var1, var2: $var2"
}

# print strings with the values of the variables before the function is run
echo "Before running my_function: var1: $var1, var2: $var2"

# call the function
my_function

# print strings with the values of the variables before the function is run
echo "After running my_function: var1: $var1, var2: $var2"

Discussion

Because specific combinations of words and symbols have different meanings (syntax), the formatting of a user-defined function in R typically has several common features. What are some of these formatting features?

Solution

Some of the typical formatting features of a R function include:

  • assignment operator <-
  • function tag word
  • parentheses
  • curly braces
  • commas between any arguments
  • indentation of function body

What about BASH user-defined function definition formatting?

Solution

Some of the typical formatting features of a BASH function include:

  • parentheses
  • curly braces
  • indentation of function body

It is also important to note that how functions are called, and so the format of the commands is significantly different between the R and BASH languages. What are these differences?

Solution

The main differences between R and BASH commands are:

  • parentheses
  • argument delimiter (comma vs space)

Coding Challenge

Now write and run your own user-defined R and BASH functions using scripts! Try using some of the other built-in functions we have learned about so far in the body of the function you create.

Hint: Remember that in R the last line of the function body is what gets returned when the function is executed (run). Also, there are several formatting differences between R and BASH function defitions and calling (commands).

Using BASH Scripts to Run R Scripts

A powerful benefit of BASH scripting is the abiliy to run other scripts called within the .sh file.

For example, let’s create a simple test R script named my_RScript.r with the following code contents:

# print a message to the screen
print("Hello from my_RScript.r script!")

Now we can use the following BASH script to execute the R script we just made. We’ll name this BASH script file my_BASHScript_first.sh.

#!/bin/bash

# run my_RScript.r script
Rscript my_RScript.r

# print a message to the screen
echo "Hello from my_BASHScript_first.sh script!"

We can also use a different BASH script to run our previous BASH script. We’ll name this BASH script file my_BASHScript_last.sh.

#!/bin/bash

# run my_BASHScript_first.sh script
bash my_BASHScript_first.sh

# print a message to the screen
echo "Hello from my_BASHScript_last.sh script!"

How to Find and Fix Bugs

While writting code it is very common to encounter errors that prevent your code from running (executing) in the expected manner. These errors are often the result of bugs, or flaws in your code.

How to Approach Debugging Image source

The first step anytime you are trying to solve an error is to find the bug, which is the source of the error. To see an error in action, let’s try to define a function that uses incompatible data types to perform a mathematical operation.

# definition of a funtion to add incompatible data type
my_function <- function() {
	first_val <- 2
	second_val <- "4"
	result <- first_val + second_val
	result
}

# run the function by calling it by its name
my_function()

This results in the following message to be output (returned) to the screen (console):

Error in first_val + second_val : non-numeric argument to binary operator

But from this message we cannot tell exactly which argument has the problem non-numeric value. Let’s use the print function to find the exact source of the error.

# definition of a funtion to add incompatible data type
my_function <- function(first_arg, second_arg) {
	first_val <- first_arg
	second_val <- second_arg
	# added print statement to look at the value of each argument
	print(first_val)
	print(second_val)
	result <- first_val + second_val
	result
}

# run the function with an intentional error
my_function(2, "4")

So, it is important to take error meesages with a grain of salt. Instead of worrying or feeling overwhelemed when you recieve a bunch of incoherent error messages, tackle the problems one at a time and step-by-step.

Coding Challenge

What is another function we can use to find the exact source of the error?

Hint: Use the internet to search “r view data type”, for example.

Solution

# definition of a funtion to add incompatible data type
my_function <- function(first_arg, second_arg) {
	first_val <- first_arg
	second_val <- second_arg
	# added print statement to look at the value of each argument
	typeof(first_val)
	typeof(second_val)
	result <- first_val + second_val
	result
}

# run the function with an intentional error
my_function(2, "4")

It is crucial to look for the first bug in your code when you are trying to find the source of an error. In general, you need to look for bugs starting at the top and work your way to the bottom of your code.

Discussion

Why is it so important to look for the earliest bug that appears in your code to fix first?

Extending Logic & Control Statements to BASH

Recall that we can combine boolean expressions with control statements to specify how programs will complete a task. Control statments allow you to have flexible outcomes by selecting which pieces of codes are executed, or not.

The three primary types of control statements are:

The most common control structure of sequential statements are lines of code written one after another, and executed line by line.

Coding Challenge - Sequential Statements

Write BASH code for the following sequential statments:

Pseudocode

  1. Assign x the value of 6
  2. Print the value of x

Hint!

x <- 6
print(x)

Solution

#!/bin/bash

x=6
echo x
6

Iterative statements allow you to execute the same piece of code a specified number of times, or until a condition is reached. The most common iterative statements are defined using either FOR or WHILE loops. Let’s start by looking at a flow diagram for a FOR loop, which dipicts the flow of information from inputs to outputs.

Coding Challenge - Iterative Statements Part 1

Write BASH code for the following FOR loop output:

Pseudocode

  1. For each value in the sequence 1, 2, 3, 4, 5
    • Assign x the current value
    • print the value of x

Hint!

for (x in 1:5) {
  print(x)
}

Solution

#!/bin/bash

for x in {1..5}
do
  echo $x
done
1
2
3
4
5

WHILE loops are another type of iterative statement that can be used as a control structure in your code. This type of iterative statement will continue to execute a piece of code until a condition is reached.

Coding Challenge - Iterative Statements Part 2

Write BASH code for the following WHILE loop output:

Pseudocode

  1. Assign x the value of 1
  2. While x is less than 3
    • print the value of x
    • increment the value of x by 1

Hint!

x <- 1
while (x < 3) {
  print(x)
  x <- x + 1
}

Solution

#!/bin/bash

x=1
while [ $x -lt 3 ]
do
  echo $x
done
1
2
3

The most common conditional statements are defined using combinations of the IF… THEN format.

The most simple form of conditional statement is the IF… THEN form.

Coding Challenge - Conditional Statements Part 1

Write BASH code for the following IF… THEN conditional statement output:

Pseudocode

  1. Assign x the value of 7
  2. If x is greater than 6, then print the value of x

Hint!

x <- 7
if (x > 6) {
  print(x)
}

Solution

#!/bin/bash

x=7
if [ $x -gt 6 ]
then
  echo $x
fi
7

The next type of conditional statement adds a level of complexity with the IF… THEN… ELSE format.

Coding Challenge - Conditional Statements Part 2

Write BASH code for the following IF… THEN… ELSE conditional statement output:

Pseudocode

  1. Assign x the value of 7
  2. If x is less than 6, then print the value of x
  3. Else print “x is greater than or equal to 6”

Hint!

x <- 7
if (x < 6) {
  print(x)
} else {
	print("x is greater than or equal to 6")
}

Solution

#!/bin/bash

x=7
if [ $x -lt 6 ]
then
  echo $x
else
  echo "x is greater than or equal to 6"
fi
x is greater than or equal to 6

A more advanced type of conditional statement combines multiple IF… THEN… ELSE statements to make a compound statememnt with many alternative outcomes.

Coding Challenge - Conditional Statements Part 3

Write BASH code for the following compound IF… THEN… ELSE conditional statement output:

Pseudocode

  1. Assign x the value of 7
  2. If x is equal to 6, then print “x is equal to 6”
  3. Else if x is greater than 6, then print “x is greater than 6”
  4. Else if x is less than 6, then print “x is less than 6”

Hint!

x <- 7
if (x = 6) {
  print("x is equal to 6")
} else if (x > 6) {
	print("x is greater than 6")
} else if (x < 6) {
	print("x is less than 6")
}

Solution

#!/bin/bash

x=7
if [ $x -eq 6 ]
then
  echo "x is equal to 6"
elif [ $x -gt 6 ]
then
  echo "x is greater than 6"
elif [ $x -lt 6 ]
then
  echo "x is less than 6"
fi
x is greater than 6

Advanced Concept

An even more advanced concept, nested IF… THEN… ELSE statements can increase the flexability of your code by allowing you to specify more complex conditions.

Advanced Challenge

If you are looking for an additional challenge, write BASH code for the following nested IF… THEN… ELSE statement:

Pseudocode

  1. Assign x the value of 4
  2. If x is greater than 4, then check if x is equal to 6
    • If x is equal to 6, then print “x is equal to 6”
    • Else print “x is greater than 4”
  3. Else print “x is less than or equal to 4”

Hint!

x <- 4
if (x > 4) {
  if (x = 6) {
    print("x is equal to 6")
  } else {
    print("x is greater than 4")
  }
} else {
  print("x is less than or equal to 4")
}

Solution

#!/bin/bash

x=4
if [ $x -gt 4 ]
then
  if [ $x -eq 6 ]
  then
    echo "x is greater than 4"
  else
    echo "x is less than or equal to 4"
  fi
fi
x is less than or equal to 4

Key Points

  • Make small changes and plan for mistakes.

  • Feel free to use RStudio to create, edit, and run BASH scripts.

  • Copy and paste!


Supplemental - Best Practices

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What are the benefits of writing programs?

  • What are the most helpful programming techniques?

  • How can I get started with writing a program?

  • Why is it important to leave notes in my code about what it does?

Objectives
  • Be able to write pseudocode to describe the steps of a program in a plain language.

  • Become familiar with methods for writing modular and understandable programs.

  • Be able to break down an overly complex piece of code into smaller, more readily understandable components.

  • Understand why it is important to write meaningful comments and documentation.

Introduction

In this section we will learn about some of the common best practices in programming, which are easy to implememnt into your programming process. We will also explore approaches to solving problems and where to begin with designing algorithms.

How to Be a Good Programmer

The development of custom software programs has become increasingly necessary in biological research. Scientists are often required to create their own programs to analyze data and create publishable results. It is therefore very important that we consider techniques for improving the reproducibility and reliability of code.

Good Programmer Characteristics Image source

There are several things to keep in mind while working through your programming process. These are techniques that will help you to solve complex problems, while avoiding common pitfalls.

Checklist

These are programming techniques that have been found to be helpful in a variety of research settings.

  • Use programs to accomplish complex or repetative tasks
  • Write programs that can be understood by others
  • Take the time to plan how you will write a program
  • Make small changes and plan for mistakes
  • Collaborate with others whenever possible
  • Always include informative documents for your programs and data
  • Carefully structure and track your raw and calculated data

By implementing the above programming techniques, you will be better prepared to create sets of code to analyze complex biological data sets.

Ways to Approach Programming Tasks

Throughout any programming undertaking we should be thinking about our problem solving thought process. This means that you will need to think critically about how you approach solving coding problems with programming. Often you will find that there are many routes to the same solution, and which route you take may depend on your intended user or available tools.

How to Approach Programming Image source

Checklist

These are steps you can take to approach solving a problem:

  1. Understand the problem
  2. Create a plan to solve the problem
  3. Implement the plan
  4. Reflect on the results

The first step for approaching problem solving requires us to break down the problem before we can begin creating a solution plan.

Checklist

There are a few techniques we can use to help break down a problem before coding:

  1. Determine the inputs
  2. Determine the outputs
  3. Test a simple example
  4. Test a complex example

Now, let’s put these steps into practice. Keep in mind that the number of steps a task or problem is broken into may depend on the skills of the intended user or available tools.

Challenge

Write an algorithm in pseudocode to complete the task of getting dressed for the day, while considering the:

  • Current weather
  • Clothes available to you

To keep things simple, assume that:

  • You are currently wearing pajamas
  • You will wear only one top and one bottom clothing item
  • You will be outside all day
  • The weather will not change

Solution

In order to determine how to write an algorithm for getting dressed for the day, we should consider the following steps to breaking down a problem.

First, determine the inputs

  1. Current weather
  2. Available clothing

Second, determine the outputs

  1. The clothes that you will be wearing for the day
  2. The order in which the clothing should be put on

Third, test a simple example by specifying sample inputs:

  1. The weather outside is cold
  2. You have access to a pair of pants and a shirt

Our simple algorithm might then be:

  1. Walk to where your clothes are kept
  2. Take off pajamas
  3. Take out the the pants and shirt
  4. Put on pants
  5. Put on shirt

Fourth, test a complex example Let’s try out a more complex example by generalizing the inputs:

  1. Assume you have a way to check the current weather
  2. Assume you have a closet with all types of clothing

Our more complex algorithm might then be:

  1. Check the weather
  2. Walk to wear you clothes are kept
  3. If the temperature is less than 75 degrees fahrenheit, then
    • Put on pants
    • Put on shirt
  4. If the temperature is greater than 75 degrees fahrenheit, then
    • Put on shorts
    • Put on tank top

Note that one way to generalize your algorithm is to use conditional statements, such as the “if” statements in the above example algorithm. Remember that conditonal statements are used in programming to handle descisions, and they have two parts: hypothesis (if) and conclusion (then). So, the outcome of a conditional statement depends on the state of the inputs at that moment.

After devising a plan for a solution to a problem or task, it is a good idea to stop and think carefully about the plan. This is particularly important for debugging and fixing any errors.

Checklist

Some questions you can ask yourself at this point include:

  • Is my solution comprehensive?
  • Did I make any mistakes?
  • How can errors or incorrect outputs arise?
  • What can I do next?

Considering our simple solution to the previous challenge of writing an algorithm for getting dressed, there remain other ways that the algorithm may be written to be more comprehensive. For example, what if the intended user or audience is a young child? Then it may be necessary to further break down the steps of the algorithm to meet the needs of the user.

Challenge

For example, let’s re-write step 4 of the simple algorithm from our solution to the above challenge. Try to break down this part of the task into as many steps as possible.

Solution

Instead of leaving the step to “Put on pants”, we might break down the step as follows:

  1. Hold pants
  2. Open waistband
  3. Insert right leg into right leg hole of pants
  4. Insert left leg into left leg hole of pants
  5. Pull left pant leg up so the left foot comes through it
  6. Pull pants up from waitsband

Commenting & Helpful Services

Small meaningful comments throughout your code can be a great way to leave yourself and others helpful notes about the purpose of your code. This is particularly important when approaching a new coding challenge, or when you need to take break. It is also helpful to leave frequent comments for code in programming languages you do not freuqently write in.

Tips for Creating Meaningful Comments Image source

Checklist

A general rule of thumb when coding is to have comments at least every 5 lines. Some other tips to keep in mind include when coding in R:

  • Preferably write code while you are coding, and not after
  • Each line of a comment should begin with the comment symbol # and a single space
  • Comments should explain the why and not the what, unless helpful for your future self
  • Use commented lines of multiple - and = to break up your file into easily readable chunks

Creating comments for your code is most useful when you are describing why it does what it does. This gives your code context, which gives other developers (or your future self) more insights into the design decisions behind a piece of code.

Looking at the documentation is one of the best ways to find out or recall exactly what a piece of code is doing. Another great way to learn the meaning of different pieces of code is through a community website where people can ask coding questions using specific examples (e.g., StackOverflow, and Biostars). There are also many, many wonderful websites and blogs with posts covering nearly any topic of which you could think (e.g., codecademy, tutorials pointR-bloggers, R Weekly, and my own site Myscape).

A Note on Documentation

Writing comprehensive documentation about your code is a great way to convey important information about your software program. and give your code further context. A common form of documentation is a README file in the directory of your code. This document is a description of the what, why, and how of the project for which the code was written.

README Documentation Practices Image source

Checklist

To help motivate writting documentation for your code, here are some questions you can ask yourself.

  • What was your motivation?
  • Why did you build this project?
  • What problem does it solve?
  • What did you learn?
  • What makes your project stand out?

Key Points

  • Use programs to accomplish complex or repetative tasks.

  • Write programs that can be understood by others.

  • Take the time to plan how you will write a program.

  • Collaborate with others whenever possible.

  • Always attempt to write comments while you develop your code.

  • Always include informative documents for your programs.


Supplemental - Language Conventions

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Why is it important to consistently format my code?

  • What are some common guidelines for writting R and BASH code?

Objectives
  • Understand the importance of programming language conventions.

  • Review the syntax and common features of R and BASH.

The Importance of Formatting

It is important to format your code in a manner that is conducive to reading. While some coding languages have specific formatting requirements, formatting and commenting code is typically not actually needed for a program to work. Code should always be written in a consistent and logical format so that not only you, but others can read your program and quickly understand the purpose of your program. This is easy to achieve by setting and following a few simple rules.

Good Program Formatting Image source

Checklist

By following common formatting conventions, you can begin writting executable code in most programming languages.

Many formatting rules are centered on the following:

  • Syntax
  • Indentation
  • White space
  • Capitalization
  • Naming conventions
  • Spelling of words (e.g., functions and variables)
  • Comments

The exact details of the formatting conventions your need to follow depends on the programming language in which you are writting your code.

Programming Language Conventions

There is a set of guildlines for every programming language that informs how code should be formatted and the meaning behind specific combinations of words and symbols. Recall that the exact details of the formatting conventions your need to follow depends on the programming language in which you are writting your code.

Discussion

Which is better to use when there are multiple words in a variable name?

  • underscores (e.g., my_value)
  • capatalization (e.g., myValue)

Naming Conventions

As we know, consistent file naming is important for properly managing your data. Some common file naming conventions:

Using common naming conventions is also a good way to improve the readability of your code. This is important since it enables yourself and others to more readily understand the purpose of your code.

File Naming Conventions Image source

Checklist

One of the best practices of programming is to consistently follow a convention when naming files, variables, functions, and anything else.

Some common R naming conventions:

  • variable and function names should be lowercase
  • use an underscore or capatalization (camel case) to separate words within a name
  • generally, variable names should be nouns and function names should be verbs
  • strive for names that are concise and meaningful
  • where possible, avoid using names of existing functions and variables

And some common BASH naming conventions:

  • function names should be lower-case, with underscores to separate words
  • anything exported to the environment (e.g., constants) should be capitalized, separated with underscores, declared at the top of the file
  • source filenames should be lowercase, with underscores to separate words if desired
  • you can use readonly or declare -r to ensure that specific variables are read only
  • declare function-specific variables using the local keyword, and with the declaration and assignment on different lines

Formatting Conventions

We now know that specific combinations of words and symbols have differfent meanings depending on the programming language. But did you also know that the formatting of the words and symbols can be important as well?

Checklist

The syntax of the R programming language has several components, some of these include: Spacing

  • Place spaces around all infix operators (e.g., =, +, -, <-)
  • Place spaces around = in function calls
  • Always put a space after a comma, and never before. Exceptions to this rule: :, :: and ::: don’t need spaces around them
  • Place a space before left parentheses, except in a function call
  • More than one space in a row is ok if it improves alignment of equal signs or assignments (<-)
  • Do not place spaces around code in parentheses or square brackets, unless there is a comma

Curly Braces

  • An opening curly brace should never go on its own line and should always be followed by a new line
  • A closing curly brace should always go on its own line, unless it is followed by an else
  • Always indent the code inside curly braces
  • It is ok to leave very short statements on the same line

Line Length

  • Strive to limit your code to 80 characters per line, which fits comfortably on a printed page with a reasonably sized font
  • If you find yourself running out of room, you should try to encapsulate and subdivide some of the work in a separate function

Indentation

  • When indenting your code, use two spaces
  • Never use tabs or mix tabs and spaces, unless a function definition runs over multiple lines. In that case, indent the second line to where the definition starts

Assignment

  • Use <-, not =, for assignment

Checklist

The syntax of the BASH programming language has several components, some of these include: Indentation

  • Indent 2 spaces
  • No tabs
  • Use blank lines between blocks of code to improve readability

Line Length & Long Strings

  • Maximum line length is 80 characters

Pipelines

  • The entire pipeline of commands should be written on one line when possible
  • Commands be split one per line, if all the commands do not fit on one line

Loops

  • Put ; do and ; then on the same line as the while, for or if keywords
  • else should be on its own line
  • Closing statements should be on their own line, and vertically aligned with the opening statement

Variable Expansion

  • Quote your variables
  • Prefer the “${var}” over “$var” form, which is called brace-delimiting

Quoting

  • Always quote strings containing variables, command substitutions, spaces or shell meta characters
  • Use arrays for safe quoting of lists of elements, especially command line flags
  • Prefer quoting strings that are words, in contrast to command options or file path names
  • Never quote literal integers
  • Use “$@” unless you have a specific reason to use $*

Discussion

What should you do if a file of code you are eiditing does not follow established, or common language conventions?

Solution

Note that for existing files of code that you are editing, it is best to not modify the existing formatting.

Key Points

  • Some coding languages have specific formatting requirements.

  • Do not modify the existing formatting of code.