View on GitHub

coding-tutorials

Basic computing - Linux, Bash, and SLURM

Return to mainpage

Goal

To give a basic introduction to coding (in Bash) and how to use Linux based machines.

Contents

  1. Linux
  2. Bash

    1. Description
    2. Setting up a sandbox
    3. Basic commands
    4. Navigation
    5. Variables
    6. Arrays
    7. Loops
    8. Text editors
    9. Running scripts
    10. Connecting to remote computers
    11. Advanced commands
  3. SLURM

    1. Description
    2. Sbatch scripts
    3. Queues
    4. Out and error files
    5. Starting, stopping, and monitoring jobs

Linux

Linux is the kernel on which most high performance computing (HPC) is done. A kernel is the software that allows an operating system to control physical hardware. The Linux kernel is based on UNIX and is open source and free to use by anyone, as its creator Linus Tuvolds started the kernel under a GNU license back in the early 90’s. Anyone can contribute code to the kernel, as long as it passes a series of revisions and oversight. Because so many people contribute to the code it is constantly being improved.

To use Linux, you need a distribution (shortened to ‘distros’). These distros are tantamount to operating systems (think Windows or MacOS). There are many to choose from: free ones like Ubuntu or Tux and enterprise ones like RedHat and CentOS. Whichever distro you choose, they will all act the same under the hood, even if the desktop appearance is different, because they all operate using the same kernel.

Although many Linux applications have easy to use graphical user interfaces (GUIs), a savvy Linux user will learn how to do everything within a terminal. A terminal is an access point into a computer that takes and returns text commands. Computing through a terminal is almost always faster than using a GUI and generally offers the user more options and customization than a GUI. The language Linux-based operating systems use in the terminal is called Bash.

Bash

  1. Description

    Bash is a powerful programming language that Linux-based operating systems use to perform tasks. Most of the time user-facing programs will use an easier language to debug like Python or MatLab, but you will need to use Bash to navigate around the terminal and launch jobs.

  2. Setting up a sandbox

    A sandbox is a safe environment in which to code without being able to break your computer. In our case we will be using CU’s Computer Science coding space :

https://coding.csel.io/hub/login

  1. Basic commands

    Commands in bash are entered directly into the command line, generally in the following format:

<command> --<option> <input>

The command is then executed when you press enter.

    1. cd To navigate from directory to directory, we can use cd or ‘change directory’.
      • We can move into a deeper directory by cd <directory name>
      • Up a directory with cd .. (‘..’ represents the parent directory)
      • The same directory cd . (‘.’ represents your current directory, we’ll use it later)
      • An adjacent directory by specifying a ‘relative path’ cd ../jon
      • A specific directory by specifying the absolute path cd /home/jon
      • Your home directory with either cd ~ or simply cd
    2. mv Similar to, and much faster than the cp function, we can use mv <source_file> <destination_file> to move a file from one location to another. Because you are not actually copying and remove the file, simply changing its location information, this function is often instant. Another use of this function is to rename files (because that is essentially what you are doing). To do this simply mv <old_name> <new_name>, you can also move and rename entire directories.
    3. Tab filling. One of the biggest timesavers in coding is using the tab key to autofill a function in your path or the name of a file/directory after you have typed the first few characters. Tabbing twice will give you a list of all files or directories in your current directory.
    4. Home directory. You home directory is usually where you will start a terminal session and contains all of the personal files necessary for you to work, including hidden files and programs. Usually your home directory is stored on a smaller, faster drive and not meant for the storage of large datasets.
    5. Permissions. All files and folders on a computer have a set of permissions, which you can view using ls -l. There are three levels of permissions: user, group, and other. And three types of permission in each level: read(r), write(w) and execute(x). These are denoted by sets of 3 letters per level.
       -rwx------ 1 shla9937 lugerlab 0 Sep  3 16:48 user.txt
       -rwxrwxr-- 1 shla9937 lugerlab 0 Sep  3 16:48 group.txt
       -rwxrwxrwx 1 shla9937 lugerlab 0 Sep  3 16:48 other.txt
      
  1. Variables
    • Variables can be defined in bash using the syntax: <varibale_name>=<variable_value>.
    • You can then call the variable using $<variable_name>.
    • And clear its value with unset <variable_name>.
    • Try setting up a variable and calling its value with the echo command.
  2. Arrays

    Lists in many programming languages are called ‘arrays’ in Bash. Simply put and array is an ordered list of values (numbers, strings, ect.) that you can iterate through.

    1. Make an empty array <array_name> = ()
    2. Make a filled array <array_name> = (<value0> <value1> <value2>)
    3. Return first value ${<array_name>} (use echo to print the output)
    4. Return specific value ${<array_name>[i]} where i is the index (or position) of the value in the list, remember arrays start indexing at 0.
    5. Return all values ${<array_name>[@]}
       jovyan@jupyter-shla9937:~$ echo ${array1[@]}
       0 1 2 3 4 5
      
    6. Return array size ${#<array_name>[@]}
    7. Change value of first element <array_name>[0]=<new_value>
    8. Append value to list <array_name>+=(<value>)
  3. Loops

    Now that you can use variables and arrays, you can use loops to iterate through those arrays and perform functions.

    1. For Loops. A ‘for loop’ will iterate through all the elements of an array and perform the same function, as in ‘for each element, do this’ and that is actually how the syntax works in bash.
      • First, declare the for loop, variable to be iterated, and iterable element through which to iterate and add ; do:
          for i in ${array1[@]}; do
        
      • Next, tell the loop what to do with each iteration:
          > echo ${array1[i]}
        
      • You can add another function or declare the end of the loop and tell Bash to execute it:
          > done
        
      • Here’s an example of a for loop that looks at all the elements in an array and prints one each round:
          jovyan@jupyter-shla9937:~$ for i in ${array1[@]}; do
          > echo ${array1[i]}
          > done
          0
          1
          2
          3
          4
          5
        
    2. if statements. If statements are a powerful tool that allow you to execute commands only if a specific condition has been met. There are three possible conditions in an if statement:
      • if runs a command if the condition is satisfied.
      • else runs a command if none of the previous conditions are met.
      • elif runs a command if the previous if’s conditions are unsatisfied and the condition set forth by the elif is satisfied.
      • the basic syntax for an if statement in bash is:
          if [ <condition> ]
          then
            <command>
          elif
            <elif_command>
          else
            <else_command>
          fi
        
      • the fi denotes the end of the statement (it is simply if backwards)
      • if statements are often placed inside loops and can trigger them to end at certain times.
    3. While loops. A while loop runs a command over and over until some condition is not met. It’s kind of like putting an if statement inside of for loop that ends when a condition becomes false.
      • The basic syntax is:
          while [ <condition> ]
          do
            [ <command> ]
          done
        
      • One caveat with while loops is that if the variable in the condition never changes or will never become false, you’ll start an endless while loop. For loops generally iterate through a iterable object of a define size and so usually don’t get caught in this behavior.
  4. Text editors
    1. Nano. Nano is one of the simplest command line text editors you can use and is installed on almost all Linux machines. It is great for quick edits, but is hard to debug unless you are intimately familiar with your script.
      • nano <new_file> will create a file and open it in the edit (a common behavior with most editors)
      • move around with arrow keys
      • ctrl+x exits the program, but asks if you want to save your file as the same name or a different one. Answer y to save and exit or n to exit without saving.
      • see more: https://www.nano-editor.org/docs.php
    2. Vim. Vim is one of the most widespread command line text editors because it color codes text and helps the use more than nano. Vim suffers from terrible documentation although you can always google your question to figure it out.
      • vi <new_file> creates and opens a file
      • Vim has two modes: edit and command. When in edit mode, you can make changes to your document.
      • esc gets you from the edit mode to command input mode (you can’t exit until you get to command mode).
      • :q quits the editor without saving
      • :qw quits and writes (saves) the file
      • some documentation: https://www.vim.org/
    3. Gedit. Gedit is a graphical editor that may not come installed on your Linux machine, but many find easy to use.
      • gedit <new_file> creates and opens a gui with your file to edit it.
      • Documentation: https://help.gnome.org/users/gedit/stable/
    4. Atom. A really powerful graphical text editor that I like to use is called Atom and is built by Github, specifically to work well with Github. You can downloaded it and find out more at https://atom.io/
  5. Running scripts

    Now that we know how to use bash and edit files, we can make scripts. Scripts are files that contain a series of commands that we can run, use to streamline pipelines, and share with others. - <program_name> <script_name> is the general formula for running scripts. - bash <script.sh> is how we can run a script using bash. The file extension .sh is often used to specify a bash specific script. - Eventually, we will learn how to input values into the script and how to make them executable.

  6. Connecting to remote computers
    1. ssh. To log into a terminal securely from one Linux (or Mac) machine to another you can open a terminal and use ssh <user>@<computer_address>. To stop the connection use exit.
    2. PuTTY. PuTTY allows Windows machines to ssh into Linux machines using a GUI to produce a terminal emulator on the Windows end.
  7. Advanced commands
    1. top Checks jobs running in the local environment.
    2. crtl+c Kills the currently running job in a terminal. Can be dangerous as it simply interrupts.
    3. history Displays inputs to your command line back a certain amount of time. Useful for remembering how tou did something you forgot to write down or put in a script.
    4. clear Clears out all of the displayed command line (not your history).
    5. * This is the symbol for a ‘wildcard’ it will do your command on everything matching your pattern. cat *.txt will read all text files in your directory. mv red* new_red_folder will move anything starting with ‘red’ into the ‘new_red_folder’.
    6. rsync A smart cp. Use rsyn -auP source_directory destination_directory to make a copy of a folder. Running this command a second time will update and existing files and copy new ones. This makes keeping a copy of a file super simple becuase you don’t have to copy every single file each time, just ones that have changed.
    7. grep Use grep "<keyword>" to find a matching pattern in a list of files or grep "<keyword>" <file_name> to look inside of a file and find a keyword.
    8. screen A powerful tool for keeping a terminal alive and returning to it later.
      • screen -S <screen_name> creates a screen_name
      • ctrl+a+d detaches the screen and allows it to run even if you logout or disconnect your computer (not if it gets turned off).
      • screen -r <screen_name> reatches the screen session.
      • exit from inside the screen will kill the screen session.
    9. sudo ‘Super User Do’ can be placed in front of commands that require superuser privileges. You usually don’t have the ability to use this unless it’s on your own computer. If you google something and it tells you to use sudo to fix it, don’t. Sudo commands can irreversibly mess up your computer.

Slurm

  1. Description

    SLURM is a workload manager common to most HPC clusters that allows users to submit jobs to it and then allocates resources based on a number of parameters. We will use this to do work on the BioKEM cluster. There many advantages to running jobs on clusters including access to orders of magnitude more resources, reproducible environments, and the ability to maximize computing efficiency.

  2. Sbatch scripts

    Sbatch scripts are the scripts SLURM requires. They start with a header which contains information that SLURM will use to allocate resources and run the script. There are four main parts of an Sbatch script:

    • Specification of which language to interpret the script. This section is denoted by a shebang followed by the path to the binary, in most cases: #!/bin/bash
    • Next are all of the SLURM parameters. Which ones are required are cluster specific, but generally you should be as explicit as possible, we’ll talk more about these parameters in the How computers work tutorial.
    • Then, you’ll load all of the modules you need to run your program module load <modules>.
    • Finally, you can run your commands.
    • You can use the .sbatch file extension to denote files
        #!/bin/bash
        #SBATCH -p <partition> # Partition or queue.
        #SBATCH --job-name=<job_name> # Job name
        #SBATCH --mail-type=END # Mail events (NONE, BEGIN, END, FAIL, ALL)
        #SBATCH --mail-user=<email@colorado.edu>
        #SBATCH --nodes=<#> # Only use a single node
        #SBATCH --cpus-per-task=50 # cpus
        #SBATCH --mem=24gb # Memory limit
        #SBATCH --time=24:00:00 # Time limit hrs:min:sec
        #SBATCH --output=/Users/%u/slurmfiles_out/slurm_%j.out # Standard output and error log
        #SBATCH --error=/Users/%u/slurmfiles_err/slurm_%j.err # %j inserts job number
      
        module load <modules>
        <commands>
      
  3. Queues

    When you submit a job to SLURM, it goes into a queue where it wait to run.

    • Running the command squeue shows you what is going on in the cluster’s queue:
        fiji-1:~$ squeue
        JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
        7861375      long    job_0 ding1018  R 7-13:11:09      1 fijinode-60
        7874945     titan nf-dreg_ lysa8537 PD       0:00      1 (Resources)
        7874946     titan nf-dreg_ lysa8537 PD       0:00      1 (Priority)
      
    • You get cursory information about everyone’s jobs on the cluster and see where it’s running (node name), if it’s at the top of the queue waiting for resources to open up (Resources), or if it’s lower in the queue waiting for other jobs to run (Priority)
  4. Out and error files

    Running an Sbatch job will make two files with the jobid followed by the extensions .out or .err. You will need to you specify the folders you want these deposited into in your Sbtach header. The .out (output) file will give you any outputs that would normally appear on the command line during the run. The .err (error) file is useful for debugging and understanding what went wrong during failed runs.

  5. Starting, stopping, and monitoring jobs
    • To start a single Sbatch job use sbatch <script_name.script this will give you a jobid that you can use to monitor your job status.
    • To stop a job that you no longer want to run or is failing in someway use scancel <jobid>. You can only cancel your own jobs.
    • To check the status of all the jobs in a queue use squeue if you only want to see your jobs squeue -u <your_user>

Practice

  1. Use a text editor to make and run a bash script that produces a text file containing a message.
  2. Use a text editor to make and run a bash script that uses a loop to append a message 20 times times onto the previous text file.
  3. Use a text editor to make and run a bash script that creates an array of file names, then uses a for loop to create all of the files.