1 of 42

I/O and file management

2 of 42

Absolute path vs relative path

  • Absolute path is the file path which can be found at anywhere of the computer.
  • Relative path is the file path base on the current working folder.

Address: 高雄市鼓山區蓮海路70號

Two houses away from me

I live here

Absolute path

Relative path

3 of 42

Where can we see the file path

At the terminal of VS code

4 of 42

Very basic linux commands

  • ls: show the files and folders in the current working folder.

5 of 42

Very basic linux commands

  • pwd: show the absolute path of the current working folder.
  • cd: change the current working folder to another.

cd $PATH – change a folder

.. – change to the parent folder

. – current folder

6 of 42

Path system for Linux/Mac

  • /home/silasysh/ANNOgesic – absolute path

home

silasysh

ANNOgesic

folder1

goatools

  • If we stand at /home/silasysh, we can use cd ANNOgesic to go to ANNOgesic. This is relative path.

Absolute path

Relative path

Absolute path can enter the target folder from everywhere of the computer

7 of 42

Path system for Windows

  • C:\Users\shyu– absolute path

C:

shyu

ANNOgesic

folder1

goatools

  • If we stand at C:\Users\shyu, we can use cd ANNOgesic to go to ANNOgesic. This is the same as Linux.

8 of 42

Exercise

  • Current folder is C:\Users\Paul\folder2\folder4. Which path can be found.
    • C:/Users/Paul/folder3
    • folder1
    • ./folder6
    • ../folder2
    • ../../Paul
    • ../folder5

C:

Paul

folder2

folder1

folder3

folder4

folder5

folder6

Users

9 of 42

Open file

  • The first step of reading or writing a file is opening it. We can use open() to open a file.
  • For using open(). We need to assign a file name with valid absolute or relative path, and the mode (read/write/append) to the function.
  • mode can be assigned as the following:
    • “r” – read from a file. The file must exist.
    • “w” – write to a file. If the file does not exist, python will create it. If the file does exist, the file will be completely rewrite.
    • “a” – append to a file. If the file does not exist, python will create it. If the file does exist, the output of the script will be added at the end of the file.

10 of 42

read()

  • All functions for file reading and writing use string style.
  • read() is a function for reading the content of a file. It can control the reading by the size.
  • read() will read the whole file.
  • read(number) will read the specific number of words. For example, read(6) will only read the first 6 words of a file.

1. In Windows, using \\ is needed due to the special symbol.

2. Don’t forget the filename extension like .txt

6 characters including a space.

File holder

11 of 42

readline() and readlines()

  • readline() can read the file line by line.
  • readlines() can read all lines of a file and assign them to be a list. Each line as a element of a list.

Each line in the file including a “\n”

12 of 42

strip

  • strip is a function which can remove the spaces and newlines(“\n”) located at the beginning and end of the strings. It is widely used for removing the newline of a string from reading file.

13 of 42

readlines with a for loop

  • The most widely used approach for reading a file is using readlines() with a for loop.

Remove the “\n”

Do something for each line.

Normally each line is a data from a gene, protein or sample.

14 of 42

close

  • Every time, when we open a file, the file holder will occupy a memory space. It is a good habit for closing the opened file when we do not need it anymore.

  • The standard structure of reading a file is
    • file_holder = open(“filename”, “r”)
    • For + readlines()
    • strip
    • File_holder.close()

Default is “r”

15 of 42

Exercise

  • Please download a fasta file via https://drive.google.com/file/d/1gVtVe4vKig8DaOVy-8V6QSldBVxiEG5x/view?usp=sharing
  • Please tell me the accession number, strain of the bacteria. (startswith() and split() are your friends)
  • Please also tell me how long of this bacteria and its GC content (G and C occupy how many percentage of the whole sequence).

The header starts from >

Accession number

Strain name

16 of 42

17 of 42

seek

  • Sometimes, we want to read a file twice.

is not executed

  • When a file is opened, the reading process cannot be traced back.
  • If we want to read a file twice, we need to open this file two times or use seek().

18 of 42

write

  • write is a function to write the output into a file.
  • For writing, a file need to be opened with mode = “w” or “a”. If the opened file does exist, the content will be overwritten when mode = “w” and will be appended if mode = “a”.
  • write can only manage string type. Everything needs to be converted to the string type for using write.
  • Using write is different from print. write does not put a “\n” at the end of string.

No “\n”

19 of 42

Run the same script with different strings –> mode = “w”

Run the same script with different strings –> mode = “a”

Overwritten

append

20 of 42

writelines

  • The concept is similar to readlines. It can write the multiple lines to a file. Each line is a element to form a list for writelines.
  • writelines also does not add “\n”.

21 of 42

print

  • print can not only print the output to the terminal, but also to a file.
  • But when print is used, the content will keep their original style, ex: a list will be printed as [1, 2, 3] not 1, 2, 3.

22 of 42

Exercise

  • Following the previous script, could you please use write instead of print to write the output to a file?

23 of 42

24 of 42

csv

  • In many cases, the input file is a table which has a delimiter for separating columns. Using readlines to read this kind of files needs some steps to deal with the delimiter.
  • csv is a python package for reading tables. The content of the table will be assigned to a list, and each element is the value of each column.

25 of 42

With open

  • with open is a method which can close the file automatically when the file reading and writing is finished.

No f.close()

No f.close()

26 of 42

Exercise

  • Rewrite the previous code for using with open.

27 of 42

28 of 42

sys.argv

  • Although input is a good method for receiving the information from the users, the users need to type the information every time while running the script. If the script has 10 questions, it will be super annoying.
  • sys.argv can allow the users type input information just behind the script. Thus, the users can save the command for futural usage.

Stop the script immediately

The name of script

The input information

Besides script name, it still need another input message

29 of 42

argparse

  • Although input and sys.argv are a good method for obtain the information from the users, they are too flexible and easy to lose control, ex: the user types “test” to a question for asking numbers.
  • argparse is a python package to parse the input messages from the users. It can check/convert the data type of input message, set the input is required or optional, received messages as a list, set default values and provide detailed information of all required input.

30 of 42

args.XXX is used for calling the input. XXX is referred from the full name of argument

31 of 42

Running without inputs. It will tell you what kinds of parameters can be assigned, and also the required ones.

If required is set as True, this parameter must be assigned while running the script.

Use default setting

32 of 42

No input message needed. It will turn False to True

action=“store_true” means that when –sr was used, args.single_room will become True, otherwise, the default is False. “store_false” is the opposite as “store_true”.

33 of 42

Each item is separated by a space

Using nargs=“+” means this argument is a list

34 of 42

-h or –help will print the help information.

35 of 42

Exercise

  • Please rewrite the previous script for making the input and output files as arguments

36 of 42

37 of 42

Exercise

the feature’s name, such as gene, CDS, tRNA…

38 of 42

Define argument

Main function has three big steps – read_fasta, read_gff, and write_seq.

Moreover, using gene_seqs to store the output information.

39 of 42

Read fasta – this is almost the same as previous exercise, except returning two outputs. (ac for AC number and seq for whole sequence)

40 of 42

  • Using csv to read gff file which delimiter is tab (\t).
  • Exclude the headers which start from #
  • The first column should be the same as ac. This step can avoid multiple strains stored in the same file.
  • The third column should be “gene”.
  • Get start and end points as well as the strand information.
  • Use the seq from read_fasta to extract the gene’s sequence.
  • If strand is reverse, we need to do complement and reverse of the sequence.
  • gene_seqs do not need to return due to its dict type.

AGAACT

TCTTGA

5’

3’

5’

3’

41 of 42

For complement

For reverse

Write the file as fasta format

42 of 42