1 of 9

Project Data Management

Data Analysis in Genome Biology - GEN242

1

Thomas Girke

May 23, 2024

GEN

242

2 of 9

Online Sign-in Form

2

  1. The attendance form for each class meeting is on the Attendance Poll page linked in the side menu of the Canvas (eLearn) instance for this class (GEN242). The direct URL to this Canvas page is:

https://elearn.ucr.edu/courses/134360/pages/attendance-poll

  1. To be identified during the sign-in as a participant of this class, follow the instructions on the Attendance Poll page carefully. The required access code is included in yellow on the same page.
  2. If you receive a message that you are not a registered participant then you have missed logging in to Poll Everywhere with your UCR email.

Login instructions for downstream knowledge polls are here. Stay logged in to save time when a new poll starts.

GEN242

3 of 9

Reminder: NGS Workflows and Challenge Projects

3

  • Overview of Course Projects is here
  • NGS workflow templates are expected to be completed on full data sets (here) from start to finish by each student. The final result is the R Markdown report rendered to an HTML file. Detailed instructions for downloading the full NGS data sets of the template workflows are provided here.
  • NGS workflow templates
        • RNA-Seq
        • VAR-Seq
        • ChIP-Seq (not applicable this year)
  • Each student’s Challenge Project (here) will be included in the above R Markdown report. The code of your challenge project should be organized in form of functions, stored in a separate *_Fct.R file that can be imported via the source() command at rendering time of the R Markdown report. Alternatively, an R package can be built for this.

GEN242

4 of 9

Reminders for NGS Workflow Templates

4

  • Have to be run on HPCC cluster
  • Topics to complete and review
    • Where to store big data? Difference between:
        • User account space: /rhome/<user_name>
        • Big data space: /bigdata/<gen242>*/<user_name> # *or other group name
    • Management of ~/.html directory for HTTP access. This is useful for:
        • Viewing graphics and reports (e.g. HTML R Marmarkdown reports)
        • Expose large data (e.g. bam files) to software running on other systems without downloading the data, e.g. certain genome browsers
        • Data sharing with collaborators (UCR or external)
    • Symbolic links: shortcuts/redirects to another directory (see here). Can be generated with

ln -s <existing_path> <link_name>

GEN242

5 of 9

Bigdata Storage for GEN242

5

  • When logging in to the HPCC, you see your home directory where you have 20GB of storage space.
  • Much more space is available under (often: 1-250 TB depending on group)

/bigdata/<group_name>/<user_name>

/bigdata/<group_name>/shared

  • For GEN242 we use

/bigdata/gen242/<user_name>

  • Symbolic links in your home accounts can be used to access these locations without typing long paths. E.g. nicknames: bigdata and shared
  • Important: deleting a symbolic link will not remove the directory it points to!
  • Also, a “git push” to GitHub will not follow symbolic links. Meaning, the data a link points to will not be uploaded to GitHub. Alternatively, use a .gitignore file).

GEN242

6 of 9

Tutorial for Managing Project Data

6

Tutorial here

GEN242

7 of 9

Your ~/.html Directory and Symbolic Links

7

  • How to configure your ~/.html directory for web access in your HPCC cluster account is described on the HPCC website here.
  • Note: to make files accessible via HTTP, their read permissions need to be set to be world readable (e.g.: -rw-r--r--), and directories require the corresponding executable permissions (e.g.: drwxr-xr-x). For details on Linux permissions and how to change them, see here.
  • Next, create under any directory in .html a symbolic link to a file you wish to make web accessible, like this:

ln -s <absolute_path_to_file> <file_name>

  • URL structure for files exposed via HTTP:

https://cluster.hpcc.ucr.edu/<~user_name>/<path_under_.html>/<file_name>

  • Example here: auto-creation of symbolic links and URLs via symLink2bam() of workflows. This provides genome browser access to large BAM files.

GEN242

8 of 9

Depending on Time: More on Slurm and HPC in R

8

If there is time, continue here

GEN242

9 of 9

Any Additional Questions?

9

  • Q&A about projects and workflows
  • Also, please attend office hours

GEN242