1 of 13

OSG-PATh Staff Meeting�HTCSS Update�09/14/2022

1

2 of 13

Update since April 2022

2

2

3 of 13

HTCSS mechanisms to�help with�"Submit file challenges"�

3

4 of 13

Submit file challenges #1�Confusing Syntax for �Custom Attributes

Why do some entries need a plus and not others? Why do I sometimes need quotations and sometimes not?

executable = foo.exe

+JobDurationCategory = "Long"

transfer_input_files = my_data

queue

4

5 of 13

EXTENDED_SUBMIT_COMMANDS

5

# use just like normal submit keywords, the value will be converted into the correct type of data

executable = foo.exe

JobDurationCategory = Long

transfer_input_files = my_data

queue

submit file

EXTENDED_SUBMIT_COMMANDS @=end

WantGlidein = true

JobDurationCategory = "Long"

SingularityImage = "/whatever"

@end

condor_config file on AP

6 of 13

EXTENDED_SUBMIT_HELPFILE

  • AP defined file or URL to inform the user

6

# return the contents of this file to the user�EXTENDED_SUBMIT_HELPFILE = $(LOCAL_DIR)/submit_help.txt�# or return the URL to the user�EXTENDED_SUBMIT_HELPFILE = http://example.com/submit_help

> condor_submit -capabilities

Schedd ap0.chtc.wisc.edu

Has Late Materialization enabled

Has Extended submit commands:

accounting_group_user value is forbidden� LongJob value is Boolean true/false

ProjectName value is string� RetryIfTransferFails value is string

WantFlocking value is boolean true/false

WantGlidein value is boolean true/false

Has Extended help:

http://example.com/submit_help

7 of 13

Submit file challenges #2�Job Policy

Users specify poor job policies – examples:

periodic_release = True

7

8 of 13

Submit file challenges #2�Job Policy

Users specify poor job policies – examples:

periodic_release = True

8

New! A job explicitly placed on hold with condor_hold can now only be released via an explicit condor_release command.

9 of 13

Submit file challenges #2�Job Policy

Users specify poor job policies – examples:

periodic_release = HoldReasonCode == 13

9

10 of 13

Submit file challenges #2�Job Policy

Users specify poor job policies – examples:

periodic_release = HoldReasonCode == 13

10

11 of 13

File Transfer Error Propagation

  • Hold Reason and Reason Codes now useful

11

Example of "condor_q –hold" (output directory missing on the Access Point)

Old Message:

Error from slot1@TODDS480S: STARTER at 127.0.0.1 failed to send file(s) to <127.0.0.1:50288>; SHADOW at 127.0.0.1 failed to write to file C:\condor\test\not_there\blah: (errno 2) No such file or directory

New Message:�Transfer output files failure at access point TODDS480S while receiving files from execution point slot3@TODDS480S. Details: writing to file C:\condor\test\not_there\blah: (errno 2) No such file or directory

12 of 13

Submit file challenges #2�Job Policy

But why make users craft their own policy? Let them pick from a menu! �So instead of :

periodic_release = HoldReasonCode == 13��Have users do:

RetryIfTransferFails = Syracuse

12

13 of 13

Submit File Challenge #3:�Changes since user "cut-n-paste"

  • New Submit Templates
  • Submit language templates defined in config of submit

13

SUBMIT_TEMPLATE_NAMES = $(SUBMIT_TEMPLATE_NAMES) TensorFlow�SUBMIT_TEMPLATE_TensorFlow @=end� if ! $(1?)

error : Template:TensorFlow requires at least 1 argument - TensorFlow(flavor)

endif

Universe = container� container_image = TensorFlow$(1).sif

MY.wantCVMFS = true�@end

use Template : TensorFlow(95)

config file

submit file