worker-globfilter
The GlobFilter Worker is a Batch Worker which takes in a glob filter as the batch definition, filters a given directory for matches and creates items of work from those matches.
Input
The following is an example input JSON for the Glob Filter Worker:
{
"batchDefinition":"input-sub-folder/**.txt",
"batchType":"GlobPattern",
"taskMessageType":"DocumentMessage",
"taskMessageParams":{
"pi:datastorePartialReference":"74c98be740b44d64b5b7a4e224555917",
"field:binaryFile":"CONTENT",
"field:fileName":"FILE_NAME",
"field:binaryFileReference":"STORAGE_REFERENCE",
"newField:aNewField":"aNewFieldValue",
"cd:aCustomDataField":"aCustomDataFieldValue"
},
"targetPipe":"langdetect-in"
}
Input JSON fields
-
batchDefinitionThe glob filter to match - in this caseinput-sub-folder/**.txt. -
batchTypeThe plugin to use for processing of the batchDefinition. Currently the only supported batch type isGlobPattern. Other batch types may be added in the future as required. -
taskMessageTypeThe type of TaskMessage that should be output from the worker. This must be set to "DocumentMessage" (different types may become configurable in the future). -
taskMessageParamsA list of namespace message parameters that the worker uses to build Task Messages (piandfieldnamespaces are described under Task Message Parameters Namespaces):pi:datastorePartialReferenceThe DataStore service partial reference to store file binaries against.field:binaryFileThe name of the field that will hold the reference to the content of the file asstorage_refencoded string.field:fileNameThe name of the field that will hold the name of the file asutf-8encoded string.field:binaryFileReferenceThe name of the field that will hold the storage reference of the file asutf-8encoded string.newField:aNewFieldA new field and value to be added to the output Documents' taskData fields. Given the example above, a field with a key calledaNewFieldand a value ofaNewFieldValuewill be added.cd:aCustomDataFieldA field and value to add to the output Documents' taskData customData. Given the example above, a field with a key calledaCustomDataFieldand a value ofaCustomDataFieldValuewill be added.
-
targetPipeThe queue that generated TaskMessages should be output to.
Task Message Parameter Namespaces
The following tables describe the Glob Filter Worker's processing instruction and field namespace parameters:
Processing instructions
The processing instructions, denoted with pi, are parameters that are used by worker operations.
The following table lists the processing instructions that are used by the Glob Filter worker:
| pi | Description |
|---|---|
| datastorePartialReference | The DataStore service partial reference to store file binaries against. |
Field
The fields, denoted with field, are parameters that provide the name of the fields as to which information on the glob matched file will be stored.
The following table lists the fields that will be added to TaskMessage taskData fields output from the Glob Filter Worker when specified:
| field | Description |
|---|---|
| binaryFile | The name of the field that will hold the reference to the content. |
| binaryFileReference | The name of the field that will hold the storage reference of the file. |
| fileName | The name of the field that will hold the name of the file. |
Note that if a name is not specified for any of the above fields then that field will not be added to the taskData.
Output
The following is an example TaskMessage taskData JSON returned from the Glob Filter Worker for a file that matched the glob operation:
{
"fields":{
"aNewField":[
{
"data":"aNewFieldValue"
}
],
"CONTENT":[
{
"data":"74c98be740b44d64b5b7a4e224555917/adb4cdce-62f1-4ada-a7e7-463e9abd4b95",
"encoding":"storage_ref"
}
],
"FILE_NAME":[
{
"data":"ATextFileThatMatchedTheGlobFilterOutput.txt"
}
],
"STORAGE_REFERENCE":[
{
"data":"74c98be740b44d64b5b7a4e224555917/adb4cdce-62f1-4ada-a7e7-463e9abd4b95"
}
]
},
"customData":{
"aCustomDataField":"aCustomDataFieldValue"
}
}
Environment Variables
CAF_GLOB_WORKER_BINARY_DATA_INPUT_FOLDERThe input folder to be scanned for matches e.g:/mnt/caf-datastore-root/sample-files.