Configuration Directives¶
Overview¶
The PixStor Search system is controlled by various configuration directives. These are represented in a filesystem heirarchy, and manifest themselves as dot seperated properties. The content of the actual property values is JSON formatted. More complex properties (e.g. dicts/maps) are also used for some properties.
Important
It is especially important to note that raw strings are NOT valid JSON, therefore all string values must be appropriately quoted
The base location for ArcaPix configuration values is /opt/arcapix/etc
(see below).
Thus, to set the value of arcapix.search.authserver
to https://localhost/
one would write a file containing precisely
"https://localhost/"
into /opt/arcapix/etc/arcapix/search/authserver
Location¶
The default location for configuration values is /opt/arcapix/etc/
.
However, this can be overridden by setting the ARCAPIX_CONFIG
environment variable to point to a suitable directory -
this maybe useful for example to centralise configuration values.
Defaults¶
The system does not have any “default” values - all configuration values must be explictly set. This is normally done as part of the installation process.
Using Python to write the config values¶
One of the most convenient methods for setting configuration values is to use the supplied python library.
from arcapix.config import config
config["arcapix.search.authserver"] = "https://localhost"
Configuration properties¶
The following properties are currently defined for search.
Note
there may be other properties present in the /opt/arcapix/etc/
heirarchy -
these are for other arcapix products not connected with search
arcapix.alerts.customer_site¶
If something goes wrong during ingest, an email alert may be sent (see below)
This config can be used to specify the name of the customer site where the ingest was running.
This will appear in the subject line of the email alert e.g. [PxS] APSearch Ingest Error @ Example LTD, UK
This makes it easier to filter email alerts when ingest is being run on multiple clusters.
Type | String |
Suggested default value | Not set |
arcapix.alerts.email.from¶
If something goes wrong during ingest, an email alert may be sent (see below)
This config specifies the email address these alert emails should be sent from (replyto)
Type | String |
Suggested default value | “root@localhost” |
arcapix.alerts.email.to¶
If something goes wrong during ingest, an email alert may be sent. This config specifies a list of email addresses to send alterts to.
If the config is None or an empty list, no emails alerts will be sent.
Type | List of Strings |
Suggested default value | [] |
arcapix.auth.server.groupmapping¶
Dictionary of roles to lists of groups, used to provide a list of “scopes” which an authenticated user should have. This list is available either at initial token creation, or via the introspection endpoint.
Each key in the dictionary is a role(scope) name, and the value is a list of unix groups. A user who is a member of one of those groups will gain the matching “role” in the scope descriptor for the token.
The special group “*” means that this role will be added to all correctly authenticated users
The role mapping is shared across multiple PixStor products, so might contain roles which do not apply to search.
The following roles apply to PixStor Search:
user
: provides read-only access to the REST APIupdater
: provides read-write access to the REST APIbypass_file_access_security
: users with this role can see all files in search results, even if they don’t have permission to view or read the original fileforbid_read_file_access_security
: users with this role are treated as if they don’t have permission to ‘read’ a file. This means search results will only show basic metadata and no proxies (thumbnails, preview)disable_proxy_view
: users with this role cannot see proxies in search results. Unlike the ‘forbid read’ role, the user may still see deep metadata if they have read access to a given file
The special broker
user must have the updater
role. Without it, ingest will fail.
The user performing ingest doesn’t need the updater
role.
The roles are supplied from the authentication service - search does not read them directly. Therefore, if you are using an external authentication service which is not the apcore-auth daemon, it must be configured to provide the similarly named scopes
Type | Mapping |
Suggested default value | {“user”: [“*”], “updater”: [“broker”]} |
arcapix.rotation.directory¶
Directory to use for ‘rotation’ files.
Rotation files are used, for example, for incremental ingest to record when the last ingest occurred
Type | String |
Suggested default value | “/mmfs1/.rotate” |
arcapix.search.finder.lock.directory¶
Directory in which lock files are stored for the finder (ingest). Lock files prevent multiple concurrent ingests on the same filesystem.
Note - the user running finder must have write permission for the configured directory.
Type | String |
Suggested default value | “/var/lock” |
arcapix.search.finder.parse_date_order¶
The finder tool --since
and --until
flags can be passed date/times in a variety of formats.
This configuration controls how dates are parsed when they are ambiguous - e.g. for 10/11/12
DMY
-> 10th Nov 2012MDY
-> 11th Oct 2012YMD
-> 12th Nov 2010
If not configured, search will attempt to infer the correct date order for the current locale. If it can’t be inferred, the defaut will be MDY
In general, it’s better to use unambiguous dates.
Type | String |
Suggested default value | <NOT SET> |
arcapix.search.finder.pid.directory¶
Directory in which process files are stored for ingest.
Process files are used to facilitate the stop
functionality.
Note - the user running ingest must have write permission for the configured directory.
Type | String |
Suggested default value | “/var/run” |
arcapix.search.finder.time_field¶
The file time field which should be compared against for the --since
/--until
finder flags, and for incremental ingest.
One of:
atime
: file access timectime
: file change time, affected by file status/metadata change, such as change of file owner, as well as data modificationmtime
: file modification time, affected by data modification
For example, if configured as mtime
, incremental ingest will only ingest files with
modification time more recent than when the last ingest was run.
If you change this setting, metadata may be inaccurate for files already ingested.
For example, switching from mtime
to ctime
may mean that file permissions are not up to date.
The database can be brought up to date with a ‘stat-only’ ingest of the already ingested files.
Note - modification time (mtime) can be changed arbitrarily e.g. with touch
.
If a file’s mtime were changed to some time prior to the last ingest, it wouldn’t be picked up by the next incremental ingest.
Warning
If this config is set to ctime
, make sure arcapix.search.metadata.mimetype.use_filemagic
is set to False.
Filemagic will update the ctime
on all the files it checks - potentially all ingested files.
Type | String |
Suggested default value | “mtime” |
arcapix.search.ingest.max_flush_behind¶
When running a bulk insert ingest, this config controls the maximum number of files to group into a single POST request.
Setting a smaller value will result in more POST requests which can introduce a lot of overhead. Setting a larger value leads to larger requests, which may time out or be rejected by the REST server. A larger value may also not take full advantage of concurrent asynchronous updates.
When more plugins are enabled, the payload per file will be bigger, so a smaller max_flush_behind
setting may be required.
Type | Integer |
Suggested default value | 1000 |
arcapix.search.ingest.max_flush_threads¶
The maximum number of concurrent asynchronous POST threads per broker process.
The number of broker processes is controlled by the ‘threads’ policy option (-m),
so the maximum total concurrent POST requests is given by policy threads * max flush threads
Note - too many concurrent POST requests may overload the REST server or elasticsearch, and could cause requests to be rejected or could even cause elasticsearch to crash.
In practice, the number of POST requests that can be processed concurrently will be
limited by the number of REST server processes (configured via uwsgi).
Beyond that number, additional POST requests will block,
but the broker will continue to process files until max_flush_threads
is reached.
Type | Integer |
Suggested default value | 10 |
arcapix.search.ingest.max_flush_tries¶
The maximum number of times to try POSTing metadata to the REST server, in the event of a POST error.
Retries use a powers of 2 exponential backoff, with jitter.
If flush doesn’t succeed, the error will be raised, to be (potentially) handled by the broker.
See the finder --on-error
flag.
If not set, the default 5 attempts will be used. If set to 0, flush attempts will not be retries.
Type | Integer |
Suggested default value | 5 |
arcapix.search.ingest.reset_times¶
If ingest reads a file, for example to generate proxies, the file’s access time (atime) will be updated.
With reset_times
enabled, the access time (atime), change time (ctime), and modification time (mtime)
will be reset to their values from before ingest.
Warning
Resetting times only works when ingest is running as root.
If this configuration is enabled and ingest is not run as root, ingest will exit with an error.
If ingest can’t be run as root, performing a ‘stat-only’ ingest will prevent file times being updated.
Type | Boolean |
Suggested default value | False |
arcapix.search.jobs.prune_after_days¶
Some search jobs, such as ingest, generate ‘ps’ files.
These ps files store metadata about the job, such as what is being ingested, when it started, its current status.
They form the basis for the searchctl jobs
command.
The ps files are stored in the directory specified in the arcapix.search.finder.pid.directory
config.
When a job completes, its ps file is kept as a historical record. These files are automatically pruned some number of days after completion - as specified by this config.
A larger number of days means a longer record of past jobs.
On the other hand, this may fill up the pid directory, and may make the jobs
command unwieldy and slow.
Type | Integer |
Suggested default value | 90 |
arcapix.search.logs.path¶
The location where asynchonous jobs write their log output. Sub-directories will be created to ensure that no one directory is overloaded.
Type | String |
Suggested default value | “/mmfs1/.policy_tmp/condor/logs” |
arcapix.search.metadata.broker.username¶
The username which the broker will use to authenticate to the system. In theory, this could be a system user, but in practice, it makes sense to use one from the arcapix.search.server.authserver.passwords property described above
Type | String |
Suggested default value | “broker” |
arcapix.search.metadata.broker.flush_on_async¶
By default, ingest will send metadata to the REST server asynchronously.
But if ingest encounters an asynchronous plugin - that is, a plugin which indicates that it submits a job to the job engine, such as for a long running proxy generation - then the ingest needs to make sure any asynchronous metadata updates have completed before the asynchronous plugin is run. This means waiting for those asynchronous metadata updates to complete.
By setting this config to False, ingest will not wait for asynchronous metadata updates to complete.
This is risky as, if the asynchrounous plugin tries to update an item which hasn’t been inserted into the search database yet, then the update from the asynchronous plugin will be rejected.
However, if you can guarantee that the asynchronous plugin job won’t run ‘for some time’ - for example,
if the job engine were configured to only run jobs at night - then you may be safe to disable flush_on_async
.
In that case, ingest is likely to run much faster.
Type | Boolean |
Suggested default value | True |
arcapix.search.metadata.broker.skip_not_modified¶
Ingest records the timestamp of the last successful ingest for each file and plugin.
This can be used to determine if a file needs to be reprocessed by each given plugin i.e. if the file hasn’t changed since the last time it was successfully processed, then we don’t need to process it again.
Similarly, if a particular plugin failed in a prior ingest and all other plugins succeeded, this checking will make it so only that ‘failed’ plugin will be re-run for the file.
This setting controls whether that last successful ingest checking is performed.
When enabled, it can save time by not performing heavy-weight processing - such as proxy generation - unless necessary.
Note
In ‘lite mode’ (aka ‘stat-only’ ingest), the overhead of checking the last ingest time can out-weight the time it would take to just re-ingest everything.
Similarly, if you know all or most of the files being ingested haven’t been ingested before, or if you’re running an incremental ingest, where only modified files are ingested, it’s a good idea to disable this setting to save on unnecessary overhead.
This config can be overridden on a per-ingest basis using the
--skip-not-modified
/--no-skip-not-modified
flags on finder ( add | update )
Type | Boolean |
Suggested default value | False |
arcapix.search.metadata.broker.ignore_extension¶
By default, ingest will check if a plugin handles a file based on file extension and mimetype.
Sometimes the file extension is incorrect/doesn’t match the contents of the file. When this happens, the plugin is likely to fail in processing the file and will return an error.
Setting this config to True will cause plugins to match on mimetype only.
When extensions are ignored, one or more of the mimetype introspection methods should be enabled, or else the only plugins which will run will be those which handle any file.
As noted under those configs, the different mimetype introspection methods have potential side-effect, such as updating the file’s access time (atime)
Type | Boolean |
Suggested default value | False |
arcapix.search.metadata.exiftool.options¶
A list of flags to be passed to the exiftool command by default. These will be appended to any flags which are passed explicitly.
The exiftool command is used by various core plugins for metadata extraction. If a file is larger than ~4GB, exiftool won’t perform metadata extraction unless large file support is enabled.
This config can be used to enable large file support by default
["-api", "largefilesupport=1"]
Note - if large file support is enabled, arcapix.search.metadata.exiftool.maxsize
should be disabled.
Type | List of String |
Suggested default value | [] |
arcapix.search.metadata.exiftool.maxsize¶
Exiftool may attempt to read the whole file while performing introspection. This can be slow for very large files.
Any file bigger than the limit set for this config won’t be introspected by exiftool. An error will be returned instead.
Note - this limit will also apply when using exiftool for mimetype checking with arcapix.search.metadata.mimetype.use_exiftool_slow
Config value is file size in bytes. If not set or set to ‘null’, no limit will be imposed.
Type | Integer |
Suggested default value | 100 * 1024 * 1024 |
arcapix.search.metadata.exiftool.timeout¶
Timeout to use when introspecting a file with exiftool. The timeout prevents the exiftool subprocess from hanging, and blocking ingest from completing.
Timeout values is in seconds.
Type | Integer |
Suggested default value | 60 |
arcapix.search.metadata.mimetype.use_exiftool¶
Exiftool offers an alternative to file magic for looking up a file’s mimetype.
If file magic is used (see below) and fails to identify a file, exiftool might be used as a fallback. Exiftool can sometimes identify more exotic file types when file magic fails.
However, using exiftool adds some overhead to ingest. And unlike file magic, exiftool will update a file’s access time (atime).
This config allows you to enable or disable the use of exiftool as a fallback.
Note - this config will call exiftool with the -fast3
flag. This makes the lookup faster, but less accurate.
See use_exiftool_slow
below.
Type | Boolean |
Suggested default value | True |
arcapix.search.metadata.mimetype.use_exiftool_slow¶
use_exiftool
will call exiftool with the -fast3
flag. This makes the lookup faster, but less accurate.
This mimetype checker will call exiftool without the -fast3
flag. This may allow it to correctly identify the file’s mimetype
but it is likely to be slower, and for large files may use a lot of memory.
In particular, significant memory use has been observed when calling exiftool with large PSD files.
Note - the arcapix.search.metadata.exiftool.maxsize
config applies to this lookup,
meaning files larger than this limit will be skipped, returning no mimetype.
This is treated as a separate checker from use_exiftool
, so that both can be enabled.
In that case, the faster use_exiftool
will be tried first, and use_exiftool_slow
will only be called if it fails.
Type | Boolean |
Suggested default value | False |
arcapix.search.metadata.mimetype.use_filemagic¶
File magic is used to identify the mimetypes of files.
It is lightweight, and doesn’t change file access time (atime) (when ingest is running as root), unlike exiftool, but may fail to identify more exotic file types.
Typically you would use this config in combination with use_exiftool
(see above)
to disable all mimetype introspection as part of a ‘stat-only’ ingest.
Note
Even with no mimetype introspection, some non-stat-only plugins may process the file based on file extension. For a truely stat-only ingest, you also need to disable all plugins - see Plugin Installation
However, it is also valid to disable filemagic and enable exiftool as the sole mimetype lookup.
Warning
Filemagic will update the change time (ctime) on any file it checks.
This config should not be enabled if arcapix.search.finder.time_field
is set to ctime
Type | Boolean |
Suggested default value | True |
arcapix.search.metadata.mimetype.use_lookup_file¶
For maximum speed, at the expense of a certain degree of accuracy, search can simply use the system/apache mime.types file to provide a rudimentary translation between file extension and mimetype, without performing any file fingerprinting.
This may mean that some files which do not have extensions are not correctly ingested, and any file which is mis-named will also cause incorrect data to be ingested. For increased coverage, activate EXIFTool or filemagic as well.
This config allows you to enable or disable the use of mime.types as a method for determing file mime types.
Type | Boolean |
Suggested default value | False |
arcapix.search.metadata.mimetype.mime_types_file¶
This config sets which file is used for the use_lookup_file mimetype from file extension process. The file should be in the standard apache/system format, i.e. contain one or more lines, with the mimetype, followed by one or more extensions, all space delimited. Portions of lines after ‘#’ are ignored.
The following are all valid entries:
image/jpeg jpg jpeg
# A comment
video/quicktime mov # A comment
Type | Boolean |
Suggested default value | /etc/mime.types |
arcapix.search.metadata.plugins.path¶
The path into which user plugins are installed. See Plugin Installation for more information.
Type | String |
Suggested default value | “/opt/arcapix/usr/share/apsearch/plugins” |
arcapix.search.metadata.plugins.<plugin::Name>.maxsize¶
Set a maximum size for what files are handled by a given plugin.
Plugin name is of the form module::ClassName
- e.g
arcapix.search.metadata.plugins.imagepreview::CoreImageThumbnail.maxsize
Config value is file size in bytes. If set to 0 or None for a given plugin, no limit is imposed.
For most plugins, if this config is not set, no limit will be imposed. However, some individual plugins may specify a specific default limit.
Type | Integer |
Suggested default value | <not set> |
arcapix.search.metadata.plugins.<plugin::Name>.max_process_time¶
Set a maximum amount of time that a given plugin can spend processing a file.
Plugin name is of the form module::ClassName
- e.g
arcapix.search.metadata.plugins.imagepreview::CoreImageThumbnail.max_process_time
Config value is time in seconds. In not set for a given plugin, no limit is imposed. This is the default behaviour.
Type | Integer |
Suggested default value | <not set> |
arcapix.search.monitor.exclude¶
A list of patterns that should always be excluded. Patterns are the same as would be passed to the monitor command line tool.
This config can be used to exclude files, directories, or file types that should never be ingested, e.g.
File: /mmfs1/path/to/file
Directory: /mmfs1/directory/*
File Type: *.tmp
Paths can be either absolute or relative (to the ingest target directory)
Type | List of Strings |
Suggested default value | [“.policytmp/”, “.ctdb/”] |
arcapix.search.proxies.image_converters¶
Specifies what should be used to convert images when creating proxies.
Images will be loaded with each listed converter, in order, until one is found which can read the file without errors.
Available converters:
pillow
: uses the python builtin Pillow (PIL) libraryimagemagick
: converts the image to jpeg by shelling-out to ImageMagick (convert
), then loads the result using Pillow
Note: if Pillow-SIMD
is installed in place of Pillow
, the ‘pillow’ converter will use it.
Type | List of Strings |
Suggested default value | [“pillow”, “imagemagick”] |
arcapix.search.proxies.inherit_acls¶
Controls the permissions applied to files and directories in the proxy store.
If set to True
, acls will be those automatically applied (inherited) from arcapix.search.proxies.path
If False
, files will be explicitly chmod-ed to mode 640
, and directories to 750
Type | Boolean |
Suggested default value | False |
arcapix.search.proxies.path¶
The location where proxy files will be stored. Ideally, this should be on the same filesystem/fileset as the proxy generators work with.
Type | String |
Suggested default value | “/mmfs1/.proxies” |
arcapix.search.proxies.preview.size¶
The size in pixels (width, height) which preview proxies are made at by the standard proxy generator tool. NB. User created proxies need not be this size (though they may be resized for display in the UI if not).
Type | 2-tuple/list of integers |
Suggested default value | (400, 300) |
arcapix.search.proxies.thumbnail.size¶
The size in pixels (width, height) for thumbnails.
Type | 2-tuple/list of integers |
Suggested default value | (150, 150) |
arcapix.search.proxies.retain¶
Specify whether proxy files should be kept when their corresponding originals are deleted from the filesystem/database.
Type | Boolean |
Suggested default value | False |
arcapix.search.proxies.workdir¶
The location where proxy files are generated (for builtin plugins). Once successfully generated, the proxies are moved to the proxy store.
Ideally, this should be on the same filesystem/fileset as the proxy store.
Type | String |
Suggested default value | “/mmfs1/apsearch/proxies/.proxytmp” |
arcapix.search.server.authserver.passwords¶
A dictionary of username/password pairings which the authentication server will also consider to be valid for the system. These can be used by automated tools to enable them to authenticate without a system account.
Type | Dictonary of string => string key/values |
Suggested default value | {“broker”: “7dsfdsfsrdsfsadcvxc98ds98SEg4tn,slicvxc”} (Password should be any randomly generated string.) |
Warning
The filesystem file associated with this property should be properly secured using file system permissions, such that only the automated tools and the auth server itself can read it. Otherwise, any user who can read the file can authenticate.
Warning
Do not use the password included in the suggested default value above - you should change it to some other suitable randomised string.
arcapix.search.server.authserver.url¶
Deprecated since version 0.8: Use arcapix.auth.server.url
instead
This value defines the server which the search server can use to verify access tokens. This server _should_ use https: rather than http:, but this is not mandated.
If the auth server is behind a reverse proxy, such as nginx, this config should point to the proxied url.
Type | String |
Suggested default value |
(This url will have /oauth2/token_info appended to it automatically.) |
arcapix.search.server.searchserver.url¶
This value defines the server which the broker will use to update the metadata in the database. Often it is useful to have this point at the “nearest” instance of the search server. If both the broker and search server are running on all nodes, using localhost is suitable.
Alternatively, if the search server is behind a reverse proxy, such as nginx, it might make sense to set this to the proxied url. That way, the broker might take advantage of load-balancing.
Type | String |
Suggested default value | “https://localhost/api” |
arcapix.search.utils.security.disable¶
This property determines whether file access security is enabled. With this option disabled, any authenticated user can see metadata and proxies for any file on the system, not just those which they would have access to.
Type | Boolean |
Suggested default value | False |
Warning
It is strongly recommended that this key should be secured against writing by non-root users, since otherwise they can disable the security checks, assuming they can trigger a service restart
executables.exiftool¶
Metadata extraction for images and videos makes use of the exiftool exectuable. This property allows the location to be set. Depending on whether you have installed a system version, or compiled a specific one, the location may change.
Type | String |
Suggested default value | “/usr/bin/exiftool” |
executables.ffprobe¶
Metadata extraction for videos makes use of the ffprobe exectuable. This property allows the location to be set. Depending on whether you have installed a system version, or compiled a specific one, the location may change.
Type | String |
Suggested default value | “/usr/bin/ffprobe” |