5.8. Search¶
Note
Before you can use the search feature, additional set-up is necessary, as described in the search feature page.
The search endpoint provides the ability to search for files across multiple sites, and aggregate the results.
Search is performed in two steps - submitting a query, and retrieving the results.
5.8.1. Submitting a query¶
Search performs a query by submitting asynchronous tasks to each requested site. The sites then perform the actual search and return results as available.
A search is initiated by POST
ing a query to the search endpoint
curl -s -X POST 'http://example.com/api/search/' -H 'Accept: application/json' -H 'Content-Type: application/json' -H "Authorization: Api-Key $TOKEN" -d '{"path": "/mmfs1/data", "sites": ["site1"], "recursive": true, "filters": {"hsm.status": "migrated"}}'
The search request payload is made of
name |
description |
---|---|
|
Directory to query |
|
List of one or more sites to search. Default: all sites |
|
Whether to search the |
|
A collection of filters against arbitrary metadata, see below. Default: None |
|
A list of metadata fields to include in search results. This can include specific field names (e.g. |
|
If the same file exists on multiple site, this will cause them to be merged in the results (see below). Default: false |
Upon successful submission, the request will return status 201 (Created)
, and a response body which includes the url for retrieving search results (see below)
{"id":1,"url":"http://example.com/api/search/1/"}
5.8.1.1. Filters¶
Filters are a collection of filters to apply to arbitrary file metadata.
The specific metadata available to be filtered on depends on the search backend being used. The fields in the following examples may not be available for all backends.
At a minimum, one can expect to be able to filter on core.filename
, the file basename. For example to filter only jpeg files, {"core.filename": "*.jpg"}
Possible filter types are
type |
description |
example |
---|---|---|
exact match |
match a value exactly |
|
match list |
match any of the values in the list (value1 OR value2 OR …) |
|
wildcard |
any string value containing an asterisk ( |
|
range |
numerical or date range, using any combination of less-than ( |
|
negation |
exclude anything matching a given filter |
|
Filters are combined as AND, e.g. {"core.extension": ".jpg", "hsm.status": "migrated"}
matches .jpg
files which are HSM migrated.
5.8.2. Retrieving results¶
When search results are read, they can be retrieved using the url returned when the query was submitted.
$ curl 'http://example.com/api/search/1/' -H "Authorization: Api-Key $TOKEN"
{
"count": 1,
"next": null,
"previous": null,
"items": [
{
"href": "http://example.com/api/file/?path=%2Fmmfs1%2Fdata%2Fhello.txt&site=site1",
"site": "site1",
"path": "/mmfs1/data",
"name": "hello.txt",
"metadata": {
"core.accesstime": "2021-10-12T16:27:28",
"core.changetime" : "2021-10-12T16:28:45",
"core.directory" : "/mmfs1/data",
"core.extension" : ".txt",
"core.filename" : "hello.txt",
"core.group.id" : 0,
"core.group.name" : "root",
"core.hash.sha512": "db3974a97...94d2434a593",
"core.modificationtime" : "2021-10-12T16:28:45",
"core.pathname": "/mmfs1/data/hello.txt",
"core.size" : 12,
"core.user.id" : 0,
"core.user.name" : "root",
"gpfs.filesetname" : "root",
"gpfs.filesystem" : "mmfs1",
"gpfs.kballocated" : 0,
"gpfs.poolname" : "sas1",
"hsm.status" : "migrated"
"ngenea.pathname" : "data/hello.txt",
"ngenea.size" : 12,
"ngenea.target" : "awss3",
"ngenea.uuid": "acf1a307-5b6a-43b0-8fb2-d2b366e88008",
}
}
],
"metadata_fields": ["core.accesstime", ...],
"complete": true,
"errors": {"site2": "Search backend is offline"}
}
Results from different sites may not arrive at the same time. The complete
field indicates whether all sites what returned their results.
This includes when a site returns with an error.
Results from different sites are ‘concatenated’, meaning if the same file exists on multiple sites, there will be separate result items for the file for each site.
The metadata
field on each item contains arbitrary file metadata. The specific metadata will vary depending on the search backend being used. In the case of the PixStor Search backend, the available fields will vary depending on file type, and which plugins were used when the files were ingested.
If metadata_fields
was specified when the query was submitted, the metadata_fields
entry in the response will match, with any wildcards expanded to list the avilable fields which match those wildcards. Otherwise, the metadata_fields
entry will list all the available metadata fields which could be returned from the search backend. Individual files may not have all the listed fields.
All search backends format results to be namespaced, similar to PixStor Search, for consistency.
If an error occurs while performing the search on any of the sites, the errors
entries will provide a mapping of site names and error messages.
5.8.2.1. Parameters¶
Search results are paginated. The following parameters can be used to control what results are returned
name |
description |
---|---|
|
Numbered page of results to fetch. Default: 1 |
|
Maximum number of results to return per page. Default: 20 |
|
One or more fields to sort results on, separated by commas, e.g. |
5.8.2.2. Merged results¶
When a search is submitted with "merge": true
, the search results will be ‘merged’.
This means that entries for matching files from different sites will be combine. An entry is considered to be matching if it has the same full path.
$ curl 'http://example.com/api/search/2/' -H "Authorization: Api-Key $TOKEN"
{
"count": 1,
"next": null,
"previous": null,
"items": [
{
"path": "/mmfs1/data",
"name": "hello.txt",
"metadata": {
"core.accesstime": "2021-10-12T16:27:28",
"core.changetime" : "2021-10-12T16:28:45",
"core.directory" : "/mmfs1/data",
"core.extension" : ".txt",
"core.filename" : "hello.txt",
"core.group.id" : 0,
"core.group.name" : "root",
"core.hash.sha512": "db3974a97...94d2434a593",
"core.modificationtime" : "2021-10-12T16:28:45",
"core.pathname": "/mmfs1/data/hello.txt",
"core.size" : 12,
"core.user.id" : 0,
"core.user.name" : "root",
"gpfs.filesetname" : "root",
"gpfs.filesystem" : "mmfs1",
"gpfs.kballocated" : 0,
"gpfs.poolname" : "sas1",
"hsm.status" : "migrated"
"ngenea.pathname" : "data/hello.txt",
"ngenea.size" : 12,
"ngenea.target" : "awss3",
"ngenea.uuid": "acf1a307-5b6a-43b0-8fb2-d2b366e88008",
},
"status": {
"site1": true,
"site2": false
}
}
],
"metadata_fields": ["core.accesstime", ...],
"complete": true
}
Merged results no longer have the site
and href
fields. In their place is a status
field, which maps sites to whether the file is ‘resident’ on that site.
A file is considered resident if the file is not migrated, or is premigrated (‘hydrated’). A file is considered not resident if the file is migrated (stubbed), or not present at all.
5.8.3. Max Results¶
There is a hard limit on the number of results returned, per site. By default, each site will return, at most, 200 results.
Fetching a lot of results makes queries slower and, since results are stored in the DB, storing more results uses more space. One the other hand, the limiting may lead to some matches not being returned.
The maximum number of results per site is controlled by the search_max_results
configuration - see Configuration for more info.
Result limiting is applied when the search query is submitted, not when results are retrieved. If you change search_max_results
, you will need to resubmit your query to fetch any additional matches.
Note, some backends have a hard limit of 10,000 results.
5.8.4. Housekeeping¶
The results from a query are stored, so they can be retrieved multiple times without performing a new query.
However, over time, the files on each site will change, and the stored results may no longer accurately reflect the active file system.
Therefore, old results are periodically culled. The housekeeping process runs once a day, and removes results for any search which was submitted more than a week ago (by default). A different ‘time-to-live’ (TTL) can be set using the search_result_ttl
configuration - see Configuration for more information.
Results can also be manually removed by performing a DELETE
request against the given search result endpoint
curl -X DELETE 'http://example.com/api/search/1/' -H "Authorization: Api-Key $TOKEN"