File
The file
scraper is used to create config items from files in a local folder (or git). This can be used to track changes in files like /etc/hosts
or /etc/passwd
, or for service metadata stored in git.
See Kubernetes Files for scraping files inside running kubernetes pods.
file-scraper.yamlapiVersion: configs.flanksource.com/v1
kind: ScrapeConfig
metadata:
name: file-git-scraper
spec:
file:
- type: $.kind
id: $.metadata.name
url: github.com/flanksource/canary-checker?ref=076cf8b888f2dbaca26a7cc98a4153c154220a22
paths:
- fixtures/minimal/http_pass.yaml
Scraper
Field | Description | Scheme | Required |
---|---|---|---|
schedule | Specify the interval to scrape in cron format. Defaults to every 60 minutes. | string | |
full | Set to true to extract changes & access logs from scraped configurations. Defaults to false . | bool | |
retention | Settings for retaining changes, analysis and scraped items | Retention | |
file | Specifies the list of File configurations to scrape. | []File | |
logLevel | Specify the level of logging. | string |
File
Field | Description | Scheme |
---|---|---|
paths* | Specify paths to configuration(s) for scraping | []glob |
url | Specify URL e.g github repository containing the configuration(s) |
|
labels | Labels for each config item. |
|
properties | Custom templatable properties for the scraped config items. | |
tags | Tags for each config item. Max allowed: 5 | |
transform | Transform configs after they've been scraped |
Mapping
Custom scrapers require you to define the id
and type
for each scraped item. For example, when you scrape a file containing a JSON array, where each array element represents a config item, you must specify the id
and type
for those items.
You can achieve this by using mappings in your custom scraper configuration.
Field | Description | Scheme |
---|---|---|
id* | A static value or JSONPath expression to use as the ID for the resource. |
|
name* | A static value or JSONPath expression to use as the name for the resource. |
|
type* | A static value or JSONPath expression to use as the type for the resource. |
|
class | A static value or JSONPath expression to use as the class for the resource. |
|
createFields | A list of JSONPath expressions used to identify the created time of the config. If multiple fields are specified, the first non-empty value will be used. | []jsonpath |
deleteFields | A list of JSONPath expressions used to identify the deleted time of the config. If multiple fields are specified, the first non-empty value will be used. | []jsonpath |
description | A static value or JSONPath expression to use as the description for the resource. |
|
format | Format of config item, defaults to JSON, available options are JSON, properties. See Formats |
|
health | A static value or JSONPath expression to use as the health of the config item. |
|
items | A JSONPath expression to use to extract individual items from the resource. Items are extracted first and then the ID, Name, Type and transformations are applied for each item. | |
status | A static value or JSONPath expression to use as the status of the config item. |
|
timestampFormat | A Go time format string used to parse timestamps in createFields and deleteFields. (Default: RFC3339) |
|
Formats
JSON
The scraper stores config items as jsonb
fields in PostgreSQL.
Resource providers typically return the JSON used. e.g. kubectl get -o json
or aws --output=json
.
When you display the config, the UI automatically converts the JSON data to YAML for improved readability.
XML / Properties
The scraper stores non-JSON files as JSON using:
{ 'format': 'xml', 'content': '<root>..</root>' }
You can still access non-JSON content in scripts using config.content
.
The UI formats and renders XML appropriately.
Extracting Changes & Access Logs
Custom scrapers ingest changes & access logs from external systems when you enable the full
option.
Every single config is expected to have at these 3 top-level fields
config
changes
access_logs
They could have more fields or even missing some of these fields. The point is that only these fields are extracted.
Consider a file that contains the following json data.
{
"reg_no": "A123",
"config": {
"meta": "this is the actual config that'll be stored."
},
"changes": [
{
"action": "drive",
"summary": "car color changed to blue",
"unrelated_stuff": 123
}
],
"access_logs": [
{
"config_id": "99024949-9118-4dcb-a3a0-b8f1536bebd0",
"external_user_id": "a3542241-4750-11f0-8000-e0146ce375e6",
"created_at": "2025-01-01"
},
{
"config_id": "9d9e51a7-6956-413e-a07e-a6aeb3f4877f",
"external_user_id": "a5c2e8e3-4750-11f0-8000-f4eaacabd632",
"created_at": "2025-01-02"
}
]
}
A regular scraper saves the entire json as a config.
However, with the full
option, the scraper extracts the config, changes and access logs.
apiVersion: configs.flanksource.com/v1
kind: ScrapeConfig
metadata:
name: file-scraper
spec:
full: true
file:
- type: Car
id: $.reg_no
paths:
- fixtures/data/car_changes.json
The resulting config is:
{
"meta": "this is the actual config that'll be stored."
}
and the scraper records the following new config change on that config:
{
"action": "drive",
"summary": "car color changed to blue",
"unrelated_stuff": 123
}
and the access logs will be saved to
[
{
"config_id": "99024949-9118-4dcb-a3a0-b8f1536bebd0",
"external_user_id": "a3542241-4750-11f0-8000-e0146ce375e6",
"created_at": "2025-01-01"
},
{
"config_id": "9d9e51a7-6956-413e-a07e-a6aeb3f4877f",
"external_user_id": "a5c2e8e3-4750-11f0-8000-f4eaacabd632",
"created_at": "2025-01-02"
}
]