File

The file scraper is used to create config items from files in a local folder (or git). This can be used to track changes in files like /etc/hosts or /etc/passwd, or for service metadata stored in git.

See Kubernetes Files for scraping files inside running kubernetes pods.

file-scraper.yaml
apiVersion: configs.flanksource.com/v1
kind: ScrapeConfig
metadata:
  name: file-git-scraper
spec:
  file:
    - type: $.kind
      id: $.metadata.name
      url: github.com/flanksource/canary-checker?ref=076cf8b888f2dbaca26a7cc98a4153c154220a22
      paths:
        - fixtures/minimal/http_pass.yaml

Scraper

Field	Description	Scheme
`schedule`	Specify the interval to scrape in cron format. Defaults to every 60 minutes.	`string`
`full`	Set to `true` to extract changes & access logs from scraped configurations. Defaults to `false`.	`bool`
`retention`	Settings for retaining changes, analysis and scraped items	`Retention`
`file`	Specifies the list of File configurations to scrape.	`[]File`
`logLevel`	Specify the level of logging.	`string`

File

Field	Description	Scheme
`paths*`	Specify paths to configuration(s) for scraping	[]glob
`url`	Specify URL e.g github repository containing the configuration(s)	`string`
`labels`	Labels for each config item.	`map[string]string`
`properties`	Custom templatable properties for the scraped config items.	`[]ConfigProperty`
`tags`	Tags for each config item. Max allowed: 5	`[]ConfigTag`
`transform`	Transform configs after they've been scraped	`Transform`

Mapping

Custom scrapers require you to define the id and type for each scraped item. For example, when you scrape a file containing a JSON array, where each array element represents a config item, you must specify the id and type for those items. You can achieve this by using mappings in your custom scraper configuration.

Field	Description	Scheme
`id*`	A static value or JSONPath expression to use as the ID for the resource.	`string` or JSONPath
`name*`	A static value or JSONPath expression to use as the name for the resource.	`string` or JSONPath
`type*`	A static value or JSONPath expression to use as the type for the resource.	`string` or JSONPath
`class`	A static value or JSONPath expression to use as the class for the resource.	`string` or JSONPath
`createFields`	A list of JSONPath expressions used to identify the created time of the config. If multiple fields are specified, the first non-empty value will be used.	[]jsonpath
`deleteFields`	A list of JSONPath expressions used to identify the deleted time of the config. If multiple fields are specified, the first non-empty value will be used.	[]jsonpath
`description`	A static value or JSONPath expression to use as the description for the resource.	`string` or JSONPath
`format`	Format of config item, defaults to JSON, available options are JSON, properties. See Formats	`string`
`health`	A static value or JSONPath expression to use as the health of the config item.	`string` or JSONPath
`items`	A JSONPath expression to use to extract individual items from the resource. Items are extracted first and then the ID, Name, Type and transformations are applied for each item.	JSONPath
`status`	A static value or JSONPath expression to use as the status of the config item.	`string` or JSONPath
`timestampFormat`	A Go time format string used to parse timestamps in createFields and deleteFields. (Default: RFC3339)	`string`

Formats

JSON

The scraper stores config items as jsonb fields in PostgreSQL.

Resource providers typically return the JSON used. e.g. kubectl get -o json or aws --output=json.

When you display the config, the UI automatically converts the JSON data to YAML for improved readability.

XML / Properties

The scraper stores non-JSON files as JSON using:

{ 'format': 'xml', 'content': '<root>..</root>' }

You can still access non-JSON content in scripts using config.content.

The UI formats and renders XML appropriately.

Extracting Changes & Access Logs

Custom scrapers ingest changes & access logs from external systems when you enable the full option.

Every single config is expected to have at these 3 top-level fields

config
changes
access_logs

info

They could have more fields or even missing some of these fields. The point is that only these fields are extracted.

Consider a file that contains the following json data.

{
  "reg_no": "A123",
  "config": {
    "meta": "this is the actual config that'll be stored."
  },
  "changes": [
    {
      "action": "drive",
      "summary": "car color changed to blue",
      "unrelated_stuff": 123
    }
  ],
  "access_logs": [
    {
      "config_id": "99024949-9118-4dcb-a3a0-b8f1536bebd0",
      "external_user_id": "a3542241-4750-11f0-8000-e0146ce375e6",
      "created_at": "2025-01-01"
    },
    {
      "config_id": "9d9e51a7-6956-413e-a07e-a6aeb3f4877f",
      "external_user_id": "a5c2e8e3-4750-11f0-8000-f4eaacabd632",
      "created_at": "2025-01-02"
    }
  ]
}

A regular scraper saves the entire json as a config. However, with the full option, the scraper extracts the config, changes and access logs.

apiVersion: configs.flanksource.com/v1
kind: ScrapeConfig
metadata:
  name: file-scraper
spec:
  full: true
  file:
    - type: Car
      id: $.reg_no
      paths:
        - fixtures/data/car_changes.json

The resulting config is:

{
  "meta": "this is the actual config that'll be stored."
}

and the scraper records the following new config change on that config:

{
  "action": "drive",
  "summary": "car color changed to blue",
  "unrelated_stuff": 123
}

and the access logs will be saved to

[
  {
    "config_id": "99024949-9118-4dcb-a3a0-b8f1536bebd0",
    "external_user_id": "a3542241-4750-11f0-8000-e0146ce375e6",
    "created_at": "2025-01-01"
  },
  {
    "config_id": "9d9e51a7-6956-413e-a07e-a6aeb3f4877f",
    "external_user_id": "a5c2e8e3-4750-11f0-8000-f4eaacabd632",
    "created_at": "2025-01-02"
  }
]

Scraper​

File​

Mapping​

Formats​

JSON​

XML / Properties​

Extracting Changes & Access Logs​

Scraper

File

Mapping

Formats

JSON

XML / Properties

Extracting Changes & Access Logs