Overview

Tool Name

harvester_tools

Purpose

The harvester_tools group manages automated collection of database metadata used by search and discovery. Configure what to harvest, how often to refresh, and where to focus. Track status across sources and remove configurations or metadata when they are no longer required.

Key Features & Functions

Configure Sources

Register which connections and databases should be harvested and at what cadence.

Schema Scoping

Include or exclude schemas to focus on relevant areas and reduce noise.

Scheduling

Set refresh intervals in minutes and trigger first crawls on demand.

Monitor Coverage

Review control rows and high-level summaries to see progress and health.

Cleanup

Remove control rows or purge stored metadata for decommissioned databases.

Input Parameters for Each Function

_get_harvest_control_data

Parameters (No parameters. Returns all current harvest control rows.)

_set_harvest_control_data

Parameters
NameDefinitionFormat
connection_idConnection id for the source to harvest.String
database_nameDatabase name to harvest. For BigQuery, use the project id.String
refresh_intervalRefresh cadence in minutes. Default 5.Integer
initial_crawl_completeSet to false to trigger an immediate initial crawl.Boolean
statusControl row status. Use Include to enable or Exclude to disable.String
schema_inclusionsList of schemas to include. Empty list means include all.Array
schema_exclusionsSchemas to exclude from harvest.Array
To force a fresh crawl after structural changes, call _set_harvest_control_data with initial_crawl_complete: false.

_remove_harvest_control_data

Parameters
NameDefinitionFormat
source_nameSource or connection identifier of the control row to remove.String
database_nameTarget database for the control row removal.String

_remove_metadata_for_database

Parameters
NameDefinitionFormat
source_nameSource or connection identifier for metadata purge.String
database_nameDatabase whose harvested metadata will be deleted.String

_get_harvest_summary

Parameters (No parameters. Returns coverage, last run times, counts, and errors.)

Use Cases

  1. Stand up harvesting for a new source
    • Add a control row for a new Snowflake database with a 5-minute interval.
  2. Tightly scope collections
    • Include only ANALYTICS and MARTS while excluding INFORMATION_SCHEMA.
  3. Monitor rollout
    • Track initial crawl completion and spot errors across environments.
  4. Tune cadences
    • Increase intervals on slow changing warehouses to reduce compute load.
  5. Decommission cleanly
    • Remove control rows and purge stored metadata for retired systems.
Purging with _remove_metadata_for_database deletes stored metadata for the database. Make sure you truly do not need it before proceeding.

Workflow/How It Works

  1. Inspect current state
    • Call _get_harvest_control_data and _get_harvest_summary to see what is active.
  2. Plan scope
    • Decide on inclusions or exclusions. Default behavior is to harvest all schemas except typical system schemas.
  3. Configure
    • Use _set_harvest_control_data to upsert the row with status, interval, and filters. Set initial_crawl_complete: false to kick off the first crawl.
  4. Monitor
    • Watch _get_harvest_summary during rollout. Adjust interval or scope if load is higher than expected.
  5. Cleanup
    • Use _remove_harvest_control_data to stop future crawls. If needed, _remove_metadata_for_database to clear stored metadata.
Schema filters are not applicable for some engines like MySQL and SQLite where scoping is at the database level.

Integration Relevance

  • data_connector_tools for connection ids and downstream metadata search.
  • genesis_job_tools to observe the background harvester jobs and follow logs.
  • project_manager_tools to track catalog rollout as tasks and milestones.
  • document_index_tools in parallel when you also index full text documents.
  • system_stats_tools for monitoring resource impact during large crawls.

Configuration Details

  • Use exact case for database and schema names to match engine reporting.
  • Start with refresh_interval of 5 minutes for active environments and relax to 15 to 60 minutes where change is rare.
  • Exclude noisy system schemas like INFORMATION_SCHEMA unless needed.
  • Ensure the service role has privileges to run SHOW and DESCRIBE style operations.
  • One control row per source and database pair keeps intent explicit.

Limitations or Notes

  1. Harvesting relies on valid connectivity and credentials.
  2. Large catalogs may take time to crawl, and summaries can lag until complete.
  3. Aggressive refresh cadences can increase compute costs on warehouses.
  4. Schema scoping does not apply to all engines.
  5. Removing a control row stops future crawls but does not delete stored metadata.
  6. Network failures and permission changes can interrupt or reduce coverage.
  7. Concurrent large harvests may be throttled by available resources.

Output

  • Harvest Control Data
    • Current rows with connection, database, status, interval, and filters.
  • Configuration Updates
    • Upsert confirmation and whether an initial crawl was triggered.
  • Harvest Summary
    • Coverage counts, last run timestamps, error snapshots, and object totals.
  • Removal Confirmations
    • Success messages for control row removal and for metadata purge actions.
  • Status Information
    • Progress indicators and completion timestamps per source and database.