Overview
Tool Name
Purpose
The harvester_tools group manages automated collection of database metadata used by search and discovery. Configure what to harvest, how often to refresh, and where to focus. Track status across sources and remove configurations or metadata when they are no longer required.Key Features & Functions
Configure Sources
Register which connections and databases should be harvested and at what cadence.
Schema Scoping
Include or exclude schemas to focus on relevant areas and reduce noise.
Scheduling
Set refresh intervals in minutes and trigger first crawls on demand.
Monitor Coverage
Review control rows and high-level summaries to see progress and health.
Cleanup
Remove control rows or purge stored metadata for decommissioned databases.
Input Parameters for Each Function
_get_harvest_control_data
Parameters
(No parameters. Returns all current harvest control rows.)
_set_harvest_control_data
Parameters
| Name | Definition | Format |
|---|---|---|
| connection_id | Connection id for the source to harvest. | String |
| database_name | Database name to harvest. For BigQuery, use the project id. | String |
| refresh_interval | Refresh cadence in minutes. Default 5. | Integer |
| initial_crawl_complete | Set to false to trigger an immediate initial crawl. | Boolean |
| status | Control row status. Use Include to enable or Exclude to disable. | String |
| schema_inclusions | List of schemas to include. Empty list means include all. | Array |
| schema_exclusions | Schemas to exclude from harvest. | Array |
To force a fresh crawl after structural changes, call
_set_harvest_control_data with initial_crawl_complete: false._remove_harvest_control_data
Parameters
| Name | Definition | Format |
|---|---|---|
| source_name | Source or connection identifier of the control row to remove. | String |
| database_name | Target database for the control row removal. | String |
_remove_metadata_for_database
Parameters
| Name | Definition | Format |
|---|---|---|
| source_name | Source or connection identifier for metadata purge. | String |
| database_name | Database whose harvested metadata will be deleted. | String |
_get_harvest_summary
Parameters
(No parameters. Returns coverage, last run times, counts, and errors.)
Use Cases
- Stand up harvesting for a new source
- Add a control row for a new Snowflake database with a 5-minute interval.
- Tightly scope collections
- Include only
ANALYTICSandMARTSwhile excludingINFORMATION_SCHEMA.
- Include only
- Monitor rollout
- Track initial crawl completion and spot errors across environments.
- Tune cadences
- Increase intervals on slow changing warehouses to reduce compute load.
- Decommission cleanly
- Remove control rows and purge stored metadata for retired systems.
Purging with
_remove_metadata_for_database deletes stored metadata for the database. Make sure you truly do not need it before proceeding.Workflow/How It Works
- Inspect current state
- Call
_get_harvest_control_dataand_get_harvest_summaryto see what is active.
- Call
- Plan scope
- Decide on inclusions or exclusions. Default behavior is to harvest all schemas except typical system schemas.
- Configure
- Use
_set_harvest_control_datato upsert the row with status, interval, and filters. Setinitial_crawl_complete: falseto kick off the first crawl.
- Use
- Monitor
- Watch
_get_harvest_summaryduring rollout. Adjust interval or scope if load is higher than expected.
- Watch
- Cleanup
- Use
_remove_harvest_control_datato stop future crawls. If needed,_remove_metadata_for_databaseto clear stored metadata.
- Use
Schema filters are not applicable for some engines like MySQL and SQLite where scoping is at the database level.
Integration Relevance
- data_connector_tools for connection ids and downstream metadata search.
- genesis_job_tools to observe the background harvester jobs and follow logs.
- project_manager_tools to track catalog rollout as tasks and milestones.
- document_index_tools in parallel when you also index full text documents.
- system_stats_tools for monitoring resource impact during large crawls.
Configuration Details
- Use exact case for database and schema names to match engine reporting.
- Start with
refresh_intervalof 5 minutes for active environments and relax to 15 to 60 minutes where change is rare. - Exclude noisy system schemas like
INFORMATION_SCHEMAunless needed. - Ensure the service role has privileges to run
SHOWandDESCRIBEstyle operations. - One control row per source and database pair keeps intent explicit.
Limitations or Notes
- Harvesting relies on valid connectivity and credentials.
- Large catalogs may take time to crawl, and summaries can lag until complete.
- Aggressive refresh cadences can increase compute costs on warehouses.
- Schema scoping does not apply to all engines.
- Removing a control row stops future crawls but does not delete stored metadata.
- Network failures and permission changes can interrupt or reduce coverage.
- Concurrent large harvests may be throttled by available resources.
Output
- Harvest Control Data
- Current rows with connection, database, status, interval, and filters.
- Configuration Updates
- Upsert confirmation and whether an initial crawl was triggered.
- Harvest Summary
- Coverage counts, last run timestamps, error snapshots, and object totals.
- Removal Confirmations
- Success messages for control row removal and for metadata purge actions.
- Status Information
- Progress indicators and completion timestamps per source and database.

