Overview

Tool Name

harvest_control_tools

Purpose

The harvest_control_tools are designed to manage the harvesting, tracking, and organization of metadata across connected databases. This tool helps streamline and automate metadata collection processes, ensuring up-to-date insights into database structures, schemas, and available data assets.

Functions Available

  1. _get_harvest_control_data: Retrieves all active harvest control configurations, including databases in scope, schemas included or excluded, refresh intervals, and crawl status.

  2. _set_harvest_control_data: Adds or updates harvest control configurations for a specific database or schema.

  3. _remove_harvest_control_data: Deletes a harvest control setup, stopping future metadata crawls for the specified database.

  4. _remove_metadata_for_database: Purges meta-harvest crawl results for a specific database, cleaning up the recorded metadata.

  5. _get_harvest_summary: Provides a summary of ongoing, completed, or failed metadata harvests, including statistics (e.g., number of tables, columns processed).

Key Features

Automated Metadata Harvesting

Configure and manage automatic metadata harvesting for connected databases.

Schema Inclusions & Exclusions

Include or exclude specific schemas to focus harvesting on the most relevant data.

Manual or Scheduled Crawls

Perform harvest crawls manually or set automatic refresh intervals.

Progress & Data Availability

Track harvesting progress and control data availability across systems.

Metadata Cleanup

Remove outdated metadata or harvest configurations when no longer required.

Input Parameters

_get_harvest_control_dataRetrieve All Active Harvest Control Configurations
Input ParametersDefinitionFormat
(None Required)Returns all active harvest control configurations, including database names and schema rules.(None)
_set_harvest_control_dataAdd or Update Harvest Control Configurations
Input ParametersDefinitionFormat
connection_id (Optional)Specifies the database connection ID (e.g., “Snowflake”).String
database_nameName of the database for harvesting (e.g., “CUSTOMER_DATA”).String
refresh_intervalSuggested refresh interval in minutes (e.g., 1440 for daily).Integer
initial_crawl_complete(Optional): Indicates if the initial crawl is done (default: False).Boolean
schema_exclusions(Optional): List of schemas to exclude from harvesting.List
schema_inclusions(Optional): List of schemas to explicitly include.List
status(Optional): Harvesting status; default is “Include”.String
_remove_harvest_control_dataDelete a Specific Harvest Control Setup
Input ParametersDefinitionFormat
source_nameThe source system for which harvesting is configured (e.g., “Snowflake”).String
database_nameThe database whose control data should be removed.String
_remove_metadata_for_databasePurge Harvested Metadata for a Database
Input ParametersDefinitionFormat
source_nameName of the source database to clean (e.g., “Snowflake”).String
database_nameName of the database for which metadata will be purged.String
_get_harvest_summaryGet a Summary of Harvest Progress & Statistics
Input ParametersDefinitionFormat
(None Required)Returns statistics for ongoing or completed metadata harvests (e.g., table counts).(None)

Output

  • Harvest Data Retrieval

    • _get_harvest_control_data returns JSON detailing active harvest configurations, including the database names, refresh intervals, and schema inclusions/exclusions.
  • Harvest Configuration/Update

    • _set_harvest_control_data confirms setup or modification of harvesting rules, returning success messages.
  • Metadata Cleanup

    • _remove_harvest_control_data and _remove_metadata_for_database respond with success confirmations indicating successful removal of control data or existing metadata.
  • Harvest Summary

    • _get_harvest_summary outputs progress and statistics, including the number of schemas crawled, tables indexed, or failures encountered.

Genbot Tip

  • Use schema_exclusions to omit system-level schemas (like INFORMATION_SCHEMA) from your harvest for cleaner results.

  • Align refresh_interval with your environment’s change frequency—shorter intervals for more rapidly changing data.

How It Works

Users define a database connection with desired harvesting configurations, specifying schemas to include or exclude. Once set, the system performs scheduled metadata crawls or initiates immediate crawls, updating the metadata store. Users can retrieve harvesting progress and summary reports, track connected database configurations, or clean outdated metadata.

IMPORTANT NOTE

  • By default, all schemas are included unless explicitly excluded—this may gather irrelevant data.

  • Large databases may extend crawling times or require optimized scheduling to avoid performance bottlenecks.

  • Combine harvest crawls with metadata audits to maintain compliance with data governance standards.