Home | Connectors | Prodigy | Prodigy - Google Cloud Storage Integration and Automation
Data flow: Google Cloud Storage to Prodigy
Teams store large image, text, audio, or document datasets in Google Cloud Storage and connect Prodigy directly to those buckets for labeling. This gives data scientists a single source of truth for raw training data while allowing annotators to work from a scalable cloud repository instead of local files.
Business value: Reduces manual file handling, improves dataset governance, and speeds up project kickoff for AI teams working across multiple business units.
Data flow: Prodigy to Google Cloud Storage
As Prodigy identifies the most informative samples for labeling, the selected records and annotation outputs can be written back to Google Cloud Storage for persistence and downstream processing. This supports repeatable training cycles and makes it easier to share labeled subsets with model training pipelines.
Business value: Improves model iteration speed, preserves annotation history, and enables consistent handoff from labeling to ML training teams.
Data flow: Bi-directional
Raw datasets, labeled exports, and revised annotation sets can be stored in separate Google Cloud Storage paths or buckets by project, version, or release date. Prodigy can pull the latest dataset version for annotation, then push completed labels back to a controlled storage location for audit and retraining.
Business value: Supports traceability, compliance, and reproducible model development, especially in regulated industries such as healthcare, finance, and insurance.
Data flow: Google Cloud Storage to Prodigy to Google Cloud Storage
Organizations with large image libraries, such as manufacturing inspection photos, retail product images, or satellite imagery, can store the source images in Google Cloud Storage and use Prodigy to label bounding boxes, classifications, or segmentation masks. Completed annotations are then exported back to cloud storage for model training and validation.
Business value: Enables scalable visual AI programs without duplicating large media files across local environments, reducing storage overhead and operational friction.
Data flow: Google Cloud Storage to Prodigy to Google Cloud Storage
Business teams can place customer emails, support tickets, contracts, chat logs, or policy documents in Google Cloud Storage and route them into Prodigy for entity tagging, classification, or intent labeling. Once annotated, the labeled text can be exported back to cloud storage for use in NLP model training pipelines.
Business value: Accelerates development of search, classification, and automation models while keeping sensitive text assets in governed cloud storage.
Data flow: Google Cloud Storage to Prodigy to Google Cloud Storage
Data engineering teams can stage model-generated predictions or uncertain samples in Google Cloud Storage, then send them to Prodigy for human review and correction. The corrected labels are stored back in cloud storage and used to improve model accuracy over time.
Business value: Creates a controlled review process that improves label quality, reduces model drift, and helps domain experts validate edge cases efficiently.
Data flow: Prodigy to Google Cloud Storage
After annotation is complete, Prodigy exports structured label files to Google Cloud Storage where they can be consumed by training jobs, feature engineering workflows, or model evaluation pipelines running in Google Cloud. This makes it easier for ML engineers to automate retraining without manual file transfers.
Business value: Streamlines the path from labeled data to production-ready models and reduces delays between annotation and deployment.
Data flow: Bi-directional
Google Cloud Storage can serve as the shared repository for raw inputs, labeled outputs, and review artifacts, while Prodigy provides the annotation workspace for data scientists and subject matter experts. Different teams can access the same governed storage locations for handoff, review, and reuse of datasets across projects.
Business value: Improves collaboration between AI teams, business analysts, and operations teams, while reducing duplication and ensuring everyone works from the same approved data assets.