How does Google Cloud Platform manage reliable data ingestion?
Google Cloud Platform (GCP) provides us with a powerful infrastructure to reliably manage data ingestion through managed services. Understanding how this data is generated is crucial. We generate events on a massive scale, from eCommerce browsing to social media sharing. Within an organization, these practices translate into three main use cases:
- User event ingestion: by using platforms such as Mercado Libre, every action generates events in real time.
- Data ingestion through databases with CDC (Change Data Capture): This technique allows capturing and acting on changes in a database.
- Event enrichment with artificial intelligence: Using Google APIs to analyze and enrich unstructured data, such as photos and videos.
What differentiates a data driven organization from an event driven organization?
A data driven organization focuses on a strategic approach. Before taking action, it plans based on strategies and hypotheses, which implies a long-term development. In contrast, an event driven organization responds in real time to data. It lets events dictate actions, allowing a faster and more adaptive reaction to business needs.
What are the characteristics of each approach?
-
Data driven:
- Long-term strategy and assumptions.
- Low time sensitivity.
- Pre-planning prior to strategy implementation.
-
Event driven:
- Rapid and adaptive response.
- Actions defined by real-time events.
- Data drives decisions, enabling agile execution.
How does Google Cloud facilitate these data ingestion approaches?
Google provides a platform that encompasses five key points for reliable data ingestion:
- Robust ingest services: Capture events regardless of size or velocity.
- Unified data ingestion: Enables batch or streaming data processing without re-encoding.
- Serverless architecture: Maximizes efficiency by eliminating the need to manage servers.
- Data sense tools: Provides the ability to extract meaningful information in real time.
- Flexibility for users: No programming experience is required to take advantage of the platform.
What products support this architecture?
PubSub
- Global product that captures data at the closest point of production.
- Scalable, processing up to 100 GB per second.
- Spotify as a use case, handling 8.5 million events per second.
import pubsub_v1
client = pubsub_v1.PublisherClient()topic_path = client.topic_path('your-project', 'your-topic')
data = 'your-message'.encode('utf-8')client.publish(topic_path, data)
Dataflow
- Based on Apache Beam, allows reuse in batch or real time.
- Integrated with several processing engines such as Apache Flink and Spark.
- Guarantees delivery of the exactly eleven message in conjunction with PubSub.
import apache_beam as beam
with beam.Pipeline() as p: ( p | 'Input' >> beam.Create([1, 2, 3, 4, 5]) | 'Multiply' >> beam.Map(lambda x: x * 10) | 'Output' >> beam.io.WriteToText('output.txt'))
Other components
- BigQuery: Stores event data in a serverless and scalable way.
- AI Platform and TensorFlow: Operationalizes artificial intelligence models, enabling complex analytics and predictions.
Why choose Google for data ingestion?
- Unified process for data ingestion and analysis in batch and real time.
- Integrated solutions that democratize analytics.
- Success stories such as eMARSIS, which processes 250,000 events per second and reduced costs by 70%.
Google Cloud is a robust and flexible ally for any organization that wants to implement reliable data ingestion, adapting to changing demands and scaling with business growth.
Want to see more contributions, questions and answers from the community?