talk-data.com talk-data.com

Filter by Source

Select conferences and events

Activities & events

Title & Speakers Event

Parallel Processing with Python

Modern software often needs to do many things at the same time to run faster and scale better. This includes data processing, web services, and machine learning workloads. Understanding parallel and concurrent execution is now an important skill for Python developers. This session gives a clear and practical introduction to parallel processing in Python. It focuses on the main ideas and shows when and how to use different approaches correctly.

Who is this for?

Students, developers, and anyone who wants to understand how Python programs can run faster by doing work in parallel. This session is useful if you want to speed up Python programs, understand the difference between threads and processes, and build more efficient and scalable applications.

Who is leading the session?

The session is led by Dr. Stelios Sotiriadis, CEO of Warestack and Associate Professor and MSc Programme Director at Birkbeck, University of London.

He works in distributed systems, cloud computing, operating systems, and Python-based data processing. He holds a PhD from the University of Derby, completed a postdoctoral fellowship at the University of Toronto, and has worked with Huawei, IBM, Autodesk, and several startups. Since 2018, he has been teaching at Birkbeck and founded Warestack in 2021. What we will cover

Requirements

A laptop with Python installed (Windows, macOS, or Linux), Visual Studio Code, and Python pip. Lab computers can be used if needed.

Format

This is a hands-on introduction with examples and short exercises. Topics include what concurrency and parallelism mean, threads vs processes in Python, the Global Interpreter Lock explained simply, using threading for I/O-heavy tasks, using multiprocessing for CPU-heavy tasks, basic use of concurrent.futures, common problems like race conditions, and when parallelism is not the right choice.

A 1.5-hour live session with short theory explanations, live coding, and guided exercises. The session runs in person, with streaming available for remote participants.

Prerequisites

Basic to intermediate Python knowledge, including functions, loops, and basic data structures.

Parallel Processing with Python

These are the notes of the previous "How to Build a Portfolio That Reflects Your Real Skills" event:

Properties of an ideal portfolio repository:

  • Built to prove employable skills and readiness for real work
  • Fewer projects, carefully chosen to match job requirements
  • Clean, readable, refactored code, and follows best practices
  • Detailed READMEs (setup, features, tech stack, decisions, how to deploy, testing strategy, etc)
  • Logical, meaningful commits that show development process <- you can follow the git history for important commits/features
  • Clear architecture (layers, packages, separation of concerns) <- use best practices
  • Unit and integration tests included and explained <-- also talk about them in the README
  • Proper validation, exceptions, and edge case handling
  • Polished, complete, production-like projects only
  • “Can this person work on our codebase?” <-- reviewers will ask this
  • Written for recruiters, hiring managers, and senior engineers
  • Uses industry-relevant and job-listed technologies <- tech stak should match the CV
  • Well-scoped, realistic features similar to real products
  • Consistent style, structure, and conventions across projects
  • Environment variables, clear setup steps, sample configs
  • Minimal, justified dependencies with clear versioning
  • Proper logging, and meaningful log messages
  • No secrets committed, basic security best practices applied
  • Shows awareness of scaling, performance, and future growth <- at least have a "possible improvements" section in the README
  • a list of ADRs explains design choices and trade-offs <- should be a part of the documentation

📌 Backend & Frontend Portfolio Project Ideas

These projects are intentionally reusable across tech stacks. Following tutorials and reusing patterns is expected — what matters is:

  • understanding the architecture
  • explaining trade-offs
  • documenting decisions clearly

☕ Junior Java Backend Developer (Spring Boot)

1. Shop Manager Application

A monolithic Spring Boot app designed with microservice-style boundaries. Features

  • Secure user registration & login
  • Role-based access control using JWT
  • REST APIs for:
  • Users
  • Products
  • Inventory
  • Orders
  • Automatic inventory updates when orders are placed
  • CSV upload for bulk product & inventory import
  • Clear service boundaries (UserService, OrderService, InventoryService, etc.)

Engineering Focus

  • Clean architecture (controllers, services, repositories)
  • Global exception handling
  • Database migrations (Flyway/Liquibase)
  • Unit & integration testing
  • Clear README explaining architecture decisions

2. Parallel Data Processing Engine

Backend service for processing large datasets efficiently. Features

  • Upload large CSV/log files
  • Split data into chunks
  • Process chunks in parallel using:
  • ExecutorService
  • CompletableFuture
  • Aggregate and return results

Demonstrates

  • Java concurrency
  • Thread pools & async execution
  • Performance optimization

3. Distributed Task Queue System

Simple async job processing system. Features

  • One service submits tasks
  • Another service processes them asynchronously
  • Uses Kafka or RabbitMQ
  • Tasks: report generation, data transformation

Demonstrates

  • Message-driven architecture
  • Async workflows
  • Eventual consistency

4. Rate Limiting & Load Control Service

Standalone service that protects APIs from abuse. Features

  • Token bucket or sliding window algorithms
  • Redis-backed counters
  • Per-user or per-IP limits

Demonstrates

  • Algorithmic thinking
  • Distributed state
  • API protection patterns

5. Search & Indexing Backend

Document or record search service. Features

  • In-memory inverted index
  • Text search, filters, ranking
  • Optional Elasticsearch integration

Demonstrates

  • Data structures
  • Read-optimized design
  • Trade-offs between custom vs external tools

6. Distributed Configuration & Feature Flag Service

Centralized config service for other apps. Features

  • Key-value configuration store
  • Feature flags
  • Caching & refresh mechanisms

Demonstrates

  • Caching strategies
  • Consistency vs availability trade-offs
  • System design for shared services

🐹 Mid-Level Go Backend Developer (Non-Kubernetes)

1. High-Throughput Event Processing Pipeline

Multi-stage concurrent pipeline. Features

  • HTTP/gRPC ingestion
  • Validation & transformation stages
  • Goroutines & channels
  • Worker pools, batching, backpressure
  • Graceful shutdown

2. Distributed Job Scheduler & Worker System

Async job execution platform. Features

  • Job scheduling & delayed execution
  • Retries & idempotency
  • Job states (pending, running, failed, completed)
  • Message queue or gRPC-based workers

3. In-Memory Caching Service

Redis-like cache written from scratch. Features

  • TTL support
  • Eviction strategies (LRU/LFU)
  • Concurrent-safe access
  • Optional disk persistence

4. Rate Limiting & Traffic Shaping Gateway

Reverse-proxy-style rate limiter. Features

  • Token bucket / leaky bucket
  • Circuit breakers
  • Redis-backed distributed limits

5. Log Aggregation & Query Engine

Incrementally built system: Step-by-step

  1. REST API + Postgres (store logs, query logs)
  2. Optimize for massive concurrency
  3. Replace DB with in-memory data structures
  4. Add streaming endpoints using channels & batching

🐍 Mid-Level Python Backend Developer

1. Asynchronous Task Processing System

Async job execution platform. Features

  • Async API submission
  • Worker pool (asyncio or Celery-like)
  • Retries & failure handling
  • Job status tracking
  • Idempotency

2. Event-Driven Data Pipeline

Streaming data processing service. Features

  • Event ingestion
  • Validation & transformation
  • Batching & backpressure handling
  • Output to storage or downstream services

3. Distributed Rate Limiting Service

API protection service. Steps

  • Step 1: Use an existing rate-limiting library
  • Step 2: Implement token bucket / sliding window yourself

4. Search & Indexing Backend

Search system for logs or documents. Features

  • Custom indexing or Elasticsearch
  • Filtering & time-based queries
  • Read-heavy optimization

5. Configuration & Feature Flag Service

Shared configuration backend. Steps

  • Step 1: Use a caching library
  • Step 2: Implement your own cache (explain in README)

🟦 Mid-Level TypeScript Backend Developer

1. Asynchronous Job Processing System

Queue-based task execution. Features

  • BullMQ / RabbitMQ / Redis
  • Retries & scheduling
  • Status tracking

2. Real-Time Chat / Notification Service

WebSocket-based system. Features

  • Presence tracking
  • Message persistence
  • Real-time updates

3. Rate Limiting & API Gateway

API gateway with protections. Features

  • Token bucket / sliding window
  • Response caching
  • Request logging

4. Search & Filtering Engine

Search backend for products, logs, or articles. Features

  • In-memory index or Elasticsearch
  • Pagination & sorting

5. Feature Flag & Configuration Service

Centralized config management. Features

  • Versioning
  • Rollout strategies
  • Caching

🟨 Mid-Level Node.js Backend Developer

1. Async Task Queue System

Background job processor. Features

  • Bull / Redis / RabbitMQ
  • Retries & scheduling
  • Status APIs

2. Real-Time Chat / Notification Service

Socket-based system. Features

  • Rooms
  • Presence tracking
  • Message persistence

3. Rate Limiting & API Gateway

Traffic control service. Features

  • Per-user/API-key limits
  • Logging
  • Optional caching

4. Search & Indexing Backend

Indexing & querying service.


5. Feature Flag / Configuration Service

Shared backend for app configs.


⚛️ Mid-Level Frontend Developer (React / Next.js)

1. Dynamic Analytics Dashboard

Interactive data visualization app. Features

  • Charts & tables
  • Filters & live updates
  • React Query / Redux / Zustand
  • Responsive layouts

2. E-Commerce Store

Full shopping experience. Features

  • Product listings
  • Search, filters, sorting
  • Cart & checkout
  • SSR/SSG with Next.js

3. Real-Time Chat / Collaboration App

Live multi-user UI. Features

  • WebSockets or Firebase
  • Presence indicators
  • Real-time updates

4. CMS / Blogging Platform

SEO-focused content app. Features

  • SSR for SEO
  • Markdown or API-based content
  • Admin editing interface

5. Personalized Analytics / Recommendation UI

Data-heavy frontend. Features

  • Filtering & lazy loading
  • Large dataset handling
  • User-specific insights

6. AI Chatbot App — “My House Plant Advisor”

LLM-powered assistant with production-quality UX. Core Features

  • Chat interface with real-time updates
  • Input normalization & validation
  • Offensive content filtering
  • Unsupported query detection
  • Rate limiting (per user)
  • Caching recent queries
  • Conversation history per session
  • Graceful fallbacks & error handling

Advanced Features

  • Prompt tuning (beginner vs expert users)
  • Structured advice formatting (cards, bullets)
  • Local LLM support
  • Analytics dashboard (popular questions)
  • Voice input/output (speech-to-text, TTS)

✅ Final Advice

You do NOT need to build everything. Instead, pick 1–2 strong projects per role and focus on depth:

  • Explain the architecture clearly
  • Document trade-offs (why you chose X over Y)
  • Show incremental improvements
  • Prove you understand why, not just how

📌 Portfolio Quality Signals (Very Important)

  • Have a large, organic commit history → A single or very few commits is a strong indicator of copy-paste work.
  • Prefer 3–5 complex projects over 20 simple ones → Many tiny projects often signal shallow understanding.

🎯 Why This Helps in Interviews

Working on serious projects gives you:

  • Real hands-on practice
  • Concrete anecdotes (stories you can tell in interviews)
  • A safe way to learn technologies you don’t fully know yet
  • Better focus and long-term learning discipline
  • A portfolio that can be ported to another tech stack later (Java → Go, Node → Python, etc.)

🎥 Demo & Documentation Best Practices

  • Create a 2–3 minute demo / walkthrough video
  • Show the app running
  • Explain what problem it solves
  • Highlight one or two technical decisions
  • At the top of every README:
  • Add a plain-English paragraph explaining what the project does
  • Assume the reader is a complete beginner

🤝 Open Source & Personal Projects (Interview Signal)

Always mention that you have contributed to Open Source or built personal projects.

  • Shows team spirit
  • Shows you can read, understand, and navigate an existing codebase
  • Signals that you can onboard into a real-world repository
  • Makes you sound like an engineer, not just a tutorial follower
[Notes]How to Build a Portfolio That Reflects Your Real Skills
Kate Shaw – Senior Product Manager for Data and SLIM @ SnapLogic , Tobias Macey – host

Summary In this episode Kate Shaw, Senior Product Manager for Data and SLIM at SnapLogic, talks about the hidden and compounding costs of maintaining legacy systems—and practical strategies for modernization. She unpacks how “legacy” is less about age and more about when a system becomes a risk: blocking innovation, consuming excess IT time, and creating opportunity costs. Kate explores technical debt, vendor lock-in, lost context from employee turnover, and the slippery notion of “if it ain’t broke,” especially when data correctness and lineage are unclear. Shee digs into governance, observability, and data quality as foundations for trustworthy analytics and AI, and why exit strategies for system retirement should be planned from day one. The discussion covers composable architectures to avoid monoliths and big-bang migrations, how to bridge valuable systems into AI initiatives without lock-in, and why clear success criteria matter for AI projects. Kate shares lessons from the field on discovery, documentation gaps, parallel run strategies, and using integration as the connective tissue to unlock data for modern, cloud-native and AI-enabled use cases. She closes with guidance on planning migrations, defining measurable outcomes, ensuring lineage and compliance, and building for swap-ability so teams can evolve systems incrementally instead of living with a “bowl of spaghetti.”

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Kate Shaw about the true costs of maintaining legacy systemsInterview IntroductionHow did you get involved in the area of data management?What are your crtieria for when a given system or service transitions to being "legacy"?In order for any service to survive long enough to become "legacy" it must be serving its purpose and providing value. What are the common factors that prompt teams to deprecate or migrate systems?What are the sources of monetary cost related to maintaining legacy systems while they remain operational?Beyond monetary cost, economics also have a concept of "opportunity cost". What are some of the ways that manifests in data teams who are maintaining or migrating from legacy systems?How does that loss of productivity impact the broader organization?How does the process of migration contribute to issues around data accuracy, reliability, etc. as well as contributing to potential compromises of security and compliance?Once a system has been replaced, it needs to be retired. What are some of the costs associated with removing a system from service?What are the most interesting, innovative, or unexpected ways that you have seen teams address the costs of legacy systems and their retirement?What are the most interesting, unexpected, or challenging lessons that you have learned while working on legacy systems migration?When is deprecation/migration the wrong choice?How have evolutionary architecture patterns helped to mitigate the costs of system retirement?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links SnapLogicSLIM == SnapLogic Intelligent ModernizerOpportunity CostSunk Cost FallacyData GovernanceEvolutionary ArchitectureThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

AI/ML Analytics Cloud Computing Data Engineering Data Management Data Quality Datafold ETL/ELT Prefect Python Cyber Security Data Streaming
Data Engineering Podcast

Beyond embarrassingly parallel processing problems, data must be shared between workers for them to do something useful. This can be done by: - sharing memory between threads, with the issue of preventing access to shared data to avoid race conditions. - copying memory to subprocesses, with the challenge of synchronizing data whenever it is mutated.

In Python, using threads is not an option because of the GIL (global interpreter lock), which prevents true parallelism. This might change in the future with the removal of the GIL, but usual problems with multithreading will appear, such as using locks and managing their complexity. Subprocesses don't suffer from the GIL, but usually need to access a database for sharing data, which is often too slow. Algorithms such as HAMT (hash array mapped trie) have been used to efficiently and safely share data stored in immutable data structures, removing the need for locks. In this talk we will show how CRDTs (conflict-free replicated data type) can be used for the same purpose.

Python
PyData Paris 2025
Event SciPy 2025 2025-07-09

Cubed is a framework for distributed processing of large arrays without a cluster. Designed to respect memory constraints at all times, Cubed can express any NumPy-like array operation as a series of embarrassingly-parallel, bounded-memory steps. By using Zarr as persistent storage between steps, Cubed can run in a serverless fashion on both a local machine and on a range of Cloud platforms. After explaining Cubed’s model, we will show how Cubed has been integrated with Xarray and demonstrate its performance on various large array geoscience workloads.

Cloud Computing NumPy Python

Data manipulation libraries like Polars allow us to analyze and process data much faster than with native Python, but that’s only true if you know how to use them properly. When the team working on NCEI's Global Summary of the Month first integrated Polars, they found it was actually slower than the original Java version. In this talk, we'll discuss how our team learned how to think about computing problems like spreadsheet programmers, increasing our products’ processing speed by over 80%. We’ll share tips for rewriting legacy code to take advantage of parallel processing. We’ll also cover how we created custom, pre-compiled functions with Numba when the business requirements were too complex for native Polars expressions.

Java Polars Python

Cloud-optimized (CO) data formats are designed to efficiently store and access data directly from cloud storage without needing to download the entire dataset. These formats enable faster data retrieval, scalability, and cost-effectiveness by allowing users to fetch only the necessary subsets of data. They also allow for efficient parallel data processing using on-the-fly partitioning, which can considerably accelerate data management operations. In this sense, cloud-optimized data is a nice fit for data-parallel jobs using serverless. FaaS provides a data-driven scalable and cost-efficient experience, with practically no management burden. Each serverless function will read and process a small portion of the cloud-optimized dataset, being read in parallel directly from object storage, significantly increasing the speedup.

In this talk, you will learn how to process cloud-optimized data formats in Python using the Lithops toolkit. Lithops is a serverless data processing toolkit that is specially designed to process data from Cloud Object Storage using Serverless functions. We will also demonstrate the Dataplug library that enables Cloud Optimized data managament of scientific settings such as genomics, metabolomics, or geospatial data. We will show different data processing pipelines in the Cloud that demonstrate the benefits of cloud-optimized data management.

Cloud Computing Cloud Storage Data Management GitHub Python

Ray is an open-source framework for scaling Python applications, particularly machine learning and AI workloads. It provides the layer for parallel processing and distributed computing. Many large language models (LLMs), including OpenAI's GPT models, are trained using Ray.

On the other hand, Apache Airflow is a consolidated data orchestration framework downloaded more than 20 million times monthly.

This talk presents the Airflow Ray provider package that allows users to interact with Ray from an Airflow workflow. In this talk, I'll show how to use the package to create Ray clusters and how Airflow can trigger Ray pipelines in those clusters.

AI/ML Airflow Astronomer GitHub LLM Python
PyData London 2025
PyData @Arpeely 2024-12-11 · 16:00

Join Us for Another PyData Meetup @Arpeely! Get ready for insightful sessions, networking with great people, and, of course, beer and pizza! A big thanks to Arpeely for hosting us!

Wednesday, December 11th 18:00-21:00

Event Highlights - Welcome words from our host – Arpeely

- Ronny Ahituv: Supercharging CTR (Click-Through Rate) with Plug-and-Play AI Capabilities explore a fully AI-driven pipeline designed to boost click-through rates (CTR) using adaptable, off-the-shelf tools. The pipeline leverages: Generative AI and genetic algorithms for creating diverse ad creatives. Contextual multi-armed bandits to select the best creatives based on real-time data, powered by built-in regressors. ARIMA models to capture and adjust for seasonal trends. Multimodal embeddings to efficiently handle and cluster high-cardinality features. This session will demonstrate how integrating readily available AI solutions can help achieve more effective, streamlined CTR optimization.

- Yuval Feinstein: Georgia on my Mind: NLP Meets Social Network Analysis for Exploring New Domains How do you choose your focus when entering a new domain? I suggest a method combining social network analysis (SNA) with natural language processing (NLP). We'll utilize the networkx, spaCy and wikipedia Python packages to get from search terms to insights.

- Mike Erlihson: State-Space Models and Deep Learning: Is there a new revolution on our doorstep? State-space models (SSMs) have advanced from dynamic system tools to deep learning architectures like Mamba (S6), which combine parallel training with efficient inference for long sequences. This lecture covers SSMs' evolution and impact on sequential data modeling

Space is limited – RSVP now to secure your spot! *This meetup will be held in English.

PyData @Arpeely

Venue: Carnival House, 100 Harbour Parade, Southampton, SO15 1ST 📢 Want to speak 📢: submit your talk proposal

Please note:

  1. 🚨🚨🚨A valid photo ID is required by building security. You MUST use your initial/first name and surname on your meetup profile, otherwise, you will NOT make it on the guest list! 🚨🚨🚨
  2. This event follows the NumFOCUS Code of Conduct, please familiarise yourself with it before the event.

If your RSVP status says "You're going" you will be able to get in. No further confirmation required. You will NOT need to show your RSVP confirmation when signing in. If you can no longer make it, please unRSVP as soon as you know so we can assign your place to someone on the waiting list.

*** Code of Conduct: This event follows the NumFOCUS Code of Conduct, please familiarise yourself with it before the event. Please get in touch with the organisers with any questions or concerns regarding the Code of Conduct. *** There will be pizza & drinks, generously provided by our host, Carnival UK. ***

Mastering Data Flow: Prefect Pipelines Workshop - Adam Hill & Chris Frohmaier Join us for an engaging workshop where we'll dive deep into the world of data engineering with Prefect 3. Throughout the session, participants will explore the following key topics:

  • Overview of Prefect and its core features
  • Understanding the Prefect ecosystem and its integration with popular data science tools
  • Setting up a Prefect environment: Installation, configuration, and project setup

Building Data Pipelines:

  • Data ingestion: Fetching data from various sources including RSS feeds, APIs, and databases
  • Data transformation and manipulation using Prefect tasks and flows
  • Data storage and persistence: Storing processed data into local databases such as MongoDB or SQL
  • Integrating machine learning models for advanced data processing and analysis within Prefect workflows

Advanced Techniques and Best Practices:

  • Implementing error handling and retry strategies for fault tolerance and reliability
  • Exploring Prefect's advanced features such as parallel execution, versioning, and dependency management

By the end of the workshop, attendees will have gained a comprehensive understanding of Prefect 3 and its capabilities, empowering them to design, execute, and optimise data pipelines efficiently in real-world scenarios.

We invite you to join us on this exciting journey of mastering data flows with Prefect!

Instructions to prepare in advance Workshop Materials and Requirements: In advance of the workshop please visit the github repo here: https://github.com/Cadarn/PyData-Prefect-Workshop. Clone a copy of the repository and follow the setup instructions in the README file including:

  • Setting up a new Python environment with all the required modules
  • Installing Docker and Docker Compose if necessary and then running the docker compose file so that local copies of the software images are downloaded in advance.
  • Create a free MongoDB Atlas account and setup a project/database for the workshop

Please follow the instructions in advance of attending the workshop/

Please note this is a practical session and you will need to bring your own laptop. We recommend you bring it fully charged, if you can, as there may not be enough plug sockets for everyone to use at the same time.

Logistics Doors open at 6.30 pm, talks start at 7 pm. For those who wish to continue networking and chatting we will move to a nearby pub/bar for drinks from 9 pm.

Please unRSVP in good time if you realise you can't make it. We're limited by building security on the number of attendees, so please free up your place for your fellow community members!

Follow @pydatasoton (https://twitter.com/pydatasoton) for updates and early announcements. We are also on Instagram/Threads as @pydatasoton; and find us on LinkedIn.

PyData Southampton - 11th Meetup

Join us in this weekly series and learn how to become an Azure Data Engineer, and how to integrate, transform, and consolidate data from various structured and unstructured data systems into structures that are suitable for building analytics solutions. Responsibilities for this role include helping stakeholders understand the data through exploration, building, and maintaining secure and compliant data processing pipelines using different tools and techniques. You will learn how to use various Azure data services and languages to store and produce cleansed and enhanced datasets for analysis. An Azure Data Engineer also helps ensure that data pipelines and data stores are high-performing, efficient, organized, and reliable, given a specific set of business requirements and constraints. This professional deals with unanticipated issues swiftly and minimizes data loss. An Azure Data Engineer also designs, implements, monitors, and optimizes data platforms to meet the data pipeline needs. Each week, we will cover a different module towards the complete learning path, preparing you for the Azure Data Engineer Associate certification (https://docs.microsoft.com/en-us/learn/certifications/azure-data-engineer/) as well as for the real world. A candidate for this certification must have a solid knowledge of data processing languages, such as SQL, Python, or Scala, and they need to understand parallel processing and data architecture patterns. Specifically, you should have the knowledge equivalent to the Azure Data Fundamentals certification. All sessions are recorded, and the entire series can be found here: https://bit.ly/Azure-Data-Engineer-certificate. This is part 50: Summary & Exam Preparation. It's the last session in the series. In this session, we will review everything we have learned throughout this series. We will also look at some examples of questions from the exam (DP-203) and solve them together. Agenda:

  • 20:00-20:15 – Opening, Announcements, and More...
  • 20:15-21:15 – Session
  • 21:15-21:30 – Q&A and Open Discussion
Azure Data Engineer - Part 50: Summary & Exam Preparation

Join us in this weekly series and learn how to become an Azure Data Engineer, and how to integrate, transform, and consolidate data from various structured and unstructured data systems into structures that are suitable for building analytics solutions. Responsibilities for this role include helping stakeholders understand the data through exploration, building, and maintaining secure and compliant data processing pipelines using different tools and techniques. You will learn how to use various Azure data services and languages to store and produce cleansed and enhanced datasets for analysis. An Azure Data Engineer also helps ensure that data pipelines and data stores are high-performing, efficient, organized, and reliable, given a specific set of business requirements and constraints. This professional deals with unanticipated issues swiftly and minimizes data loss. An Azure Data Engineer also designs, implements, monitors, and optimizes data platforms to meet the data pipeline needs. Each week, we will cover a different module towards the complete learning path, preparing you for the Azure Data Engineer Associate certification (https://docs.microsoft.com/en-us/learn/certifications/azure-data-engineer/) as well as for the real world. A candidate for this certification must have a solid knowledge of data processing languages, such as SQL, Python, or Scala, and they need to understand parallel processing and data architecture patterns. Specifically, you should have the knowledge equivalent to the Azure Data Fundamentals certification. All sessions are recorded, and the entire series can be found here: https://bit.ly/Azure-Data-Engineer-certificate. This is part 49: Integrate Microsoft Purview and Azure Synapse Analytics. In this module, we will learn how to integrate Microsoft Purview with Azure Synapse Analytics to improve data discoverability and lineage tracking. In this module, you'll learn how to:

  • Catalog Azure Synapse Analytics database assets in Microsoft Purview.
  • Configure Microsoft Purview integration in Azure Synapse Analytics.
  • Search the Microsoft Purview catalog from Synapse Studio.
  • Track data lineage in Azure Synapse Analytics pipelines activities.

Link to the relevant module in Microsoft Learn: https://learn.microsoft.com/en-us/training/modules/integrate-microsoft-purview-azure-synapse-analytics/. Agenda:

  • 10:00-10:15 – Opening, Announcements, and More...
  • 10:15-11:15 – Session
  • 11:15-11:30 – Q&A and Open Discussion
Azure Data Engineer - Part 49: Integrate Microsoft Purview and Synapse Analytics

Join us in this weekly series and learn how to become an Azure Data Engineer, and how to integrate, transform, and consolidate data from various structured and unstructured data systems into structures that are suitable for building analytics solutions. Responsibilities for this role include helping stakeholders understand the data through exploration, building, and maintaining secure and compliant data processing pipelines using different tools and techniques. You will learn how to use various Azure data services and languages to store and produce cleansed and enhanced datasets for analysis. An Azure Data Engineer also helps ensure that data pipelines and data stores are high-performing, efficient, organized, and reliable, given a specific set of business requirements and constraints. This professional deals with unanticipated issues swiftly and minimizes data loss. An Azure Data Engineer also designs, implements, monitors, and optimizes data platforms to meet the data pipeline needs. Each week, we will cover a different module towards the complete learning path, preparing you for the Azure Data Engineer Associate certification (https://docs.microsoft.com/en-us/learn/certifications/azure-data-engineer/) as well as for the real world. A candidate for this certification must have a solid knowledge of data processing languages, such as SQL, Python, or Scala, and they need to understand parallel processing and data architecture patterns. Specifically, you should have the knowledge equivalent to the Azure Data Fundamentals certification. All sessions are recorded, and the entire series can be found here: https://bit.ly/Azure-Data-Engineer-certificate. This is part 48: Manage Power BI Assets by Using Microsoft Purview. In this module, we will learn how to improve data governance and asset discovery using Power BI and Microsoft Purview integration. In this module, you'll learn how to:

  • Register and scan a Power BI tenant.
  • Use the search and browse functions to find data assets.
  • Describe the schema details and data lineage tracing of Power BI data assets.

Link to the relevant module in Microsoft Learn: https://learn.microsoft.com/en-us/training/modules/manage-power-bi-artifacts-use-microsoft-purview/. Agenda:

  • 20:00-20:15 – Opening, Announcements, and More...
  • 20:15-21:15 – Session
  • 21:15-21:30 – Q&A and Open Discussion
Azure Data Engineer - Part 48: Manage Power BI Assets by Using Microsoft Purview
Mika Kimmins – author , Holden Karau – author

Modern systems contain multi-core CPUs and GPUs that have the potential for parallel computing. But many scientific Python tools were not designed to leverage this parallelism. With this short but thorough resource, data scientists and Python programmers will learn how the Dask open source library for parallel computing provides APIs that make it easy to parallelize PyData libraries including NumPy, pandas, and scikit-learn. Authors Holden Karau and Mika Kimmins show you how to use Dask computations in local systems and then scale to the cloud for heavier workloads. This practical book explains why Dask is popular among industry experts and academics and is used by organizations that include Walmart, Capital One, Harvard Medical School, and NASA. With this book, you'll learn: What Dask is, where you can use it, and how it compares with other tools How to use Dask for batch data parallel processing Key distributed system concepts for working with Dask Methods for using Dask with higher-level APIs and building blocks How to work with integrated libraries such as scikit-learn, pandas, and PyTorch How to use Dask with GPUs

data data-science data-science-tools dask API Cloud Computing NumPy Pandas Python PyTorch Scikit-learn
O'Reilly Data Science Books

Join us in this weekly series and learn how to become an Azure Data Engineer, and how to integrate, transform, and consolidate data from various structured and unstructured data systems into structures that are suitable for building analytics solutions. Responsibilities for this role include helping stakeholders understand the data through exploration, building, and maintaining secure and compliant data processing pipelines using different tools and techniques. You will learn how to use various Azure data services and languages to store and produce cleansed and enhanced datasets for analysis. An Azure Data Engineer also helps ensure that data pipelines and data stores are high-performing, efficient, organized, and reliable, given a specific set of business requirements and constraints. This professional deals with unanticipated issues swiftly and minimizes data loss. An Azure Data Engineer also designs, implements, monitors, and optimizes data platforms to meet the data pipeline needs. Each week, we will cover a different module towards the complete learning path, preparing you for the Azure Data Engineer Associate certification (https://docs.microsoft.com/en-us/learn/certifications/azure-data-engineer/) as well as for the real world. A candidate for this certification must have a solid knowledge of data processing languages, such as SQL, Python, or Scala, and they need to understand parallel processing and data architecture patterns. Specifically, you should have the knowledge equivalent to the Azure Data Fundamentals certification. All sessions are recorded, and the entire series can be found here: https://bit.ly/Azure-Data-Engineer-certificate. This is part 47: Catalog Data Artifacts by Using Microsoft Purview. In this module, we will learn how to register, scan, catalog, and view data assets and their relevant details in Microsoft Purview. In this module, you'll learn how to:

  • Describe asset classification in Microsoft Purview.

Link to the relevant module in Microsoft Learn: https://learn.microsoft.com/en-us/training/modules/catalog-data-artifacts-use-microsoft-purview/. Agenda:

  • 20:00-20:15 – Opening, Announcements, and More...
  • 20:15-21:15 – Session
  • 21:15-21:30 – Q&A and Open Discussion
Azure Data Engineer - Part 47: Catalog Data Artifacts by Using Microsoft Purview

Join us in this weekly series and learn how to become an Azure Data Engineer and integrate, transform, and consolidate data from various structured and unstructured data systems into structures suitable for building analytics solutions. Responsibilities for this role include helping stakeholders understand the data through exploration, building, and maintaining secure and compliant data processing pipelines using different tools and techniques. You will learn how to use various Azure data services and languages to store and produce cleansed and enhanced datasets for analysis. An Azure Data Engineer also helps ensure that data pipelines and data stores are high-performing, efficient, organized, and reliable, given a specific set of business requirements and constraints. This professional deals with unanticipated issues swiftly and minimizes data loss. An Azure Data Engineer also designs, implements, monitors, and optimizes data platforms to meet the data pipeline needs. Each week, we will cover a different module towards the complete learning path, which will prepare you for the Azure Data Engineer Associate certification (https://docs.microsoft.com/en-us/learn/certifications/azure-data-engineer/) as well as for the real world. A candidate for this certification must have a solid knowledge of data processing languages, such as SQL, Python, or Scala, and they need to understand parallel processing and data architecture patterns. Specifically, you should have the knowledge equivalent to the Azure Data Fundamentals certification. All sessions are recorded, and the entire series can be found here: https://bit.ly/Azure-Data-Engineer-certificate. This is part 46: Discover Trusted Data Using Microsoft Purview. In this module, we will use Microsoft Purview Studio to discover trusted organizational assets for reporting. In this module, you'll learn how to:

  • Browse, search, and manage data catalog assets.
  • Use data catalog assets with Power BI.
  • Use Microsoft Purview in Azure Synapse Studio.

Link to the relevant module in Microsoft Learn: https://learn.microsoft.com/en-us/training/modules/discover-trusted-data-use-azure-purview/. Agenda:

  • 10:00-10:15 – Opening, Announcements, and More...
  • 10:15-11:15 – Session
  • 11:15-11:30 – Q&A and Open Discussion
Azure Data Engineer - Part 46: Discover Trusted Data Using Microsoft Purview

Join us in this weekly series and learn how to become an Azure Data Engineer, and how to integrate, transform, and consolidate data from various structured and unstructured data systems into structures that are suitable for building analytics solutions. Responsibilities for this role include helping stakeholders understand the data through exploration, building, and maintaining secure and compliant data processing pipelines using different tools and techniques. You will learn how to use various Azure data services and languages to store and produce cleansed and enhanced datasets for analysis. An Azure Data Engineer also helps ensure that data pipelines and data stores are high-performing, efficient, organized, and reliable, given a specific set of business requirements and constraints. This professional deals with unanticipated issues swiftly and minimizes data loss. An Azure Data Engineer also designs, implements, monitors, and optimizes data platforms to meet the data pipeline needs. Each week, we will cover a different module towards the complete learning path, preparing you for the Azure Data Engineer Associate certification (https://docs.microsoft.com/en-us/learn/certifications/azure-data-engineer/) as well as for the real world. A candidate for this certification must have a solid knowledge of data processing languages, such as SQL, Python, or Scala, and they need to understand parallel processing and data architecture patterns. Specifically, you should have the knowledge equivalent to the Azure Data Fundamentals certification. All sessions are recorded, and the entire series can be found here: https://bit.ly/Azure-Data-Engineer-certificate. This is part 45: Introduction to Microsoft Purview. In this module, we will evaluate whether Microsoft Purview is the right choice for your data discovery and governance needs. In this module, you'll learn how to:

  • Evaluate whether Microsoft Purview is appropriate for data discovery and governance needs.
  • Describe how the features of Microsoft Purview work to provide data discovery and governance.

Link to the relevant module in Microsoft Learn: https://learn.microsoft.com/en-us/training/modules/intro-to-microsoft-purview/. Agenda:

  • 20:00-20:15 – Opening, Announcements, and More...
  • 20:15-21:15 – Session
  • 21:15-21:30 – Q&A and Open Discussion
Azure Data Engineer - Part 45: Introduction to Microsoft Purview

Join us in this weekly series and learn how to become an Azure Data Engineer, and how to integrate, transform, and consolidate data from various structured and unstructured data systems into structures that are suitable for building analytics solutions. Responsibilities for this role include helping stakeholders understand the data through exploration, building, and maintaining secure and compliant data processing pipelines using different tools and techniques. You will learn how to use various Azure data services and languages to store and produce cleansed and enhanced datasets for analysis. An Azure Data Engineer also helps ensure that data pipelines and data stores are high-performing, efficient, organized, and reliable, given a specific set of business requirements and constraints. This professional deals with unanticipated issues swiftly and minimizes data loss. An Azure Data Engineer also designs, implements, monitors, and optimizes data platforms to meet the data pipeline needs. Each week, we will cover a different module towards the complete learning path, preparing you for the Azure Data Engineer Associate certification (https://docs.microsoft.com/en-us/learn/certifications/azure-data-engineer/) as well as for the real world. A candidate for this certification must have a solid knowledge of data processing languages, such as SQL, Python, or Scala, and they need to understand parallel processing and data architecture patterns. Specifically, you should have the knowledge equivalent to the Azure Data Fundamentals certification. All sessions are recorded, and the entire series can be found here: https://bit.ly/Azure-Data-Engineer-certificate. This is part 44: Visualize Real-Time Data with Azure Stream Analytics and Power BI. In this module, we will learn how we can create real-time data dashboards by combining the stream processing capabilities of Azure Stream Analytics and the data visualization capabilities of Microsoft Power BI. In this module, you'll learn how to:

  • Configure a Stream Analytics output for Power BI.
  • Use a Stream Analytics query to write data to Power BI.
  • Create a real-time data visualization in Power BI.

Link to the relevant module in Microsoft Learn: https://learn.microsoft.com/en-us/training/modules/visualize-real-time-data-azure-stream-analytics-power-bi/. Agenda:

  • 20:00-20:15 – Opening, Announcements, and More...
  • 20:15-21:15 – Session
  • 21:15-21:30 – Q&A and Open Discussion
Azure Data Engineer - Part 44: Visualize Data with Stream Analytics and Power BI

Join us in this weekly series and learn how to become an Azure Data Engineer, and how to integrate, transform, and consolidate data from various structured and unstructured data systems into structures that are suitable for building analytics solutions. Responsibilities for this role include helping stakeholders understand the data through exploration, building, and maintaining secure and compliant data processing pipelines using different tools and techniques. You will learn how to use various Azure data services and languages to store and produce cleansed and enhanced datasets for analysis. An Azure Data Engineer also helps ensure that data pipelines and data stores are high-performing, efficient, organized, and reliable, given a specific set of business requirements and constraints. This professional deals with unanticipated issues swiftly and minimizes data loss. An Azure Data Engineer also designs, implements, monitors, and optimizes data platforms to meet the data pipeline needs. Each week, we will cover a different module towards the complete learning path, preparing you for the Azure Data Engineer Associate certification (https://docs.microsoft.com/en-us/learn/certifications/azure-data-engineer/) as well as for the real world. A candidate for this certification must have a solid knowledge of data processing languages, such as SQL, Python, or Scala, and they need to understand parallel processing and data architecture patterns. Specifically, you should have the knowledge equivalent to the Azure Data Fundamentals certification. All sessions are recorded, and the entire series can be found here: https://bit.ly/Azure-Data-Engineer-certificate. This is part 43: Ingest Streaming Data Using Azure Stream Analytics and Azure Synapse Analytics. In this module, we will learn how Azure Stream Analytics provides a real-time data processing engine that you can use to ingest streaming event data into Azure Synapse Analytics for further analysis and reporting. In this module, you'll learn how to:

  • Describe common stream ingestion scenarios for Azure Synapse Analytics.
  • Configure inputs and outputs for an Azure Stream Analytics job.
  • Define a query to ingest real-time data into Azure Synapse Analytics.
  • Run a job to ingest real-time data and consume that data in Azure Synapse Analytics.

Link to the relevant module in Microsoft Learn: https://learn.microsoft.com/en-us/training/modules/ingest-streaming-data-use-azure-stream-analytics-synapse/. Agenda:

  • 20:00-20:15 – Opening, Announcements, and More...
  • 20:15-21:15 – Session
  • 21:15-21:30 – Q&A and Open Discussion
Azure Data Engineer - Part 43: Ingest Streaming Data with Azure Stream Analytics
Kevin Kho – core contributor @ Fugue , Tobias Macey – host

Summary Python has grown to be one of the top languages used for all aspects of data, from collection and cleaning, to analysis and machine learning. Along with that growth has come an explosion of tools and engines that help power these workflows, which introduces a great deal of complexity when scaling from single machines and exploratory development to massively parallel distributed computation. In answer to that challenge the Fugue project offers an interface to automatically translate across Pandas, Spark, and Dask execution environments without having to modify your logic. In this episode core contributor Kevin Kho explains how the slight differences in the underlying engines can lead to big problems, how Fugue works to hide those differences from the developer, and how you can start using it in your own work today.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses. Every data project starts with collecting the information that will provide answers to your questions or inputs to your models. The web is the largest trove of information on the planet and Oxylabs helps you unlock its potential. With the Oxylabs scraper APIs you can extract data from even javascript heavy websites. Combined with their residential proxies you can be sure that you’ll have reliable and high quality data whenever you need it. Go to dataengineeringpodcast.com/oxylabs today and use code DEP25 to get your special discount on residential proxies. Your host is Tobias Macey and today I’m interviewing Kevin Kho about Fugue, a library that offers a unified interface for distributed computing that lets users execute Python, pandas, and SQL code on Spark and Dask without rewrites

Interview

Introduction How did you get involved in the area of data management? Can you describe what Fugue is and the story behind it? What are the core goals of the Fugue project? Who are the target users for Fugue and how does that influence the feature priorities and API design? How does Fugue compare to projects such as Modin, etc. for abst

AI/ML API BigEye Data Engineering Data Management GitHub JavaScript Kubernetes Looker Modern Data Stack Pandas Python Snowflake Spark SQL
Data Engineering Podcast