talk-data.com
Activities & events
| Title & Speakers | Event |
|---|---|
|
Parallel Processing with Python
2026-01-19 · 18:00
Parallel Processing with Python Modern software often needs to do many things at the same time to run faster and scale better. This includes data processing, web services, and machine learning workloads. Understanding parallel and concurrent execution is now an important skill for Python developers. This session gives a clear and practical introduction to parallel processing in Python. It focuses on the main ideas and shows when and how to use different approaches correctly. Who is this for? Students, developers, and anyone who wants to understand how Python programs can run faster by doing work in parallel. This session is useful if you want to speed up Python programs, understand the difference between threads and processes, and build more efficient and scalable applications. Who is leading the session? The session is led by Dr. Stelios Sotiriadis, CEO of Warestack and Associate Professor and MSc Programme Director at Birkbeck, University of London. He works in distributed systems, cloud computing, operating systems, and Python-based data processing. He holds a PhD from the University of Derby, completed a postdoctoral fellowship at the University of Toronto, and has worked with Huawei, IBM, Autodesk, and several startups. Since 2018, he has been teaching at Birkbeck and founded Warestack in 2021. What we will cover Requirements A laptop with Python installed (Windows, macOS, or Linux), Visual Studio Code, and Python pip. Lab computers can be used if needed. Format This is a hands-on introduction with examples and short exercises. Topics include what concurrency and parallelism mean, threads vs processes in Python, the Global Interpreter Lock explained simply, using threading for I/O-heavy tasks, using multiprocessing for CPU-heavy tasks, basic use of concurrent.futures, common problems like race conditions, and when parallelism is not the right choice. A 1.5-hour live session with short theory explanations, live coding, and guided exercises. The session runs in person, with streaming available for remote participants. Prerequisites Basic to intermediate Python knowledge, including functions, loops, and basic data structures. |
Parallel Processing with Python
|
|
[Notes]How to Build a Portfolio That Reflects Your Real Skills
2025-12-28 · 18:00
These are the notes of the previous "How to Build a Portfolio That Reflects Your Real Skills" event: Properties of an ideal portfolio repository:
📌 Backend & Frontend Portfolio Project Ideas
☕ Junior Java Backend Developer (Spring Boot)1. Shop Manager ApplicationA monolithic Spring Boot app designed with microservice-style boundaries. Features
Engineering Focus
2. Parallel Data Processing EngineBackend service for processing large datasets efficiently. Features
Demonstrates
3. Distributed Task Queue SystemSimple async job processing system. Features
Demonstrates
4. Rate Limiting & Load Control ServiceStandalone service that protects APIs from abuse. Features
Demonstrates
5. Search & Indexing BackendDocument or record search service. Features
Demonstrates
6. Distributed Configuration & Feature Flag ServiceCentralized config service for other apps. Features
Demonstrates
🐹 Mid-Level Go Backend Developer (Non-Kubernetes)1. High-Throughput Event Processing PipelineMulti-stage concurrent pipeline. Features
2. Distributed Job Scheduler & Worker SystemAsync job execution platform. Features
3. In-Memory Caching ServiceRedis-like cache written from scratch. Features
4. Rate Limiting & Traffic Shaping GatewayReverse-proxy-style rate limiter. Features
5. Log Aggregation & Query EngineIncrementally built system: Step-by-step
🐍 Mid-Level Python Backend Developer1. Asynchronous Task Processing SystemAsync job execution platform. Features
2. Event-Driven Data PipelineStreaming data processing service. Features
3. Distributed Rate Limiting ServiceAPI protection service. Steps
4. Search & Indexing BackendSearch system for logs or documents. Features
5. Configuration & Feature Flag ServiceShared configuration backend. Steps
🟦 Mid-Level TypeScript Backend Developer1. Asynchronous Job Processing SystemQueue-based task execution. Features
2. Real-Time Chat / Notification ServiceWebSocket-based system. Features
3. Rate Limiting & API GatewayAPI gateway with protections. Features
4. Search & Filtering EngineSearch backend for products, logs, or articles. Features
5. Feature Flag & Configuration ServiceCentralized config management. Features
🟨 Mid-Level Node.js Backend Developer1. Async Task Queue SystemBackground job processor. Features
2. Real-Time Chat / Notification ServiceSocket-based system. Features
3. Rate Limiting & API GatewayTraffic control service. Features
4. Search & Indexing BackendIndexing & querying service. 5. Feature Flag / Configuration ServiceShared backend for app configs. ⚛️ Mid-Level Frontend Developer (React / Next.js)1. Dynamic Analytics DashboardInteractive data visualization app. Features
2. E-Commerce StoreFull shopping experience. Features
3. Real-Time Chat / Collaboration AppLive multi-user UI. Features
4. CMS / Blogging PlatformSEO-focused content app. Features
5. Personalized Analytics / Recommendation UIData-heavy frontend. Features
6. AI Chatbot App — “My House Plant Advisor”LLM-powered assistant with production-quality UX. Core Features
Advanced Features
✅ Final AdviceYou do NOT need to build everything. Instead, pick 1–2 strong projects per role and focus on depth:
📌 Portfolio Quality Signals (Very Important)
🎯 Why This Helps in InterviewsWorking on serious projects gives you:
🎥 Demo & Documentation Best Practices
🤝 Open Source & Personal Projects (Interview Signal)Always mention that you have contributed to Open Source or built personal projects.
|
[Notes]How to Build a Portfolio That Reflects Your Real Skills
|
|
The True Costs of Legacy Systems: Technical Debt, Risk, and Exit Strategies
2025-10-18 · 22:35
Kate Shaw
– Senior Product Manager for Data and SLIM
@ SnapLogic
,
Tobias Macey
– host
Summary In this episode Kate Shaw, Senior Product Manager for Data and SLIM at SnapLogic, talks about the hidden and compounding costs of maintaining legacy systems—and practical strategies for modernization. She unpacks how “legacy” is less about age and more about when a system becomes a risk: blocking innovation, consuming excess IT time, and creating opportunity costs. Kate explores technical debt, vendor lock-in, lost context from employee turnover, and the slippery notion of “if it ain’t broke,” especially when data correctness and lineage are unclear. Shee digs into governance, observability, and data quality as foundations for trustworthy analytics and AI, and why exit strategies for system retirement should be planned from day one. The discussion covers composable architectures to avoid monoliths and big-bang migrations, how to bridge valuable systems into AI initiatives without lock-in, and why clear success criteria matter for AI projects. Kate shares lessons from the field on discovery, documentation gaps, parallel run strategies, and using integration as the connective tissue to unlock data for modern, cloud-native and AI-enabled use cases. She closes with guidance on planning migrations, defining measurable outcomes, ensuring lineage and compliance, and building for swap-ability so teams can evolve systems incrementally instead of living with a “bowl of spaghetti.” Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Kate Shaw about the true costs of maintaining legacy systemsInterview IntroductionHow did you get involved in the area of data management?What are your crtieria for when a given system or service transitions to being "legacy"?In order for any service to survive long enough to become "legacy" it must be serving its purpose and providing value. What are the common factors that prompt teams to deprecate or migrate systems?What are the sources of monetary cost related to maintaining legacy systems while they remain operational?Beyond monetary cost, economics also have a concept of "opportunity cost". What are some of the ways that manifests in data teams who are maintaining or migrating from legacy systems?How does that loss of productivity impact the broader organization?How does the process of migration contribute to issues around data accuracy, reliability, etc. as well as contributing to potential compromises of security and compliance?Once a system has been replaced, it needs to be retired. What are some of the costs associated with removing a system from service?What are the most interesting, innovative, or unexpected ways that you have seen teams address the costs of legacy systems and their retirement?What are the most interesting, unexpected, or challenging lessons that you have learned while working on legacy systems migration?When is deprecation/migration the wrong choice?How have evolutionary architecture patterns helped to mitigate the costs of system retirement?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links SnapLogicSLIM == SnapLogic Intelligent ModernizerOpportunity CostSunk Cost FallacyData GovernanceEvolutionary ArchitectureThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA |
Data Engineering Podcast |
|
Parallel processing using CRDTs
2025-10-01 · 14:00
Beyond embarrassingly parallel processing problems, data must be shared between workers for them to do something useful. This can be done by: - sharing memory between threads, with the issue of preventing access to shared data to avoid race conditions. - copying memory to subprocesses, with the challenge of synchronizing data whenever it is mutated. In Python, using threads is not an option because of the GIL (global interpreter lock), which prevents true parallelism. This might change in the future with the removal of the GIL, but usual problems with multithreading will appear, such as using locks and managing their complexity. Subprocesses don't suffer from the GIL, but usually need to access a database for sharing data, which is often too slow. Algorithms such as HAMT (hash array mapped trie) have been used to efficiently and safely share data stored in immutable data structures, removing the need for locks. In this talk we will show how CRDTs (conflict-free replicated data type) can be used for the same purpose. |
PyData Paris 2025 |
|
Cubed: Scalable array processing with bounded-memory in Python
2025-07-09 · 20:15
Tom White
– author
,
Tom Nicholas
Cubed is a framework for distributed processing of large arrays without a cluster. Designed to respect memory constraints at all times, Cubed can express any NumPy-like array operation as a series of embarrassingly-parallel, bounded-memory steps. By using Zarr as persistent storage between steps, Cubed can run in a serverless fashion on both a local machine and on a range of Cloud platforms. After explaining Cubed’s model, we will show how Cubed has been integrated with Xarray and demonstrate its performance on various large array geoscience workloads. |
|
|
Breaking Out of the Loop: Refactoring Legacy Software with Polars
2025-07-09 · 18:25
Data manipulation libraries like Polars allow us to analyze and process data much faster than with native Python, but that’s only true if you know how to use them properly. When the team working on NCEI's Global Summary of the Month first integrated Polars, they found it was actually slower than the original Java version. In this talk, we'll discuss how our team learned how to think about computing problems like spreadsheet programmers, increasing our products’ processing speed by over 80%. We’ll share tips for rewriting legacy code to take advantage of parallel processing. We’ll also cover how we created custom, pre-compiled functions with Numba when the business requirements were too complex for native Polars expressions. |
|
|
Processing Cloud-optimized data in Python with Serverless Functions (Lithops, Dataplug)
2025-07-08 · 15:00
Cloud-optimized (CO) data formats are designed to efficiently store and access data directly from cloud storage without needing to download the entire dataset. These formats enable faster data retrieval, scalability, and cost-effectiveness by allowing users to fetch only the necessary subsets of data. They also allow for efficient parallel data processing using on-the-fly partitioning, which can considerably accelerate data management operations. In this sense, cloud-optimized data is a nice fit for data-parallel jobs using serverless. FaaS provides a data-driven scalable and cost-efficient experience, with practically no management burden. Each serverless function will read and process a small portion of the cloud-optimized dataset, being read in parallel directly from object storage, significantly increasing the speedup. In this talk, you will learn how to process cloud-optimized data formats in Python using the Lithops toolkit. Lithops is a serverless data processing toolkit that is specially designed to process data from Cloud Object Storage using Serverless functions. We will also demonstrate the Dataplug library that enables Cloud Optimized data managament of scientific settings such as genomics, metabolomics, or geospatial data. We will show different data processing pipelines in the Cloud that demonstrate the benefits of cloud-optimized data management. |
|
|
Scaling AI workloads with Ray & Airflow
2025-06-08 · 15:15
Ray is an open-source framework for scaling Python applications, particularly machine learning and AI workloads. It provides the layer for parallel processing and distributed computing. Many large language models (LLMs), including OpenAI's GPT models, are trained using Ray. On the other hand, Apache Airflow is a consolidated data orchestration framework downloaded more than 20 million times monthly. This talk presents the Airflow Ray provider package that allows users to interact with Ray from an Airflow workflow. In this talk, I'll show how to use the package to create Ray clusters and how Airflow can trigger Ray pipelines in those clusters. |
PyData London 2025 |
|
PyData @Arpeely
2024-12-11 · 16:00
Join Us for Another PyData Meetup @Arpeely! Get ready for insightful sessions, networking with great people, and, of course, beer and pizza! A big thanks to Arpeely for hosting us! Wednesday, December 11th 18:00-21:00 Event Highlights - Welcome words from our host – Arpeely - Ronny Ahituv: Supercharging CTR (Click-Through Rate) with Plug-and-Play AI Capabilities explore a fully AI-driven pipeline designed to boost click-through rates (CTR) using adaptable, off-the-shelf tools. The pipeline leverages: Generative AI and genetic algorithms for creating diverse ad creatives. Contextual multi-armed bandits to select the best creatives based on real-time data, powered by built-in regressors. ARIMA models to capture and adjust for seasonal trends. Multimodal embeddings to efficiently handle and cluster high-cardinality features. This session will demonstrate how integrating readily available AI solutions can help achieve more effective, streamlined CTR optimization. - Yuval Feinstein: Georgia on my Mind: NLP Meets Social Network Analysis for Exploring New Domains How do you choose your focus when entering a new domain? I suggest a method combining social network analysis (SNA) with natural language processing (NLP). We'll utilize the networkx, spaCy and wikipedia Python packages to get from search terms to insights. - Mike Erlihson: State-Space Models and Deep Learning: Is there a new revolution on our doorstep? State-space models (SSMs) have advanced from dynamic system tools to deep learning architectures like Mamba (S6), which combine parallel training with efficient inference for long sequences. This lecture covers SSMs' evolution and impact on sequential data modeling Space is limited – RSVP now to secure your spot! *This meetup will be held in English. |
PyData @Arpeely
|
|
PyData Southampton - 11th Meetup
2024-11-26 · 19:00
Venue: Carnival House, 100 Harbour Parade, Southampton, SO15 1ST 📢 Want to speak 📢: submit your talk proposal Please note:
If your RSVP status says "You're going" you will be able to get in. No further confirmation required. You will NOT need to show your RSVP confirmation when signing in. If you can no longer make it, please unRSVP as soon as you know so we can assign your place to someone on the waiting list. *** Code of Conduct: This event follows the NumFOCUS Code of Conduct, please familiarise yourself with it before the event. Please get in touch with the organisers with any questions or concerns regarding the Code of Conduct. *** There will be pizza & drinks, generously provided by our host, Carnival UK. *** Mastering Data Flow: Prefect Pipelines Workshop - Adam Hill & Chris Frohmaier Join us for an engaging workshop where we'll dive deep into the world of data engineering with Prefect 3. Throughout the session, participants will explore the following key topics:
Building Data Pipelines:
Advanced Techniques and Best Practices:
By the end of the workshop, attendees will have gained a comprehensive understanding of Prefect 3 and its capabilities, empowering them to design, execute, and optimise data pipelines efficiently in real-world scenarios. We invite you to join us on this exciting journey of mastering data flows with Prefect! Instructions to prepare in advance Workshop Materials and Requirements: In advance of the workshop please visit the github repo here: https://github.com/Cadarn/PyData-Prefect-Workshop. Clone a copy of the repository and follow the setup instructions in the README file including:
Please follow the instructions in advance of attending the workshop/ Please note this is a practical session and you will need to bring your own laptop. We recommend you bring it fully charged, if you can, as there may not be enough plug sockets for everyone to use at the same time. Logistics Doors open at 6.30 pm, talks start at 7 pm. For those who wish to continue networking and chatting we will move to a nearby pub/bar for drinks from 9 pm. Please unRSVP in good time if you realise you can't make it. We're limited by building security on the number of attendees, so please free up your place for your fellow community members! Follow @pydatasoton (https://twitter.com/pydatasoton) for updates and early announcements. We are also on Instagram/Threads as @pydatasoton; and find us on LinkedIn. |
PyData Southampton - 11th Meetup
|
|
Azure Data Engineer - Part 50: Summary & Exam Preparation
2023-08-09 · 17:00
Join us in this weekly series and learn how to become an Azure Data Engineer, and how to integrate, transform, and consolidate data from various structured and unstructured data systems into structures that are suitable for building analytics solutions. Responsibilities for this role include helping stakeholders understand the data through exploration, building, and maintaining secure and compliant data processing pipelines using different tools and techniques. You will learn how to use various Azure data services and languages to store and produce cleansed and enhanced datasets for analysis. An Azure Data Engineer also helps ensure that data pipelines and data stores are high-performing, efficient, organized, and reliable, given a specific set of business requirements and constraints. This professional deals with unanticipated issues swiftly and minimizes data loss. An Azure Data Engineer also designs, implements, monitors, and optimizes data platforms to meet the data pipeline needs. Each week, we will cover a different module towards the complete learning path, preparing you for the Azure Data Engineer Associate certification (https://docs.microsoft.com/en-us/learn/certifications/azure-data-engineer/) as well as for the real world. A candidate for this certification must have a solid knowledge of data processing languages, such as SQL, Python, or Scala, and they need to understand parallel processing and data architecture patterns. Specifically, you should have the knowledge equivalent to the Azure Data Fundamentals certification. All sessions are recorded, and the entire series can be found here: https://bit.ly/Azure-Data-Engineer-certificate. This is part 50: Summary & Exam Preparation. It's the last session in the series. In this session, we will review everything we have learned throughout this series. We will also look at some examples of questions from the exam (DP-203) and solve them together. Agenda:
|
Azure Data Engineer - Part 50: Summary & Exam Preparation
|
|
Join us in this weekly series and learn how to become an Azure Data Engineer, and how to integrate, transform, and consolidate data from various structured and unstructured data systems into structures that are suitable for building analytics solutions. Responsibilities for this role include helping stakeholders understand the data through exploration, building, and maintaining secure and compliant data processing pipelines using different tools and techniques. You will learn how to use various Azure data services and languages to store and produce cleansed and enhanced datasets for analysis. An Azure Data Engineer also helps ensure that data pipelines and data stores are high-performing, efficient, organized, and reliable, given a specific set of business requirements and constraints. This professional deals with unanticipated issues swiftly and minimizes data loss. An Azure Data Engineer also designs, implements, monitors, and optimizes data platforms to meet the data pipeline needs. Each week, we will cover a different module towards the complete learning path, preparing you for the Azure Data Engineer Associate certification (https://docs.microsoft.com/en-us/learn/certifications/azure-data-engineer/) as well as for the real world. A candidate for this certification must have a solid knowledge of data processing languages, such as SQL, Python, or Scala, and they need to understand parallel processing and data architecture patterns. Specifically, you should have the knowledge equivalent to the Azure Data Fundamentals certification. All sessions are recorded, and the entire series can be found here: https://bit.ly/Azure-Data-Engineer-certificate. This is part 49: Integrate Microsoft Purview and Azure Synapse Analytics. In this module, we will learn how to integrate Microsoft Purview with Azure Synapse Analytics to improve data discoverability and lineage tracking. In this module, you'll learn how to:
Link to the relevant module in Microsoft Learn: https://learn.microsoft.com/en-us/training/modules/integrate-microsoft-purview-azure-synapse-analytics/. Agenda:
|
Azure Data Engineer - Part 49: Integrate Microsoft Purview and Synapse Analytics
|
|
Join us in this weekly series and learn how to become an Azure Data Engineer, and how to integrate, transform, and consolidate data from various structured and unstructured data systems into structures that are suitable for building analytics solutions. Responsibilities for this role include helping stakeholders understand the data through exploration, building, and maintaining secure and compliant data processing pipelines using different tools and techniques. You will learn how to use various Azure data services and languages to store and produce cleansed and enhanced datasets for analysis. An Azure Data Engineer also helps ensure that data pipelines and data stores are high-performing, efficient, organized, and reliable, given a specific set of business requirements and constraints. This professional deals with unanticipated issues swiftly and minimizes data loss. An Azure Data Engineer also designs, implements, monitors, and optimizes data platforms to meet the data pipeline needs. Each week, we will cover a different module towards the complete learning path, preparing you for the Azure Data Engineer Associate certification (https://docs.microsoft.com/en-us/learn/certifications/azure-data-engineer/) as well as for the real world. A candidate for this certification must have a solid knowledge of data processing languages, such as SQL, Python, or Scala, and they need to understand parallel processing and data architecture patterns. Specifically, you should have the knowledge equivalent to the Azure Data Fundamentals certification. All sessions are recorded, and the entire series can be found here: https://bit.ly/Azure-Data-Engineer-certificate. This is part 48: Manage Power BI Assets by Using Microsoft Purview. In this module, we will learn how to improve data governance and asset discovery using Power BI and Microsoft Purview integration. In this module, you'll learn how to:
Link to the relevant module in Microsoft Learn: https://learn.microsoft.com/en-us/training/modules/manage-power-bi-artifacts-use-microsoft-purview/. Agenda:
|
Azure Data Engineer - Part 48: Manage Power BI Assets by Using Microsoft Purview
|
|
Scaling Python with Dask
2023-07-26
Mika Kimmins
– author
,
Holden Karau
– author
Modern systems contain multi-core CPUs and GPUs that have the potential for parallel computing. But many scientific Python tools were not designed to leverage this parallelism. With this short but thorough resource, data scientists and Python programmers will learn how the Dask open source library for parallel computing provides APIs that make it easy to parallelize PyData libraries including NumPy, pandas, and scikit-learn. Authors Holden Karau and Mika Kimmins show you how to use Dask computations in local systems and then scale to the cloud for heavier workloads. This practical book explains why Dask is popular among industry experts and academics and is used by organizations that include Walmart, Capital One, Harvard Medical School, and NASA. With this book, you'll learn: What Dask is, where you can use it, and how it compares with other tools How to use Dask for batch data parallel processing Key distributed system concepts for working with Dask Methods for using Dask with higher-level APIs and building blocks How to work with integrated libraries such as scikit-learn, pandas, and PyTorch How to use Dask with GPUs |
O'Reilly Data Science Books
|
|
Join us in this weekly series and learn how to become an Azure Data Engineer, and how to integrate, transform, and consolidate data from various structured and unstructured data systems into structures that are suitable for building analytics solutions. Responsibilities for this role include helping stakeholders understand the data through exploration, building, and maintaining secure and compliant data processing pipelines using different tools and techniques. You will learn how to use various Azure data services and languages to store and produce cleansed and enhanced datasets for analysis. An Azure Data Engineer also helps ensure that data pipelines and data stores are high-performing, efficient, organized, and reliable, given a specific set of business requirements and constraints. This professional deals with unanticipated issues swiftly and minimizes data loss. An Azure Data Engineer also designs, implements, monitors, and optimizes data platforms to meet the data pipeline needs. Each week, we will cover a different module towards the complete learning path, preparing you for the Azure Data Engineer Associate certification (https://docs.microsoft.com/en-us/learn/certifications/azure-data-engineer/) as well as for the real world. A candidate for this certification must have a solid knowledge of data processing languages, such as SQL, Python, or Scala, and they need to understand parallel processing and data architecture patterns. Specifically, you should have the knowledge equivalent to the Azure Data Fundamentals certification. All sessions are recorded, and the entire series can be found here: https://bit.ly/Azure-Data-Engineer-certificate. This is part 47: Catalog Data Artifacts by Using Microsoft Purview. In this module, we will learn how to register, scan, catalog, and view data assets and their relevant details in Microsoft Purview. In this module, you'll learn how to:
Link to the relevant module in Microsoft Learn: https://learn.microsoft.com/en-us/training/modules/catalog-data-artifacts-use-microsoft-purview/. Agenda:
|
Azure Data Engineer - Part 47: Catalog Data Artifacts by Using Microsoft Purview
|
|
Azure Data Engineer - Part 46: Discover Trusted Data Using Microsoft Purview
2023-07-12 · 07:00
Join us in this weekly series and learn how to become an Azure Data Engineer and integrate, transform, and consolidate data from various structured and unstructured data systems into structures suitable for building analytics solutions. Responsibilities for this role include helping stakeholders understand the data through exploration, building, and maintaining secure and compliant data processing pipelines using different tools and techniques. You will learn how to use various Azure data services and languages to store and produce cleansed and enhanced datasets for analysis. An Azure Data Engineer also helps ensure that data pipelines and data stores are high-performing, efficient, organized, and reliable, given a specific set of business requirements and constraints. This professional deals with unanticipated issues swiftly and minimizes data loss. An Azure Data Engineer also designs, implements, monitors, and optimizes data platforms to meet the data pipeline needs. Each week, we will cover a different module towards the complete learning path, which will prepare you for the Azure Data Engineer Associate certification (https://docs.microsoft.com/en-us/learn/certifications/azure-data-engineer/) as well as for the real world. A candidate for this certification must have a solid knowledge of data processing languages, such as SQL, Python, or Scala, and they need to understand parallel processing and data architecture patterns. Specifically, you should have the knowledge equivalent to the Azure Data Fundamentals certification. All sessions are recorded, and the entire series can be found here: https://bit.ly/Azure-Data-Engineer-certificate. This is part 46: Discover Trusted Data Using Microsoft Purview. In this module, we will use Microsoft Purview Studio to discover trusted organizational assets for reporting. In this module, you'll learn how to:
Link to the relevant module in Microsoft Learn: https://learn.microsoft.com/en-us/training/modules/discover-trusted-data-use-azure-purview/. Agenda:
|
Azure Data Engineer - Part 46: Discover Trusted Data Using Microsoft Purview
|
|
Azure Data Engineer - Part 45: Introduction to Microsoft Purview
2023-07-05 · 17:00
Join us in this weekly series and learn how to become an Azure Data Engineer, and how to integrate, transform, and consolidate data from various structured and unstructured data systems into structures that are suitable for building analytics solutions. Responsibilities for this role include helping stakeholders understand the data through exploration, building, and maintaining secure and compliant data processing pipelines using different tools and techniques. You will learn how to use various Azure data services and languages to store and produce cleansed and enhanced datasets for analysis. An Azure Data Engineer also helps ensure that data pipelines and data stores are high-performing, efficient, organized, and reliable, given a specific set of business requirements and constraints. This professional deals with unanticipated issues swiftly and minimizes data loss. An Azure Data Engineer also designs, implements, monitors, and optimizes data platforms to meet the data pipeline needs. Each week, we will cover a different module towards the complete learning path, preparing you for the Azure Data Engineer Associate certification (https://docs.microsoft.com/en-us/learn/certifications/azure-data-engineer/) as well as for the real world. A candidate for this certification must have a solid knowledge of data processing languages, such as SQL, Python, or Scala, and they need to understand parallel processing and data architecture patterns. Specifically, you should have the knowledge equivalent to the Azure Data Fundamentals certification. All sessions are recorded, and the entire series can be found here: https://bit.ly/Azure-Data-Engineer-certificate. This is part 45: Introduction to Microsoft Purview. In this module, we will evaluate whether Microsoft Purview is the right choice for your data discovery and governance needs. In this module, you'll learn how to:
Link to the relevant module in Microsoft Learn: https://learn.microsoft.com/en-us/training/modules/intro-to-microsoft-purview/. Agenda:
|
Azure Data Engineer - Part 45: Introduction to Microsoft Purview
|
|
Join us in this weekly series and learn how to become an Azure Data Engineer, and how to integrate, transform, and consolidate data from various structured and unstructured data systems into structures that are suitable for building analytics solutions. Responsibilities for this role include helping stakeholders understand the data through exploration, building, and maintaining secure and compliant data processing pipelines using different tools and techniques. You will learn how to use various Azure data services and languages to store and produce cleansed and enhanced datasets for analysis. An Azure Data Engineer also helps ensure that data pipelines and data stores are high-performing, efficient, organized, and reliable, given a specific set of business requirements and constraints. This professional deals with unanticipated issues swiftly and minimizes data loss. An Azure Data Engineer also designs, implements, monitors, and optimizes data platforms to meet the data pipeline needs. Each week, we will cover a different module towards the complete learning path, preparing you for the Azure Data Engineer Associate certification (https://docs.microsoft.com/en-us/learn/certifications/azure-data-engineer/) as well as for the real world. A candidate for this certification must have a solid knowledge of data processing languages, such as SQL, Python, or Scala, and they need to understand parallel processing and data architecture patterns. Specifically, you should have the knowledge equivalent to the Azure Data Fundamentals certification. All sessions are recorded, and the entire series can be found here: https://bit.ly/Azure-Data-Engineer-certificate. This is part 44: Visualize Real-Time Data with Azure Stream Analytics and Power BI. In this module, we will learn how we can create real-time data dashboards by combining the stream processing capabilities of Azure Stream Analytics and the data visualization capabilities of Microsoft Power BI. In this module, you'll learn how to:
Link to the relevant module in Microsoft Learn: https://learn.microsoft.com/en-us/training/modules/visualize-real-time-data-azure-stream-analytics-power-bi/. Agenda:
|
Azure Data Engineer - Part 44: Visualize Data with Stream Analytics and Power BI
|
|
Join us in this weekly series and learn how to become an Azure Data Engineer, and how to integrate, transform, and consolidate data from various structured and unstructured data systems into structures that are suitable for building analytics solutions. Responsibilities for this role include helping stakeholders understand the data through exploration, building, and maintaining secure and compliant data processing pipelines using different tools and techniques. You will learn how to use various Azure data services and languages to store and produce cleansed and enhanced datasets for analysis. An Azure Data Engineer also helps ensure that data pipelines and data stores are high-performing, efficient, organized, and reliable, given a specific set of business requirements and constraints. This professional deals with unanticipated issues swiftly and minimizes data loss. An Azure Data Engineer also designs, implements, monitors, and optimizes data platforms to meet the data pipeline needs. Each week, we will cover a different module towards the complete learning path, preparing you for the Azure Data Engineer Associate certification (https://docs.microsoft.com/en-us/learn/certifications/azure-data-engineer/) as well as for the real world. A candidate for this certification must have a solid knowledge of data processing languages, such as SQL, Python, or Scala, and they need to understand parallel processing and data architecture patterns. Specifically, you should have the knowledge equivalent to the Azure Data Fundamentals certification. All sessions are recorded, and the entire series can be found here: https://bit.ly/Azure-Data-Engineer-certificate. This is part 43: Ingest Streaming Data Using Azure Stream Analytics and Azure Synapse Analytics. In this module, we will learn how Azure Stream Analytics provides a real-time data processing engine that you can use to ingest streaming event data into Azure Synapse Analytics for further analysis and reporting. In this module, you'll learn how to:
Link to the relevant module in Microsoft Learn: https://learn.microsoft.com/en-us/training/modules/ingest-streaming-data-use-azure-stream-analytics-synapse/. Agenda:
|
Azure Data Engineer - Part 43: Ingest Streaming Data with Azure Stream Analytics
|
|
Build Your Python Data Processing Your Way And Run It Anywhere With Fugue
2022-02-21 · 03:00
Kevin Kho
– core contributor
@ Fugue
,
Tobias Macey
– host
Summary Python has grown to be one of the top languages used for all aspects of data, from collection and cleaning, to analysis and machine learning. Along with that growth has come an explosion of tools and engines that help power these workflows, which introduces a great deal of complexity when scaling from single machines and exploratory development to massively parallel distributed computation. In answer to that challenge the Fugue project offers an interface to automatically translate across Pandas, Spark, and Dask execution environments without having to modify your logic. In this episode core contributor Kevin Kho explains how the slight differences in the underlying engines can lead to big problems, how Fugue works to hide those differences from the developer, and how you can start using it in your own work today. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses. Every data project starts with collecting the information that will provide answers to your questions or inputs to your models. The web is the largest trove of information on the planet and Oxylabs helps you unlock its potential. With the Oxylabs scraper APIs you can extract data from even javascript heavy websites. Combined with their residential proxies you can be sure that you’ll have reliable and high quality data whenever you need it. Go to dataengineeringpodcast.com/oxylabs today and use code DEP25 to get your special discount on residential proxies. Your host is Tobias Macey and today I’m interviewing Kevin Kho about Fugue, a library that offers a unified interface for distributed computing that lets users execute Python, pandas, and SQL code on Spark and Dask without rewrites Interview Introduction How did you get involved in the area of data management? Can you describe what Fugue is and the story behind it? What are the core goals of the Fugue project? Who are the target users for Fugue and how does that influence the feature priorities and API design? How does Fugue compare to projects such as Modin, etc. for abst |
Data Engineering Podcast |