talk-data.com talk-data.com

Topic

DataOps

data_management agile devops

11

tagged

Activity Trend

12 peak/qtr
2020-Q1 2026-Q1

Activities

11 activities · Newest first

Skrub: machine learning for dataframes

Skrub is an open source package that simplifies machine-learning with dataframes by providing a variety of tools to explore, prepare and feature-engineer dataframes so they can be integrated into scikit-learn pipelines. Skrub DataOps allow to build extensive, multi-table wrangling plans, explore hyperparameter spaces, and export the resulting objects for deployment. The talk showcases various use cases where skrub can simplify the job of a data scientist from data preparation to deployment, through code examples and demonstrations.

Sponsored by: Astronomer | Scaling Data Teams for the Future

The role of data teams and data engineers is evolving. No longer just pipeline builders or dashboard creators, today’s data teams must evolve to drive business strategy, enable automation, and scale with growing demands. Best practices seen in the software engineering world (Agile development, CI/CD, and Infrastructure-as-code) from the DevOps movement are gradually making their way into data engineering. We believe these changes have led to the rise of DataOps and a new wave of best practices that will transform the discipline of data engineering. But how do you transform a reactive team into a proactive force for innovation? We’ll explore the key principles for building a resilient, high-impact data team—from structuring for collaboration, testing, automation, to leveraging modern orchestration tools. Whether you’re leading a team or looking to future-proof your career, you’ll walk away with actionable insights on how to stay ahead in the rapidly changing data landscape.

Sonal Goyal: Open Source Entity Resolution - Needs and Challenges

🌟 Session Overview 🌟

Session Name: Open Source Entity Resolution - Needs and Challenges Speaker: Sonal Goyal Session Description: Real world data contains multiple records belonging to the same customer. These records can be in single or multiple systems and they have variations across fields, which makes it hard to combine them together, especially with growing data volumes. This hurts customer analytics - establishing lifetime value, loyalty programs, or marketing channels is impossible when the base data is not linked. No AI algorithm for segmentation can produce the right results when there are multiple copies of the same customer lurking in the data. No warehouse can live up to its promise if the dimension tables have duplicates.

With a modern data stack and DataOps, we have established patterns for E and L in ELT for building data warehouses, datalakes and deltalakes. However, the T - getting data ready for analytics still needs a lot of effort. Modern tools like dbt are actively and successfully addressing this. What is also needed is a quick and scalable way to resolve entities to build the single source of truth of core business entities post Extraction and pre or post Loading.

This session would cover the problem of Entity Resolution, its practical applications and challenges in building an entity resolution system. It will also cover Zingg - an Open Source Framework for building Entity Resolution systems. (https://github.com/zinggAI/zingg/) 🚀 About Big Data and RPA 2024 🚀

Unlock the future of innovation and automation at Big Data & RPA Conference Europe 2024! 🌟 This unique event brings together the brightest minds in big data, machine learning, AI, and robotic process automation to explore cutting-edge solutions and trends shaping the tech landscape. Perfect for data engineers, analysts, RPA developers, and business leaders, the conference offers dual insights into the power of data-driven strategies and intelligent automation. 🚀 Gain practical knowledge on topics like hyperautomation, AI integration, advanced analytics, and workflow optimization while networking with global experts. Don’t miss this exclusive opportunity to expand your expertise and revolutionize your processes—all from the comfort of your home! 📊🤖✨

📅 Yearly Conferences: Curious about the evolution of QA? Check out our archive of past Big Data & RPA sessions. Watch the strategies and technologies evolve in our videos! 🚀 🔗 Find Other Years' Videos: 2023 Big Data Conference Europe https://www.youtube.com/playlist?list=PLqYhGsQ9iSEpb_oyAsg67PhpbrkCC59_g 2022 Big Data Conference Europe Online https://www.youtube.com/playlist?list=PLqYhGsQ9iSEryAOjmvdiaXTfjCg5j3HhT 2021 Big Data Conference Europe Online https://www.youtube.com/playlist?list=PLqYhGsQ9iSEqHwbQoWEXEJALFLKVDRXiP

💡 Stay Connected & Updated 💡

Don’t miss out on any updates or upcoming event information from Big Data & RPA Conference Europe. Follow us on our social media channels and visit our website to stay in the loop!

🌐 Website: https://bigdataconference.eu/, https://rpaconference.eu/ 👤 Facebook: https://www.facebook.com/bigdataconf, https://www.facebook.com/rpaeurope/ 🐦 Twitter: @BigDataConfEU, @europe_rpa 🔗 LinkedIn: https://www.linkedin.com/company/73234449/admin/dashboard/, https://www.linkedin.com/company/75464753/admin/dashboard/ 🎥 YouTube: http://www.youtube.com/@DATAMINERLT

Sascha Giese: DataOps: Much More Than DevOps for Data

Explore the fusion of DevOps and DataOps with Sascha Giese, SolarWinds Global Tech Evangelist. Learn to optimize data efficiency, mitigate risks, and transform your database management for reliability and scalability. 🚀📊 #DataOps #devops

✨ H I G H L I G H T S ✨

🙌 A huge shoutout to all the incredible participants who made Big Data Conference Europe 2023 in Vilnius, Lithuania, from November 21-24, an absolute triumph! 🎉 Your attendance and active participation were instrumental in making this event so special. 🌍

Don't forget to check out the session recordings from the conference to relive the valuable insights and knowledge shared! 📽️

Once again, THANK YOU for playing a pivotal role in the success of Big Data Conference Europe 2023. 🚀 See you next year for another unforgettable conference! 📅 #BigDataConference #SeeYouNextYear

Enterprise MDS deployment at scale: dbt & DevOps - Coalesce 2023

Behind any good DataOps within a Modern Data Stack (MDS) architecture is a solid DevOps design! This is particularly pressing when building an MDS solution at scale, as reliability, quality and availability of data requires a very high degree of process automation while remaining fast, agile and resilient to change when addressing business needs.

While DevOps in Data Engineering is nothing new - for a broad-spectrum solution that includes data warehouse, BI, etc seemed either a bit out of reach due to overall complexity and cost - or simply overlooked due to perceived issues around scaling often attributed to the challenges of automation in CI/CD processes. However, this has been fast changing with tools such as dbt having super cool features which allow a very high degree of autonomy in the CI/CD processes with relative ease, with flexible and cutting edge features around pre-commits, Slim CI, etc.

In this session, Datatonic covers the challenges around building and deploying enterprise-grade MDS solutions for analytics at scale and how they have used dbt to address those - especially around near-complete autonomy to the CI/CD processes!

Speaker: Ash Sultan, Lead Data Architect, Datatonic

Register for Coalesce at https://coalesce.getdbt.com

How We Made a Unified Talent Solution Using Databricks Machine Learning, Fine-Tuned LLM & Dolly 2.0

Using Databricks, we built a “Unified Talent Solution” backed by a robust data and AI engine for analyzing skills of a combined pool of permanent employees, contractors, part-time employees and vendors, inferring skill gaps, future trends and recommended priority areas to bridge talent gaps, which ultimately greatly improved operational efficiency, transparency, commercial model, and talent experience of our client. We leveraged a variety of ML algorithms such as boosting, neural networks and NLP transformers to provide better AI-driven insights.

One inevitable part of developing these models within a typical DS workflow is iteration. Databricks' end-to-end ML/DS workflow service, MLflow, helped streamline this process by organizing them into experiments that tracked the data used for training/testing, model artifacts, lineage and the corresponding results/metrics. For checking the health of our models using drift detection, bias and explainability techniques, MLflow's deploying, and monitoring services were leveraged extensively.

Our solution built on Databricks platform, simplified ML by defining a data-centric workflow that unified best practices from DevOps, DataOps, and ModelOps. Databricks Feature Store allowed us to productionize our models and features jointly. Insights were done with visually appealing charts and graphs using PowerBI, plotly, matplotlib, that answer business questions most relevant to clients. We built our own advanced custom analytics platform on top of delta lake as Delta’s ACID guarantees allows us to build a real-time reporting app that displays consistent and reliable data - React (for front-end), Structured Streaming for ingesting data from Delta table with live query analytics on real time data ML predictions based on analytics data.

Talk by: Nitu Nivedita

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

How Dashboards as Code Can Help You Develop and Validate Your Analytics |  Glean

ABOUT THE TALK: Dashboards sit at the end of a long chain of ever-changing data dependencies. And, it is a very visual process – it is hard to tell if a dashboard is correct without an end user looking at the rendered result. This all adds up to a development process that can be slow and error-prone.

“DataOps” is a new set of code-based patterns and practices that aim to address these challenges. In this talk, Dan Eisenberg does a deep dive on these approaches and demonstrate some ways to integrate DataOps into the BI development lifecycle at Glean.

ABOUT THE SPEAKER: Dan Eisenberg is the VP of Technology at Glean.io, a platform for data visualization and collaboration. Prior to Glean, he was a Senior Director of Engineering at Flatiron Health, where his teams designed and built systems for abstracting data from unstructured medical records at scale.

ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.

Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.

FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil-ai/

More metadata, less problems

There's a kind of funny (and painful) paradigm that exists in the data world: more data, more problems. As companies grow their data, there's more broken dashboards to find, impacted users to chase, and cobbling data together that happens. But what if there was tooling to make the scaling process a little less painful, and a lot more scalable?

Join Marion Pavillet (MUX) and Kevin Hu (Metaplane) as they share stories around scaling data stacks at a large and small companies, demonstrate how they used modern tools to help improve data lineage and accuracy, and discuss the power of metadata in improving data products over time. Attendees can expect to leave this session with new concepts they can use to frame problems faced by rapidly growing companies, a blueprint for implementing DataOps, and solutions to which they can map their own situations.

Check the slides here: https://docs.google.com/presentation/d/1jL2OfAyPFJl0Dq6evmdwnpK6xrebNxEbjVj1nIcwXsU/

Coalesce 2023 is coming! Register for free at https://coalesce.getdbt.com/.

More Context, Less Chaos: How Atlan and Unity Catalog Power Column-Level Lineage and Active Metadata

“What does this mean? Who created it? How is it being used? Is it up to date?” Ever fielded these types of questions about your Databricks assets?

Today, context is a huge challenge for data teams. Everyone wants to use your company’s data, but often only a few experts know all of its tribal knowledge and context. The result — they get bombarded with endless questions and requests.

Atlan — the active metadata platform for modern data teams, recently named a Leader in The Forrester Wave: Enterprise Data Catalogs for DataOps — has launched an integration with Databricks Unity Catalog. By connecting to UC’s REST API, Atlan extracts metadata from Databricks clusters and workspaces, generates column-level lineage, and pairs it with metadata from the rest of your data assets to create true end-to-end lineage and visibility across your data stack.

In this session, Prukalpa Sankar (Co-Founder at Atlan and a lifelong data practitioner) and Todd Greenstein (Product Manager with Databricks) will do a live product demo to show how Atlan and Databricks work together to power modern data governance, cataloging, and collaboration.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Simplify Global DataOps and MLOps Using Okta’s FIG Automation Library

Think for a moment about an ML pipeline that you have created. Was it tedious to write? Did you have to familiarize yourself with technology outside your normal domain? Did you find many bugs? Did you give up with a “good enough” solution? Even simple ML pipelines are tedious. Complex ML pipelines make teams that include Data Engineers and ML Engineers still end up with delays and bugs. Okta’s FIG (Feature Infrastructure Generator) simplifies this with a configuration language for Data Scientists that produces scalable and correct ML pipelines, even highly complex ones. FIG is “just a library” in the sense that you can PIP install it. Once installed, FIG will configure your AWS account, creating ETL jobs, workflows, and ML training and scoring jobs. Data Scientists then use FIG’s configuration language to specify features and model integrations. With a single function call, FIG will run an ML pipeline to generate feature data, train models, and create scoring data. Feature generation is performed in a scalable, efficient, and temporally correct manner. Model training artifacts and scoring are automatically labeled and traced. This greatly simplifies the ML prototyping experience. Once it is time to productionize a model, FIG is able to use the same configuration to coordinate with Okta’s deployment infrastructure to configure production AWS accounts, register build and model artifacts, and setup monitoring. This talk will show a demo of using FIG in the development of Okta’s next generation security infrastructure. The demo includes a walkthrough of the configuration language and how that is translated into AWS during a prototyping session. The demo will also briefly cover how FIG interacts with Okta’s deployment system to make productionization seamless.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/