talk-data.com talk-data.com

Topic

Big Data

data_processing analytics large_datasets

1217

tagged

Activity Trend

28 peak/qtr
2020-Q1 2026-Q1

Activities

1217 activities · Newest first

Navigate the complexities of today's digital and data landscape in our panel discussion that underscores the essential role of data governance in the era of accelerating mis- and dis-information. As Big Data ceases to be a buzzword and transforms into the lifeblood of decision-making, governance is elevated from a regulatory compliance requirement to a differentiator and a beacon of trust. This session goes beyond exploration of governance as a data hygiene factor and delves into the relationship between governance, value creation, and elicitation of trust—particularly for decisions steered by AI as we travel into the world of automated risk management and decision making.

In this flagship Big Data LDN keynote debate, conference chair and leading industry analyst Mike Ferguson welcomes executives from leading software vendors to discuss key topics in data management and analytics. Panellists will debate the impact of Generative AI, the implications of key industry trends, how best to deal with real-world customer challenges, how to build a modern data and analytics (D&A) architecture, how to manage, produce, share and govern data and AI, and issues on-the-horizon that companies should be planning for today.

Attendees will learn best practices for data and analytics implementation in a modern data-driven enterprise from seasoned executives and an experienced industry analyst in a packed, unscripted, candid discussion.

In today’s fast-evolving big data landscape, having a solid data strategy is crucial. Our session will explore how leaders develop and execute data strategies, and whether these strategies are still essential in 2024. We will cover the key elements of a successful data strategy, the necessity of adapting to technological advancements like AI and evolving data privacy regulations, and provide insights from industry leaders who have effectively navigated these changes. We'll discuss the future trajectory of data strategies, questioning if traditional approaches still hold value or if new paradigms are emerging. This session is perfect for technology leaders, data professionals, and anyone keen on harnessing data for business success.

Join us as we unlock the secrets of data-driven strategies that drive profit, loyalty, and hyper-personalised experiences, with Capgemini and a Women in Data leadership panel.

At this year’s Big Data London, Women in Data & Capgemini are back with another must-see panel, featuring a diverse and engaging group of female data leaders and their allies from across the Retail & CPG worlds. Last year’s session was one of the most oversubscribed events of the day, with standing room only, thanks to its thought-provoking and honest discussions. This year’s panel promises the same dynamic as they tackle the conundrum of balancing margin focus with rewarding customer loyalty and how data plays a key role. 

The panellists, as well as sharing their own career journeys and experience, will explore how they’ve approached bold strategies that move beyond immediate profits to emphasise the long-term value of customer data and loyalty. They’ll explore how data, analytics & ai can uncover deep insights into customer behaviours and preferences, enabling brands to create personalised experiences and loyalty programs that boost engagement and build lasting trust. 

The discussion will highlight the importance of seeing customer data as a strategic asset. By investing in data collection and analysis, companies can identify trends, predict future behaviours, and tailor their offerings to meet evolving customer needs. This approach can drive repeat business and increase customer lifetime value, ultimately leading to higher margins over time. 

This year’s panel will explore the how data and boldness are key for a balanced strategy that blends margin management with a robust focus on customer loyalty. Using data smartly is key to achieving sustainable profit growth and strengthening brand loyalty. Don’t miss out on what promises to be an inspiring and insightful discussion! 

Our panel discussion explores the transformative potential of big data and AI across diverse industries. We will address key technical challenges such as data management and model scalability, along with ethical and privacy concerns, including data utilization and algorithmic biases. Looking ahead, we will discuss future trends and emerging AI breakthroughs. Emphasizing interdisciplinary collaboration, we advocate for diverse teams to ensure fairness and innovation in AI solutions. This dialogue aims to illuminate the complexities and opportunities in big data and AI.

In this short presentation, Big Data LDN Conference Chairman and Europe’s leading IT Industry Analyst in Data Management and Analytics, Mike Ferguson, will welcome everyone to Big Data LDN 2024. He will also summarise where companies are in data, analytics and AI in 2024, what the key challenges and trends are, how are these trends impacting on how companies build a data-driven enterprise and where you can find out more about these at the show.

Statistics for Data Science and Analytics

Introductory statistics textbook with a focus on data science topics such as prediction, correlation, and data exploration Statistics for Data Science and Analytics is a comprehensive guide to statistical analysis using Python, presenting important topics useful for data science such as prediction, correlation, and data exploration. The authors provide an introduction to statistical science and big data, as well as an overview of Python data structures and operations. A range of statistical techniques are presented with their implementation in Python, including hypothesis testing, probability, exploratory data analysis, categorical variables, surveys and sampling, A/B testing, and correlation. The text introduces binary classification, a foundational element of machine learning, validation of statistical models by applying them to holdout data, and probability and inference via the easy-to-understand method of resampling and the bootstrap instead of using a myriad of “kitchen sink” formulas. Regression is taught both as a tool for explanation and for prediction. This book is informed by the authors’ experience designing and teaching both introductory statistics and machine learning at Statistics.com. Each chapter includes practical examples, explanations of the underlying concepts, and Python code snippets to help readers apply the techniques themselves. Statistics for Data Science and Analytics includes information on sample topics such as: Int, float, and string data types, numerical operations, manipulating strings, converting data types, and advanced data structures like lists, dictionaries, and sets Experiment design via randomizing, blinding, and before-after pairing, as well as proportions and percents when handling binary data Specialized Python packages like numpy, scipy, pandas, scikit-learn and statsmodels—the workhorses of data science—and how to get the most value from them Statistical versus practical significance, random number generators, functions for code reuse, and binomial and normal probability distributions Written by and for data science instructors, Statistics for Data Science and Analytics is an excellent learning resource for data science instructors prescribing a required intro stats course for their programs, as well as other students and professionals seeking to transition to the data science field.

Sayle Matthews leads the North American GCP Data Practice at DoiT International. Over the past year and a half, he has focused almost exclusively on BigQuery, helping hundreds of GCP customers optimize their usage and solve some of their biggest 'Big Data' challenges. With extensive experience in Google BigQuery billing, we sat down to discuss the changes and, most importantly, the impact these changes have had on the market, as observed by Sayle while working with hundreds of clients of various sizes at DoiT. Sayle's LinkedIn page - https://www.linkedin.com/in/sayle-matthews-522a795/

Polars Cookbook

Dive into the world of data analysis with the Polars Cookbook. This book, ideal for data professionals, covers practical recipes to manipulate, transform, and analyze data using the Python Polars library. You'll learn both the fundamentals and advanced techniques to build efficient and scalable data workflows. What this Book will help me do Master the basics of Python Polars including installation and setup. Perform complex data manipulation like pivoting, grouping, and joining. Handle large-scale time series data for accurate analysis. Understand data integration with libraries like pandas and numpy. Optimize workflows for both on-premise and cloud environments. Author(s) Yuki Kakegawa is an experienced data analytics consultant who has collaborated with companies such as Microsoft and Stanford Health Care. His passion for data led him to create this detailed guide on Polars. His expertise ensures you gain real-world, actionable insights from every chapter. Who is it for? This book is perfect for data analysts, engineers, and scientists eager to enhance their efficiency with Python Polars. If you are familiar with Python and tools like pandas but are new to Polars, this book will upskill you. Whether handling big data or optimizing code for performance, the Polars Cookbook has the guidance you need to succeed.

Guardrails are not something we actively use in our day-to-day lives, they’re in place to keep us safe when we lack the control needed to keep us on course, and for that, they are essential. Navigating the complexities of decision-making in AI and data can be challenging, especially on a global scale when many are searching for any sort of competitive advantage. Every choice you make can have significant impacts, and having the right frameworks, ethics and guardrails in place are crucial. But how do you create systems that guide decisions without stifling creativity or flexibility? What practices can you employ to ensure your team consistently make better choices and flourish in the age of AI? Viktor Mayer-Schönberger is a distinguished Professor of Internet Governance and Regulation at the Oxford Internet Institute, University of Oxford. With a career spanning over decades, his research focuses on the role of information in a networked economy. He previously served on the faculty of Harvard’s Kennedy School of Government for ten years and has authored several influential books, including the award-winning “Delete: The Virtue of Forgetting in the Digital Age” and the international bestseller “Big Data.” Viktor founded Ikarus Software in 1986, where he developed Virus Utilities, Austria’s best-selling software product. He has been recognized as a Top-5 Software Entrepreneur in Austria and has served as a personal adviser to the Austrian Finance Minister on innovation policy. His work has garnered global attention, featuring in major outlets like the New York Times, BBC, and The Economist. Viktor is also a frequent public speaker and an advisor to governments, corporations, and NGOs on issues related to the information economy. In the episode, Richie and Viktor explore the definition of guardrails, characteristics of good guardrails, guardrails in business contexts, life-or-death decision-making, principles of effective guardrails, decision-making and cognitive bias, uncertainty in decision-making, designing guardrails, AI and the implementation of guardrails, and much more. Links Mentioned in the Show: Guardrails: Guiding Human Decisions in the Age of AI by Urs Gasser and Viktor Mayer-SchönbergerBook - The Checklist Manifesto by Atul GawandeConnect with ViktorCourse - AI EthicsRelated Episode: Making Better Decisions using Data & AI with Cassie Kozyrkov, Google's First Chief Decision ScientistRewatch sessions from RADAR: AI Edition New to DataCamp? Learn on the go using the DataCamp mobile app Empower your business with world-class data and AI skills with DataCamp for business

DuckDB in Action

Dive into DuckDB and start processing gigabytes of data with ease—all with no data warehouse. DuckDB is a cutting-edge SQL database that makes it incredibly easy to analyze big data sets right from your laptop. In DuckDB in Action you’ll learn everything you need to know to get the most out of this awesome tool, keep your data secure on prem, and save you hundreds on your cloud bill. From data ingestion to advanced data pipelines, you’ll learn everything you need to get the most out of DuckDB—all through hands-on examples. Open up DuckDB in Action and learn how to: Read and process data from CSV, JSON and Parquet sources both locally and remote Write analytical SQL queries, including aggregations, common table expressions, window functions, special types of joins, and pivot tables Use DuckDB from Python, both with SQL and its "Relational"-API, interacting with databases but also data frames Prepare, ingest and query large datasets Build cloud data pipelines Extend DuckDB with custom functionality Pragmatic and comprehensive, DuckDB in Action introduces the DuckDB database and shows you how to use it to solve common data workflow problems. You won’t need to read through pages of documentation—you’ll learn as you work. Get to grips with DuckDB's unique SQL dialect, learning to seamlessly load, prepare, and analyze data using SQL queries. Extend DuckDB with both Python and built-in tools such as MotherDuck, and gain practical insights into building robust and automated data pipelines. About the Technology DuckDB makes data analytics fast and fun! You don’t need to set up a Spark or run a cloud data warehouse just to process a few hundred gigabytes of data. DuckDB is easily embeddable in any data analytics application, runs on a laptop, and processes data from almost any source, including JSON, CSV, Parquet, SQLite and Postgres. About the Book DuckDB in Action guides you example-by-example from setup, through your first SQL query, to advanced topics like building data pipelines and embedding DuckDB as a local data store for a Streamlit web app. You’ll explore DuckDB’s handy SQL extensions, get to grips with aggregation, analysis, and data without persistence, and use Python to customize DuckDB. A hands-on project accompanies each new topic, so you can see DuckDB in action. What's Inside Prepare, ingest and query large datasets Build cloud data pipelines Extend DuckDB with custom functionality Fast-paced SQL recap: From simple queries to advanced analytics About the Reader For data pros comfortable with Python and CLI tools. About the Authors Mark Needham is a blogger and video creator at @‌LearnDataWithMark. Michael Hunger leads product innovation for the Neo4j graph database. Michael Simons is a Java Champion, author, and Engineer at Neo4j. Quotes I use DuckDB every day, and I still learned a lot about how DuckDB makes things that are hard in most databases easy! - Jordan Tigani, Founder, MotherDuck An excellent resource! Unlocks possibilities for storing, processing, analyzing, and summarizing data at the edge using DuckDB. - Pramod Sadalage, Director, Thoughtworks Clear and accessible. A comprehensive resource for harnessing the power of DuckDB for both novices and experienced professionals. - Qiusheng Wu, Associate Professor, University of Tennessee Excellent! The book all we ducklings have been waiting for! - Gunnar Morling, Decodable

podcast_episode
by Val Kroll , Julie Hoyer , Tim Wilson (Analytics Power Hour - Columbus (OH) , Katie Bauer (GlossGenius) , Moe Kiss (Canva) , Michael Helbling (Search Discovery)

Broadly writ, we're all in the business of data work in some form, right? It's almost like we're all swimming around in a big data lake, and our peers are swimming around it, too, and so are our business partners. There might be some HiPPOs and some SLOTHs splashing around in the shallow end, and the contours of the lake keep changing. Is lifeguarding…or writing SQL…or prompt engineering to get AI to write SQL…or identifying business problems a job or a skill? Does it matter? Aren't we all just trying to get to the Insights Water Slide? Katie Bauer, Head of Data at Gloss Genius and thought-provoker at Wrong But Useful, joined Michael, Julie, and Val for a much less metaphorically tortured exploration of the ever-shifting landscape in which the modern data professional operates. Or swims. Or sinks? For complete show notes, including links to items mentioned in this episode and a transcript of the show, visit the show page.

0:00

hi everyone Welcome to our event this event is brought to you by data dos club which is a community of people who love

0:06

data and we have weekly events and today one is one of such events and I guess we

0:12

are also a community of people who like to wake up early if you're from the states right Christopher or maybe not so

0:19

much because this is the time we usually have uh uh our events uh for our guests

0:27

and presenters from the states we usually do it in the evening of Berlin time but yes unfortunately it kind of

0:34

slipped my mind but anyways we have a lot of events you can check them in the

0:41

description like there's a link um I don't think there are a lot of them right now on that link but we will be

0:48

adding more and more I think we have like five or six uh interviews scheduled so um keep an eye on that do not forget

0:56

to subscribe to our YouTube channel this way you will get notified about all our future streams that will be as awesome

1:02

as the one today and of course very important do not forget to join our community where you can hang out with

1:09

other data enthusiasts during today's interview you can ask any question there's a pin Link in live chat so click

1:18

on that link ask your question and we will be covering these questions during the interview now I will stop sharing my

1:27

screen and uh there is there's a a message in uh and Christopher is from

1:34

you so we actually have this on YouTube but so they have not seen what you wrote

1:39

but there is a message from to anyone who's watching this right now from Christopher saying hello everyone can I

1:46

call you Chris or you okay I should go I should uh I should look on YouTube then okay yeah but anyways I'll you don't

1:53

need like you we'll need to focus on answering questions and I'll keep an eye

1:58

I'll be keeping an eye on all the question questions so um

2:04

yeah if you're ready we can start I'm ready yeah and you prefer Christopher

2:10

not Chris right Chris is fine Chris is fine it's a bit shorter um

2:18

okay so this week we'll talk about data Ops again maybe it's a tradition that we talk about data Ops every like once per

2:25

year but we actually skipped one year so because we did not have we haven't had

2:31

Chris for some time so today we have a very special guest Christopher Christopher is the co-founder CEO and

2:37

head chef or hat cook at data kitchen with 25 years of experience maybe this

2:43

is outdated uh cuz probably now you have more and maybe you stopped counting I

2:48

don't know but like with tons of years of experience in analytics and software engineering Christopher is known as the

2:55

co-author of the data Ops cookbook and data Ops Manifesto and it's not the

3:00

first time we have Christopher here on the podcast we interviewed him two years ago also about data Ops and this one

3:07

will be about data hops so we'll catch up and see what actually changed in in

3:13

these two years and yeah so welcome to the interview well thank you for having

3:19

me I'm I'm happy to be here and talking all things related to data Ops and why

3:24

why why bother with data Ops and happy to talk about the company or or what's changed

3:30

excited yeah so let's dive in so the questions for today's interview are prepared by Johanna berer as always

3:37

thanks Johanna for your help so before we start with our main topic for today

3:42

data Ops uh let's start with your ground can you tell us about your career Journey so far and also for those who

3:50

have not heard have not listened to the previous podcast maybe you can um talk

3:55

about yourself and also for those who did listen to the previous you can also maybe give a summary of what has changed

4:03

in the last two years so we'll do yeah so um my name is Chris so I guess I'm

4:09

a sort of an engineer so I spent about the first 15 years of my career in

4:15

software sort of working and building some AI systems some non- AI systems uh

4:21

at uh Us's NASA and MIT linol lab and then some startups and then um

4:30

Microsoft and then about 2005 I got I got the data bug uh I think you know my

4:35

kids were small and I thought oh this data thing was easy and I'd be able to go home uh for dinner at 5 and life

4:41

would be fine um because I was a big you started your own company right and uh it didn't work out that way

4:50

and um and what was interesting is is for me it the problem wasn't doing the

4:57

data like I we had smart people who did data science and data engineering the act of creating things it was like the

5:04

systems around the data that were hard um things it was really hard to not have

5:11

errors in production and I would sort of driving to work and I had a Blackberry at the time and I would not look at my

5:18

Blackberry all all morning I had this long drive to work and I'd sit in the parking lot and take a deep breath and

5:24

look at my Blackberry and go uh oh is there going to be any problems today and I'd be and if there wasn't I'd walk and

5:30

very happy um and if there was I'd have to like rce myself um and you know and

5:36

then the second problem is the team I worked for we just couldn't go fast enough the customers were super

5:42

demanding they didn't care they all they always thought things should be faster and we are always behind and so um how

5:50

do you you know how do you live in that world where things are breaking left and right you're terrified of making errors

5:57

um and then second you just can't go fast enough um and it's preh Hadoop era

6:02

right it's like before all this big data Tech yeah before this was we were using

6:08

uh SQL Server um and we actually you know we had smart people so we we we

6:14

built an engine in SQL Server that made SQL Server a column or

6:20

database so we built a column or database inside of SQL Server um so uh

6:26

in order to make certain things fast and and uh yeah it was it was really uh it's not

6:33

bad I mean the principles are the same right before Hadoop it's it's still a database there's still indexes there's

6:38

still queries um things like that we we uh at the time uh you would use olap

6:43

engines we didn't use those but you those reports you know are for models it's it's not that different um you know

6:50

we had a rack of servers instead of the cloud um so yeah and I think so what what I

6:57

took from that was uh it's just hard to run a team of people to do do data and analytics and it's not

7:05

really I I took it from a manager perspective I started to read Deming and

7:11

think about the work that we do as a factory you know and in a factory that produces insight and not automobiles um

7:18

and so how do you run that factory so it produces things that are good of good

7:24

quality and then second since I had come from software I've been very influenced

7:29

by by the devops movement how you automate deployment how you run in an agile way how you

7:35

produce um how you how you change things quickly and how you innovate and so

7:41

those two things of like running you know running a really good solid production line that has very low errors

7:47

um and then second changing that production line at at very very often they're kind of opposite right um and so

7:55

how do you how do you as a manager how do you technically approach that and

8:00

then um 10 years ago when we started data kitchen um we've always been a profitable company and so we started off

8:07

uh with some customers we started building some software and realized that we couldn't work any other way and that

8:13

the way we work wasn't understood by a lot of people so we had to write a book and a Manifesto to kind of share our our

8:21

methods and then so yeah we've been in so we've been in business now about a little over 10

8:28

years oh that's cool and uh like what

8:33

uh so let's talk about dat offs and you mentioned devops and how you were inspired by that and by the way like do

8:41

you remember roughly when devops as I think started to appear like when did people start calling these principles

8:49

and like tools around them as de yeah so agile Manifesto well first of all the I

8:57

mean I had a boss in 1990 at Nasa who had this idea build a

9:03

little test a little learn a lot right that was his Mantra and then which made

9:09

made a lot of sense um and so and then the sort of agile software Manifesto

9:14

came out which is very similar in 2001 and then um the sort of first real

9:22

devops was a guy at Twitter started to do automat automated deployment you know

9:27

push a button and that was like 200 Nish and so the first I think devops

9:33

Meetup was around then so it's it's it's been 15 years I guess 6 like I was

9:39

trying to so I started my career in 2010 so I my first job was a Java

9:44

developer and like I remember for some things like we would just uh SFTP to the

9:52

machine and then put the jar archive there and then like keep our fingers crossed that it doesn't break uh uh like

10:00

it was not really the I wouldn't call it this way right you were deploying you

10:06

had a Dey process I put it yeah

10:11

right was that so that was documented too it was like put the jar on production cross your

10:17

fingers I think there was uh like a page on uh some internal Viki uh yeah that

10:25

describes like with passwords and don't like what you should do yeah that was and and I think what's interesting is

10:33

why that changed right and and we laugh at it now but that was why didn't you

10:38

invest in automating deployment or a whole bunch of automated regression

10:44

tests right that would run because I think in software now that would be rare

10:49

that people wouldn't use C CD they wouldn't have some automated tests you know functional

10:56

regression tests that would be the

Big Data on Kubernetes

Big Data on Kubernetes is your comprehensive guide to leveraging Kubernetes for scalable and efficient big data solutions. You will learn key concepts of Kubernetes architecture and explore tools like Apache Spark, Airflow, and Kafka. Gain hands-on experience building complete data pipelines to tackle real-world data challenges. What this Book will help me do Understand Kubernetes architecture and learn to deploy and manage clusters. Build and orchestrate big data pipelines using Spark, Airflow, and Kafka. Develop scalable and resilient data solutions with Docker and Kubernetes. Integrate and optimize data tools for real-time ingestion and processing. Apply concepts to hands-on projects addressing actual big data scenarios. Author(s) Neylson Crepalde is an experienced data specialist with extensive knowledge of Kubernetes and big data solutions. With deep practical experience, Neylson brings real-world insights to his writing. His approach emphasizes actionable guidance and relatable problem-solving with a strong foundation in scalable architecture. Who is it for? This book is ideal for data engineers, BI analysts, data team leaders, and tech managers familiar with Python, SQL, and YAML. Targeted at professionals seeking to develop or expand their expertise in scalable big data solutions, it provides practical insights into Docker, Kubernetes, and prominent big data tools.

Data quality is the foundation of everything we do as Data Analysts and Data Scientists. So why do so many organizations suffer from dirty data? And what can you do to clean it up? In this session, we'll share some of the best data cleaning strategies and real actionable advice from The Classification Guru, Susan Walsh. You'll leave with a solid plan to start identifying problems with your data, and most importantly, to start fixing them on the path to clean data. What You'll Learn: Why dirty data is such a big problem, and the benefits of cleaning it up The most common types of dirty data you should be on the lookout for Where you should focus your data cleaning efforts to make the biggest impact   Register for free to be part of the next live session: https://bit.ly/3XB3A8b   About our guest: Susan Walsh is a specialist in data classification, taxonomy customisation and data cleansing. She also created the COAT philosophy, which is at the core of The Classification Guru's work. By bringing clarity and accuracy to data and procurement, Susan helps teams work more effectively and efficiently More than a numbers gal, Susan's also an industry thought leader, TEDx speaker and author of the published 'Between the Spreadsheets: Classifying and Fixing Dirty Data'. She's also spoken globally at events such as ProcureCon, Big Data LDN, Big Data & AI World and she cuts through the jargon to address the issues of dirty data and its consequences in an entertaining and engaging way. Fix your dirty data now: www.theclassificationguru.co   Follow us on Socials: LinkedIn YouTube Instagram (Mavens of Data) Instagram (Maven Analytics) TikTok Facebook Medium X/Twitter

The talk will cover how we use Airflow at the heart of our Workflow Management Platform(WFM) at Booking.com, enabling our internal users to orchestrate big data workflows on Booking Data Exchange(BDX). High level overview of the talk: Adapting open source Airflow helm chart to spin up Airflow installation in Booking Kubernetes Service (BKS) Coming up with Workflow definition format (yaml) Conversion of workflow.yaml to workflow.py DAGs Usage of Deferrable operators to provide standard step templates to users Workspaces (collection of workflows), using it to ensure role based access to DAG permissions for users Using okta for authentication Alerting, monitoring, logging Plans to shift to Astronomer