talk-data.com talk-data.com

Topic

Python

programming_language data_science web_development

1446

tagged

Activity Trend

185 peak/qtr
2020-Q1 2026-Q1

Activities

1446 activities · Newest first

Summary

Every business has customers, and a critical element of success is understanding who they are and how they are using the companies products or services. The challenge is that most companies have a multitude of systems that contain fragments of the customer's interactions and stitching that together is complex and time consuming. Segment created the Unify product to reduce the burden of building a comprehensive view of customers and synchronizing it to all of the systems that need it. In this episode Kevin Niparko and Hanhan Wang share the details of how it is implemented and how you can use it to build and maintain rich customer profiles.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack Your host is Tobias Macey and today I'm interviewing Kevin Niparko and Hanhan Wang about Segment's new Unify product for building and syncing comprehensive customer profiles across your data systems

Interview

Introduction How did you get involved in the area of data management? Can you describe what Segment Unify is and the story behind it? What are the net-new capabilities that it brings to the Segment product suite? What are some of the categories of attributes that need to be managed in a prototypical customer profile? What are the different use cases that are enabled/simplified by the availability of a comprehensive customer profile?

What is the potential impact of more detailed customer profiles on LTV?

How do you manage permissions/auditability of updating or amending profile data? Can you describe how the Unify product is implemented?

What are the technical challenges that you had to address while developing/launching this product?

What is the workflow for a team who is adopting the Unify product?

What are the other Segment products that need to be in use to take advantage of Unify?

What are some of the most complex edge cases to address in identity resolution? How does reverse ETL factor into the enrichment process for profile data? What are some of the issues that you have to account for in synchronizing profiles across platforms/products?

How do you mititgate the impact of "regression to the mean" for systems that don't support all of the attributes that you want to maintain in a profile record?

What are some of the data modeling considerations that you have had to account for to support e.g. historical changes (e.g. slowly changing dimensions)? What are the most interesting, innovative, or unexpected ways that you have seen Segment Unify used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Segment Unify? When is Segment Unify the wrong choice? What do you have planned for the future of Segment Unify?

Contact Info

Kevin

LinkedIn Blog

Hanhan

LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your

In this episode, Conor and Bryce continue to chat with Tristan Brindle about his new library Flux, ChatGPT and more. Link to Episode 127 on WebsiteDiscuss this episode, leave a comment, or ask a question (on GitHub)Twitter ADSP: The PodcastConor HoekstraBryce Adelstein LelbachAbout the Guest Tristan Brindle a freelance programmer and trainer based in London, mostly focussing on C++. He is a member of the UK national body (BSI) and ISO WG21. Occasionally I can be found at C++ conferences. He is also a director of C++ London Uni, a not-for-profit organisation offering free beginner programming classes in London and online. He has a few fun projects on GitHub that you can find out about here.

Show Notes

Date Recorded: 2023-04-05 Date Released: 2023-04-28 ADSP Episode 125: NanoRange with Tristan BrindleADSP Episode 126: Flux (and Flow) with Tristan BrindleChatGPTcwhyChatDBGcommentator“Performance Matters” by Emery Berger“Python Performance Matters” by Emery Berger (Strange Loop 2022)Tristan’s Tweet about scanTristan’s Tweet with flux solutions to MCOArrayCast Episode 48 Henry Rich Reveals J with Threads J9.4J Fold’stop10 GitHub RepoHaskell’s mapAdjacentblackbird C++ Combinator LibraryC++ On Sea ConferenceCppNorthIntro Song Info Miss You by Sarah Jansen https://soundcloud.com/sarahjansenmusic Creative Commons — Attribution 3.0 Unported — CC BY 3.0 Free Download / Stream: http://bit.ly/l-miss-you Music promoted by Audio Library https://youtu.be/iYYxnasvfx8

Summary

Real-time capabilities have quickly become an expectation for consumers. The complexity of providing those capabilities is still high, however, making it more difficult for small teams to compete. Meroxa was created to enable teams of all sizes to deliver real-time data applications. In this episode DeVaris Brown discusses the types of applications that are possible when teams don't have to manage the complex infrastructure necessary to support continuous data flows.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack Your host is Tobias Macey and today I'm interviewing DeVaris Brown about the impact of real-time data on business opportunities and risk profiles

Interview

Introduction How did you get involved in the area of data management? Can you describe what Meroxa is and the story behind it?

How have the focus and goals of the platform and company evolved over the past 2 years?

Who are the target customers for Meroxa?

What problems are they trying to solve when they come to your platform?

Applications powered by real-time data were the exclusive domain of large and/or sophisticated tech companies for several years due to the inherent complexities involved. What are the shifts that have made them more accessible to a wider variety of teams?

What are some of the remaining blockers for teams who want to start using real-time data?

With the democratization of real-time data, what are the new categories of products and applications that are being unlocked?

How are organizations thinking about the potential value that those types of apps/services can provide?

With data flowing constantly, there are new challenges around oversight and accuracy. How does real-time data change the risk profile for applications that are consuming it?

What are some of the technical controls that are available for organizations that are risk-averse?

What skills do developers need to be able to effectively design, develop, and deploy real-time data applications?

How does this differ when talking about internal vs. consumer/end-user facing applications?

What are the most interesting, innovative, or unexpected ways that you have seen Meroxa used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Meroxa? When is Meroxa the wrong choice? What do you have planned for the future of Meroxa?

Contact Info

LinkedIn @devarispbrown on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Meroxa

Podcast Episode

Kafka Kafka Connect Conduit - golang Kafka connect replacement Pulsar Redpanda Flink Beam Clickhouse Druid Pinot

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC

The Modern Data Stack has brought a lot of new buzzwords into the data engineering lexicon: "data mesh", "data observability", "reverse ETL", "data lineage", "analytics engineering". In this light-hearted talk we will demystify the evolving revolution that will define the future of data analytics & engineering teams.

Our journey begins with the PyData Stack: pandas pipelines powering ETL workflows...clean code, tested code, data validation, perfect for in-memory workflows. As demand for self-serve analytics grows, new data sources bring more APIs to model, more code to maintain, DAG workflow orchestration tools, new nuances to capture ("the tax team defines revenue differently"), more dashboards, more not-quite-bugs ("but my number says this...").

This data maturity journey is a well-trodden path with common pitfalls & opportunities. After dashboards comes predictive modelling ("what will happen"), prescriptive modelling ("what should we do?"), perhaps eventually automated decision making. Getting there is much easier with the advent of the Python Powered Modern Data Stack.

In this talk, we will cover the shift from ETL to ELT, the open-source Modern Data Stack tools you should know, with a focus on how dbt's new Python integration is changing how data pipelines are built, run, tested & maintained. By understanding the latest trends & buzzwords, attendees will gain a deeper insight into Python's role at the core of the future of data engineering.

Did you know that the Python Software Foundation Code of Conduct is turning 10 years old in 2023? It was voted in as they felt they were “unbalanced and not seeing the true spectrum of the greater community”. Why is that a big thing? Come to my talk and find out!

Discover how Infrastructure From Code (IfC) can revolutionize Cloud DevOps automation by generating cloud deployment templates directly from Python code. Learn how this technology empowers Python developers to easily deploy and operate cost-effective, secure, reliable, and sustainable cloud software. Join us to explore the strategic potential of IfC.

Asynchronous programming is a type of parallel programming in which a unit of work is allowed to run separately from the primary application thread. Post execution, it notifies the main thread about the completion or failure of the worker thread. There are numerous benefits to using it, such as improved application performance, enhanced responsiveness, and effective usage of CPU.

Asynchronicity seems to be a big reason why Node.js is so popular for server-side programming. Most of the code we write, especially in heavy IO applications like websites, depends on external resources. This could be anything from a remote database POST API call. As soon as you ask for any of these resources, your code is waiting around for process completion with nothing to do. With asynchronous programming, you allow your code to handle other tasks while waiting for these other resources to respond.

In this session, we are going to talk about asynchronous programming in Python. Its benefits and multiple ways to implement it.

In this talk, I’d be talking about Zarr, an open-source data format for storing chunked, compressed N-dimensional arrays. This talk presents a systematic approach to understanding and implementing Zarr by showing how it works, the need for using it, and a hands-on session at the end. Zarr is based on an open technical specification, making implementations across several languages possible. I’d mainly talk about Zarr’s Python implementation and show how it beautifully interoperates with the existing libraries in the PyData stack.

With an average of 3.2 new papers published on Arxiv every day in 2022, causal inference has exploded in popularity, attracting large amount of talent and interest from top researchers and institutions including industry giants like Amazon or Microsoft. Text data, with its high complexity, posits an exciting challenge for causal inference community. In the workshop, we'll review the latest advances in the field of Causal NLP and implement a causal Transformer model to demonstrate how to translate these developments into a practical solution that can bring real business value. All in Python!

tox is a widely-used tool for automating testing in Python. In this talk, we will go behind the scenes of the creation of tox 4, the latest version of the tool. We will discuss the motivations for the rewrite, the challenges and lessons learned during the development process. We will have a look at the new features and improvements introduced in tox 4. But most importantly, you will get to know the maintainers.

As we are in an era of big data where large groups of information are assimilated and analyzed, for insights into human behavior, data privacy has become a hot topic. Since there is a lot of private information which once leaked can be misused, all data cannot be released for research. This talk aims to discuss Differential Privacy, a cutting-edge technique of cybersecurity that claims to preserve an individual’s privacy, how it is employed to minimize the risks with private data, its applications in various domains, and how Python eases the task of employing it in our models with PyDP.

Let’s say you are the ruler of a remote island. For it to succeed and thrive you can’t expect it to be isolated from the world. You need to establish trade routes, offer your products to other islands, and import items from them. Doing this will certainly make your economy grow! We’re not going to talk about land masses or commerce, however, you should think of your application as an island that needs to connect to other applications to succeed. Unfortunately, the sea is treacherous and is not always very consistent, similar to the networks you use to connect your application to the world.

We will explore some techniques and libraries in the Python ecosystem used to make your life easier while dealing with external services. From asynchronicity, caching, testing, and building abstractions on top of the APIs you consume, you will definitely learn some strategies to build your connected application gracefully, and avoid those pesky 2 AM errors that keep you awake.

A modern AI start-up is a front-end developer plus a prompt engineer" is a popular joke on Twitter. This talk is about LangChain, a Python open-source tool for prompt engineering. You can use it with completely open-source language models or ChatGPT. I will show you how to create a prompt and get an answer from LLM. As an example application, I will show a demo of an intelligent agent using web search and generating Python code to answer questions about this conference.

Many developers avoid using generators. For example, many well-known python libraries use lists instead of generators. The generators themselves are slower than normal list loops, but their use in code greatly increases the speed of the application. Let’s discover why.

The Python data ecosystem has matured during the last decade and there are less and less reasons to rely only large batch process executed in a Spark cluster, but with every large ecosystem, putting together the key pieces of technology takes some effort. There are now better storage technologies, streaming execution engines, query planners, and low level compute libraries. And modern hardware is way more powerful than what you'd probably expect. In this workshop we will explore some global-warming-reducing techniques to build more efficient data transformation pipelines in Python, and a little bit of Rust.

By taking neural networks back to the school bench and teaching them some elements of geometry and topology we can build algorithms that can reason about the shape of data. Surprisingly these methods can be useful not only for computer vision – to model input data such as images or point clouds through global, robust properties – but in a wide range of applications, such as evaluating and improving the learning of embeddings, or the distribution of samples originating from generative models. This is the promise of the emerging field of Topological Data Analysis (TDA) which we will introduce and review recent works at its intersection with machine learning. TDA can be seen as being part of the increasingly popular movement of Geometric Deep Learning which encourages us to go beyond seeing data only as vectors in Euclidean spaces and instead consider machine learning algorithms that encode other geometric priors. In the past couple of years TDA has started to take a step out of the academic bubble, to a large extent thanks to powerful Python libraries written as extensions to scikit-learn or PyTorch.

Python is a beautiful language for fast prototyping and and sketching ideas quickly. People often struggle to get their code into production though for various reasons. Besides of all security and safety concerns that usually are not addressed from the very beginning when playing around with an algorithmic idea, performance concerns are quite frequently a reason for not taking the Python code to the next level.

We will look at the "missing performance" worries using a simple numerical problem and how to speed the corresponding Python code up to top notch performance.

The importance of enterprise architecture patterns is all well-known and applicable to varied types of tasks. Thinking about the architecture from the beginning of the journey is crucial to have a maintainable, therefore testable, and flexible code base. In We are going to explore the Ports and Adapters(Hexagonal) pattern by showing a simple web app using Repository, Unit of Work, and Services(Use Cases) patterns tied together with Dependency Injection. All those patterns are quite famous in other languages but they are relatively new for the Python ecosystem, which is a crucial missing part. As a web framework, we are going to use FastAPI which can be replaced with any framework in a matter of time because of the abstractions we have added.