In this landmark 100th episode of Data Unchained, host Molly Presley sits down with Jonathan Flynn, Director of Applied Systems at Hammerspace, live from Supercomputing 2025. Together they explore the performance engineering breakthroughs that enabled Hammerspace and Samsung to deliver a historic IO500 10 Node Production result using only standard Linux, the upstream NFSv4.2 client, and off the shelf NVMe hardware. This episode breaks down how the Hammerspace Data Platform delivered more than a 33 percent gain over earlier submissions, doubled overall bandwidth, and achieved an unprecedented 809 percent improvement in the IO Hard Read test using Samsung PM1753 Gen 5 NVMe SSDs. Jonathan explains the Linux kernel innovations, metadata advancements, IO path optimization, parallel file system breakthroughs, and multi instance file placement strategies that allowed Hammerspace to reach genuine HPC class performance without proprietary clients or custom networking. Listeners get a detailed walkthrough of the architectural differences between Research and Production IO500 submissions, the impact of metadata redundancy, the performance benefits of NFSd direct and NFS direct, the role of ZFS locking improvements, and how upstream Linux contributions directly advanced the state of HPC and AI data infrastructure. Jonathan also highlights the evolution of MLPerf benchmarking, the benefits of tier zero storage, and how Hammerspace performance engineering is unlocking new levels of efficiency and scalability for AI training, scientific workloads, and large scale analytics. This episode is essential for AI architects, HPC engineers, kernel developers, data scientists, and infrastructure leaders building the next generation of high performance data platforms. Cyberpunk by jiglr | https://soundcloud.com/jiglrmusic Music promoted by https://www.free-stock-music.com Creative Commons Attribution 3.0 Unported License https://creativecommons.org/licenses/by/3.0/deed.en_US Hosted on Acast. See acast.com/privacy for more information.
talk-data.com
Topic
Linux
37
tagged
Activity Trend
Top Events
Send us a text Dive into the powerful world of mainframes! Chief Product Officer of IBM Z and LinuxONE, Tina Tarquinio, reveals the truth behind those eight nines of uptime and explores how mainframes are evolving with AI, hybrid cloud, and future-proofing strategies for mission-critical business decisions.
Discover the cutting-edge innovations transforming enterprise computing—from on-chip AIU and Spyre AI accelerators enabling real-time inferencing at transaction speed, to how LinuxONE is redefining hybrid cloud architecture. Tina discusses DevOps integration, AI-powered code assistants revolutionizing mainframe development, compelling AI use cases, and shares her bold predictions for the mainframe’s next 100 years. Plus, career advice from a tech leader and what she does for fun! 00:46 Tina Tarquinio03:18 The Most Mainframe Surprise09:12 What IS the Mainframe Really? 8 Nines!14:40 On Chip AIU, Spyre Inferencing18:11 Mainframes with Hybrid Cloud19:11 The Linux One Pitch19:59 Exciting Mainframe Innovations22:09 DevOps23:36 Code Assistants26:03 AI Use Case27:49 Future Proofing Decisions37:17 Regulations38:45 Bold Prediction38:58 Mainframe 10040:48 Career Advice42:24 For FunLinkedIn: linkedin.com/in/tina-tarquinio Website: https://www.ibm.com/products/z
MakingDataSimple #IBMz #Mainframe #LinuxONE #AIInferencing #SpyreAccelerator #HybridCloud #EnterpriseAI #DevOps #AICodeAssistant #EightNines #TinaTarquinio #MainframeModernization #AIUChip #FutureProofing #TechLeadership #WatsonxCodeAssistant #CloudComputing #TelumII
Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.
In this episode of Data Unchained,we sit down with David Flynn, Founder and CEO of Hammerspace, to explore how next-generation infrastructure is transforming the future of AI factories, hyperscaler data centers, and enterprise-scale AI deployments. From exabyte-in-a-rack architectures to parallel file systems native in Linux, this conversation reveals how organizations can drastically lower CapEx, OpEx, and power consumption while unlocking unprecedented performance density. Cyberpunk by jiglr | https://soundcloud.com/jiglrmusic Music promoted by https://www.free-stock-music.com Creative Commons Attribution 3.0 Unported License https://creativecommons.org/licenses/by/3.0/deed.en_US
AIInfrastructure #Hyperscalers #DataEngineering #EnterpriseAI #SoftwareArchitecture #ExabyteStorage #ParallelFileSystems #LinuxNative #DataCenters #AIatScale #OpenPlatformInitiative #globaldata
Hosted on Acast. See acast.com/privacy for more information.
Supported by Our Partners • Statsig — The unified platform for flags, analytics, experiments, and more. • Graphite — The AI developer productivity platform. • Augment Code — AI coding assistant that pro engineering teams love — GitHub recently turned 17 years old—but how did it start, how has it evolved, and what does the future look like as AI reshapes developer workflows? In this episode of The Pragmatic Engineer, I’m joined by Thomas Dohmke, CEO of GitHub. Thomas has been a GitHub user for 16 years and an employee for 7. We talk about GitHub’s early architecture, its remote-first operating model, and how the company is navigating AI—from Copilot to agents. We also discuss why GitHub hires junior engineers, how the company handled product-market fit early on, and why being a beloved tool can make shipping harder at times. Other topics we discuss include: • How GitHub’s architecture evolved beyond its original Rails monolith • How GitHub runs as a remote-first company—and why they rarely use email • GitHub’s rigorous approach to security • Why GitHub hires junior engineers • GitHub’s acquisition by Microsoft • The launch of Copilot and how it’s reshaping software development • Why GitHub sees AI agents as tools, not a replacement for engineers • And much more! — Timestamps (00:00) Intro (02:25) GitHub’s modern tech stack (08:11) From cloud-first to hybrid: How GitHub handles infrastructure (13:08) How GitHub’s remote-first culture shapes its operations (18:00) Former and current internal tools including Haystack (21:12) GitHub’s approach to security (24:30) The current size of GitHub, including security and engineering teams (25:03) GitHub’s intern program, and why they are hiring junior engineers (28:27) Why AI isn’t a replacement for junior engineers (34:40) A mini-history of GitHub (39:10) Why GitHub hit product market fit so quickly (43:44) The invention of pull requests (44:50) How GitHub enables offline work (46:21) How monetization has changed at GitHub since the acquisition (48:00) 2014 desktop application releases (52:10) The Microsoft acquisition (1:01:57) Behind the scenes of GitHub’s quiet period (1:06:42) The release of Copilot and its impact (1:14:14) Why GitHub decided to open-source Copilot extensions (1:20:01) AI agents and the myth of disappearing engineering jobs (1:26:36) Closing — The Pragmatic Engineer deepdives relevant for this episode: • AI Engineering in the real world • The AI Engineering stack • How Linux is built with Greg Kroah-Hartman • Stacked Diffs (and why you should know about them) • 50 Years of Microsoft and developer tools — See the transcript and other references from the episode at https://newsletter.pragmaticengineer.com/podcast — Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email [email protected].
Get full access to The Pragmatic Engineer at newsletter.pragmaticengineer.com/subscribe
Supported by Our Partners • Statsig — The unified platform for flags, analytics, experiments, and more. • Sinch — Connect with customers at every step of their journey. • Modal — The cloud platform for building AI applications. — How has Microsoft changed since its founding in 1975, especially in how it builds tools for developers? In this episode of The Pragmatic Engineer, I sit down with Scott Guthrie, Executive Vice President of Cloud and AI at Microsoft. Scott has been with the company for 28 years. He built the first prototype of ASP.NET, led the Windows Phone team, led up Azure, and helped shape many of Microsoft’s most important developer platforms. We talk about Microsoft’s journey from building early dev tools to becoming a top cloud provider—and how it actively worked to win back and grow its developer base. In this episode, we cover: • Microsoft’s early years building developer tools • Why Visual Basic faced resistance from devs back in the day: even though it simplified development at the time • How .NET helped bring a new generation of server-side developers into Microsoft’s ecosystem • Why Windows Phone didn’t succeed • The 90s Microsoft dev stack: docs, debuggers, and more • How Microsoft Azure went from being the #7 cloud provider to the #2 spot today • Why Microsoft created VS Code • How VS Code and open source led to the acquisition of GitHub • What Scott’s excited about in the future of developer tools and AI • And much more! — Timestamps (00:00) Intro (02:25) Microsoft’s early years building developer tools (06:15) How Microsoft’s developer tools helped Windows succeed (08:00) Microsoft’s first tools were built to allow less technically savvy people to build things (11:00) A case for embracing the technology that’s coming (14:11) Why Microsoft built Visual Studio and .NET (19:54) Steve Ballmer’s speech about .NET (22:04) The origins of C# and Anders Hejlsberg’s impact on Microsoft (25:29) The 90’s Microsoft stack, including documentation, debuggers, and more (30:17) How productivity has changed over the past 10 years (32:50) Why Gergely was a fan of Windows Phone—and Scott’s thoughts on why it didn’t last (36:43) Lessons from working on (and fixing) Azure under Satya Nadella (42:50) Codeplex and the acquisition of GitHub (48:52) 2014: Three bold projects to win the hearts of developers (55:40) What Scott’s excited about in new developer tools and cloud computing (59:50) Why Scott thinks AI will enhance productivity but create more engineering jobs — The Pragmatic Engineer deepdives relevant for this episode: • Microsoft is dogfooding AI dev tools’ future • Microsoft’s developer tools roots • Why are Cloud Development Environments spiking in popularity, now? • Engineering career paths at Big Tech and scaleups • How Linux is built with Greg Kroah-Hartman — See the transcript and other references from the episode at https://newsletter.pragmaticengineer.com/podcast — Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email [email protected].
Get full access to The Pragmatic Engineer at newsletter.pragmaticengineer.com/subscribe
Supported by Our Partners • WorkOS — The modern identity platform for B2B SaaS. • Modal — The cloud platform for building AI applications. • Cortex — Your Portal to Engineering Excellence. — Kubernetes is the second-largest open-source project in the world. What does it actually do—and why is it so widely adopted? In this episode of The Pragmatic Engineer, I’m joined by Kat Cosgrove, who has led several Kubernetes releases. Kat has been contributing to Kubernetes for several years, and originally got involved with the project through K3s (the lightweight Kubernetes distribution). In our conversation, we discuss how Kubernetes is structured, how it scales, and how the project is managed to avoid contributor burnout. We also go deep into: • An overview of what Kubernetes is used for • A breakdown of Kubernetes architecture: components, pods, and kubelets • Why Google built Borg, and how it evolved into Kubernetes • The benefits of large-scale open source projects—for companies, contributors, and the broader ecosystem • The size and complexity of Kubernetes—and how it’s managed • How the project protects contributors with anti-burnout policies • The size and structure of the release team • What KEPs are and how they shape Kubernetes features • Kat’s views on GenAI, and why Kubernetes blocks using AI, at least for documentation • Where Kat would like to see AI tools improve developer workflows • Getting started as a contributor to Kubernetes—and the career and networking benefits that come with it • And much more! — Timestamps (00:00) Intro (02:02) An overview of Kubernetes and who it’s for (04:27) A quick glimpse at the architecture: Kubernetes components, pods, and cubelets (07:00) Containers vs. virtual machines (10:02) The origins of Kubernetes (12:30) Why Google built Borg, and why they made it an open source project (15:51) The benefits of open source projects (17:25) The size of Kubernetes (20:55) Cluster management solutions, including different Kubernetes services (21:48) Why people contribute to Kubernetes (25:47) The anti-burnout policies Kubernetes has in place (29:07) Why Kubernetes is so popular (33:34) Why documentation is a good place to get started contributing to an open-source project (35:15) The structure of the Kubernetes release team (40:55) How responsibilities shift as engineers grow into senior positions (44:37) Using a KEP to propose a new feature—and what’s next (48:20) Feature flags in Kubernetes (52:04) Why Kat thinks most GenAI tools are scams—and why Kubernetes blocks their use (55:04) The use cases Kat would like to have AI tools for (58:20) When to use Kubernetes (1:01:25) Getting started with Kubernetes (1:04:24) How contributing to an open source project is a good way to build your network (1:05:51) Rapid fire round — The Pragmatic Engineer deepdives relevant for this episode: • Backstage: an open source developer portal • How Linux is built with Greg Kroah-Hartman • Software engineers leading projects • What TPMs do and what software engineers can learn from them • Engineering career paths at Big Tech and scaleups — See the transcript and other references from the episode at https://newsletter.pragmaticengineer.com/podcast — Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email [email protected].
Get full access to The Pragmatic Engineer at newsletter.pragmaticengineer.com/subscribe
Supported by Our Partners • CodeRabbit — Cut code review time and bugs in half. Use the code PRAGMATIC to get one month free. • Modal — The cloud platform for building AI applications. — How will AI tools change software engineering? Tools like Cursor, Windsurf and Copilot are getting better at autocomplete, generating tests and documentation. But what is changing, when it comes to software design? Stanford professor John Ousterhout thinks not much. In fact, he believes that great software design is becoming even more important as AI tools become more capable in generating code. In this episode of The Pragmatic Engineer, John joins me to talk about why design still matters and how most teams struggle to get it right. We dive into his book A Philosophy of Software Design, unpack the difference between top-down and bottom-up approaches, and explore why some popular advice, like writing short methods or relying heavily on TDD, does not hold up, according to John. We also explore: • The differences between working in industry vs. academia • Why John believes software design will become more important as AI capabilities expand • The top-down and bottoms-up design approaches – and why you should use both • John’s “design it twice” principle • Why deep modules are essential for good software design • Best practices for special cases and exceptions • The undervalued trait of empathy in design thinking • Why John advocates for doing some design upfront • John’s criticisms of the single-responsibility principle, TDD, and why he’s a fan of well-written comments • And much more! As a fun fact: when we recorded this podcast, John was busy contributing to the Linux kernel: adding support to the Homa Transport Protocol – a protocol invented by one of his PhD students. John wanted to make this protocol available more widely, and is putting in the work to do so. What a legend! (We previously covered how Linux is built and how to contribute to the Linux kernel) — Timestamps (00:00) Intro (02:00) Why John transitioned back to academia (03:47) Working in academia vs. industry (07:20) Tactical tornadoes vs. 10x engineers (11:59) Long-term impact of AI-assisted coding (14:24) An overview of software design (15:28) Why TDD and Design Patterns are less popular now (17:04) Two general approaches to designing software (18:56) Two ways to deal with complexity (19:56) A case for not going with your first idea (23:24) How Uber used design docs (26:44) Deep modules vs. shallow modules (28:25) Best practices for error handling (33:31) The role of empathy in the design process (36:15) How John uses design reviews (38:10) The value of in-person planning and using old-school whiteboards (39:50) Leading a planning argument session and the places it works best (42:20) The value of doing some design upfront (46:12) Why John wrote A Philosophy of Software of Design (48:40) An overview of John’s class at Stanford (52:20) A tough learning from early in Gergely’s career (55:48) Why John disagrees with Robert Martin on short methods (1:10:40) John’s current coding project in the Linux Kernel (1:14:13) Updates to A Philosophy of Software Design in the second edition (1:19:12) Rapid fire round (1:01:08) John’s criticisms of TDD and what he favors instead (1:05:30) Why John supports the use of comments and how to use them correctly (1:09:20) How John uses ChatGPT to help explain code in the Linux Kernel — The Pragmatic Engineer deepdives relevant for this episode: • Engineering Planning with RFCs, Design Documents and ADRs • Paying down tech debt • Software architect archetypes • Building Bluesky: a distributed social network — See the transcript and other references from the episode at https://newsletter.pragmaticengineer.com/podcast — Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email [email protected].
Get full access to The Pragmatic Engineer at newsletter.pragmaticengineer.com/subscribe
Supported by Our Partners • WorkOS — The modern identity platform for B2B SaaS. • Vanta — Automate compliance and simplify security with Vanta. — Linux is the most widespread operating system, globally – but how is it built? Few people are better to answer this than Greg Kroah-Hartman: a Linux kernel maintainer for 25 years, and one of the 3 Linux Kernel Foundation Fellows (the other two are Linus Torvalds and Shuah Khan). Greg manages the Linux kernel’s stable releases, and is a maintainer of multiple kernel subsystems. We cover the inner workings of Linux kernel development, exploring everything from how changes get implemented to why its community-driven approach produces such reliable software. Greg shares insights about the kernel's unique trust model and makes a case for why engineers should contribute to open-source projects. We go into: • How widespread is Linux? • What is the Linux kernel responsible for – and why is it a monolith? • How does a kernel change get merged? A walkthrough • The 9-week development cycle for the Linux kernel • Testing the Linux kernel • Why is Linux so widespread? • The career benefits of open-source contribution • And much more! — Timestamps (00:00) Intro (02:23) How widespread is Linux? (06:00) The difference in complexity in different devices powered by Linux (09:20) What is the Linux kernel? (14:00) Why trust is so important with the Linux kernel development (16:02) A walk-through of a kernel change (23:20) How Linux kernel development cycles work (29:55) The testing process at Kernel and Kernel CI (31:55) A case for the open source development process (35:44) Linux kernel branches: Stable vs. development (38:32) Challenges of maintaining older Linux code (40:30) How Linux handles bug fixes (44:40) The range of work Linux kernel engineers do (48:33) Greg’s review process and its parallels with Uber’s RFC process (51:48) Linux kernel within companies like IBM (53:52) Why Linux is so widespread (56:50) How Linux Kernel Institute runs without product managers (1:02:01) The pros and cons of using Rust in Linux kernel (1:09:55) How LLMs are utilized in bug fixes and coding in Linux (1:12:13) The value of contributing to the Linux kernel or any open-source project (1:16:40) Rapid fire round — The Pragmatic Engineer deepdives relevant for this episode: What TPMs do and what software engineers can learn from them The past and future of modern backend practices Backstage: an open-source developer portal — See the transcript and other references from the episode at https://newsletter.pragmaticengineer.com/podcast — Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email [email protected].
Get full access to The Pragmatic Engineer at newsletter.pragmaticengineer.com/subscribe
The Data Product Management In Action podcast, brought to you by Soda and executive producer Scott Hirleman, is a platform for data product management practitioners to share insights and experiences. We've released a special edition series of minisodes of our podcast. Recorded live at Data Connect 2024, our host Michael Toland engages in short, sweet, informative, and delightful conversations with five prevelant practitioners who are forging their way forward in data and technology.
About our host Michael Toland: Michael is a Product Management Coach and Consultant with Pathfinder Product, a Test Double Operation. Since 2016, Michael has worked on large-scale system modernizations and migration initiatives at Verizon. Outside his professional career, Michael serves as the Treasurer for the New Leaders Council, mentors with Venture for America, sings with the Columbus Symphony, and writes satire for his blog Dignified Product. He is excited to discuss data product management with the podcast audience. Connect with Michael on LinkedIn About our guest Jean-Georges Perrin: Jean-Georges “jgp” Perrin is the Chief Innovation Officer at AbeaData, where he focuses on developing cutting-edge data tooling. He chairs the Open Data Contract Standard (ODCS) at the Linux Foundation's Bitol project, co-founded the AIDA User Group, and has authored several influential books, including Implementing Data Mesh (O'Reilly) and Spark in Action, 2nd Edition (Manning). With over 25 years in IT, Jean-Georges is recognized as a Lifetime IBM Champion, a PayPal Champion, and a Data Mesh MVP. His expertise spans data engineering, governance, and the industrialization of data science. Outside of tech, he enjoys exploring Upstate New York and New England with his family. Connect with J-GP on LinkedIn. All views and opinions expressed are those of the individuals and do not necessarily reflect their employers or anyone else. Join the conversation on LinkedIn. Apply to be a guest or nominate a practitioner. Do you love what you're listening to? Please rate and review the podcast, and share it with fellow practitioners you know. Your support helps us reach more listeners and continue providing valuable insights!
Jean-Georges Perrin is a serial startup founder, currently co-founder of AbeaData [https://abeadata.com/], and co-author of "Implementing Data Mesh." He is the one who championed the PayPal's data contract project, which is now part of Bitol and the Linux Foundation. In this episode, JGP speaks about building and maintaining open-source data contract solutions using open standards. He shares a lot about why and how he came to it and the challenges of maintaining it to avoid appropriation of the solution. JGP discusses how they balance the interests of different groups in developing a community around open data contract standards. More importantly, he shares how data contracts can positively change the life of every data engineer.Check out JGP's LinkedInCheck out Bitol - Open Standards for Data Contracts and become a contributor.
We are excited to welcome Linux NFS Kernel Maintainer and CTO of Hammerspace, Trond Myklebust, to join us on this episode of the podcast! Trond and Molly discuss the Linux community and Trond's journey from aspiring Particle Physics Ph. D. to Linux Maintainer and innovative industry technology visionary. Trond has dedicated a career to building open source software and driving innovation in data technologies. In this episode we discuss the evolution of high-performance file systems. Historically, parallel file systems have required additional client software to be loaded on each client machine that needs to work with high-performance data sets. Added software can be difficult to get approved by security team standards, slow to be added to workstation images, and is typically charged per client software instance. These challenges have made it difficult to give access to all users and applications that could derive value from the data sets housed in the parallel file system. All of this began to change with the vision of the Linux community developing an embedded parallel file system as part of the NFS protocol. With the creation of pNFS (parallel NFS), compute-standard Linux clients now can read and write directly to the storage, and scale performance linearly and near infinitely. Expensive and proprietary software is no longer needed to create a parallel file system. pNFS is built into open standards. All these topics and more as we dive deeper into the data driven world on this podcast episode of Data Unchained!
data #pNFS #NFS #Linux #Maintainer #Community #decentralizeddata #datastorage #storage
Cyberpunk by jiglr | https://soundcloud.com/jiglrmusic Music promoted by https://www.free-stock-music.com Creative Commons Attribution 3.0 Unported License https://creativecommons.org/licenses/by/3.0/deed.en_US Hosted on Acast. See acast.com/privacy for more information.
In this episode, Bryce and Conor chat about programming language logos, code formatting, the top future C++ features and more! Link to Episode 100 on Website Twitter ADSP: The PodcastConor HoekstraBryce Adelstein LelbachShow Notes Date Recorded: 2022-10-16 Date Released: 2022-10-21 The Swift Programming LanguageThe Racket Programming LanguageThe Clojure Programming LanguageThe New APL LogoMind in Motion by Barbara TverskyNudge by Richard ThalerThinking Fast, Thinking Slow by Daniel KahnemanAmos TverskyThe Peak-End RuleC++’s Clang-FormatPython’s BlackPython’s PEP8NVIDIA CUB LogoADSP Episode 99: Moby Dick & Our Favorite MoviesO3DCON by Linux Foundation ConferenceTop 3 C++ Features #1: ReflectionTop 3 C++ Features #2: Pattern MatchingTop 3 C++ Features #3: Senders & ReceiversC++ std::variantC++ std::optionalRust enumC++ std::expectedSy Brand’s tl::expectedPython resultC++20 is Here! (ISO C++ Prague Feb 2020 Vlog)Intro Song Info Miss You by Sarah Jansen https://soundcloud.com/sarahjansenmusic Creative Commons — Attribution 3.0 Unported — CC BY 3.0 Free Download / Stream: http://bit.ly/l-miss-you Music promoted by Audio Library https://youtu.be/iYYxnasvfx8
Send us a text Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.
Abstract
Hosted by Al Martin, VP, Data and AI Expert Services and Learning at IBM, Making Data Simple provides the latest thinking on big data, A.I., and the implications for the enterprise from a range of experts.
This week on Making Data Simple, we have Dale Davis Jones, who is an IBM Vice President and Distinguished Engineer in Global Technology Services, where she leads the GTS IT Architect community and Client Innovation. We also have Hai-Nhu Tran, who is the Senior Manager of Content of Design in Data and AI at IBM. Hai-Nhu and her team are responsible for the technical content experience for a large portfolio of products and platforms.
Show Notes 11:43 - What is the context and how did you get involved? 17:10 - How do you define success? 19:40 - Are you focused on IT language? 25:03 - How do you know you’re doing it right? 32:30 - What decision have already been made? 37:13 - What other challenges have you faced? 40:00 - How do you know when you’re done? 41:29 - How can people contribute? Dale Davis Jones - LinkedIn Hai-Nhu Tran - LinkedIn Blog - Words Matter: Driving Thoughtful Change Toward Inclusive Language in Technology Call for Code for Racial Justice: Linux Foundation
Connect with the Team Producer Kate Brown - LinkedIn. Producer Steve Templeton - LinkedIn. Host Al Martin - LinkedIn and Twitter. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.
Send us a text Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.
Abstract Hosted by Al Martin, VP, Data and AI Expert Services and Learning at IBM, Making Data Simple provides the latest thinking on big data, A.I., and the implications for the enterprise from a range of experts.
This week on Making Data Simple, we have Dale Davis Jones, who is an IBM Vice President and Distinguished Engineer in Global Technology Services, where she leads the GTS IT Architect community and Client Innovation. We also have Hai-Nhu Tran, who is the Senior Manager of Content of Design in Data and AI at IBM. Hai-Nhu and her team are responsible for the technical content experience for a large portfolio of products and platforms.
Show Notes 11:43 - What is the context and how did you get involved? 17:10 - How do you define success? 19:40 - Are you focused on IT language? 25:03 - How do you know you’re doing it right? 32:30 - What decision have already been made? 37:13 - What other challenges have you faced? 40:00 - How do you know when you’re done? 41:29 - How can people contribute? Dale Davis Jones - LinkedIn Hai-Nhu Tran - LinkedIn Blog - Words Matter: Driving Thoughtful Change Toward Inclusive Language in Technology https://www.ibm.com/blogs/think/2020/08/words-matter-driving-thoughtful-change-toward-inclusive-language-in-technology/ Call for Code for Racial Justice: https://developer.ibm.com/callforcode/racial-justice/ Linux Foundation
Connect with the Team Producer Kate Brown - LinkedIn. Producer Steve Templeton - LinkedIn. Host Al Martin - LinkedIn and Twitter. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.
Computer Vision is not Perfect Julia Evans joins us help answer the question why do neural networks think a panda is a vulture. Kyle talks to Julia about her hands-on work fooling neural networks. Julia runs Wizard Zines which publishes works such as Your Linux Toolbox. You can find her on Twitter @b0rk
Summary There are a number of platforms available for object storage, including self-managed open source projects. But what goes on behind the scenes of the companies that run these systems at scale so you don’t have to? In this episode Will Smith shares the journey that he and his team at Linode recently completed to bring a fast and reliable S3 compatible object storage to production for your benefit. He discusses the challenges of running object storage for public usage, some of the interesting ways that it was stress tested internally, and the lessons that he learned along the way.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Will Smith about his work on building object storage for the Linode cloud platform
Interview
Introduction How did you get involved in the area of data management? Can you start by giving an overview of the current state of your object storage product?
What was the motivating factor for building and managing your own object storage system rather than building an integration with another offering such as Wasabi or Backblaze?
What is the scale and scope of usage that you had to design for? Can you describe how your platform is implemented?
What was your criteria for deciding whether to use an available platform such as Ceph or MinIO vs building your own from scratch? How have your initial assumptions about the operability and maintainability of your installation been challenged or updated since it has been released to the public?
What have been the biggest challenges that you have faced in designing and deploying a system that can meet the scale and reliability requirements of Linode? What are the most important capabilities for the underlying hardware that you are running on? What supporting systems and tools are you using to manage the availability and durability of your object storage? How did you approach the rollout of Linode’s object storage to gain the confidence that you needed to feel comfortable with full scale usage? What are some of the benefits that you have gained internally at Linode from having an object storage system available to your product teams? What are your thoughts on the state of the S3 API as a de facto standard for object storage? What is your main focus now that object storage is being rolled out to more data centers?
Contact Info
Dorthu on GitHub dorthu22 on Twitter LinkedIn Website
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Linode Object Storage Xen Hypervisor KVM (Linux K
Summary Archaeologists collect and create a variety of data as part of their research and exploration. Open Context is a platform for cleaning, curating, and sharing this data. In this episode Eric Kansa describes how they process, clean, and normalize the data that they host, the challenges that they face with scaling ETL processes which require domain specific knowledge, and how the information contained in connections that they expose is being used for interesting projects.
Introduction
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Eric Kansa about Open Context, a platform for publishing, managing, and sharing research data
Interview
Introduction
How did you get involved in the area of data management?
I did some database and GIS work for my dissertation in archaeology, back in the late 1990’s. I got frustrated at the lack of comparative data, and I got frustrated at all the work I put into creating data that nobody would likely use. So I decided to focus my energies in research data management.
Can you start by describing what Open Context is and how it started?
Open Context is an open access data publishing service for archaeology. It started because we need better ways of dissminating structured data and digital media than is possible with conventional articles, books and reports.
What are your protocols for determining which data sets you will work with?
Datasets need to come from research projects that meet the normal standards of professional conduct (laws, ethics, professional norms) articulated by archaeology’s professional societies.
What are some of the challenges unique to research data?
What are some of the unique requirements for processing, publishing, and archiving research data?
You have to work on a shoe-string budget, essentially providing "public goods". Archaeologists typically don’t have much discretionary money available, and publishing and archiving data are not yet very common practices.
Another issues is that it will take a long time to publish enough data to power many "meta-analyses" that draw upon many datasets. The issue is that lots of archaeological data describes very particular places and times. Because datasets can be so particularistic, finding data relevant to your interests can be hard. So, we face a monumental task in supplying enough data to satisfy many, many paricularistic interests.
How much education is necessary around your content licensing for researchers who are interested in publishing their data with you?
We require use of Creative Commons licenses, and greatly encourage the CC-BY license or CC-Zero (public domain) to try to keep things simple and easy to understand.
Can you describe the system architecture that you use for Open Context?
Open Context is a Django Python application, with a Postgres database and an Apache Solr index. It’s running on Google cloud services on a Debian linux.
Wh
Summary
Business intelligence is a necessity for any organization that wants to be able to make informed decisions based on the data that they collect. Unfortunately, it is common for different portions of the business to build their reports with different assumptions, leading to conflicting views and poor choices. Looker is a modern tool for building and sharing reports that makes it easy to get everyone on the same page. In this episode Daniel Mintz explains how the product is architected, the features that make it easy for any business user to access and explore their reports, and how you can use it for your organization today.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Daniel Mintz about Looker, a a modern data platform that can serve the data needs of an entire company
Interview
Introduction How did you get involved in the area of data management? Can you start by describing what Looker is and the problem that it is aiming to solve?
How do you define business intelligence?
How is Looker unique from other approaches to business intelligence in the enterprise?
How does it compare to open source platforms for BI?
Can you describe the technical infrastructure that supports Looker? Given that you are connecting to the customer’s data store, how do you ensure sufficient security? For someone who is using Looker, what does their workflow look like?
How does that change for different user roles (e.g. data engineer vs sales management)
What are the scaling factors for Looker, both in terms of volume of data for reporting from, and for user concurrency? What are the most challenging aspects of building a business intelligence tool and company in the modern data ecosystem?
What are the portions of the Looker architecture that you would do differently if you were to start over today?
What are some of the most interesting or unusual uses of Looker that you have seen? What is in store for the future of Looker?
Contact Info
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Looker Upworthy MoveOn.org LookML SQL Business Intelligence Data Warehouse Linux Hadoop BigQuery Snowflake Redshift DB2 PostGres ETL (Extract, Transform, Load) ELT (Extract, Load, Transform) Airflow Luigi NiFi Data Curation Episode Presto Hive Athena DRY (Don’t Repeat Yourself) Looker Action Hub Salesforce Marketo Twilio Netscape Navigator Dynamic Pricing Survival Analysis DevOps BigQuery ML Snowflake Data Sharehouse
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
Summary
As software lifecycles move faster, the database needs to be able to keep up. Practices such as version controlled migration scripts and iterative schema evolution provide the necessary mechanisms to ensure that your data layer is as agile as your application. Pramod Sadalage saw the need for these capabilities during the early days of the introduction of modern development practices and co-authored a book to codify a large number of patterns to aid practitioners, and in this episode he reflects on the current state of affairs and how things have changed over the past 12 years.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Pramod Sadalage about refactoring databases and integrating database design into an iterative development workflow
Interview
Introduction How did you get involved in the area of data management? You first co-authored Refactoring Databases in 2006. What was the state of software and database system development at the time and why did you find it necessary to write a book on this subject? What are the characteristics of a database that make them more difficult to manage in an iterative context? How does the practice of refactoring in the context of a database compare to that of software? How has the prevalence of data abstractions such as ORMs or ODMs impacted the practice of schema design and evolution? Is there a difference in strategy when refactoring the data layer of a system when using a non-relational storage system? How has the DevOps movement and the increased focus on automation affected the state of the art in database versioning and evolution? What have you found to be the most problematic aspects of databases when trying to evolve the functionality of a system? Looking back over the past 12 years, what has changed in the areas of database design and evolution?
How has the landscape of tooling for managing and applying database versioning changed since you first wrote Refactoring Databases? What do you see as the biggest challenges facing us over the next few years?
Contact Info
Website pramodsadalage on GitHub @pramodsadalage on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Database Refactoring
Website Book
Thoughtworks Martin Fowler Agile Software Development XP (Extreme Programming) Continuous Integration
The Book Wikipedia
Test First Development DDL (Data Definition Language) DML (Data Modification Language) DevOps Flyway Liquibase DBMaintain Hibernate SQLAlchemy ORM (Object Relational Mapper) ODM (Object Document Mapper) NoSQL Document Database MongoDB OrientDB CouchBase CassandraDB Neo4j ArangoDB Unit Testing Integration Testing OLAP (On-Line Analytical Processing) OLTP (On-Line Transaction Processing) Data Warehouse Docker QA==Quality Assurance HIPAA (Health Insurance Portability and Accountability Act) PCI DSS (Payment Card Industry Data Security Standard) Polyglot Persistence Toplink Java ORM Ruby on Rails ActiveRecord Gem
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
Summary
Data is an increasingly sought after raw material for business in the modern economy. One of the factors driving this trend is the increase in applications for machine learning and AI which require large quantities of information to work from. As the demand for data becomes more widespread the market for providing it will begin transform the ways that information is collected and shared among and between organizations. With his experience as a chair for the O’Reilly AI conference and an investor for data driven businesses Roger Chen is well versed in the challenges and solutions being facing us. In this episode he shares his perspective on the ways that businesses can work together to create shared data resources that will allow them to reduce the redundancy of their foundational data and improve their overall effectiveness in collecting useful training sets for their particular products.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers A few announcements:
The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20% If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.
Your host is Tobias Macey and today I’m interviewing Roger Chen about data liquidity and its impact on our future economies
Interview
Introduction How did you get involved in the area of data management? You wrote an essay discussing how the increasing usage of machine learning and artificial intelligence applications will result in a demand for data that necessitates what you refer to as ‘Data Liquidity’. Can you explain what you mean by that term? What are some examples of the types of data that you envision as being foundational to multiple organizations and problem domains? Can you provide some examples of the structures that could be created to facilitate data sharing across organizational boundaries? Many companies view their data as a strategic asset and are therefore loathe to provide access to other individuals or organizations. What encouragement can you provide that would convince them to externalize any of that information? What kinds of storage and transmission infrastructure and tooling are necessary to allow for wider distribution of, and collaboration on, data assets? What do you view as being the privacy implications from creating and sharing these larger pools of data inventory? What do you view as some of the technical challenges associated with identifying and separating shared data from those that are specific to the business model of the organization? With broader access to large data sets, how do you anticipate that impacting the types of businesses or products that are possible for smaller organizations?
Cont