talk-data.com talk-data.com

Topic

GitHub

version_control collaboration code_hosting

661

tagged

Activity Trend

79 peak/qtr
2020-Q1 2026-Q1

Activities

661 activities · Newest first

Advanced R: Data Programming and the Cloud

Program for data analysis using R and learn practical skills to make your work more efficient. This book covers how to automate running code and the creation of reports to share your results, as well as writing functions and packages. Advanced R is not designed to teach advanced R programming nor to teach the theory behind statistical procedures. Rather, it is designed to be a practical guide moving beyond merely using R to programming in R to automate tasks. This book will show you how to manipulate data in modern R structures and includes connecting R to data bases such as SQLite, PostgeSQL, and MongoDB. The book closes with a hands-on section to get R running in the cloud. Each chapter also includes a detailed bibliography with references to research articles and other resources that cover relevant conceptual and theoretical topics. What You Will Learn Write and document R functions Make an R package and share it via GitHub or privately Add tests to R code to insure it works as intended Build packages automatically with GitHub Use R to talk directly to databases and do complex data management Run R in the Amazon cloud Generate presentation-ready tables and reports using R Who This Book Is For Working professionals, researchers, or students who are familiar with R and basic statistical techniques such as linear regression and who want to learn how to take their R coding and programming to the next level.

Financial analysis techniques for studying numeric, well structured data are very mature. While using unstructured data in finance is not necessarily a new idea, the area is still very greenfield. On this episode,Delia Rusu shares her thoughts on the potential of unstructured data and discusses her work analyzing Wikipedia to help inform financial decisions. Delia's talk at PyData Berlin can be watched on Youtube (Estimating stock price correlations using Wikipedia). The slides can be found here and all related code is available on github.

Platform as a service is a growing trend in data science where services like fraud analysis and face detection can be provided via APIs. Such services turn the actual model into a black box to the consumer. But can the model be reverse engineered? Florian Tramèr shares his work in this episode showing that it can. The paper Stealing Machine Learning Models via Prediction APIs is definitely worth your time to read if you enjoy this episode. Related source code can be found in https://github.com/ftramer/Steal-ML.

Jo Hardin joins us this week to discuss the ASA's Election Prediction Contest. This is a competition aimed at forecasting the results of the upcoming US presidential election competition. More details are available in Jo's blog post found here. You can find some useful R code for getting started automatically gathering data from 538 via Jo's github and official contest details are available here. During the interview we also mention Daily Kos and 538.

I'm joined this week by Jon Morra, director of data science at eHarmony to discuss a variety of ways in which machine learning and data science are being applied to help connect people for successful long term relationships. Interesting open source projects mentioned in the interview include Face-parts, a web service for detecting faces and extracting a robust set of fiducial markers (features) from the image, and Aloha, a Scala based machine learning library. You can learn more about these and other interesting projects at the eHarmony github page. In the wrap up, Jon mentioned the LA Machine Learning meetup which he runs. This is a great resource for LA residents separate and complementary to datascience.la groups, so consider signing up for all of the above and I hope to see you there in the future.

Mastering RStudio: Develop, Communicate, and Collaborate with R

"Mastering RStudio: Develop, Communicate, and Collaborate with R" is your guide to unlocking the potential of RStudio. You'll learn to use RStudio effectively in your data science projects, covering everything from creating R packages to interactive web apps with Shiny. By the end, you'll fully understand how to use RStudio tools to manage projects and share results effectively. What this Book will help me do Gain a comprehensive understanding of the RStudio interface and workflow optimizations. Effectively communicate data insights with R Markdown, including static and interactive documents. Create impactful data visualizations using R's diverse graphical systems and tools. Develop Shiny web applications to showcase and share analytical results. Learn to collaborate on projects using Git and GitHub, and understand R package development workflows. Author(s) Julian Hillebrand and None Nierhoff are experienced R developers with years of practical expertise in data science and software development. They have a passion for teaching how to utilize RStudio effectively. Their approach to writing combines practical examples with thorough explanations, ensuring readers can readily apply concepts to real-world scenarios. Who is it for? This book is ideal for R programmers and analysts seeking to enhance their workflows using RStudio. Whether you're looking to create professional data visualizations, develop R packages, or implement Shiny web applications, this book provides the tools you need. Suitable for those already familiar with basic R programming and fundamental concepts.

This week's episode explores the possibilities of extracting novel insights from the many great social web APIs available. Matthew Russell's Mining the Social Web is a fantastic exploration of the tools and methods, and we explore a few related topics. One helpful feature of the book is it's use of a Vagrant virtual machine. Using it, readers can easily reproduce the examples from the book, and there's a short video available that will walk you through setting up the Mining the Social Web virtual machine. The book also has an accompanying github repository which can be found here. A quote from Matthew that particularly reasonates for me was "The first commandment of Data Science is to 'Know thy data'." Take a listen for a little more context around this sage advice. In addition to the book, we also discuss some of the work done by Digital Reasoning where Matthew serves as CTO. One of their products we spend some time discussing is Synthesys, a service that processes unstructured data and delivers knowledge and insight extracted from the data. Some listeners might already be familiar with Digital Reasoning from recent coverage in Fortune Magazine on their cognitive computing efforts. For his benevolent recommendation, Matthew recommends the Hardcore History Podcast, and for his self-serving recommendation, Matthew mentioned that they are currently hiring for Data Science job opportunities at Digital Reasoning if any listeners are looking for new opportunities.

Using Flume

How can you get your data from frontend servers to Hadoop in near real time? With this complete reference guide, you’ll learn Flume’s rich set of features for collecting, aggregating, and writing large amounts of streaming data to the Hadoop Distributed File System (HDFS), Apache HBase, SolrCloud, Elastic Search, and other systems. Using Flume shows operations engineers how to configure, deploy, and monitor a Flume cluster, and teaches developers how to write Flume plugins and custom components for their specific use-cases. You’ll learn about Flume’s design and implementation, as well as various features that make it highly scalable, flexible, and reliable. Code examples and exercises are available on GitHub. Learn how Flume provides a steady rate of flow by acting as a buffer between data producers and consumers Dive into key Flume components, including sources that accept data and sinks that write and deliver it Write custom plugins to customize the way Flume receives, modifies, formats, and writes data Explore APIs for sending data to Flume agents from your own applications Plan and deploy Flume in a scalable and flexible way—and monitor your cluster once it’s running

Apache Sqoop Cookbook

Integrating data from multiple sources is essential in the age of big data, but it can be a challenging and time-consuming task. This handy cookbook provides dozens of ready-to-use recipes for using Apache Sqoop, the command-line interface application that optimizes data transfers between relational databases and Hadoop. Sqoop is both powerful and bewildering, but with this cookbook’s problem-solution-discussion format, you’ll quickly learn how to deploy and then apply Sqoop in your environment. The authors provide MySQL, Oracle, and PostgreSQL database examples on GitHub that you can easily adapt for SQL Server, Netezza, Teradata, or other relational systems. Transfer data from a single database table into your Hadoop ecosystem Keep table data and Hadoop in sync by importing data incrementally Import data from more than one database table Customize transferred data by calling various database functions Export generated, processed, or backed-up data from Hadoop to your database Run Sqoop within Oozie, Hadoop’s specialized workflow scheduler Load data into Hadoop’s data warehouse (Hive) or database (HBase) Handle installation, connection, and syntax issues common to specific database vendors

Data Source Handbook

If you're a developer looking to supplement your own data tools and services, this concise ebook covers the most useful sources of public data available today. You'll find useful information on APIs that offer broad coverage, tie their data to the outside world, and are either accessible online or feature downloadable bulk data. You'll also find code and helpful links. This guide organizes APIs by the subjects they cover—such as websites, people, or places—so you can quickly locate the best resources for augmenting the data you handle in your own service. Categories include: Website tools such as WHOIS, bit.ly, and Compete Services that use email addresses as search terms, including Github Finding information from just a name, with APIs such as WhitePages Services, such as Klout, for locating people with Facebook and Twitter accounts Search APIs, including BOSS and Wikipedia Geographical data sources, including SimpleGeo and U.S. Census Company information APIs, such as CrunchBase and ZoomInfo APIs that list IP addresses, such as MaxMind Services that list books, films, music, and products

Agentic DevOps with GitHub Copilot

Our very own (not so secret) agent, Martin Woodward, takes us through the latest developments in GitHub Copilot with a deep dive into all the announcements from the keynote. You will not only learn how to get started with all the latest and greatest AI enhanced development features across VS Code and GitHub, but you will also learn how to take the best advantage of them in your day-to-day development work.​

Capgemini introduces the Agentic Industry Studio, combining deep industry expertise with Microsoft’s agentic platforms to turn AI ambition into measurable impact—fast. You’ll see how multi channel knowledge creation and domain grounded agents orchestrate work across Microsoft 365, GitHub, and Azure, and how the AI Powered Service Desk elevates customer and employee experiences with intelligent, proactive operations. Built for enterprise scale with “human–AI chemistry” felt across sectors.

Be more productive in your SAP Environment with ABAP AI model in VS Code

In the age of AI, SAP supercharges it’s portfolio with natively embedded AI features, AI agents, access to 40+ LLMs, and tools for developers to increase productivity. Developers find GitHub Copilot in VS Code invaluable in assisting with code writing in popular languages. With SAP's domain-specific Advanced Business Application Programming (ABAP) language, developers can work in its ecosystem through writing code in ABAP. SAP is now making a fine-tuned ABAP AI model available to help developers

Elevate DevEx 2.0 with continuous security across the SDLC

DevEx 2.0 means giving developers guardrails that accelerate delivery instead of slowing it. This session shows how continuous security integrates with Azure and GitHub pipelines. You’ll see IDE coaching and SAST at commit. With Microsoft Copilot and Azure OpenAI using GPT-5, developers can cut through false positives and receive actionable fixes directly in the code. Builds include SBOM generation, container scans, and dependency checks. Staging environments add DAST and API testing.

Learn to leverage agent-framework, the new unified platform from Semantic Kernel and AutoGen engineering teams, to build A2A compatible agents similar to magnetic-one. Use SWE Agents (GitHub Copilot coding agent and Codex with Azure OpenAI models) to accelerate development. Implement MCP tools for secure enterprise agentic workflows. Experience hands-on building, deploying, and orchestrating multi-agent systems with pre-release capabilities. Note: Contains embargoed content.

Power Agentic Access. Govern Non-Human Identities

AI agents run on non-human identities: service principals, managed identities, tokens. If you can’t see them, you can’t govern them. Oasis gives Microsoft-native guardrails: discover every agent across Entra, Azure, M365, and GitHub; right-size roles; eliminate long-lived secrets; and automate lifecycle with owners, purpose, TTL, rotation, and evidence. Same developer speed, far less standing privilege.