Talking Drupal #409 - Data Lakes

July 31, 2023

Today we are talking about Data Lakes with Melissa Bent & April Sides.

Listen:

direct Link

Topics

What is a data lake
Does it have to be NoSQL
How do organizations use data lake
How does RedHat use the data lake
How do you maintain it
How do you make changes to the data lake
Who manages Mongo
How big does it have to be to be considered a data lake
Why not Solr
What Drupal modules
Communication of changes
Gotchas?

Resources

Module of the Week

Tagify

Brief description:
- Provides an entity reference widget that’s more user friendly: visually styles as tags (without showing the reference ID), drag to sort, and more
Brief history
- How old: created in Mar 2023
Versions available:
- 1.0.19, which works wth Drupal >8.8, 9, and 10
Maintainership
- Actively maintained, latest release in the past week
Number of open issues:
- 4, one of which is a bug
Usage stats:
- 177 sites
Maintainer(s):
- gxleano (David Galeano), who I got to meet in person at Drupal Dev Days
Module features and usage
- Tagify is a popular JS library, so this module is a Drupal integration for that
- Features in the module include deactivating labels when the field’s max number of entries has been reached, allowing the creation of new tags when the field has been configured to allow that, and so on
- Will automatically disallow duplicate tags
- Includes a User List submodule specifically for user reference fields, which also shows the user’s profile pic in the tag
- Project page has animated GIFs that demonstrate how many of these features work
- A module I started using on my own blog, nice and simple UX. I could see the drag to sort be really useful, for example if you wanted the first term reference to be used in a pathauto pattern

Transcript

[MUSIC]

This is Talking Drupal, a weekly chat about web design and development from a group of people with one thing in common, we love Drupal.

This is episode 409, Data Lakes.

Welcome to Talking Drupal. Today, we are talking about Data Lakes with Melissa Bent and April Sides.

Melissa is a self-taught web developer specializing in configuring Drupal for large integrated systems. She started building websites in 1996, well, getting into Drupal in 2006, and is now a senior software engineer at Red Hat, leading the Drupal portions of their customer portal.

She currently resides in Nampa, Idaho, on a small farm she owns with her parents and collects plants, Lego, and various IoT

and other practical DIY tech projects.

Melissa, welcome to the show and thanks for joining us.

Thank you. Happy to be here.

April is a backend developer for developers.redhat.com, cloud.redhat.com, and, hard to say, kubernetesbyexample.com. She is also heavily involved in Drupal Camp Asheville. Thank you for that. The Ally Talks virtual meetup and the Drupal community working group community health team. She is based in Asheville, North Carolina and is also into Lego and Harry Potter.

April, welcome to the show. Again, welcome back to the show and thanks for joining us.

Thanks for having me again.

I know listeners that you're thinking, "Hey, two Lego guests withNic and I understand Tim likes a little bit of Lego too. We're probably going to spend a lot of time talking about Lego today. So brace yourselves, it should be fun. For those of you that don't know or haven't listened before, I'mNic Picozzi, Solutions Architect at E-PAM, and today, as I alluded to, joining me for his second week is Tim Plunkett, Engineering Manager at Acquia.

Tim, welcome. Thanks for having me again. Absolutely. So if you are signed up for our newsletter in episode, the newsletter for episode 408, you can learn a little bit more about Tim in five questions with Tim, and you can find out if Tim prefers plain or peanut M&Ms. I'm really hoping it's peanut M&Ms so that we have that in common, but don't tell anybody, they have to read the newsletter to find out. No spoilers.

Perfect.

And joining us as usual,Nic Lathlin, Founder at Enlightened Development. Hello,Nic Happy to be here. And now, to talk about our module of the week, let's turn it over to MartinAnderson-Clutz, a Senior Solutions Engineer at Acquia and maintainer of a number of Drupal modules of his own. Martin, what do you have for us this week?

Thanks,Nic This week, I thought we would talk about the Tagify module, which provides an entity reference widget that's more user friendly. So it has the selections visually styled as tags without showing the reference ID. You can drag to sort and sort of a variety of other features.

It's a module that was created in March of 2022. It has a 1.0.19 version, which works with Drupal 8.8 or newer, including Drupal 9 and 10. It is actively maintained. In fact, the latest release was pushed out in just the past week.

And in terms of open issues, it only has four open issues, and only one of those is a bug. And it does also include test coverage in the module. Now, it's a module that is currently in use by 177 sites. The maintainer is GXliano, who is David Galliano, a person that actually got to meet in person at Drupal Dev Days last week.

Now, Tagify is actually a popular JavaScript library, so it's in use by a lot more sites than just that Drupal number. And the module then is really just a Drupal integration for that library.

Features that you get with the module include deactivating labels when the field's maximum number of entries have been reached, allowing the creation of new tags when the field has been configured to allow that, and so on. Now, it will also disallow duplicate tags automatically, and it also has a user list submodule specifically for user reference fields, which will also show the user's profile pic in the tag that it generates. The project page also, as sort of a handy reference, has animated GIFs that sort of demonstrate a lot of these capabilities, so it's definitely worth checking that out. I'll also add that this is a module I started using on my own blog a few months ago, and find that it provides a really nice and simple user experience. So I could also see that drag-to-sort capability being really useful. For example, if you were using the first term in a reference field as part of your path auto pattern. So that being said, let's open it up to discussion about the Tagify module. So I'm going to be Debbie Downer here and ask the first question.

Does it, if for some reason JavaScript is blocked or disabled, does it have a reasonable fallback, or does it just fall back to default Drupal?

That's an excellent question. I will confess I have not tested that, but maybe one of our listeners can chime into the talking Drupal Slack channel and let us know.

Yeah, I don't know the answer to that either, but I actually have used this module. I cannot remember which project it was on, but I remember installing it, I think sometime last year, it was easy to configure, and I think pretty intuitive for clients. I think one of the things that it provides, there's two things I think that it provides

that users are starting to get used to on other projects that makes it kind of seem a little snappier, and that is when the styling of the tags, like you get that little box with the X next to it,

and I think people are starting to get used to that. I don't even think the client that I installed it for, I'm going to have to find out who it was,

but I don't think they even use the drag and drop all that much because most of the systems that, most of my clients with tagging systems order doesn't really matter. It's more of like, is it tagged or not?

So I don't think they use that functionality, but yeah, it was pretty neat little module, pretty easy to work with. Yeah, I've also used it, and I think, I don't know, I just like the UX benefits are definitely a big win, so I recommend that, and as Martin did say, if anybody has tried it without JS, I'd be interested to know if and how that works. I have to say, not an tag of the thing that I noticed today, I was comparing a couple of other modules for allowing users to log in by email, and I'm curious if anybody else has had this. While I was evaluating them, I heard Martin's voice in the back of my mind, looking at like, if this module has 5,000 users and three open issues, two of which are bugs.

So you've definitely changed how I evaluate modules, so I appreciate that actually helped me make a selection today. Well, I will say I have many voices in my head, but Martin's is not often one of them.

I think something I like about this, I mean, these from the screenshots, because I've never used this one. I think the more you work with Drupal, this is a common problem that people try to solve is to make tagging easier to use without having to resort to the multivalued Drupal like individual field for each thing.

I've never seen one that allows you to edit, like click to edit a term that you've already entered. I didn't even notice that. That's a pretty cool option, because like what if you just misspelled a tag and you're like, "Oh no, but I already entered it," or whatever, like that's kind of cool. I've never seen it. So is it just me, or does an animated GIF on a module page sell a module way better than just regular images, right? Tim's giving me the thumbs up. Like any time I see an animated GIF, I'm like, "All right, we're doing it. We're putting it in there." You know, it could be like, you know, the most ridiculous thing, but like just gets me going. Going off on that tangent, I actually learned recently that you can make animated GIFs using Keynote on Macs pretty efficiently. So I can drop a tutorial about that in the show notes.

So along those same lines, another tool that you can use is a GIF. He actually has a tool that you can download. I had to make an animated GIF for something last week. And yeah, I was able to download it, I think from the App Store and install it. And it worked pretty well, captured what it was supposed to capture, and allowed me to kind of resize it a little bit too. So yeah, there you go. Not only is it a module of the week, but we're also giving you ways to make animated GIFs.

All righty. - I would be curious to know the accessibility of this form. It looks really cool, but I don't know.

It'd be interesting to know about that. - I will also add that I've been on projects where the customer really wanted to remove the numbers and brackets after like a standard entity reference field. And the fact that this does this, I think also adds to the user experience as well. - I feel like we could have a whole podcast about that one. Like that was the one thing that they were like, "Nope, simply cannot launch this site until those numbers are gone from that field."

I'd be interested to hear more about that at a later date. - All right. Well, Martin, as usual, thank you for bringing us a wonderful module of the week and we will talk to you again next week. - Thanks,Nic - See you then.

- All righty. Let's move on to our primary topic, which is talking about data lakes.

I'm gonna try not to make any of the dumb dad jokes about like, "It's summertime, let's go to the data lake." Oh, there I go. I just did it. All right, let's start with the easy, hopefully an easy question. What is a data lake? - I would say a data lake is a raw and flexible data store with schema on read capabilities.

- Oh boy. - A neat fun fact, something that I just looked up today is the history of it is that, I guess why it's called a data lake is because it came out whenever Apache Hadoop, I guess that's how you pronounce it, H-A-D-O-O-P, had the yellow elephant logo and it was something about the yellow elephant and the watering holes. I thought that that was interesting. I didn't know about that before this podcast.

But yeah, that's why it's called a data lake. - You said schema on reads, does that mean like only no SQL,

data lakes can only be no SQL type databases? - It's not a relational database. It's not a relational database. The way that we're using it though, we are defining schema as we're indexing content and we can get into that later when we're talking about how we're using it. So it's not that you can't put a schema on it when you're indexing, putting data into it. You can also do a schema on read, I don't know. I'm not really sure how that piece would work.

It's not part of what we've been working with. But it's interesting, an interesting feature of a data lake.

- So at its core, I guess it's a set of data, right? And a schema to basically get to that data or use that data, right? If we simplify it, right? - Yeah. - Cool. - Beyond the standard, like when I first started using Drupal, I remember being like, how am I supposed to figure this out? - I don't know. And that for whatever reason, it never dawned on me that I could look at the database

when I was a new developer. Because I'm a self-taught developer, right? So I was just figuring things out as I went. And I mean, the things that I did when I did migrations, you don't even wanna know how much I used Excel to import things.

I didn't know about auto incrementing IDs or any of those things. To my shame, the first Drupal site I ever built, the auto increment IDs were off by and the revision IDs were off by one because I didn't do my auto population in Excel properly.

So for the rest of the life of that site, they were always off by one because we didn't have revisioning on. Anyway, that kind of stuff. If you look at the database of Drupal now, we get comfortable with that concept. And I know that there are different backends people use with Drupal, but the most common one is MySQL, MySQL.

Again, I don't hear these things spoken aloud often. I read them a lot. But the difference here, the main difference here is that it's very flexible and it doesn't really care about how you put it in. It just says it's here, right? Versus it being very opinionated about this has to be this. Like it's not typed. Like you don't have to say, this is this long and it has to be an integer. Like it doesn't care about that kind of stuff. It just says, here's your data. - I think one of the other key features of it too is you can source from multiple places so they don't all have to match. Like if you have one primary source that has, IDs that are alphabetized and another one that has them that are numerical. And even if they conflict, you can put all those in the data lake

and access them separately or work together.

It's just basically as the information comes in, it gets stored. It doesn't matter what that information is or how it's formatted.

It means that putting information into it is very easy.

Sometimes pulling information out can be a little bit more complex because you have to end up, it puts a lot of that validation on either the application or some layer between the data lake and the application.

But it makes it very easy to just ingest data, I think. - Yeah.

- How would something like this compare to something no SQL database like Mongo?

- So we actually are using Mongo for-- - Answer the question. - Yeah, yeah. So, but it's actually very similar. And well, I mean, because you can use different types of backends for data lakes, but the concept is almost exactly the same. And for us, it's exactly the same. - Hmm.

So how, you know, I'm curious obviously about Red Hat in particular, but how do organizations typically use the data lake and maybe how is Red Hat using it for your project specifically?

- So I'd say for us, one of the ways that we use it is to, or let's say the reason that we use it. So the reason that it came up as a topic is because we have so many different systems that need to be separate

for various reasons, whether it's this team has their own system, we have ours, whether it's like, we think in Drupal because we're building in Drupal most of the time, but you know, we have others teams that are building in like a homegrown solution or they have like a completely JavaScript, like application that they've built that they're using Mongo as a backend, possibly for their entire application.

So we have all these different technologies. The thing that ends up happening is you get,

because you have all these different reasons for these systems, but you need them to be able to talk to each other or have other systems use their data. That's where the data lake comes in and it's really, really valuable because it becomes, I mean, April, that's such a great thing that you found the watering hole. Like they all come together in one place. They might have various backgrounds or various different needs, but in the end, they're all coming to the same place to get data. - While your website animals come to the same watering hole.

- Exactly. - And that really helps with like making sure your data is consistent, right? So if we, you know, if Red Hat is a product company, we wanna make sure our product information is accurate across all of these different pieces of the ecosystem, having it in one place and having it edited like by a single source, then it should be more consistent. It shouldn't be a manual process to make things consistent across the ecosystem. - So that raises a little bit of a question for me and maybe we'll talk about this and how Red Hat is specifically using a data lake, but I wonder if it's a central place for data, right?

And that schema is being applied typically when the data is being read.

Who can write the data, right? To that data lake and then is there a schema that needs to be applied there as to how they're writing it?

- Well, the data lake is definitely bring your own governance, right? You have to provide all those things. So for my use case, we don't have a single source. We have multiple sites. We're using it for learning paths. So like being able to index data that can be a part of a learning path, which is just a collection of like information to help you learn about a topic or product or something like that. We just wanted it to have like basically a pool or a lake of resources that you could choose from from different sites. And so the way that we are making sure that the data is what we expect is that we have a shared module that's installed on the Drupal sites that are currently using this. And right now it's developers.redhat.com and cloud.redhat.com. And so they both, the module contains the schema. And so when we're indexing, the schema is followed and the shared module has all this, you know,

similar ways of retrieving the data and it should act in similar ways. We have all those expectations sort of baked into the shared module specific to learning paths. And then Melissa created an indexing module, which I think that's the one that you shared as a sandbox project, right? So I'll let her talk a little bit about how we're indexing data, because it's pretty cool. - Yeah, so it's actually internally, it's called Red Hat Index because we already had a Red Hat search module and naming things. So Red Hat Index it is,

but it basically integrates with Search API. So because Search API already has a lot of those things figured out. And I thought, I don't want to have to rebuild this from scratch and I shouldn't have to open source, yay, right? So basically we built on top of Search API. And there are a couple of little gotchas in there, like at least the way that we're using it. And I haven't had time to figure out why or to work around it, but it expects if you're using the UI to map your incoming values, if you wanted to define your schema directly through the Search API interface,

it requires all the keys are lowercase.

And our team wanted them to be camel case.

So I actually do a transform in my event subscriber that formats them the way we want.

It's really simple. It's like strip out the underscores and wherever you see an underscore, make it camel cases instead and it works, it's fine.

So it's not a big deal. It's just like the little gotchas that you find when you're working with it. And initially I was making them all like standard, you know, lowercase underscores, kind of like a very triple approach to things.

But our team who is not, they are not triple people who handle our GraphQL layer that sits on top of the data lake for extraction of data. They were just like, why does it look like this? And I was like, because.

They were like, can it be camel case? I'm like, yes, it can. So I made it. - Hopefully that's well documented somewhere. - Yeah. - Yeah, so I have it,

in the Red Hat Index module that I made, I have like a sample event subscriber with like the, basically what I, lots of comments in there about why I did what I did and why things are happening. We also have sample schemas. We have like a Google spreadsheet that we use for when we onboard new projects into the data lake, because it's small right now. So we're just, it's a lot of ad hoc stuff going on because we just launched the data lake architecture like late last year. We just launched production values using it earlier this year. And like April, I think your learning paths, when was that launched? Was that launched earlier this year? Or is it late last year? - On developers, it launched, I think in fall, like I think in September of last year. And then we integrated with cloud this year. - The best thing about this is that I was doing all this documentation and building and everything of that. And she launched hers before I got to launch mine. So she actually used all the work that I, not all, I shouldn't say that. I'm not saying she didn't do any work. That's not what I'm saying. What I'm saying is she used the groundwork that I laid, put her implementation on top of it and actually launched hers for production before mine actually got out the door.

But it was really great to see it in use early on. So, but yeah, we have to do, we have to actually define the schema and the way I did it was through an event subscriber that I trigger through my custom Red Hat index module. And it triggers the event subscriber right before it goes out the door, right before it goes to Mongo. And it lets you do basically whatever you want, whatever massage into the data, because I think we all experienced enough with Drupal to know that if you're too opinionated, then you end up finding a use case that you're just fighting against Drupal the whole time. And I didn't want that to happen because I didn't know how people were gonna use it. So I was like, I'm just gonna make it very basic so that we can just get it out the door.

- Makes sense.

So I'm a little bit more curious about the Drupal integration side then. So it sounds like you kind of use search API to kind of feed information into it, but are you also pulling information out of the data lake for Drupal and how does that work? Is it Fuse, Custom Integration?

How's that work?

- Yeah, so that was a task to figure out how to get the data out of Mongo.

At the time, so Melissa's work was, they were working on GraphQL integration as well. So a GraphQL layer on top of MongoDB to do all your queries. And then her information was gonna be in sort of single page apps.

So not specifically in Drupal. And so we created like a basically a base service that can query MongoDB directly using the PHP MongoDB drivers.

And so we have a way to do the queries like find and find one, I think are the two that we're using.

And then, we are, I think, what was the,

there's a module that I was using. I'm not gonna remember what it is. It was external data source.

So that allows us to query MongoDB using the PHP MongoDB driver and do an auto complete field. So when they're looking to, we're creating basically a resource container so that they can add extra information to it. They can tag it for their particular site. They can add a little extra data. And so where we have a node or a content type called a resource, and that has a text field that will hold a unique identifier that is in MongoDB for that resource in the data lake. So this external data source module

allows us to query and do an auto complete thing. It's a little hacky whenever you select the title or whatever you can search by title and you can see what site the information's coming from. And once you select it, it just drops that UUID in there. We need a more robust system so that people can tell like what resource is being referenced here. So we'll do some iterations there, but that was a pretty lightweight way to get that data and make those connections and get those UUIDs set up so that when you view the resource, then either as standalone, which is more for editing purposes,

we're more showing it in the context of a learning path. So we have like the hero and the sidebar and then previous next buttons and things like that. When it's displayed, it's pulling all the data, like the image, if you go to the website and check out any of the learning paths on developers or cloud, it's pulling a little image, it's creating a button, it's pulling all the page text maybe from an article or something like that. That's all being shown from the data lake, but it's being pulled in and put into the pre-process. So like we do the query for that UUID, we add it to the render array and then our templates know what to do. We can do whatever we want with that data and render it out per site. However, that site wants to show that data.

- One of the things that I'm working on right now, I actually just had a meeting about it yesterday and I mentioned it in the DrupalCon talk we did is like shared content or like, I was calling it content syndication, but that sounds like RSS feeds, which is not what it is. But actually, Tim, it's using Layout Builder to build content. Hey, yeah, we have a module called PatternKit, which is on drupal.org. Yes, so it sits on top of that. So you can apply patterns to your output. And then what we're doing is we're actually, well, I mean, what we will be doing, this is what I have to do in the next sprint or two is to do this for our product implementation, is to basically make it so that people can build like CTAs or Vans or whatever in the middle of the content. And then they'll build it, it'll index into the data lake fully rendered. So like all the HTML will be there and then we'll have metadata attached to it about like what product is associated with whatever metadata needs to go into it, because schema, so we'll put that in the schema. And then what that allows us to do is when my frontend developer does his GraphQL query, he'll do it against this part of the data lake and it'll pull in all of the appropriate pieces of content that go with it and inject them into the page. And then with Drupal, you can use like scheduler. So then our content editors don't even have to touch anything on the frontend at all. They can actually like schedule content, get all their stuff ready to go. If there's an event, if there's something they're trying to push, whatever. It stores the fully rendered layout builder driven work. And actually just injects it at render time, which for my application is a statically generated page. So we have like a pipeline that statically generates the pages and it shows them. - So this show is gonna, this show is gonna pair well with our episode, two episodes ago for "Seven" where we talked about Drupal search in quite a bit depth. But just so you know, Melissa, there's a core bug currently with search tape. I think it's also in search API, but when you are indexing rendered content,

big pipe, there's some conflict with sessions being started out of sync and headers being sent previously.

I'll link them in the show notes again, since I just mentioned it again, but you might run into some issues with indexing rendered content. Your logs are going to fill up pretty quickly.

- Good to know. - So I'll link those, but yeah, it's an ongoing issue right now. I think it might be more prevalent in 10.1

because it started happening on a site that upgraded to 10.1. - With the, it sounds like you definitely have the output side of it lined up pretty great. How is the experience of getting things into the lake?

What does that process like and how does it actually work?

- For my side, it's just like a search index. So like a solar, yeah. - Right, but what exactly, I mean, without, what can you describe? I mean, April mentioned the kinds of things that are being put in there, but like how are they being generated? You know, like where are they coming from, really?

- Right, so right now it's Drupal centric. So all of our content in the data lake that we have today is coming from a Drupal source. - Makes sense, okay. - In the future, as we onboard more projects, we'll have alternate sources coming in because we do have other systems within Red Hat that want to utilize this ecosystem.

But we're starting off with Drupal because that's what we work in, but also because we have the module, the shared module ready. - Of course. - So yeah. - So right now it's an abstraction layer that may or may not need to be there because it's all Drupal, but then allows you to do all these other things in the future and move away from that. - Right. - Drupal's the only type of canonical backend. All right, that sounds awesome. - I actually called it, at DrupalCon, I called it basically our, it's basically our presentation layer for our product site because the product site actually has no, the Drupal instance doesn't actually have a public facing URL at all. It's a couple. - It's just a open store. - Yeah, it's the content store. And it handles the permissions,

all of the admin aspects of the content are stored and handled by Drupal. But the rendering portion is basically the actual schema, like the JSON object that has a schema applied and pushed it to the data link. - And I don't think you've said, Melissa, what your use case is, you're doing product data.

In particular, right? - Yeah, that's like, this is, I started at Red Hat in April of 2020. - 2020? Oh my gosh, how long has it been? - I think 2020, I don't know. It's been a long time.

2021, whatever, anyway, it doesn't matter. So I started, it's been like over two years that I've been working on this particular dataset because it's something that, it's the classic thing where the need outpaced the architecture where everybody needed something now or needs something today. And they did what was necessary for their application. And then when now we're going, having to go back and build something that standardizes the approach, which is just always takes longer.

And so we're having almost weekly meetings about this still,

trying to standardize some of this approach. But the data lake approach that I took, that I generated was basically my way of, well, here we are, we have this available as a possibility. If you guys, if we don't agree on that being the data source, then we will then instead generate or ingest an ID from them. Because the idea that, the biggest idea that I had was

to have a single ID for products. So that way, when we're doing these kinds of matches between our various systems that we know what we're talking about. Because if you just do name matches, it's not good enough. Everyone knows that a name can change. It just doesn't work consistently.

So that's the kind of concept that we're working through right now.

But yeah, I was working with product data. We basically took over access.redhat.com slash products. So everything after that sub path is running off of our data lake data. And it comes, it's all statically generated on, like I said, through a pipeline. He does our front end developer does a query against our GraphQL endpoint, pulls the data in, builds it in a Nuxed application, generates static pages and presents it.

So that's constantly being iterated upon as we go. The next thing, like I said, is going to be those kind of inline, it's not ads. It's more like call outs, right? Just like, hey, there's this new thing, whatever.

So much of our front end needs to be dynamic. And as much as Drupal, Drupal can do amazingly powerful things.

All the different things that we're pulling from, it made sense to use it decoupled approach for us.

I know decoupled is kind of like,

that people who are like, why are you using it? Should we use it? And you should absolutely ask those questions. For us, it was yes, but it's not yes for everybody. And it shouldn't be yes for everybody, right? - Yeah, exactly.

- So one question that keeps coming to my mind as I'm hearing you talk about this is,

like you have this data lake, data goes into it from various sources, data comes out of it to various sources, right?

And this question is kind of like a two parter, I guess. I'm wondering how you make changes to the data lake, right? So I'm imagining like, hey, there needs to be a schema update to the data lake for some reason, right? So like, and when I say changes, like changes to like the schema of the data lake, but also changes to maybe certain types of data that are stored within it. So like thinking like bulk updates, syncing of similar data that's been entered from multiple sources, that sort of thing. Like, so let's break it down a little bit. Let's first ask the question of like, how do you make changes to like the schema of the data lake for future enhancements or because you realize that change is needed?

- April, you wanted to take that? I think you're smiling harder. - Yeah. - We both on you at the same time. - I was actually gauging who was smiling harder to figure out who I was gonna pick. - This is how April and I fight. We're like, we have to fight. It's a smile down.

- Yeah, I mean, I would say that's where the power of search API comes into play, where you can queue all of your content up for re-indexing. You can clear the data from the site or through search API. You can clear the data in the data lake for a particular site and then re-index everything. Like, I think that really the ability to like not have stale data or have data rot in the data lake is because we're using search API, because we have those tools that are built in with that contrib module,

that makes it a lot easier to make those updates. I mean, yeah, if you start getting lots of data and you have to index all of it, and if we're seeing the problem thatNic has just mentioned about search API and all of that, which may be a problem that we found today, and so I'm gonna have to check out that link that you have.

But yeah, like just being able to use those tools that already exist has been amazing. - Real time problem solving. One thing is I just wanna expand on there and then I'll let you jump in,Nic But so all the sites are using that module. So you actually have the ability if there needs to be some sort of like systemic change to run update hooks and kind of do that stuff at the Drupal level, right? So that makes a lot more sense to me now. And as to like, that's one stop basically for everybody to connect to the data lake.Nic what were you gonna say?

- Yeah, so it sounds to me like that's kind of a standard process for you because you have just one single source right now. So if you wanted to change the way you're formatting dates, for example, for whatever reason, you would just update that on the search API side, re-index and you'd be done.

So it sounds like you really haven't crossed the bridge yet where another source is entering data that may be modifying existing data, like appending additional information to an index or actually an index, but to a record or something.

Cause that's where it really gets complex. Like if you have two sources, source A creates the thing and then source B modifies that and then you need to kind of change how that works.

- Yeah, so all of ours,

yeah, all of our data has like, it has an origin. Like it knows what site it came from, one. So if there is duplicate data on the two sites right now that are using it, then it's going to be duplicated in the system. It just is the nature of it.

And so, yeah, so each site has its own ID. It's not gonna blow away other, like if I re-index cloud, it's not gonna blow away the data for developers and vice versa, like it's site specific. And so I think really,

we're kind of using the shared module as our governance, so you have to use the shared module, you have to have the same schema and if we make an update, it's gonna affect all the sites that are using it and being able to manage that. Right now we just have the two sites. And so we haven't had a ton of, I think of managing of that, but yeah. - There's no shared data there, it sounds like, right? So it's not like one, two sites are using one data item from the lake, right? - Right, we have multiple data sources that are feeding in, but they are not overlapping. They're not editing the same data. I think that would be a bad thing. - Yeah, yeah, yeah. - Yeah, we're not going from many to many, for sure. We don't wanna have to solve that problem. But one thing April brought up, which I think is really a cool addition that she actually added into the Red Hat index module itself was this idea of actually using an internal, all of our sites have IDs internally. And so we actually use that as the hash. So Search API has a hash value that lets you determine where something came from. So we use that as the hash and that's when you clear,

when you use Search API to clear an index,

it keys it by that hash so that it doesn't accidentally do everybody's. And that's what April did because I didn't think of it. And when she cleared the index, she cleared all my data out at that point. And I was like, oh, thanks April. Well, you said-- - I did it more because,

well, I did it more because we have, we obviously we have like our dev and our stage and our prod and our locals. And if we're connecting to this data, we pull down a prod database and the hash value is different than our local, then they're not gonna find the data. So it had to be a common hash for the site so that at all these different levels, the data would be found. It would be the same IDs or whatever. Like that was the origin of the site. - It was a good addition. - I don't think, yeah. Your stuff is in a separate collection. That's another thing about this is that we each have our own collection. It's the same Mongo database, Mongo DB database, but I have a collection for Learning Pass and she has a collection for products. And so you can connect. My understanding is that we're connecting to our collection and not overlapping or touching each other's stuff. - Yeah, so technically you can. You can do them into the same collection. Actually my initial view of it was that we were gonna use the same collection for everything, but then there's a question of like permissions and you have right permissions to different things. And like, how much do you trust your colleagues aren't going to overwrite your stuff and destroy everything when they have full rights to write everything. And we were just like, trust but verify. So we're gonna go ahead and have collections to separate that out. But one thing,Nic that you brought up about schema, about updates, we do have this idea of primary data, very data within every schema that we generate. So primary data is gonna be the shared data that's shared across every record, meaning not meaning the content itself, but meaning the key values. So like UUID created date, updated date,

system or source system, those kinds of things. Those are every record in the data lake has to have. And then everything else after that is where deviations allowed, so for your implementation. So if we decide to add new primary key values, that's not primary key, that means something else. Primary data values you wanna track, that's where the shared module will be useful because then we can tell it to pull it in. - That brings up a question for me, Melissa. It sounds like you have different collections, so you can kind of manage your own, I guess, domains, indexes, but who's in charge of Mongo itself?

- Not me, which was actually a goal of mine because as I told my colleagues here, I was like, I don't wanna be the DevOps person for this.

I've worked in IT before, before I was a web developer, I was in IT for a number of years and I'm like, I don't wanna do that again. I don't want the on-call, like 3 a.m. call because something's down or whatever. So one of the things we actually worked out beforehand very early on was who's in charge of this, who's on PagerDuty, who's, you know, so there's actually an infra team here who's in charge of keeping our Mongo instance up and running and they have people on call because international, right? So Red Hat's a large place, so they kind of like hand off who's in charge of it. Yeah.

So that it doesn't have to be me, which was my main goal. - It's a good goal, honestly.

So I asked this in all seriousness, but how big does the lake have to be to be a lake and not a pond or something else? Like when does it start? Is it just- - A puddle? - Yeah, a puddle, a data puddle. - I would consider right now it's more like a data pond.

So, but in all actuality, I don't think it's a question of the size, it's more of the like the intent. So as long as the architecture is set up in such a way that like as we've described with not opinionated,

the schema is at least for us is being applied as it goes in. It's a very light administrative side of things, like it's not trying to tell you what to do basically. Like as long as those things are in place,

then you have a data lake. Like you could have one on your local. - Yeah. So it's more about the approach, the methodology and making sure it could scale to be a data ocean, but. - Yeah. - Exactly. - I was gonna say the scalability. - Yeah, that makes sense. - It's scalable when we need it to be scalable. - Something I brought up at the talk as well was that I, it's just this give and take thing where it introduces a single point of failure for all your data, which can be scary. But it also means if something goes wrong, you only have one place to look. You don't have to look at all these different places where things could go wrong. So it simplifies the troubleshooting process by quite a bit because the architecture itself is very simple. It's a MongoDB with data in it. So, when it comes down to troubleshooting issues, it simplifies the process quite a bit.

- It also means if a source goes down,

the sources reading, the destinations reading the data don't go down, right? So if you have an older legacy system that's kind of fragile, goes down all the time, it just needs to be up as long as people are editing content and indexing content, right? So you can kind of use the data as a temporary cache of the data.

- And that's, yeah, it acts as a caching layer. The other thing it can do as well, since you're at least in my instance, my Drupal instance isn't serving traffic.

So resource-wise, we can scale the resourcing for it to be very small because we don't need it to handle incoming traffic. So it can actually, in some ways, if you have these lots of data sources putting data in, it can save you on infracosts, possibly,

which could be really cool. But infracosts, it's the black hole of operational expenditures. I mean, how many of us, when we were testing, have left a server running, because I did, on AWS?

They give you a one-time refund. - Oopsy card. - Yeah, one-time Oopsy card, which I've already cached in.

(laughing) So my next question is about modules, but I keep coming back to this thought in my head, and I wanna ask about it. So bear with me, because it's gonna be a dumb question.

Earlier on,Nic talked about our search show, and on that search show, we talked about Solar. And Solar kinda has the same idea, right? It's like a search server, right, that has cores that are the different kind of buckets of data, right? So it sounds very similar to what is happening for you guys in Mongo.

I'm wondering if we can kinda just talk about the differences there, right? So obviously Solar is more of a search appliance. It's very heavily used for searching as a search utility.

Is the main difference, I guess, in a data lake versus that Solar sort of search index, the fact that multiple systems could get that data out of the data lake, as opposed to not maybe being able to do that in kind of a Solar configuration?

And Tim orNic if this is a totally dumb question, you can just be like, "Yo man, you're an idiot."

Bam, feel free. - So I asked the same question when we were talking about data lakes. - I'm not an idiot.

- It was really funny because I actually have a friend who ran, I think it's the crazycouponlady.com website for a long time, and they actually were indexing WordPress into Solar and using a decoupled front end to pull all their content from Solar.

So it's actually, I was like, "Why can't you just use Solar?" Instead, but the problem, I think, I'm trying to remember exactly what it was, but it, well, first off, we actually, within Red Hat, we have a very strong Solar team, Infra team already, and they manage a lot of our search connectivity between all of our different applications, because we also have different search collections for Solar indexing and all that stuff. It's very complicated.

So I was like, "Why can't we just use that?" And the first answer I got was because, and I didn't think that was good enough. - Seems reasonable. - So I asked the same question, and I think that the thing it came down to was,

we wanted to be able to iterate quickly. That was one reason.

And with Mongo, you can just stand up a collection, you can index straight into it. It's very fast, very, at least, from an IT perspective internally, it was faster. But I was like, "That's kind of still not good enough "if you wanna, that shouldn't be a good enough reason "to not use the best idea."

But I think it was, what it came down to was the flexibility of the schema.

And Solar puts, I don't know, it's, because it is very, very fast. - You had me at flexibility. I'm a big fan of Solar's not as flexible. Solar is a search utility. - It's made for one thing. It really is made for that. And also I've heard situations, at least in the past, where if something goes down, they have to re-index it, and it can take over the long time to do that.

I remember from that website that my friend who was doing the Crazy People Native website, just the number of times that their Solar index would go down and then their whole site would go down and all this crazy stuff. And it's like, "Well, let's go for the simple approach "that gives us the flexibility we need. "We don't need all the overhead of what Solar applies."

- Thanks for indulging, Mike. - I think you also mentioned,

at one point, that Solar is more opinionated, I guess maybe as far as schema or something like that, then MongoDB would be, that was another-- - And MongoDB was faster as well for what we were doing. - I think some of it too is, a lot of times with these larger organizations, especially if they have a dedicated team for something, that team is gonna have a lot of opinions. And if you need to deviate from those, and they're usually there for a good reason, right? But if you're doing something new and you're trying to deviate or form a new opinion, there's a lot of institutional momentum there.

So sometimes going out into a new technology,

it is worthwhile. And I mean, Solar's mature, it's been around forever though. I mean, Mongo is mature at this point too, but it's also newer. So it handles certain situations differently. And it sounds like it's a better fit for you guys. - So I appreciate everybody indulging my dumb, dumb question. Turns out it may not have been as dumb as I thought it was.

My original question, and I think you've kind of already elaborated on this. I just wanna double check that we covered everything here. I'm wondering what Drupal modules are you using, right? So you're using, based on our conversation, you're using a custom module that all the sites are using, right? And then external data sources.

Are there any other modules that are kind of integral to the connection to the data lake, or are those two pretty much it? - As far as getting the data in and out, those are pretty much it. And obviously the dependency on Search API with our HomeGroom module, there's also a MongoDB module that provides some connectivity that we're using.

And then as far as learning paths, we're using the allow only one module to make sure that our Drupal site isn't referencing,

like doesn't have duplicate references for resources in the data lake.

So yeah, so there's a one-to-one. If you're referencing an article on another site, there's only one resource that is referencing that article within the site that you're working in. Otherwise it could get a little wild. - Allow only one, is that not for a content type? Can you set it to like allow for a certain field type? How does that do for-- - Yeah, I think we're doing it on ID. So we were gonna do it by title, but another module we're using on the learning path stuff is the automatic entity label.

Automatic entity label.

So it will pull, it allows us to create, we created a custom token, I believe is how it works, that has the data lake title, the resource title in the data lake, and that just gets populated whenever the node is saved or created.

And then we tried to do allow only one based on title and the unique identifier from the data lake.

And because I think it had something to do with when you create it, when you create the first node or something,

it looked at the token, there was some like probably like order of operations issue where it didn't quite get the title that you wanted, it was looking at the token.

So we just ended up, I mean, it doesn't matter, we're doing it by the ID,

that is the unique identifier. So that's really what's gonna keep us from having multiple references. - Got it. - So, the browser, I'm trying to think, I think that's for our resources. We can actually, since we're injecting the data lake data in a pre-process hook, we can do any view of the resource can include data lake data. So we can do like a view that shows cards of the different resources as well.

So like, I think that's when the resource is being added to a learning path. So you've created your resources, it's your resource, it's referencing the data and the data lake, then you wanna add that resource to the learning path. That's just like an entity reference field that has, you know, drag and drop ordering,

or using entity browser to then display card views of those things that already exist, or you can click the tab to create a new one while you're creating the learning path and that sort of thing. So those are pretty handy. And then I think as far as our, we've been talking about this index module that we share locally in our talk,

Melissa put a sandbox project up that's MongoDB data lake. We can share that link.

And maybe I was thinking maybe I should put my,

the service class I was talking about how I retrieve data in Drupal, maybe I should add that to that module so it's available for people to play around with as well. - You are a maintainer. - I am a maintainer. - She's like, thanks a lot, Melissa. I was like, you're welcome.

- Yeah, April's implementation is,

I was just gonna say that April's implementation is very much, is so different from mine, which is, I think such a good example of how this can be used, right? Hers is very integrated with Drupal and mine is using Drupal for the actual data management and then the presentation, something else. So mine is actually very light, but one thing we do, we are doing is translation as well. So we actually have translations going in and that's using mostly Drupal core like content translation work. But pretty soon we will be using layout builder once the new feature that I have to build

gets to feature complete. But mine's very light by comparison. The rest of mine is mostly custom just because it has to be.

- I think our future plans right now are going to include decoupled learning paths. So that's gonna be interesting. That's gonna go a little bit full circle, come back to the thing that you've been doing with products and we're doing more with learning paths too. So that'll be an interesting thing to go through.

- Awesome.

So kind of the next question that I always have with these types of situations, right? You have multiple systems kind of managing their own collection of data, right? But then you have a bunch of people using that data. If you need to modify your collection,

I think there's two parts to this question. First, how do you communicate to your users like, hey, we're going to change this date format, so be ready. And how do you coordinate the actual rollout of that change?

Maybe we can start with you Melissa, because it sounds like you have a few more users that are a little bit more removed from the data. So the communication plans. - Yeah, generally, so that usually what I do, and this applies for API is actually, I use the same mentality for both this kind of schema approach in APIs, because it's kind of a similar approach, except that we don't do versioning. Like with an API, you can say V2 and have them sit side by side. With this, technically you could,

but what I've gone for so far is actually, if I need to change something

and I'm going to deprecate a field, I will actually just add a new field with the format the way I want it to be. And then I'll give a lifetime for the old one and say in whatever, six months, three months, one month, whatever, this will no longer be present. If you have a problem with that, reach out to us and let us know and we'll coordinate, right? So that's basically how I do it. We have an email list of people who respond to us, but we also have select channels, which is I think more effective for us, where we'll do like an announcement and select channel for people who are using our data like data. And I'll just say, announcement on this day, you'll no longer be able to use this field, use this instead. And that seems to work really well. And again, there's usually not a reason to,

at least in my mind, there's not necessarily a reason to overwrite something unless it's an integrated field. So if you have a field that like is used as like an ID field that's used as a key for something else, then that becomes a little more tricky.

But those 10, I try not to touch ID fields because it's like, that's the whole point of an ID is that it doesn't change. So yeah.

- I'll be April, anything different?

- We haven't had a ton of schema changes yet to have to deal with that. So that's a good thing to bring up and bring to light. And our teams that are working on the two sites that are currently using it are the same team. And so if we know a schema change is coming up, we'll create tickets that will handle template changes and things like that to make sure that it all matches up. But that's a really good point is how we can, I would think some sort of change log and confluence and our documentation will probably be really handy too to let people know what to expect.

We take advantage of like, we like how Drupal deprecates things or like there's all these schedules and things. I guess we have to, as part of our governance now, we gotta come up with our own deprecation governance

as a part of this. So that was a good point. - What I keep telling our users is that we are currently on the bleeding edge of technology when it comes to this feature. And so sometimes you get cut.

I was actually just talking with someone on our design team because she presented designs and they said, oh, well, we have to change this now. And she was like, this is the third time.

And I was like, that's what happens with bleeding edge. Like it moves really, really fast. And even by, I would say even by like agency, like Drupal agency standards, it moves really fast like from day to day, sometimes requirements will change

versus Red Hat standards, where sometimes corporations can take like, oh, that happens super fast. I'm like, it took us three months to do that. They're like, yeah, that was really fast. I'm like, what?

So this kind of like iteration that we're working on right now, our schedule is really fast. So we have to be in constant communication with one another else, it just doesn't work. And sometimes we have to redo stuff because we're figuring it out as we go.

- Sounds good.

So you've mentioned a bunch of different, sandbox models and other, offhand a bunch of models here. What are any of the,

what would be the number one sort of like gotcha or pro tip if for anyone trying to implement their own data like for their own organization or product?

- Do you need it? (laughing)

You have to have a really good use case for it because it does take a lot of work to do.

When you abstract data, the way we're doing it, it automatically brings its own challenges like the governance plan that April and I have been talking about this whole time. The fact that we have to have a governance plan,

something that not everyone thinks about. You don't wanna get into a ready fire aim situation where you like build something and then go, oh no, now people are using it and we can't change it anymore.

I don't think the complexity is in the technology, not really, the complexity is in the administration. And if you don't administrate it right from the very beginning, then you end up with just a lot of problems down the road. So I would say the number one question would be, do you need it? Do you actually need it? And if you do, then be ready for the work. - Yeah, when we originally did the learning path stuff, it really felt like overkill because we were only doing it on one site. We were basically taking content on the same site, putting it in the data lake and then referencing it back into the same site.

So it was really complex for that, but the idea was setting the foundation for being able to add more sites to it and stuff like that. So yeah, it would have been easily solvable with just an entity reference if we weren't looking for that idea to share the data across different properties and different applications.

- Just out of curiosity on the governance note, who is managing that? Is that your team directly or is there some sort of governance overlord that is determining that governance? - It's basically the data owners so far because there's so few of us. As we expand, I have no doubt that it'll grow and it'll help more people, but it's basically myself, April, and a lady named Stacy who's on our shared tooling team. So the three of us kind of collaborate together and Stacy's like our advocate for when people want to start using the data lake, then she knows how to use it as well. So that way they don't have to track April and myself down if they don't need to. Then also, I don't want to be a single point of failure. April doesn't either. So the more people who know how to work with this, the better.

So yeah, so that's how we do it right now. And we're kind of in, because we're the ones using it and working with it, we're the authoritative source, but we do collaborate a lot as well. So with one another and with our teams.

So if that's not a situation that someone who's considering doing this, if they're not in that situation where they don't have that authority, then they need to get some sort of mandate from someone who has authority to give them that authority so people will listen to you. Otherwise you end up with just a day like full of stuff that doesn't look like it. - Like a data swamp.

- Hey, yeah, keep it going. - Yeah. - You like it? - Yeah.

- Melissa, April, thank you for joining us. It has been a very enlightening conversation and full of great tips and tricks around data lakes.

- My pleasure. - Yeah, thanks for having us. - Yeah. - Do you have questions or feedback? You can reach out to TalkingJupo on Twitter or X, I guess, for now.

With the handle TalkingJupo, we'd really got to update this and figure out where we're gonna live because the weekly changes are- - This is why we don't letNic go off script. - Well, it's no longer called Twitter, I guess, officially. But anyway, you can also connect by email, which is more reliable at show at TalkingJupo.com. And you can connect with our hosts and other listeners on the Jupo Slack in the TalkingJupo channel. Which feels like the most reliable option if I had to choose. But anyway, you, yes, you, can promote your Drupal community event on TalkingJupo. Learn more at TalkingJupo.com slash TD promo. - And you can get the TalkingJupo newsletter for show news, upcoming Drupal camps, local meetups, and much more. You can set up for the newsletter at TalkingJupo.com slash newsletter. And as we mentioned this week's, well, last week's edition has the questions with Tim.

- I'm still thinking about those M&Ms.

Thank you, patrons, for supporting Talking Drupal. Your support is greatly appreciated. You can learn more about becoming a patron at TalkingJupo.com and choosing the Become a Patron button in the sidebar.

All right, we have reached the end of our show. This is the portion of the show where we allow for shameless self-promotion.

Melissa, folks wanted to talk about Red Hat, talk about Data Lakes. How best can they contact you to do that? - That is a great question.

Probably email, honestly, to start with. I'm also on Drupal Slack.

My, is it handle? Is that what we call them still? Is that the thing? - Sure. - Is MiraLuka, M-E-R-A-U, L-U-K-A. There's a long story behind that that goes back to AOL. Enjoy that story.

But that's my handle on Drupal.org. Can you reach out to, I have all my contacts listed there, but also my email is mbent at Red Hat.com. That'll be fun. Please enjoy, please don't spam me. - Just for nostalgia, you've got mail.

- Oh, yay! - April, where can folks find you? - I am also on the Drupal Slack. My name is presented there, April Sides, but my handle is week before next, not the week after next, but the week before, which is now.

Yes, that's probably the best way to get in touch with me, really.

I don't wanna spam. - That's all right, that's fair. Nobody converts the email addresses from audio to email, to spam people. Tim, how can folks find you? - I'll say Drupal Slack as well, honestly. - All right, Drupal Slack. Hey, listen, you can start a chat with all of us. It'll be fun, I can guarantee it.

Nic Laflin how can our listeners get a hold of you? - Well, I'm at Nick's VAN, N-I-C-X-V-A-N, pretty much everywhere. - And I'mJohnPicozzi You can find me on all the major social networks atjohnPicozzi and as well as Drupal.org. And you can find out about E-PAM at E-P-A-M.com.

- And if you've enjoyed listening, we've enjoyed talking.

- Have a good one, everyone.

Talking Drupal #409 - Data Lakes

Listen:

Topics

Resources

Module of the Week

Melissa Bent

April Sides

Tim Plunkett

Martin Anderson-Clutz

Nic Laflin

John Picozzi