Introduction to Big Data Cluster on SQL Server 2019 | Virtualization, Kubernetes, and Containers

(cheerful music) Hello and welcome, my name is Sanjay Soni. Here is a microlearning,
readiness video Let me welcome Buck Woody to the studio. Hi, Buck. Hey, how you doing Sanjay? SANJAY: Great to have you with us. It’s good to be here,
it’s good to be here. Awesome. So, what do you do at Microsoft? I am an applied data scientist. I work on the Azure Data Team,
so we work with SQL Server and all the other data platforms as well. SANJAY: I see, I see. So I know there’s a lot
of buzz going around for AI and big data, first
of all, is big data real? Is big data real? Wow,
that’s a great question. It is. It’s not just a buzz word. You have to kinda define what
big data is though, first. Right? So, what is big data? Have you ever been asked what’s big data? SANJAY: Yes. What do you normally reply?
How do you reply to that? It’s bigger than small. (both laughing) BUCK: I love it. I love it. It’s not small data. You know I’ve heard that
the V’s like velocity SANJAY: Of Course, the four V’s. and variety and volume. You know the V’s, right? And that’s fine, but I
kind of have a little bit of an issue with that. For me, the simpler definition
is going to be any data that you can’t process
in the time you want with the technology you have. SANJAY: Yes. That’s the way I think about it. And the way I think about that
is if I have a Commodore 64, one megabyte of data is big data. Right? So that’s, is not very big. But if I’ve got, yeah SANJAY: You could not derive
the information from it Fast enough. Yeah. for what I want, so I think
if you’ve got a system and it’s not giving you
the response you need, you need to look at something else. And so that’s what we’ll
be talking about today. Some people ask me, big data’s kind of for larger companies and things like that but it’s really kind of not true. We’ve all got that drawer in the kitchen that you can’t close anymore or open, because we’ve got so much stuff in it. When we first started at
the house or the apartment, we could put stuff in the
drawer, because it was new and there was nothing in
there, but as time went on, we had more stuff and we had more time to put it in the drawer. And that’s what we’re doing with data. You know, we’ve got smart watches Right And computers and your
cell phone and all that. And so we’re collecting
tons and tons of data, just because we’ve been able to longer. You know retail is a
classic example of this. Exactly Right? So when you got
into the grocery store and you put your card in the reader, all kinds of data is
being collected about you. Not just that transaction, but the stock levels of that item, where did we get that item from, how does that come back to us? All these kinds of things
are being recorded. And so retail is a huge example of working with large sets of data. And every company has big data. Doesn’t matter what you are, you have the availability of big data. The interesting thing is, you don’t always have it locally to you. It could be, that you’ve
got data that you can access that isn’t yours, maybe
it’s weather reports. Maybe it’s demographics. Maybe, it’s all kinds of things, right? So we have these sources of data that are not necessarily in
our systems, under our control. SANJAY: I see. SANJAY: So you know,
how do customers use it? Like what kind of use-cases exist? Yeah, there’s lots of use-cases, and so if you think about
the various industries that are out there. You know, retail, finance,
healthcare, public sector, things like that, each of
those has kind of found some main line use-cases that you can see. There’s demand prediction for retail. There’s fraud detection
and so on inside banks. And cyber security and so on. All of these things need large amounts of data to answer questions. And it’s really kind of two
or three kinds of questions we could ask big data. One is just exploration. We just kind of get
around in there and look, and we’re looking for something but we don’t know what
we’re looking for yet. Another kind might be classifications, like we put these people
into these categories, we put these events into
this category and so on. And then another one might be predictions. I want to know, because of
this, what happens here? How do we predict what
might happen in the future? And so all of those need
large amounts of data, and these are the main
use-cases that we see. SANJAY: Awesome, thank you so much. So we’re talking about big data. Now, let’s switch gears and
I’ll ask the next question, which is about solving big data. Yeah. So Buck, I understand a lot of companies have big data that, you
know, they’re trying to make a big difference using big data. Yeah. Scale Up, you know, for
traditionally Scale Up, has been the method Sure, and Scale Up is what now? It’s just making things bigger, right? Like a single box, like my
truckasaurus laptop here, making it have more memory and more CPU and better
hard drives and so on. Yeah, you can Scale Out maybe Well we have to work with Scale Out, and when we think about that, we think about the classic
example which is Hadoop. Right? That’s the big example. Of course Actually came from a project called Nutch that was invented at Yahoo, of all places and then that was open sourced and then Google took it and
made the Hadoop project, checked that back in to the
Apache project and so on. So, the interesting thing about Hadoop is it really solves big data
the only way that you can and the only way you can solve big data is to break it up to where you you have distributed computing systems and distributed storage systems and then some way to put those together. And this is done, normally,
with something called MapReduce and then HDFS is the
file system and then YARN which keeps everything together. The entry thing is,
that’s kind of abstract and so I usually explain Hadoop to people by using a grocery store. SANJAY: Alright BUCK: You’ve been to a
grocery store, right? SANJAY: Of course BUCK: And you get in line and there’s a million
people in front of you SANJAY: Yeah BUCK: Because maybe it’s
snowing in Seattle, right? SANJAY: Actually it is snowing in Seattle. BUCK: It actually is snowing in Seattle and so what happens? Everybody rushes to the store and buys up the essentials. You know, beer, and diapers, right? You go to the store and you
buy those things, right now. And so there’s a long line, right? So what do they do when
there’s a long line? Add another cashier, right? SANJAY: Of course BUCK: And then just put
another cashier there SANJAY: Yes, yes BUCK: And then half the line,
goes over to that cashier and then starts working with those people. Now the interesting thing is, whichever line I’m in will move slower, but that’s another problem. But the idea there, is
that we have split up. The cashiers can do the exact same work. The person on the first register, knows how to do exactly the same thing as the second register,
but we moved all the data. The things you’re buying,
the things I’m buying, are in two different lines. That’s right, now we’ve
Scaled Out the processing and you and I have Scaled Out the data. But we need some way, the
manager at the back of the store, she needs to look at okay, how do we look at all of my sales today? So she puts all the data
back together, right? And so you might think
of her as the YARN level. She’s the one that’s
putting everything back. We’ve broken up the compute
and we put compute over data. My cash register person handles me, your cash register person handles you. Same compute, different
data, put back together. That’s Hadoop. Wow That’s all that is, right? That is easy, yes. It’s not that hard to understand. The only problem with this, is it’s a little slow. It’s batch-oriented. And so what that mean is
that I submit some things, it happens and then later
I can see the results. And that, to me, is a little unsatisfying. And it was unsatisfying
to other people as well, because you want SANJAY: Fast BUCK: Right? What do they say? Computer people would
microwave instant coffee. You know, we want everything right now. So they came up with
this idea at UC Berkeley, and it was called Spark. SANJAY: Of course BUCK: Yeah, and what Spark does, is it tries to address the slowness in the first two layers. The MapReduce and the YARN there. The storage, it leaves alone. It leaves that alone to handle
the distributed storage part, so it doesn’t worry about storage. But what it does do, it’s a
series of API calls in libraries that you can use to wander
across this data sets and to do the traditional things. The extract, transform, and load, explore the data, do
some machine learning, maybe get the data result, and so on. So it does that by doing some tricks around adding more memory,
doing more things in memory, a different kind of shuffle mechanism, other kinds of tricks that it does, working on something called a
Resilient Distributed Data Sat which is the RDD. And then that will make a data frame which makes it look a
little more SQL Server like. And then the data set looks a little more object oriented to things
like Java, and Python. And, in fact, in Spark,
you can run Spark SQL which looks like SQL
language, because it is. And you can run something
like Python or Java, or other kinds of languages. This is kind of how we Scale things out. This is the foundations of our Scale, and kind of brings us to this. SANJAY: I see, so how do
you emulate a computer? BUCK: Right, so if we’re gonna do this, we have to Scale Out computers, right? We gotta go build a ton of computers, and when we’re done with them, we gotta put them back together. We’re not gonna do that, we’re gonna, as you said, virtualize that. That’s the problem we have,
is how do we handle that? So what we did was, the industry at large, basically said, we need
something called a Hyper Visor. And really all that does,
is present the big four, back to software. So whenever you start up your computer there’s four big pieces in the computer. There’s CPU, Disk, Networking, and Memory. It has all those things and
so the operating system says, oh, I know how to work
with a CPU and a disk, and memory, and networking, and all that. So it allows us to do that. So when we talk about virtualization, we took a piece of software that’s riding on that
hardware, and we said, act like you’ve got a CPU, a
disk, memory, and networking. And tell an operating system,
yeah, I’m really a computer. And so, in that way, we can share what’s
called the host machine, we can share it’s big four, SANJAY: Yeah BUCK: And then we can present
the big four in software, to an operating system that has no clue that it’s running in that environment. So now I have a way, to
simply have a hard drive, which as we know, is just a file, and it emulates the big
four, and now that big file can start up an operating
system and I have a computer, and I have another one, and
another one and another one. And in that way, while it’s interesting to run that on one system,
the more interesting thing is that I can send them around those files and run them in lots of places. Then I can do my Hadoop and my Spark on those kinds of things but of course there’s a problem with this. I’m carrying the entire operating system. And in fact, I don’t really
need the whole operating system, I really just need some
binaries and things like that. That’s all I need. So they came up with this
next way to work with things, which is called containers. SANJAY: Okay. Yeah, so containers are
kind of interesting. They’re not a virtual machine. SANJAY: They’re not? No, no. (Sanjay laughs) They do, they’re like,
wait man, isn’t this just- and it’s actually not true. What they are, they represent just- so if you’ve abstracted the
hardware out in virtualization, you’re gonna abstract
out the operating system and up in containers. That’s essentially the way to say it. It shares the operating
system, shares the big four, and says, I’m gonna ring
fence, I’m gonna bind around, I’m gonna abstract out
just the binaries you need and any data you need. So if I wanna run, lets say a Python app, I would ship just the Python app, and enough stuff to run the Python app, and any data that Python app stores. That’s it, that’s all it is. SANJAY: You don’t need
to struggle, you know. BUCK: Yeah, I don’t need to carry the operating system now. Because of that, these
things are really small. containers are really small. containers are built from an image and all of that is just driven from a file where you say, I’d love to
have Python in this data. That’s what you put in the file, and I can just mail
that to you, it’s tiny. If you have a container processing system, it’s called a runtime engine,
the main one is Docker, that’s sort of the big dog right now. SANJAY: Yes, yes. And in Docker, if you have Docker and I send you that file, you can compose what I sent you in text,
if you have everything, up into an image, or I
can do that part for you and simply send you the
image and say to you, here, run this. When you run it, it becomes a container. So that’s where those terms come from. So we have, Docker is the runtime, the compose is taking the
text file out to an image, and then running an image
becomes a container. SANJAY: So, can you
have containers in a VM? BUCK: You know, you can, isn’t that weird? So you can- (laughs) So one of the demos I do
for the SQL Server product is I will on my Mac, run Docker, which then runs SQL Server in a container, which runs in Windows. So I’m on a Max that’s running
a Linux kind of process, that’s running a Windows process. It’s really interesting. And it’s done in minutes, in seconds. Here’s the other thing, because
these things are so small, they stand up very quickly,
they stand up super quickly, which is incredibly useful. SANJAY: I see, so I know you’ve shown all kinds of ingredients to us. You know, how does this
whole orchestration happen in SQL Server for
all these containers? Yeah, thinking even
beyond just SQL Server, one can imagine, okay, we have big data, we have a way to solve big data, that’s Hadoop and Spark and so on, whatever kind of distributed
processing and storage that we can solve big data
now, we know how to do that. And to do that, we need
to do it on something and so those would be virtual machines, and that becomes a little
heavy, so we’ve got containers. But because they’re so light, it’s just like our kitchen
drawer problem again, we have a lot of ’em. And now we’ve got to figure
out, where does this one go, what does it run on, how
does it work and so on. So this idea of container
orchestration came along and what that means is we need a way to control all of these things. And that actually came from
a product called Kubernetes. Now Kubernetes, or sometimes
people call it K8s, ’cause I guess it’s too
long to say Kubernetes, and so it’s K8s, but at any
rate, in Greek this means the pilot or the navigator of a ship, is what that stands for. Well, we know about Docker,
right, we’ve got that down. And we’ve got the containers down. We understand what those
are, so we’re good there. Basically what Kubernetes does is it introduces this
abstraction called a pod, pod. And a pod is merely a
collection of containers. That’s all it is. SANJAY: Okay. BUCK: That’s all it is, so
it’s just an abstraction. Then what you do, is you need
to run that on something, those need to run on something. And so it introduces this
idea called a node, N-O-D-E. Now that can be a physical
computer or a virtual computer but it’s a computer. And what the node has in it,
is three things at least. First, it has Docker, so
we can run containers. Secondly it has something
called a Kubelet, which is a little service
that runs and tells this node, you’re part of Kubernetes. Then it has something called Kube-Proxy along with some other services that say, here’s how
networking sort of works. Right, and so when you put
a bunch of these together, when you put a bunch of nodes together, you end up with something
called a Cluster. And there you go. Now the problem is, in a
Cluster, what if it goes down? These things move around, by the way, they just automatically move around and all that kinda stuff,
how does that work? Well, where does the storage go? If I’m a database, and we’ve got big data, you got virtual machines,
you got containers, now we got a way to work with containers, but what if a database
just suddenly disappears? Kind of a bad thing,
right, we don’t want that. So what they do is they have this idea of something called a volume, and this one happens to be a persistent
volume, meaning it sticks around. And you connect to it, it’s
almost like a software wire, if you will, and it
connects the node or pod directly to the storage,
which is handled separately. That way if the node goes down,
which happens all the time, and Kubernetes moves things
around, it’s kinda spooky. And you don’t really control
that, it does it for you. But it can always follow
it’s storage around, is the way that works. So by doing all of these
things, we now have the ability to sort of get out some orchestration. Let me show you what an
application might look like on one, just a generic application. SANJAY: Yeah, maybe a Kubernetes cluster? Just a cluster that’s used for, lets say our shopping cart, example. SANJAY: Yeah, lets take a mutual example. Yeah, let’s take the retail. And so what we do here
is we have this idea. There’s one pod in here, as you can see, called the Kubernetes master, and it’s actually not
specifically in a pod, it’s actually a service,
but it holds here, and this controls the whole thing. From there, maybe we present
a web tier, a web tier. And in this web tier, maybe we’re taking- These are the checkout lanes,
these are the registers. But in the middle of that register is probably some business logic that says, you know, here’s the price list and here’s what rings up, and so on. So we probably have a middle
tier business logic thing that says, don’t sell things
that aren’t in inventory, and so on, and then of course, we need to be able to store the data. And so this is a typical
generic Kubernetes cluster, which has some sort of purpose,
and you’re Scaling Out. Here’s the thing, Sanjay, this
is the miracle right here. Once again, these are
just files, and you say, I would love a cluster
that looks like this, and it does that. So it’s literally, I’d
love to have this, go. And it just lays it all out for us, which is pretty awesome, I think. Would you like to see what one looks like? SANJAY: Yes, absolutely. BUCK: Okay, let me pop
over to a demo here. Alright, this is an impromptu demo. Impromptu demo. So what I’ve done is
I’ve actually stood up a Azure Kubernetes service AKS cluster. And so you’re looking at the nodes there, here’s our nodes that we talked about. And inside those nodes,
you can see the pods that are running and
kinda what they’re doing. Pretty cool, hey? SANJAY: Yes. BUCK: So, the neat thing is, I don’t have to worry about all this, Azure takes care of this for me. But Kubernetes runs in lots of places, so I’m okay if it runs
with a bigger thing there. It’s not a problem if it does that. So lets start up our slideshow here again. So we’ve got the generic
Kubernetes cluster that we can do sort of anything
with, and that handles our- We have big data, we’ve got
a way to handle big data with Hadoop and Spark and so on, we’ve got a way to virtualize things, we’ve got containers that handle things. And now we’ve got something
that can manage the containers. SANJAY: Awesome, so now how does it all fit into SQL Server? Okay, so I’ve described all this stuff and I have tactfully avoided SQL Server. Okay, hold on now, okay,
so there’s all this stuff out there, what are you gonna do with it? I just want to say something, you know, so I’ve known Buck for so many years. I’m a big fan of Buck because
the way he explains everything is just so simple, you know, like, gross example, just really- BUCK: Well, Sanjay, I’m
a simple man, so yeah, (Sanjay laughs) it all has to be simple from me. SANJAY: Alright, now lets
bring you to SQL Server. Okay, lets bring it down
to SQL Server, here we go. So we looked at all this and
we said, hey, you know what, I think if you saw Bob Ward’s sessions, SANJAY: Of course. BUCK: If you’ve ever seen
any of those, Bob talks and he’s even written
a book on SQL on Linux. SANJAY: Of course. BUCK: And that’s pretty cool, and here’s why it’s pretty cool. So we’ve always been able to
run SQL Server on Windows, right, that’s been since the
day, right, back in the day. Actually back in the day
when Bob and I started, it was OS2 1.1 character
based, but I digress. It’s always been available on Windows, and we’re familiar with
this role, we know it. But what we said was, you know what, we need to be able to
run on other platforms so that we give you a platform of choice. And so we took a look and said, what things make SQL
Server run on Windows? What are those things that happen? Well, remember we have the big four? SANJAY: Yes. BUCK: CPU, disk, memory and network. Those are laid over by a
kernel on an operating system. And there’s kernels like the ones in OS2, and there’s ones in the ones in Windows, and there’s ones in the ones in Mac OS, and there’s ones in Linux, and so on. And what you do when you write code, you’re calling to those big four through the operating system,
using various drivers. So we said, where is SQL Server
sort of making those calls? And maybe it’s a lot, like,
this is a really hard problem. If we want it to run
somewhere else, we may have to write a complete Windows
emulator for other platforms. And that’s really not great. It doesn’t perform well,
it’s a lot of code base. We could rewrite all of SQL Server, and that didn’t look
like a good idea either ’cause now you’re keeping this SQL Server and that SQL Server, and it’s just a mess. So we looked and we said,
what about if we could just sort of cut away the parts
where it just talks to Windows? And so a group of folks here did that. And they said, you know,
it’s actually not that many. It’s some, but it’s not that many. And we created this thing called Bhopal. And I think Bob covers this
really well in his book, and he’s covered it here on our sessions. But Bhopal is the platform
abstraction layer, right. So it basically trims
away all of the pieces where SQL talks to Windows, and says, okay, you can still talk to Windows, we’ll put that layer back,
but we’ll make another one. Maybe we make another one for Linux. Maybe now I can run SQL on
Linux because there’s Bhopal that abstracts out the operating system. Well, that opens up
all kind of new worlds. So sure, we can run on Linux and now you’ve got platforms
of choice, and that’s great, but here’s what that
opens up for us, Sanjay. And this was a long way for
me to answer your question. I was hoping you’d forget
your question, but- (laughs) What we’ll do is move forward. This allows me to run SQL
Server in a container. SANJAY: Okay. Ah, so we have big data, SANJAY: Yes we’ve got a way to solve big data, we’ve got virtualization,
we’ve got containers, we’ve got Kubernetes, and now
I’ve got SQL in containers. Huzzah, and so now I can
run this on premises. I can run it in a public
or a private cloud. And the interesting thing
is, I can run it on both if I wanted to, so it becomes very useful. SANJAY: Fantastic, thank you
so much for all this data- BUCK: No worries, now we gotta
put this together somehow. SANJAY: Of course. BUCK: Lets do that later. SANJAY: Yes, of course, thank you so much. Thank you for watching this video. Learn more about this and other
topics at

Leave a Reply

Your email address will not be published. Required fields are marked *