SQL Server 2017: Advanced Analytics with Python

SQL Server 2017: Advanced Analytics with Python


[MUSIC] Hello everyone,
welcome to our presentation. My name is Sumit Kumar, I’m a program manager in the
Microsoft Machine Learning Group. And with me is my colleague Vijay.>>Hello everybody, my name is, Vijay Jayaseelan, I’m an engineering
manager in the same group as Sumit.>>We are super excited today to
talk to you about the new capability that we are bringing
in SQL Server 2017. So with native integration of
Python in SQL Server 2017. Python is effectively a peer to R
language that we added in 2016. So now, in addition to
R-based machine learning and advanced analytics applications, you
can also run Python based machine learning, AI, and advanced analytics
applications in SQL Server. And so, what this also does, gives
you a capability to do is deploy a Python based machine learning
application in enterprise production environment, which is already
running on SQL server, right? So in addition to that capability,
what this integration gives you, is the ability to bring any
open source Python package and make it run in SQL Server. So that you impart the intelligence
from that open source package into your data which is
already in SQL Server. What are the key advantages that
are enabled by this integration? First, and the biggest one,
is the elimination of the need to move
the data from database to wherever you’re doing your Python
based machine learning and compute. That is a very,
very important capability, because it gives you
a number of benefits. Number one,
you moving data is not secure. And the amount of infrastructure
that you have to build to move data around to do machine learning,
that is also another problem. On top of that, oftentimes, you
cannot move bits of data, so you’re forced to do sampling and do your
machine learning on samples of data. For this now,
you can do machine learning and your Python based compute on the
entire dataset which is in database and data is not moving anywhere,
right? So the second problem that
people often have with operationalizing, their machine
learning coders find that the data scientists have done
a great job and created nice models. But putting them in production in
such a way that the applications, the enterprise applications, LOB applications can actually take
use of that is a problem, right?>>Yeah.
>>So now with this integration what you can do is you can deploy
your machine learning model and Python script right inside
of a T-SQL stored procedure. We’ll talk more about that in a bit. But what that gives you is, hey, any
app that can talk to SQL Server and that can invoke a T-SQL
stored procedure. Without even being aware
that a Python script or Python based model is
running in the background, it just gets the intelligence
from it, right? So in addition to that
nice integration and the ability to deploy
models in production, we also offer some performance and
scale advantages. Vijay, why don’t you
tell us more about that?>>Right, you know that as part
of 2016 in-database analytics integration, we did introduce a new
communication channel to exchange data from SQL Server to the scripts
which are running inside SQL Server. That is a very high
throughput channel and it’s like a columnar based
compression channel. It’s much, much faster than ODBC. So this integration of
Python also uses that. Apart from that, you already know
that SQL has in-memory columnstore, in-memory tables, and columnstore
indexes which are very highly conducive for
large scale data analytics. In addition to all this, as part of
the integration we are also bringing something called Scale Py. Which is equivalent the RiboScalar
package which we introduced.>>So all the RX set of APIs.>>The RX [CROSSTALK] functions,
right.>>We will have a Pythonic
version of those?>>Yes,
we have a Pythonic version of those. And all of those are not bounded
by memory or scale, right? Like for example, if you’re doing
a linear regression on a billion-row dataset, you can easily do it on
a very small scale hardware, right, without any memory limitation.>>Excellent.
>>So that’s like Enterprise grade scale.>>Fantastic, so moving along. What are the key scenarios
that we enable by this? So the first one is the data
scientist scenario, where the data scientist is working
on their laptop or their desktop. And they’re doing data
exploration and modeling. It’s a very activated process. So now, while you’re still
working in your IDE, you can connect to SQL Server and
do your compute there on the box.>>Yes.
>>So you can set your compute context to either local or
to remote on SQL Server. And even though you are executing
everything on your IDE, it is actually being done
on the SQL Server machine.>>Yes, so that’s like one of the
main advantages of the Py package which makes the local and
remote compute transparent. So the data center doesn’t
have to leave his IDE to write something on SQL.>>I think that is huge, right? There’s no more pushing data around, the IT team giving fragments
of data to the data center. All the advantages that we built in
R services are now available for Python based machine
learning as well. So the next one is
the application developers. That’s another persona that will
benefit from this integration. So the data scientist team
has created the models and the Python scripts. Now, I want to, as an application
developer for LOB applications or my business applications,
I want to take advantage of that. Earlier it used to be that the data
scientist model will have to be literally recoded into whatever
the LOB application language is, Python or Java or C#.>>Yeah.
>>You name it, but now all I, as an application developer, need to
do is, invoke a stored procedure, we already talked about it.>>Yes.>>So
that simplifies my job drastically.>>So another thing also
as part of the application personalization is in a basic
three tiered architectured, assuming you had some Python code
which you are to run for an ML. Generally, it will be running in the
middle tier and managing that Python execution and all that in
middle tier is very cumbersome. Now, you don’t need
to worry about it. It is as simple as calling
a stored procedure, right?>>Absolutely.>>You don’t need to worry about
what Python version is running. Is this patched, is this secure? Everything is part of SQL Server. So that takes away a lot of pain in
terms of operationalizing the Python code into SQL.>>And as we’ve been talking about,
the intelligence is truly with data. And applications don’t have
to worry about all of this.>>Yes.
>>And this kind of intelligent application is no longer
the domain of the AI experts and machine learning experts. So that’s a big step in
the direction of democratizing machine learning and AI.>>Yeah.>>So the next sort of persona that
benefits from this integration is of course the owner of the database,
that is the database administrator.>>So the database administrators
are very particular about what gets run on the database. Who has access to it? How the queries
are using their sources? Are they throttling? All those benefits are also there
as part of the Python integration.>>Because I want to control what an
extended runtime can do on my box.>>Yes, so the DBA can control that,
like for example, we have something called
resource governance. Which we did introduce as part
of SQL 2016 itself as part of R integration. The same applies for Python
integration also, where an admin can choose to say, these set of logins
cannot use more than 5% of the CPU. So the Python port cannot
use more than 5% of the CPU. So that kind of very nice governing
allows the database administrator to finely control what it is that the
stuff is running on their database. Apart from that,
there are some detailed logs and views where the database
administrator can really look what Python code got executed,
who is executing it. So these are just regular part
of the SQL Management, right?>>So I can take advantage of
the state-of-art technology and machine learning capabilities while
still being able to control my resources on the box without
worrying about security, etc.>>Yeah.
>>Excellent. All right, so
let’s take a look at how, sort of, this actually, we talked about
the data exploration scenario. So on the left you have data
scientist working on his Python IDE. And as he connects in remote
compute context to the SQL Server, the script goes to SQL Server. Execution happens there
on the Python Runtime, the results come back to
the developer’s machine. And in case of
the operationalization, and the application is basically
Connecting to SQL Server, calling a stored procedure, the
compute again happens on SQL server. It leverages the new system stored
procedure that we created last time in 2016, called
sp_execute_external_script, and it now accepts Python
in addition to R code. Now let’s take a look
at the architecture. Why don’t you tell us a little bit more about how we were
able to achieve all that.>>Right, so some of this we did talk as
part of the 2016 launch also. The architecture, as far as
our services integration and Python services integration is
concerned, is nearly identical. Right, for example,
we did introduce as you just mentioned,
sp_execute_external_script, which is a system stored procedure
to execute external scripts. We used to support R with 2016,
now it has the language Python where you can submit Python
scripts and it’ll execute. The query comes to the SQL Server,
SQL Server executes it. There is an external process called
as Launchpad which launches it. There a concept, right,
like a run time launcher. With 2017, we have RLauncher and Pythonlauncher to launch different
run times, based on what is needed. And it gets launched,
and it gets executed, and it connects back to
SQL Server on the SQL Satellite high throughput data channel,
I talked about it. Right, it exchanges the data, then
the script executes and comes back. Conceptually it is very simple,
think about it this way, right? You’re executing your query
in the SQL Server and the results come back. With external script, what you have done is, you execute
a query in the SQL Server, the results of the query actually
go to the external run time. The post process is based on that,
and that result actually comes back via
SQL Server, to the client backside. It’s just like an additional
post-processing, it’s as simple as that.>>Excellent,
now let’s take a look at a demo, where we see all of
these things working. Let’s see some code in action. What you’re seeing here is Visual
Studio IDE, with Python tools. And you’re seeing a very simple,
few lines of Python code. It does nothing more than
printing Hello world. So if I run it here
locally on my machine, it prints Hello world,
as expected, right? Now let’s see how would I take this
code and run inside SQL Server, leveraging our integration
that we just talked about. So here now inside of SSMS, you are seeing the exact same
Hello world code as here. And what you’re seeing is
the special system stored procedure called sp_execute_external_script. That takes an external
language script, in this case, the language is Python here. And this script is
the exact same Python code. Let’s execute this code. And you see the exact same output
here inside of SSMS as well. So it’s that easy to bring
any extended Python code and make it run in SQL Server. Now let’s take a look at how
the deeper integration with SQL is done in this case, Vijay?>>Okay, let’s now look at another
example which shows the SQL and Python integration
in a bit more depth. What you’re seeing right now
here is basically a SQL script where I have a temporary
table called as PythonTest, which has a DayOfWeek as a string
and some Amount associated with it. I have some piece of data which is
put into it as the script runs. There are a couple of parameters
which I’m printing the value, right? And then I am again invoking
the same sp_execute_external_script, which Sunni just showed you
in the Hello world example. And the language is Python here. What this script also does
is it does the Hello world. Then it’s changing the parameters,
right, which is basically some
integer it’s assigning. But at the core of it, what
the script is doing is essentially, it gets whatever the data set
was passed into the script, the SQL integration, and
it assigns to the OutputDataSet. So in this case, remember that we
would have got a string and a float, right? What the script does is, it has a map which maps a day of
week from a string to an integer. And it just replaces one of
the columns in the output data set, which is basically a pandas data. The input data, as you see, as I
just explained, comes to a query, which is coming to
the temporary table. And there are some parameters
which are being passed in, the rowsPerRead, ParamINT and
ParamCharN, which are, the two parameters are output parameters,
which the script can change. Now let’s run it and
see what is the output we see. So here you see very
clearly before and after. So you see the strings where here
days of week was as a string, and then it got converted to
days of week as an integer, the amounts have been retained. So in this particular case,
what we just saw was the data coming from a SQL Server table, being used
by sp_execute_external_script. Python doing some sort of
a very primitive processing of converting a string to an integer
value, and then outputting the data. Now let’s look at
the parameters also. Apart from the input and output
data set, you can pass input and output parameters also. Now here, when I kind of switch
over here, the parameters were, when they were going in they’re 1,
2, 3, 4, 5, 6, 7, and the string parameter
was of type INPUT. And when the output came back,
it is basically to an OUTPUT. Literally some parameter
was passed in, the script changed the parameter,
and those values also came back. Now another thing to remember
in this kind of integration is, sometimes what happens is, the data could be pretty
large which is coming in. So we do have a concept called
a streaming which is supported. And the way you control that is
a special parameter which is rowsPerRead, right? In this particular case,
the rowsPerRead was 5. Which essentially says,
at a time give me 5 rows. And the total number
of rows were 10. So in the output you see
the script got executed twice. As you can see the Hello
are getting printed twice. But the output when you see, it’s
on margin and just row stand rows. So literally what you saw here
was a streaming behavior also. Okay, now let us show you
the revoscalpy demo also. So let me switch over
to the Python IDE, which is the VTVS as part
of Microsoft Visual Studio. Here what we have is
the revoscalepy package. There is a series of concepts there,
and what we are trying to do there is, basically, we have a data
source called as SqlServerData, which can realize it as a data frame
or pandas data frame in Python. Then we have a concept,
something called computeContext, which essentially says where
the code should be running. In this particular case, I’m saying the code should be
running on the SQL compute context. And the function which we are trying
to do is the linear regression, which is part of the revoscalepy
package, which is rx_lin_mod. And we’re regressing one variable
and line delay against dayofweek. The source is
the SqlServerData source and the compute context is
the compute context. And let’s run it and
see what we see. I’m going to switch over and kind
run it from the command line here. So if you see here, I’m running, and as you see with the rx_lin_mod SQL
Server compute context got executed, and the results are coming back. Please note that here what happened
was although the script started in the client IDE, the computation
was pushed to SQL Server. And the linear regression actually
was done on the SQL Server using the data which was on SQL Server,
and the results came back. Okay, now let’s-
>>That’s excellent, so you’re running in an IDE, but the computation is actually
happening on SQL Server.>>Yes.>>You can test your script on the
entire set of data that is there.>>Yeah, so this is a very kind of
a favorite data scientist scenario, where they tend to work
on local compute context. And then as they’re ready, before
they push it into production to SQL Server, they can test it
against SQL Server, right?>>All right, so that was excellent. Now, not only can you use the native
integration and the basic default libraries and the revoscalepy
library that we are adding. You can literally bring any other
external Python package and make it run into SQL. Let me show you an example
of tensor flows, all right? So in these few lines of code,
here I’m importing the tensorflow. And all that I had to do was just do
a regular pip install of tensorflow in the right path. Let’s execute this. And from T-SQL environment,
I’m now able to run a very basic
Hello from tensorflow. So that shows the power of
this integration that enables you to bring any other
external library as well. So let’s quickly summarize, what are the three things
that we saw in the demo? First one is,
it’s extremely simple to lift and shift your Python code from
outside to T-SQL environment. We saw how using
revoscalepy package, you can do remote
compute context and you can do ETL inside of T-SQL. And you can leverage the additional algorithms that we have
added in the package. And the third thing is,
you can bring any other external package and
make it run in T-SQL. So I hope you got a quick
glimpse of the integration. We will continue to do
more detailed demos, and put more code out on GitHub. So we saw Python
integration in action. Let’s summarize what are all the
things that we’re getting with this integration. First of all, Python, the new
language is natively integrated in SQL Server,
which eliminates the need to move data for
doing Python-based data science. Once you have the model,
you can easily operationalize it for production environment. And the best thing is, it is available in all
editions of SQL Server.>>Yeah, that’s great.>>Thanks once again for taking
time to watch our presentation. We encourage you to install the CDP
20 release of SQL Server 2017, and give us feedback. Thank you.
>>Thank you. [MUSIC]

Leave a Reply

Your email address will not be published. Required fields are marked *