Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, very good day to everyone. My name is Ang, based in Singapore
and today I'll be talking about beginning your own data engineering projects.
Before we start, let me talk a little bit about what is data engineering
currently in my role, I help drive a lot of data engineering projects
at work, mainly to help my company become more data driven.
And data engineering actually helps to ensure consistent
data flow for data data scientists or data users, typically in
the form of a data warehouse to enable large scale data mining,
analytics and also for reporting purposes. If youre also thinking
of collecting more data on your own for your
own data science projects, this talk might also be useful for
you. As you know, data engineers are also important because
they help build the foundation for data driven enterprises.
Usually before we get to do data science or
data analytics, the most important thing is to have quality data.
So with all these data engineering projects,
what we are trying to do is to build data pipelines,
and we actually help to transfer
data from one place to another through these pipelines.
And so instead of transporting
liquid or gas, what we are trying to do over here is
to transport data. So the typical process actually
involves extracting data from one place and
then transforming them before loading them into the data warehouse.
So this process is called ETL extract,
transform, load. And at my workplace currently,
we make use of APIs that points to
different sources. Currently what we have
is different kinds of campaigns begin run on different
advertising platforms. So these advertising platforms have a lot
of data collecting related to the campaigns.
What we have is to try to build pipelines
that pull data from these platforms, then store them in
our own data warehouse, which is on Google Cloud platform.
So depending on your context, depending on your company, you might have in house
systems that contain transactional
data, and then you want to use them for analysis. So what you
have to do is to extract them and then probably do
some cleaning or transformation before loading them into your
data warehouse, which you can then use for data analysis. So we
want to automate this process as much as possible to save time
and ensure consistency and accuracy. So as you know, like the source
data, for example, if it's containing transactional data,
there might be some errors, data here and there, or some
test cases here and there that are being stored and that
should not be used for analysis. So during the
transformation process, this is where we do the data cleaning. We want to
check that whatever data that we have for our analysis data
is as accurate as possible. So the data
warehouse should be the place where there is a single source of truth and
it should contain the most accurate data. So today I'll
be sharing about a few use cases. So the
first one is related towards data
science, distinct. What I am trying to do over
here is that for those who are not so familiar
about towards data science, it actually is a publication hosted
on medium where different authors, different writers actually
write about what they have done,
what they have realized through the different
data science project that they do on their own, or be it
for their work as well. So there are
a lot of articles being published on a day to day basis.
And then what I try to do over here is that I want to extract
the titles of these articles and store them
somewhere so that I can use for further analysis
later on. So as you know, because sometimes
we want to do our own site projects like we
want to just to practice.
There are some, I would say constraints, say for
example limited to budget, where we might not be
able to afford a data warehouse or storage
space. So I think a good starting
point is to make use of Google sheets. So the Google sheets API is
free and also Google data studio is so free.
So if you are considering of starting
your own data engineering projects, this can be something
you can consider. So what I do over here is
that we do the ETL process from towards data
science website, the medium publication,
and then we do some transformation,
data cleaning, data preparation in Python
script. And also after that we load this
clean data into a Google sheet.
And then subsequently I can use data
studio to connect to this Google sheet and then present it in a dashboard
manner whereby it's friendly for deep dive.
So this is what towards data science looks like in
terms of the web interface. So if you go to just
towardsdatascience.com you can see there
are many, many different articles published. So this is just a screenshot
that I've taken. So what we want to do is to
extract the data related to author
and also recency and post title.
And then subsequently I added a column on the time extracted.
So I know when exactly I have run the script or when
the script is being run in the future if I look back.
So what we do is we need to use
some web scraping tools, some web scraping packages
within Python to extract the data on the website.
And then as you can see, I have imported beautiful
soup and selenium. These are the two main
packages necessary for the web scraping. And then
numpy and pandas are used for transformation.
And then Gspread data frame and gspread
is being used to push the data to
the Google sheet. I would say this talk will be
perhaps maybe more relevant to you if you are
really familiar with Python. So you
know exactly what the syntax
means. So I have also shared the
instructions on how to use the Google API in this slide.
So subsequently if you were to download the
slides and click on the links here, they will direct you
to the Google API GitHub
page. This is where it contains more documentation
on how to exactly get it started. And also to
use gspread and G spread data frame you need to install
the two packages. So you run Pip
install g spread and also pip install gspread data frame.
So I would say the instructions on the page
is pretty detailed and pretty clear.
So if you just follow through you should be able to get the
API working. So there are no charges for using
this API, for using Google sheets API and then over
here. So the authentication part is, I would
say probably maybe the most important part because
if it doesn't work then
you probably wouldn't be able to do the
engineering side of it effectively. Because what we are
trying to do is to automate the process, right? So if you can't
authenticate the API, you wouldn't be able to automate the process and
then the data wouldn't be able to get loaded in.
So next, what you see over here is also
something that I have gotten off the tutorial or
the instructions relating to how to use Google API.
So as you can see, there are some parameters
you have to just change. You can actually create a new Google sheet
first and then you look at the id of the sheet.
So it should be the last part of the URL
and then you can copy that sheet id and
paste it here to replace it that is applicable in your case.
And then over here, after you have set
up the API, there should be, I would say a password,
a credential token file which you have to download to your local
and then it should be a JSON file. And this is where
in order to get the API working, the credentials
will be read from this JSON file. And then currently what
we are using are the spreadsheet API. So we need
to include in our scopes. So after we do the authentication,
then the packages G spread and
G spread data frame will work. So over here what
I'm trying to do is that I have created the sheet
API, I mean sorry, I've created the Google sheet already and
then I am reading in sheet one. So I didn't
really change the name of the sheet. So if you change the name of
the sheet then probably you need to change your worksheet
name instead of sheet one youre just change it to whatever worksheet
name you have. I am getting whatever existing data
there is. So if there is no existing data, it's an
empty spreadsheet then yeah there isn't
anything. But because I am going to do the update on
a daily basis, there are existing data already in the sheet.
So I want to retrieve them first and then what I do next is
I want to append more data into the sheet.
So the next step after I get the data from the sheet,
I will scrape the data from the website. So this
is where web scraping comes in. I access the
site first towards data science and then
you notice I actually have to make use of selenium because there are
certain parts on the website where we have to
basically trigger for it to load even
more. So how do I scrape title
off the website? I will have to find
out text whereby all
these titles or all these data related to the title
or author is being typed to in the HTML
text. So when I click inspect,
then you return me and show me the HTML behind it. And then
I see that it is being typed to the a tag. Okay, so over
here you can see there is the a class whatever.
So what I mentioned just now is so that there are certain
parts of the website that require some interaction in
order for it to load. Furthermore, in this case I have
to click show more to load youre articles
because there are a lot of articles being published in one day. So of
course I'm running this script on a daily basis. I want to make sure
that all the articles that are published in one day is being shown to me.
So I think if you just load the page on its own then
there are maybe, I think only ten articles being shown without
clicking show more. So I
have to in a way make the
script interact with the page to click show more so that
I can get more articles being script.
So I actually set can arbitrary figure which
is 20 arbitrary number
20 where I click on the show more button 20
times. So every time I click show more then probably
I think another ten articles will load and then after
that I will click show more again to load
the next ten articles. So I do this for
about 20 times, but this is in a way not done by
me, manually done by the script. It's automated so it's very
easy. And then in between each click I
actually have a lack of 2 seconds. So you can see over
here there's a time sleep bracket two. So this is what I'm doing
because in a way I also want to make sure that
the page doesn't hang or also like the page doesn't lock me
out. Because sometimes there are some checks behind whereby they
think that a bot, which I mean, in this case it is a bot,
but we want to try to overcome this to
make it look like more human in a sense. So by building in
this kind of time lapse, then it will mimic a more
human behavior and ensure that the script is able to run
true. So next, after I have
what I do over here is actually a loop. So I
will look for the ATAG and then extract
it out, get the text, and then over
here youre can see there is another loop that
I run through again, second for loop.
So let me just show you again what the
output is. After I have script the text type to the a type.
So this is what you will get. After you run
the script, you will get the data returned on the
author and the recent c, and then the title. And then
there is also like some other more random text
that appears and it's not very consistent. So some
of these other data that you see or other text that you see
being returned is tied to, for example,
the caption of a photo, or like
the youre of the photo or image that
was added into the article.
What I realized is that there is a pattern. There is always
like tree blanks
in between each post, or rather
tree blank in between the data between each post.
So what we need is just the first tree elements
of each chunk that is separated by
the tree spaces. So what
I did is I will check whether the
last element is similar, or rather
like the current element is similar to the previous two element.
Okay, so over here you can see equals equals I
plus one, and also to check whether I plus one
equals equals I plus two. So this is what I'm doing. So once I
hit, so as I run through the loop, I will check whether this
element is similar to the next one, which should
be a blank space, and then it's similar to the one below it
as well. So if all three are similar, that means
they are actually off spaces. Then that is the
row, the element on a position that I want.
Okay, so I store all this element
position that I want, and then I do another loop to
extract the text type to this element position.
After that I convert it into a data frame,
and then I will also add in a
new column, which is the time now as an
indication of just to let me know when this
job is being run. So, based on the later
data that I have script, I will append it to existing
data in the Google sheet that I have. And then I will
check whether there are duplicates or not based on author
and post title. Because since I have arbitrary chosen
to click show more 20 times. Right. There could be
cases whereby the data is already scraped in the
previous day. So just wanted to make sure
that there are no overlaps, there are no duplicates. So I will
drop the duplicates using again like pandas.
And then I will also fill missing values with space
if there is. And then I
will drop the index whereby it's
missing. Okay? And then in certain cases
the first row after it does the sorting is empty.
So I will drop it and then after that I will update
my google sheet. So this is what the last line of
code w dot update is doing. Okay,
so this is the output based on the quote
just now. You see where I am actually looping through
the element position and then I
get the text type to that position.
A little bit of note is that there are also few cases
whereby the title is not
in the third line in the chunk that
is being scraped off in the different for each post.
Okay, so there could be certain cases whereby the data is
not so clean, where it actually extracts the image
caption in
an earlier position and then the title appear later.
So in those cases, those caption
actually in general have less than 20 characters.
So this is where I built in the if else over here.
In the case, if the number of characters is less than 20,
that means that the title should be in the seven
position, okay? If not, yeah, it should be
earlier and then this is how I identify it.
So in order to
automate the entire process, now that I have done the
Python script to do the web scraping and updating
of Google Sheet, I want to make it run on a daily basis.
Okay? So I have to schedule it, but if not,
it's actually very easy. Now that we have the Python script, you can just
open the command prompt, the anal over
here. What I have is a virtual environment. So I activate my virtual environment
first and then I run my Python script. So else you
want, you can also set up a Chrome job locally.
Okay, so for example, in my case I'm using Windows
and then what I can do is set it
up using the Windows task scheduler. I look at the
instructions over here, right. What you need to do is you can
create basic tasks and then,
yeah, this is what you will see when you launch
administrative tools, create basic tasks and
then there are certain boxes which you
have to fill in. So you locate where your python exe
is. So you paste the path over here and then
just also identify where your script is being located
at and then paste the location of your script in
the arguments. So the rest
of the parts that you can fix you can set up
in the task schedule is pretty straightforward, like when you
want to run it, like what time, et cetera. So it's pretty straightforward,
which also if you click on the stack overthrow.
Sorry. If youre search this link,
you should be able to see the entire instructions.
So this is what we will get.
Okay. And then of course I think there are still certain cases whereby
based on the logic that I've come up with, for example the checking
of the number of characters, there could still be outliers here
and there. So in those cases they might still need
some manual intervention. But on the whole
I would say this is 90% or even
like 95% cleaned data.
So from this Google sheet we can actually
use Google data studio to connect to the Google data
sheet and then create nicer looking interface
for user to play around or to explore the data.
Okay, so over here we can see
that what I have done is actually just collecting the
data over a six week period between 21st November 20,
22nd January 2021. And yeah,
there are so many articles being published in a day to
me is pretty crazy. So maybe just give me a
while, I can load the data studio page.
So this is what we have in the Google sheets.
What's interesting is that there are multiple topics,
I mean multiple articles on pretty similar topics. And then
from here, if youre interested, you can make use of the data to
do some text mining to identify what
are some of the more popular domains within
data science at the moment. Whether is it relating
to certain kinds of tools or
algorithms, you can search for the keywords within
these articles. Yeah, as you can see that there are really
quite a number of articles that are being published over six
weeks, over 1000.
It's like 1680
over a six week period. Okay, so now I'll
be talking about the second use case on creating
a blended face based on multiple images.
So in this case also we are getting data,
in this case image of the Internet. And then I
am saving it into my local drive, my local folder.
What I did after that was to make use
of open source library that was hosted
on GitHub to create a blended face.
So in this case, the process is a little
bit different in the sense that we will do extract
and loading first before we do the transformation.
So actually there is no one single rule when it comes to
data engineering, like it can be also
done, not necessarily that it has to be
ETL, like abstract, transform, load.
It can be also abstract and load first before doing the
transformation. So in some organizations
they actually have data lakes. So this is where they have data
from different places, and then they just pull it all into
one single location. First they call it the data
lake, and then after that, so it can be clean
or unclean data altogether in this data lake, and then they
do the transformation after that.
But based on my work experiences so far,
I typically do ETL most of
the time, which is extract, transform, then load
the data. So again, it depends on your organization,
your structure, how actually you
want to store the data.
So in this case, what I'm
trying to do is to create a blended face,
blended average face based on the
candidates of different political parties in
Singapore. So last year in 2020,
there was a general election, and then the street science, which is
a news platform, a news outlet,
they have designed a website to basically
let people know who the candidates are from the different political
parties and then where they are contesting.
So they have uploaded images of all these
political candidates and then they are publicly
accessible. So what I did was to scrape the
image of all these political parties
off this website into my local drive.
So again, I made use of the web scripting tools,
beautiful soup and selenium. And then also I
imported Wii, which is for regular expressions,
because I need to extract the format of the image
of later based on the URL.
So subsequently also, I mean, there is
this library that I use URL request or so for accessing the
website. And what you see over
here is the same thing as the
previous use case. Okay, so this page is also
interactive in the sense that I have to actually scroll down
in order for the page to load entirely. So what
we need to do when we use Python to access
is to include more arguments in the options.
So when we use the webdriver to
access the page, it will allow the page to
load entirely. So what I added over here is headless
argument and a disabled gpu. So this, to overcome this,
and then also similarly like using beautiful
soup, I look for the text that all these images are
tied to, which in this case is IMG, and then
I run a loop to script the images. Also,
if you can see, there's a bit of trial and error here where I need
to identify the location of the first image that I'm interested in.
Because when extract the information typed
to the IMG text, there are other information,
other images that I am not interested in. For example,
it can be like the map. So after a bit
of trial, I identified that the first image that I want is
in the very first position of
the IMG tag. And then there are x
number of images I need to script, right? So I identify like okay, the last
element is in the position three, six, nine.
So I run the loop and then I
basically filter out things that not necessary. Like you can see
there is also additional subtext
like alts that is blank. And then in
this case, this is where I will skip those, because they
shouldn't be blank. And then I get the
source of the URL, the picture, the image,
and then I scrape it off with the name
type to the image. Okay, so if
you can see, I also have included
this minus four colon, which is to extract the format
of the image like it can be JPG or
PNG. So I save it accordingly into my local drive.
And then after all the image is being script
successfully. Then I make use of this open source library
that was created by this person.
He hosted it on GitHub, which I'm very thankful
for because it makes things much easier. So I think this
is also like the wonders of open source, right? You can do
a lot of things much quicker, much easier based on
what other people have developed or created.
If you go into this URL,
then you will see the scripts that
are developed to create a blended image,
average face. So what you need is just to
change the file path that pine
to those images where you have saved it locally. And then
you run the script, you run the package, and yeah,
the instructions is actually all on
GitHub for this library. So if you
go through the instructions on how to
download it and how to run it, you'll be able to get the
output very quickly as well. So this is what I got
after running the script. And yeah,
for each political party, if at least ten candidates, I created
one phase for each of them, which is the average phase.
Okay, so just wrapping up in terms
of data engineering projects, first I guess we have to
start off with thinking what analysis you or your
data science team want to do here. From there we will think
about what data is necessary, and then we will write a Python script to automate
ETL or ERT process.
And yeah, this will allow us to
generate insights much quicker based on the data that we have
collected so far. Okay, so I actually maintain
a data science block called data double confirm. And over here
I cover a lot of other things related to
various data science tasks. So web scraping is
one big component related to data collection,
because I think there is basically a lot of,
I would say, development in the algorithms
realm and also in deep learning.
So sometimes what we need is the data, which is
very important, which is why I'm still talking about data
engineering. And then from there, with the data we have, we can
perform data visualization or statistical analysis
that can give us insights and help
us make more informed decisions. So this
is the URL of my data science
blog, and also feel free to connect with me via
LinkedIn or Twitter. Thank you.