Data science is the talk of the town in the present-day world. It came into picture in 2008 when with the advancement of internet and device connectivity, immense flow of data was observed. A need for professionals who could handle and analyse the huge bulk of data of all kinds, was felt.
Since then, data science has gained a great momentum with certain myths and facts associated with it. We shall discuss some of those facts in this blog.
Some people do not hesitate to call data science a fad, owing to its fast growing success and popularity. But I strongly demur to this idea as fad is something that lasts short. But data science is here to stay, for as long as one can imagine.
The volume of data worldwide is growing at an astronomical rate. Data produced in the last two years alone is greater in volume than all of the data produced before that. This clearly shows how important it is to have data science roaring to handle this data eruption across the platforms.
All this and there is much more to write in acclamation for data science.
But First, What is Data Science?
Simply put, data science is the scientific approach to handle data and make the most out of it to drive a business. Now from handling data to making the most out of it comes the real heroic deal to manifest in the world of crooked and prodigious data.
The scientific approach mentioned above encompasses various tools, machine learning, algorithms and a fair bit of analytical skills that wind the sailing ship of a business. The data collected is cleaned and analyzed with the help of effective tools and techniques. See a detailed explanation of data science here.
Data – The Game Changer
Data science was born out of data. The volume of any type of data today cannot be underrated. Data is truly the game changer to have produced this entirely new field of data science. It is therefore customary for anyone interested in data science to know some facts first about data itself before steering for the data science facts. Reckoning that, let us now check out a couple of knowledge bytes around data.
What is the Biggest Data Unit You know?
From our childhood we have been reading about the different units of digital data starting from bit, byte to even gigabyte or terabyte. But it is exciting to know the measuring units of all the big data that is floating around and driving the engines of big and small businesses.
Here is a tabular representation of some data units worth having a look.
|Abbreviation||Unit||Value||Size (in bytes)|
|b||bit||0 or 1||1/8 of a byte|
|B||bytes||8 bits||1 byte|
|KB||kilobytes||1,000 bytes||1,000 bytes|
|MB||megabyte||1,000² bytes||1,000,000 bytes|
|GB||gigabyte||1,000³ bytes||1,000,000,000 bytes|
|TB||terabyte||1,000⁴ bytes||1,000,000,000,000 bytes|
|PB||petabyte||1,000⁵ bytes||1,000,000,000,000,000 bytes|
|xxEB||exabyte||1,000⁶ bytes||1,000,000,000,000,000,000 bytes|
|ZB||zettabyte||1,000⁷ bytes||1,000,000,000,000,000,000,000 bytes|
|YB||yottabyte||1,000⁸ bytes||1,000,000,000,000,000,000,000,000 bytes|
- If someone asks the fundamental – “How much data exists in the world today?”, there is probably no definite answer to this. But it is estimated that by the year 2025, 463 exabytes of data will be created daily.
- We create new data every millisecond. We make 40k search queries on google each second, which amounts to 1.2 trillion searches per year.
- Data coming into any field or business is now no more just data, it is Big Data.
Since data science revolves around data and is gaining a remarkable gravity, there are many interesting trivia trails in the data science arenas too, just like data, that deserve a heed. If you are a data science enthusiast, this article is sure to enamor you more towards it.
Here I list out the top intriguing facts about data science that will give you a closer view of the subject.. Let us know in your comments below, which one is your favourite trivia bite!
Data Scientists and Data Analysts are NOT the Same
This is a common myth among the people having a superficial idea about data science. Reality is, the work of data scientists and data analysts is totally different. Whereas data analysts work on finding the trends and analyzing the data, data scientists work on finding the cause of a trend and forecasting the upcoming trends. As data science is a new field, popping up of certain misconceptions is inevitable.
However, it is worth noting that the two work in tandem. They complement each other and work for a common goal. Now let us check out some of the basic differences between the two.
|Data Scientist||Data Analyst|
|Discovers unexplored questions that may need an answer.||Uses existing information to get workable data on existing questions|
|Skillset : Algorithms, data mining, programming, database management, data analysis, machine learning, predictive analysis||Skillset : Data mining, modeling, programming, statistical analysis, database management, data analysis|
|They estimate the unknown data||They work with known data set|
|They choose to address business problems that would have maximum effect||They address the business problem assigned to them|
|They work at macro level||They work at micro level|
Data is Never Clean
That’s true. Data is nasty. Even when data is collected and cleaned with an eagle eye, some or the other data discrepancy does creep in at some point. And data scientists know to work with data chaos and noise, while cleaning it on the way.
About Dirty Data –
Dirty data is of one or more of the following forms-
“It’s way more than just errors. It can break your data science project.”– towardsdatascience.com
The collected data being dirty is one problem. But the bigger problem is joining multiple datasets into a single entity. Now data can have been collected from different sources by different people, softwares, devices etc. There is a huge possibility of them being non-coherent. The join key may not be consistent or the format may be different for different systems. Data scientists clean the entire data by re-formatting, screening, organising and so on.
“You will spend most of your time cleaning and preparing data.”– Kamil Bartocha, head of data science R&D
Question now is, if data is so unclean, how then analysis is done out of it? Well, at this point it would be good to paraphrase, in the end, data is clean enough to reach a desired outcome.
There are several data cleaning techniques which are implemented at every step to reach the least dirty form of data. And this becomes the basis of the final analysis.
You do Not Need to be a Tech Savvy or Hold a PhD to Learn Data Science
Data science sounds like a field of tech savvy professionals and this leads to the common misbelief that to be eligible to learn data science, one needs to have a super brain or hold a PhD degree.
This is absolutely incorrect. As a matter of fact, anyone with an average intelligence can learn data science.
Data science learning involves upskilling in the below fields –
- Statistical modelling
- Predictive modelling
- Machine learning
This is the theory behind learning data science. But it would be interesting to hear a few words about data scientists, straight from the horse’s mouth.
Joma, a famous youtuber and an experienced data scientist at a GAFA(Google/Amazon/Facebook/Apple) company describes in a video, what it takes to learn and become a data scientist. I’ll summarize his view in the below points-
- One does not need a degree from a very high profile university.
- Data scientists come from different backgrounds like – electricals, economics etc. Some even don’t have a science degree.
- To learn data science one can do an internship or study basic stats.
- Other things like programming, algorithms, statistics – can be easily picked up along the way.
- One needs to have an empathy to ask the right questions related to data.
- One needs to learn to apply the correct SQL queries and also learn a bit of Python language that anyone can, with the correct approach.
- Data scientists work a lot with data, sequel query and presentation.
In a nutshell, data science is not as heavy as it may look. Just an empathy towards possibilities is the requisite. Rest fall in place along the learning.
Data Science is Not Just Excel Sheets
Contrary to the aforementioned belief, this one can seem surprising but many people are of the opinion that the life of a data scientist revolves around excel sheets.
This is anything but true. As mentioned before, data science is a vast field with basic focus on the correct and intended outcome. And to get that outcome, the data science professionals fight tooth and nail. They use different data analytics techniques, SQL query, statistical analysis, predictive analysis and what not.
They do work on excel sheets, but that is just a small unit within their work periphery.
There was once a time when excel sheets played a major role in arriving at a conclusion and making analysis using formulae and calculations. At present with an easy availability of programming tools like Python and R, most of the data scientists spend a great portion of their time coding rather than on excel sheets.
Data Science Competitions and Real Life Projects are Different
Competition Vs real-life project
Getting a success in a data science competition(eg. through an online platform like Kaggle) may give a boost to one’s confidence so much that one starts thinking of landing to a data science career. But it is here to understand that there is quite a lot of difference between a competition and a real-life scenario.
“ I remember I was a little bit overwhelmed when on my first real-life project all the models that typically worked well on Kaggle, miserably failed. I wish I was prepared for this.”– Sergii Makarevych, data scientist
Here is a listing of few differences between the two-
|Data science competitions||Real-life projects|
|Number of datasets is limited||There is no limit on data and datasets. It’s the data that matters.|
|In online competition platforms, a warning is given when you have made an error||There is no warning. You only learn after you have committed a mistake and borne the consequences. You go back all over again and do some data cleaning and rework.|
|You need to write the code just once||You need to rewrite the code every 5-15 minutes.|
|You do not need to deploy your model.||You definitely deploy your model|
|There is no authentication or security||Authentication and security is equally important as the data itself.|
So, it would be safe to say that competitions do give a fair practice for data science. But it is not enough. You need to make your hands dirty and work in the live real-time projects to know the correct essence of data science.
More Data Does Not Always Mean More Accuracy
I am tempted to use a cliche here – Quantity does not always mean Quality.
Let us understand this point with reference to data through the bottom-up approach.
Suppose we have a dataset with the exact number of minimum data that is needed to make a correct analysis. This would be an ideal dataset. Now if we add some more data, the entire dataset will need to be reconstructed considering the new set of data as well. While reconstructing, there will be a need to clean the new data and spend time to understand their deviation from the existing set, if any.
Now even after the new data is cleaned and merged to the existing ideal dataset, there is a possibility that some new element is still dirty but unidentified. This will lead to an overall degradation of the final result or analysis.
In this case, lesser data was surely better than more data.
Hence, more data doesn’t mean more insight or more value addition. Using smart data is the key.
Data Science Field has Different Roles, Not just Data Scientists
Many people associate data science to data scientists only, ignoring the other prominent roles belonging to the field.
Data science includes all of these –
- Data engineer – They are responsible to manage data infrastructure throughout the data science lifecycle. Basic skills include – programming tools like python, database tools like NoSQL and big data tools like Hadoop.
- Data analyst – They find answers of questions by working through the data available, using appropriate tools. Basic skills include – programming, data visualisation, statistics, mathematics and of course data analysis.
- Data scientist – Data scientists work on big data, analyse it and then communicate the finding through reports and presentations. Basic skills include – statistics, mathematics, programming,data visualisation, SQl, Hadoop, machine learning.
You can find more information about them here.
Apart from these too you can make your career in data science through various other roles.
Data Science is Not Meant Only For Large Organizations
Many businesses believe that data science is meant only for big organizations having high class infrastructure.
Such belief pops out from a wrong notion about data science. Data science is not made up of machines, heavy tools or the size of working resources. It perhaps is made up of big data, statistics, analysis, programming, presentation and some smart people who know how to make the best out of data and add value to the organization. It has nothing to do with big or small organizations.
A data scientist needs to arrive at a result that benefits the company. And no one really cares as to what tools and techniques have been used to achieve that result.
Coming to infrastructure, all that is needed is a computing device, internet and some tools that help through the data science life cycle. There are a number of open source tools available online that can be downloaded to get the ball rolling.
Data Science Needs Great Communication Skills
Communication and presentation play a key role in data science.
Communication here refers to two areas –
- Coordinating within and among the teams during the different stages of data science life cycle.
- Presenting the final outcome in the most comprehensive and lucid manner.
Without a proper communication, the entire exercise may fall futile. It may not project into any substantial product. It is important to learn public speaking as there are a lot of presentations involved.
Also, learning to do better and crisp writing enhances one’s visibility in and around the organization.
Writing involves –
An analysis without a proper communication in writing or otherwise, is just a placeholder with no significance.
Data Science is Not for Everyone
Let me first throw some light on what a data science interview smells like.
Again, I have taken this information from the famous youtuber and data science expert, Joma.
The data science job interview questions spin around the below –
- SQL or a simple coding such as Python, as they want to make sure you know these because you would be doing a lot of it on the job
- A quantitative analysis or a math question including statistics, probability or linear algebra.
- Some graduation level math theorems like Bayes’ theorem, distribution, law of large numbers, linear regression etc.
- Product interview : they give some hypothetical product and ask you how you can improve it.
So did the questions smell sweet or sour?
The topics may seem a bit overwhelming for those who are an absolute novice in this area. But those who are actually prepared and are ready to jump into the pool of data science, would find it interesting to glimpse over such interview topics.
However enchanting it may look to a beholder, the data science field is no cake walk. Even preparing for the field needs a good amount of data affinity.
There are lots of videos and articles on the web suggesting anyone can be a data scientist. It’s true with certain conditions. It is always a good idea to ask yourself first, why do you want to be in this field. It is good to do some reality check before taking a blind leap.
Introspection in the start is a great virtue for a successful stint in any field.
Data science is becoming inevitable with data explosion in almost every field. It offers a good career opportunity.Thinking of data science as a career option can be a wise decision for anyone who enjoys problem solving and has data empathy.
As cool as it sounds, it has immense potential for both business as well as for job seekers. But it is advisable not to fall for any wrong information about the field.
With its growing popularity, data science has got some myths associated, that we saw along with some interesting facts. Let me know in the comments if I missed any point.