How much of data wrangling is a data scientist's job?2019 Community Moderator ElectionTools to perform SQL analytics on 350TB of csv dataTechnical name for this data wrangling process? Multiple columns into multi-factor single columnHow do you define the steps to explore the data?Which one is better performer on wrangling big data, R or Python?R Programming rearranging rows and colums from timeline dataHow do I split number string with digit pattern?How to work with string data with a lot of NAs in an aggregation task with RWhat is the difference between 'if the data if of good quality' and 'if the data is tidy'?how to calculate number of datapoints within a given time interval?How to deal with count data in random forest

Schwarzchild Radius of the Universe

Email Account under attack (really) - anything I can do?

Is there really no realistic way for a skeleton monster to move around without magic?

Can an x86 CPU running in real mode be considered to be basically an 8086 CPU?

How to add power-LED to my small amplifier?

How to calculate implied correlation via observed market price (Margrabe option)

The magic money tree problem

Is it possible to make sharp wind that can cut stuff from afar?

XeLaTeX and pdfLaTeX ignore hyphenation

Shell script can be run only with sh command

What is the offset in a seaplane's hull?

How can the DM most effectively choose 1 out of an odd number of players to be targeted by an attack or effect?

Why did the Germans forbid the possession of pet pigeons in Rostov-on-Don in 1941?

How old can references or sources in a thesis be?

What is the command to reset a PC without deleting any files

How did the USSR manage to innovate in an environment characterized by government censorship and high bureaucracy?

Are tax years 2016 & 2017 back taxes deductible for tax year 2018?

A newer friend of my brother's gave him a load of baseball cards that are supposedly extremely valuable. Is this a scam?

DOS, create pipe for stdin/stdout of command.com(or 4dos.com) in C or Batch?

Can I make popcorn with any corn?

How is the claim "I am in New York only if I am in America" the same as "If I am in New York, then I am in America?

A function which translates a sentence to title-case

How do I create uniquely male characters?

What is the meaning of "of trouble" in the following sentence?

How much of data wrangling is a data scientist's job?

2019 Community Moderator ElectionTools to perform SQL analytics on 350TB of csv dataTechnical name for this data wrangling process? Multiple columns into multi-factor single columnHow do you define the steps to explore the data?Which one is better performer on wrangling big data, R or Python?R Programming rearranging rows and colums from timeline dataHow do I split number string with digit pattern?How to work with string data with a lot of NAs in an aggregation task with RWhat is the difference between 'if the data if of good quality' and 'if the data is tidy'?how to calculate number of datapoints within a given time interval?How to deal with count data in random forest

I'm currently working as a data scientist at a retail company (my first job as a DS, so this question may be a result of my lack of experience). They have a huge backlog of really important data science projects that would have a great positive impact if implemented. But.

Data pipelines are non-existent within the company, the standard procedure is for them to hand me gigabytes of TXT files whenever I need some information. Think of these files as tabular logs of transactions stored in arcane notation and structure. No whole piece of information is contained in one single data source, and they can't grant me access to their ERP database for "security reasons".

Initial data analysis for the simplest project requires brutal, excruciating data wrangling. More than 80% of a project's time spent is me trying to parse these files and cross data sources in order to build viable datasets. This is not a problem of simply handling missing data or preprocessing it, it's about the work it takes to build data that can be handled in the first place (solvable by dba or data engineering, not data science?).

1) Feels like most of the work is not related to data science at all. Is this accurate?

2) I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that in order to build for a sustainable future of data science projects, minimum levels of data accessibility are required. Am I wrong?

3) Is this type of setup common for a company with serious data science needs?

edited 2 days ago

asked Apr 3 at 15:16

Victor Valente

29028

New contributor

$begingroup$
Did you specify which format you want the information on? And give them instructions on how they can do this with their ERP?
$endgroup$
– jonnor
Apr 3 at 19:57

$begingroup$
@jonnor Of course. I've been working here for almost two years now, and since day 1 I explained how we could build a better platform for data accessibility. There's strong resistance to changing what the company has been doing for 30 years though.
$endgroup$
– Victor Valente
Apr 3 at 20:12

13

$begingroup$
Start tracking your hours and convert it to a cost on how much they're wasting your time converting the TXT back to a usable format. I'll bet you once they have a $ figure, they can get it done.
$endgroup$
– Nelson
Apr 4 at 3:05

$begingroup$
If it is a burden on your time you could outsource it.
$endgroup$
– Sarcoma
Apr 4 at 6:23

$begingroup$
I find it confusing that a company would hire a Data Scientist and still be resistant to change. You should show them the amount of wasted time and the danger os keeping data into long TXT files without real security arround it
$endgroup$
– Pedro Henrique Monforte
yesterday

add a comment |

1) Feels like most of the work is not related to data science at all. Is this accurate?

3) Is this type of setup common for a company with serious data science needs?

edited 2 days ago

asked Apr 3 at 15:16

Victor Valente

29028

New contributor

$begingroup$
Did you specify which format you want the information on? And give them instructions on how they can do this with their ERP?
$endgroup$
– jonnor
Apr 3 at 19:57

$begingroup$
@jonnor Of course. I've been working here for almost two years now, and since day 1 I explained how we could build a better platform for data accessibility. There's strong resistance to changing what the company has been doing for 30 years though.
$endgroup$
– Victor Valente
Apr 3 at 20:12

13

$begingroup$
Start tracking your hours and convert it to a cost on how much they're wasting your time converting the TXT back to a usable format. I'll bet you once they have a $ figure, they can get it done.
$endgroup$
– Nelson
Apr 4 at 3:05

$begingroup$
If it is a burden on your time you could outsource it.
$endgroup$
– Sarcoma
Apr 4 at 6:23

$begingroup$
I find it confusing that a company would hire a Data Scientist and still be resistant to change. You should show them the amount of wasted time and the danger os keeping data into long TXT files without real security arround it
$endgroup$
– Pedro Henrique Monforte
yesterday

add a comment |

1) Feels like most of the work is not related to data science at all. Is this accurate?

3) Is this type of setup common for a company with serious data science needs?

edited 2 days ago

asked Apr 3 at 15:16

Victor Valente

29028

New contributor

1) Feels like most of the work is not related to data science at all. Is this accurate?

3) Is this type of setup common for a company with serious data science needs?

data-wrangling

edited 2 days ago

asked Apr 3 at 15:16

Victor Valente

29028

New contributor

edited 2 days ago

asked Apr 3 at 15:16

Victor Valente

29028

New contributor

edited 2 days ago

asked Apr 3 at 15:16

Victor Valente

29028

New contributor

asked Apr 3 at 15:16

Victor Valente

29028

asked Apr 3 at 15:16

Victor Valente

29028

New contributor

Victor Valente is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

$begingroup$
Did you specify which format you want the information on? And give them instructions on how they can do this with their ERP?
$endgroup$
– jonnor
Apr 3 at 19:57

$begingroup$
@jonnor Of course. I've been working here for almost two years now, and since day 1 I explained how we could build a better platform for data accessibility. There's strong resistance to changing what the company has been doing for 30 years though.
$endgroup$
– Victor Valente
Apr 3 at 20:12

13

$begingroup$
Start tracking your hours and convert it to a cost on how much they're wasting your time converting the TXT back to a usable format. I'll bet you once they have a $ figure, they can get it done.
$endgroup$
– Nelson
Apr 4 at 3:05

$begingroup$
If it is a burden on your time you could outsource it.
$endgroup$
– Sarcoma
Apr 4 at 6:23

$begingroup$
I find it confusing that a company would hire a Data Scientist and still be resistant to change. You should show them the amount of wasted time and the danger os keeping data into long TXT files without real security arround it
$endgroup$
– Pedro Henrique Monforte
yesterday

add a comment |

$begingroup$
Did you specify which format you want the information on? And give them instructions on how they can do this with their ERP?
$endgroup$
– jonnor
Apr 3 at 19:57

$begingroup$
@jonnor Of course. I've been working here for almost two years now, and since day 1 I explained how we could build a better platform for data accessibility. There's strong resistance to changing what the company has been doing for 30 years though.
$endgroup$
– Victor Valente
Apr 3 at 20:12

13

$begingroup$
Start tracking your hours and convert it to a cost on how much they're wasting your time converting the TXT back to a usable format. I'll bet you once they have a $ figure, they can get it done.
$endgroup$
– Nelson
Apr 4 at 3:05

$begingroup$
If it is a burden on your time you could outsource it.
$endgroup$
– Sarcoma
Apr 4 at 6:23

$begingroup$
I find it confusing that a company would hire a Data Scientist and still be resistant to change. You should show them the amount of wasted time and the danger os keeping data into long TXT files without real security arround it
$endgroup$
– Pedro Henrique Monforte
yesterday

Did you specify which format you want the information on? And give them instructions on how they can do this with their ERP?

– jonnor
Apr 3 at 19:57

@jonnor Of course. I've been working here for almost two years now, and since day 1 I explained how we could build a better platform for data accessibility. There's strong resistance to changing what the company has been doing for 30 years though.

– Victor Valente
Apr 3 at 20:12

Start tracking your hours and convert it to a cost on how much they're wasting your time converting the TXT back to a usable format. I'll bet you once they have a $ figure, they can get it done.

– Nelson
Apr 4 at 3:05

If it is a burden on your time you could outsource it.

– Sarcoma
Apr 4 at 6:23

I find it confusing that a company would hire a Data Scientist and still be resistant to change. You should show them the amount of wasted time and the danger os keeping data into long TXT files without real security arround it

– Pedro Henrique Monforte
yesterday

add a comment |

9 Answers
9

active

oldest

votes

Feels like most of the work is not related to data science at all. Is this accurate?

Yes

I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?

You're not wrong, but such are the realities of real life.

Is this type of setup common for a company with serious data science needs?

Yes

From a technical standpoint, you need to look into ETL solutions that can make your life easier. Sometimes one tool can be much faster than another to read certain data. E.g. R's readxl is orders of mangnitudes faster than python's pandas at reading xlsx files; you could use R to import the files, then save them to a Python-friendly format (parquet, SQL, etc). I know you're not working on xlsx files and I have no idea if you use Python - it was just an example.

From a practical standpoint, two things:

first of all, understand what is technically possible. In many cases,
the people telling you no are IT-illiterate people who worry about
business or compliance considerations, but have no concept of what is
and isn't feasible from an IT standpoint. Try to speak to the DBAs or
to whoever manages the data infrastructure. Understand what is
technically possible. THEN, only then, try to find a compromise. E.g.
they won't give you access to their system, but I presume there is a
database behind it? Maybe they can extract the data to some other
formats? Maybe they can extract the SQL statements that define the
data types etc?

business people are more likely to help you if you can make the case that doing so is in THEIR interest. If they don't even believe in what you're doing, tough luck...

edited Apr 4 at 12:40

Stephen Rauch

1,52551330

answered Apr 4 at 12:29

PythonGuest

2062

New contributor

1

$begingroup$
Excellent point about finding / buidling an ETL solution. Just need to add: pick a setup you are comfortable with and can easily read / debug. In the early stages of automating tasks, this is even more important than finding the fastest data-slurp tool. If it's gigs of text, it'll likely often run overnight, and your fluency with a tool / framework / language can make the difference between waking up to good data or something you have to start again. Just a single do-over can wipe out any efficiency benefits. Better to be steady with fewer bugs than to go fast and stumble.
$endgroup$
– Jason
Apr 4 at 13:35

2

$begingroup$
True. But, also, don't overoptimise. Choose your priorities wisely. If importing the data is a one -off, don't spend days looking for how to reduce the import time from 2 hours to 30 minutes. Etc.
$endgroup$
– PythonGuest
Apr 4 at 13:58

add a comment |

This is a situation that many blogs, companies and papers acknowledge as something real in many cases.

In this paper Data Wrangling for Big Data: Challenges and Opportunities, there is a quote about it

data scientists spend from 50 percent to 80 percent of their time

collecting and preparing unruly digital data.

Also, you can read the source of that quote in this article from The New York Times, For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights

Unfortunately, the real world is not like Kaggle. You don't get a CSV or Excel file that you can just start the Data Exploration with a little bit of cleaning. You need to find the data in a format that is not suitable for your needs.

What you can do is make use of the old data as much as you can and try to adapt the storing of new data in a process that will be easier for you (or a future colleague) to work with.

edited Apr 4 at 12:37

Stephen Rauch

1,52551330

answered Apr 3 at 16:35

Tasos

1,51011138

$begingroup$
Forbes article claiming the same 80% figure.
$endgroup$
– Jesse Amano
Apr 3 at 19:08

4

$begingroup$
Forbes should nowhere be mentioned together with the words "data science".
$endgroup$
– gented
Apr 3 at 22:52

$begingroup$
50-80% based on (quote) "interviews and expert estimates"
$endgroup$
– oW_
Apr 3 at 23:33

3

$begingroup$
@gented Opinion based comment about an opinion based survey in an opinion based article placed on an opinion based answer to an opinion based question. Who would have thought you would find this in "Data Science" SE?
$endgroup$
– Keeta
Apr 4 at 11:46

add a comment |

Feels like most of the work is not related to data science at all. Is this accurate?

This is the reality of any data science project. Google actually measured it and published a paper "Hidden Technical Debt in Machine Learning Systems" https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf

enter image description here

Result of the paper reflects my experience as well. Vast majority of time is spent in acquiring, cleaning and processing data.

answered Apr 3 at 16:47

Shamit Verma

1,4291214

add a comment |

As another recent starter in Data Science, I can only add that I don't think you're experience is unique, my team of about 10 apparently hasn't done any DS in over a year (one small project that occupied 2 of the team). This is due to the promise of an effective pipeline the team's been working on, but still just isn't quite delivering the data. Apparently retention has been fairly poor in the past and there's continuous promise of a holy-grail MS Azure environment for future DS projects.

So to answer:

1) Yes totally accurate

2) No you're correct, but it's an uphill battle to get access to the data you want (if it even exists).

3) I'm sure there's companies out there who are better than others. If you can't stand it at your current company, 2 years is a decent length of time, start looking for brighter things (be careful how you phrase your desire to leave your current job, something like "looking to work with a more dynamic team" would sound better than "my old company won't give me data").

answered Apr 3 at 23:03

Oliver Houston

1511

New contributor

add a comment |

Feels like most of the work is not related to data science at all. Is this accurate?

Wrangling data is most definitely in the Data Scientist job description. At some level you have to understand the data generating process in order to use it to drive solutions. Sure, someone specialized in ETL could do it faster/more efficient, but being given data dumps is not uncommon in the real world. If you don't like this aspect of data science, there may be an opportunity to work more closely with IT resources to get the data properly sourced into a warehouse you have access to. Alternatively, you could find a job that already has data in better order.

I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?

I think the minimum level is txt files. If you have access to the data via text files, you should have access to the data in the database (push back on this with superiors).

Is this type of setup common for a company with serious data science needs?

Yes. You are the data SCIENTIST; you are the expert. It is part of your job to educate others on the inefficiencies of the current data structure and how you can help. Data that isn't usable isn't helping anyone. You have an opportunity to make things better and shape the future of the company.

answered Apr 4 at 19:40

Underminer

1314

add a comment |

If you look at this from the perspective of "this isn't my job, so why should I do it" then that's a fairly common, general problem not specific to data science. Ultimately, your job is to do whatever the boss tells you to do, but in practice there is little reason for the boss to be dictatorial about this and usually they can be persuaded. Or at least they will give you a sincere explanation of why it has to be that way. But as far as appealing to authority, there is no official definition of "Data Science" that says you can only do at most X% data cleaning. The authority is whoever is paying you, so long as they have the legal right to stop paying you.

You could also look at it from another perspective: Is this a good use of your time? It sounds like you took a job to do some tasks (which you mean by "data science") but you are having to do another thing (which you call "data wrangling"). Job descriptions and personal feelings are a bit beside the point here because there is something more pertinent: The company presumably pays you a good amount of money to do something that only you can do (the data science). But it's having you do other things instead, which could be done by other people who are some combination of more capable, more motivated or less expensive. If the data wrangling could be done by someone making half your salary, then it makes no sense to pay you twice as much to do the same thing. If it could be done faster by someone paid the same salary, the same logic applies. Therefore it is a waste of resources (especially money) to have the company assign this task to you. Coming at it from this perspective, you might find it much easier to make your superiors see your side of things.

Of course, at the end of the day, somebody has to do the data wrangling. It may be that the cheapest, fastest, easiest way of doing it -- the best person for the job, is you. In that case, you're kind of out of luck. You could try to claim it's not part of your contract, but what are the odds they were naive enough to put something that specific in the contract?

answered Apr 4 at 21:05

Whelibeiren

542

add a comment |

Perhaps to put it simply:

When creating variables and binning numerics, would you be doing that blindly, or after analysing your data?

When peers review your findings, if they had questions about particular bits of data, would it embarrass you to not know them?

You need to work with and understand your data - which includes simple stuff from fixing inconsistencies (NULLs, empty strings, "-") to understanding how a piece of data goes from collected to being displayed. Processing it includes knowing the same pieces of information, so it is partially work you would have had to do anyway.

Now, it sounds like this company could benefit from setting up some sort of free MySQL (or similar) instance to hold your data. Trying to be flexible when you're designing your wrangling code is also a good idea - having an intermediate dataset of processed data I think would be useful if you're allowed to (and can't do it in MySQL).

But of course you're still setting up things from scratch. This is not an easy process, but this "learning experience" is at least good to put in your CV.

answered Apr 4 at 22:51

David M

1212

New contributor

add a comment |

1) Feels like most of the work is not related to data science at all. Is this accurate?
In my opinion, Data Science cannot pull out from Data wrangling. But, as you said, the question would come on how much percentage of Data Wrangling is required to do by a Data Scientist. It depends on Organizations bandwidth and the person interest in doing such work. In my experience of 15 to 16 years as DS, I always, spent around 60% to 70% in data wrangling activity and spent to a max of 15% of time in real analysis. so take your call.

2) I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?
Again it depends on organization's security policies. They cannot leave everything to you and they have their own security issues to reveal the data to a person who is temporary employee (sorry to use this words :-()

3) Is this type of setup common for a company with serious data science needs?
I feel these kind of companies require most attention from Data Scientists to make feel that data driven modeling is the future to sustain their business. :-)

I have given my inputs in thinking of businesses instead of technical stand points. :-)
Hope I am clear in my choice of words.

answered 2 days ago

user70920

211

New contributor

add a comment |

In his talk "Big Data is four different problems", Turing award winner Michael Stonebraker mentions this particular issue as a big problem (video, slides)

He says that there are a number of open problems in this area: Ingest, Transform(e.g. euro/dollar),
Clean(e.g.-99/Null),
Schema mapping (e.g. wages/salary),
Entity consolidation (e.g. Mike Stonebraker/Michael Stonebreaker)

There are number of companies/products trying to solve this problem such as Tamr, Alteryx, Trifacta, Paxata, Google Refine working to solve this problem.

Until this area matures, a lot of the data scientist job will indeed be data wrangling.

answered yesterday

hojusaram

1211

New contributor

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

Victor Valente is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48531%2fhow-much-of-data-wrangling-is-a-data-scientists-job%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

9 Answers
9

active

oldest

votes

9 Answers
9

active

oldest

votes

Feels like most of the work is not related to data science at all. Is this accurate?

Yes

I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?

You're not wrong, but such are the realities of real life.

Is this type of setup common for a company with serious data science needs?

Yes

From a practical standpoint, two things:

first of all, understand what is technically possible. In many cases,
the people telling you no are IT-illiterate people who worry about
business or compliance considerations, but have no concept of what is
and isn't feasible from an IT standpoint. Try to speak to the DBAs or
to whoever manages the data infrastructure. Understand what is
technically possible. THEN, only then, try to find a compromise. E.g.
they won't give you access to their system, but I presume there is a
database behind it? Maybe they can extract the data to some other
formats? Maybe they can extract the SQL statements that define the
data types etc?

business people are more likely to help you if you can make the case that doing so is in THEIR interest. If they don't even believe in what you're doing, tough luck...

edited Apr 4 at 12:40

Stephen Rauch

1,52551330

answered Apr 4 at 12:29

PythonGuest

2062

New contributor

1

$begingroup$
Excellent point about finding / buidling an ETL solution. Just need to add: pick a setup you are comfortable with and can easily read / debug. In the early stages of automating tasks, this is even more important than finding the fastest data-slurp tool. If it's gigs of text, it'll likely often run overnight, and your fluency with a tool / framework / language can make the difference between waking up to good data or something you have to start again. Just a single do-over can wipe out any efficiency benefits. Better to be steady with fewer bugs than to go fast and stumble.
$endgroup$
– Jason
Apr 4 at 13:35

2

$begingroup$
True. But, also, don't overoptimise. Choose your priorities wisely. If importing the data is a one -off, don't spend days looking for how to reduce the import time from 2 hours to 30 minutes. Etc.
$endgroup$
– PythonGuest
Apr 4 at 13:58

add a comment |

Feels like most of the work is not related to data science at all. Is this accurate?

Yes

I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?

You're not wrong, but such are the realities of real life.

Is this type of setup common for a company with serious data science needs?

Yes

From a practical standpoint, two things:

first of all, understand what is technically possible. In many cases,
the people telling you no are IT-illiterate people who worry about
business or compliance considerations, but have no concept of what is
and isn't feasible from an IT standpoint. Try to speak to the DBAs or
to whoever manages the data infrastructure. Understand what is
technically possible. THEN, only then, try to find a compromise. E.g.
they won't give you access to their system, but I presume there is a
database behind it? Maybe they can extract the data to some other
formats? Maybe they can extract the SQL statements that define the
data types etc?

business people are more likely to help you if you can make the case that doing so is in THEIR interest. If they don't even believe in what you're doing, tough luck...

edited Apr 4 at 12:40

Stephen Rauch

1,52551330

answered Apr 4 at 12:29

PythonGuest

2062

New contributor

1

$begingroup$
Excellent point about finding / buidling an ETL solution. Just need to add: pick a setup you are comfortable with and can easily read / debug. In the early stages of automating tasks, this is even more important than finding the fastest data-slurp tool. If it's gigs of text, it'll likely often run overnight, and your fluency with a tool / framework / language can make the difference between waking up to good data or something you have to start again. Just a single do-over can wipe out any efficiency benefits. Better to be steady with fewer bugs than to go fast and stumble.
$endgroup$
– Jason
Apr 4 at 13:35

2

$begingroup$
True. But, also, don't overoptimise. Choose your priorities wisely. If importing the data is a one -off, don't spend days looking for how to reduce the import time from 2 hours to 30 minutes. Etc.
$endgroup$
– PythonGuest
Apr 4 at 13:58

add a comment |

Feels like most of the work is not related to data science at all. Is this accurate?

Yes

I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?

You're not wrong, but such are the realities of real life.

Is this type of setup common for a company with serious data science needs?

Yes

From a practical standpoint, two things:

first of all, understand what is technically possible. In many cases,
the people telling you no are IT-illiterate people who worry about
business or compliance considerations, but have no concept of what is
and isn't feasible from an IT standpoint. Try to speak to the DBAs or
to whoever manages the data infrastructure. Understand what is
technically possible. THEN, only then, try to find a compromise. E.g.
they won't give you access to their system, but I presume there is a
database behind it? Maybe they can extract the data to some other
formats? Maybe they can extract the SQL statements that define the
data types etc?

business people are more likely to help you if you can make the case that doing so is in THEIR interest. If they don't even believe in what you're doing, tough luck...

edited Apr 4 at 12:40

Stephen Rauch

1,52551330

answered Apr 4 at 12:29

PythonGuest

2062

New contributor

Feels like most of the work is not related to data science at all. Is this accurate?

Yes

I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?

You're not wrong, but such are the realities of real life.

Is this type of setup common for a company with serious data science needs?

Yes

From a practical standpoint, two things:

first of all, understand what is technically possible. In many cases,
the people telling you no are IT-illiterate people who worry about
business or compliance considerations, but have no concept of what is
and isn't feasible from an IT standpoint. Try to speak to the DBAs or
to whoever manages the data infrastructure. Understand what is
technically possible. THEN, only then, try to find a compromise. E.g.
they won't give you access to their system, but I presume there is a
database behind it? Maybe they can extract the data to some other
formats? Maybe they can extract the SQL statements that define the
data types etc?

business people are more likely to help you if you can make the case that doing so is in THEIR interest. If they don't even believe in what you're doing, tough luck...

edited Apr 4 at 12:40

Stephen Rauch

1,52551330

answered Apr 4 at 12:29

PythonGuest

2062

New contributor

edited Apr 4 at 12:40

Stephen Rauch

1,52551330

edited Apr 4 at 12:40

Stephen Rauch

1,52551330

edited Apr 4 at 12:40

Stephen Rauch

1,52551330

answered Apr 4 at 12:29

PythonGuest

2062

New contributor

answered Apr 4 at 12:29

PythonGuest

2062

answered Apr 4 at 12:29

PythonGuest

2062

New contributor

PythonGuest is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

1

$begingroup$
Excellent point about finding / buidling an ETL solution. Just need to add: pick a setup you are comfortable with and can easily read / debug. In the early stages of automating tasks, this is even more important than finding the fastest data-slurp tool. If it's gigs of text, it'll likely often run overnight, and your fluency with a tool / framework / language can make the difference between waking up to good data or something you have to start again. Just a single do-over can wipe out any efficiency benefits. Better to be steady with fewer bugs than to go fast and stumble.
$endgroup$
– Jason
Apr 4 at 13:35

2

$begingroup$
True. But, also, don't overoptimise. Choose your priorities wisely. If importing the data is a one -off, don't spend days looking for how to reduce the import time from 2 hours to 30 minutes. Etc.
$endgroup$
– PythonGuest
Apr 4 at 13:58

add a comment |

1

$begingroup$
Excellent point about finding / buidling an ETL solution. Just need to add: pick a setup you are comfortable with and can easily read / debug. In the early stages of automating tasks, this is even more important than finding the fastest data-slurp tool. If it's gigs of text, it'll likely often run overnight, and your fluency with a tool / framework / language can make the difference between waking up to good data or something you have to start again. Just a single do-over can wipe out any efficiency benefits. Better to be steady with fewer bugs than to go fast and stumble.
$endgroup$
– Jason
Apr 4 at 13:35

2

$begingroup$
True. But, also, don't overoptimise. Choose your priorities wisely. If importing the data is a one -off, don't spend days looking for how to reduce the import time from 2 hours to 30 minutes. Etc.
$endgroup$
– PythonGuest
Apr 4 at 13:58

Excellent point about finding / buidling an ETL solution. Just need to add: pick a setup you are comfortable with and can easily read / debug. In the early stages of automating tasks, this is even more important than finding the fastest data-slurp tool. If it's gigs of text, it'll likely often run overnight, and your fluency with a tool / framework / language can make the difference between waking up to good data or something you have to start again. Just a single do-over can wipe out any efficiency benefits. Better to be steady with fewer bugs than to go fast and stumble.

– Jason
Apr 4 at 13:35

True. But, also, don't overoptimise. Choose your priorities wisely. If importing the data is a one -off, don't spend days looking for how to reduce the import time from 2 hours to 30 minutes. Etc.

– PythonGuest
Apr 4 at 13:58

add a comment |

This is a situation that many blogs, companies and papers acknowledge as something real in many cases.

In this paper Data Wrangling for Big Data: Challenges and Opportunities, there is a quote about it

data scientists spend from 50 percent to 80 percent of their time

collecting and preparing unruly digital data.

Also, you can read the source of that quote in this article from The New York Times, For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights

What you can do is make use of the old data as much as you can and try to adapt the storing of new data in a process that will be easier for you (or a future colleague) to work with.

edited Apr 4 at 12:37

Stephen Rauch

1,52551330

answered Apr 3 at 16:35

Tasos

1,51011138

$begingroup$
Forbes article claiming the same 80% figure.
$endgroup$
– Jesse Amano
Apr 3 at 19:08

4

$begingroup$
Forbes should nowhere be mentioned together with the words "data science".
$endgroup$
– gented
Apr 3 at 22:52

$begingroup$
50-80% based on (quote) "interviews and expert estimates"
$endgroup$
– oW_
Apr 3 at 23:33

3

$begingroup$
@gented Opinion based comment about an opinion based survey in an opinion based article placed on an opinion based answer to an opinion based question. Who would have thought you would find this in "Data Science" SE?
$endgroup$
– Keeta
Apr 4 at 11:46

add a comment |

This is a situation that many blogs, companies and papers acknowledge as something real in many cases.

In this paper Data Wrangling for Big Data: Challenges and Opportunities, there is a quote about it

data scientists spend from 50 percent to 80 percent of their time

collecting and preparing unruly digital data.

Also, you can read the source of that quote in this article from The New York Times, For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights

What you can do is make use of the old data as much as you can and try to adapt the storing of new data in a process that will be easier for you (or a future colleague) to work with.

edited Apr 4 at 12:37

Stephen Rauch

1,52551330

answered Apr 3 at 16:35

Tasos

1,51011138

$begingroup$
Forbes article claiming the same 80% figure.
$endgroup$
– Jesse Amano
Apr 3 at 19:08

4

$begingroup$
Forbes should nowhere be mentioned together with the words "data science".
$endgroup$
– gented
Apr 3 at 22:52

$begingroup$
50-80% based on (quote) "interviews and expert estimates"
$endgroup$
– oW_
Apr 3 at 23:33

3

$begingroup$
@gented Opinion based comment about an opinion based survey in an opinion based article placed on an opinion based answer to an opinion based question. Who would have thought you would find this in "Data Science" SE?
$endgroup$
– Keeta
Apr 4 at 11:46

add a comment |

This is a situation that many blogs, companies and papers acknowledge as something real in many cases.

In this paper Data Wrangling for Big Data: Challenges and Opportunities, there is a quote about it

data scientists spend from 50 percent to 80 percent of their time

collecting and preparing unruly digital data.

Also, you can read the source of that quote in this article from The New York Times, For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights

What you can do is make use of the old data as much as you can and try to adapt the storing of new data in a process that will be easier for you (or a future colleague) to work with.

edited Apr 4 at 12:37

Stephen Rauch

1,52551330

answered Apr 3 at 16:35

Tasos

1,51011138

This is a situation that many blogs, companies and papers acknowledge as something real in many cases.

In this paper Data Wrangling for Big Data: Challenges and Opportunities, there is a quote about it

data scientists spend from 50 percent to 80 percent of their time

collecting and preparing unruly digital data.

Also, you can read the source of that quote in this article from The New York Times, For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights

What you can do is make use of the old data as much as you can and try to adapt the storing of new data in a process that will be easier for you (or a future colleague) to work with.

edited Apr 4 at 12:37

Stephen Rauch

1,52551330

answered Apr 3 at 16:35

Tasos

1,51011138

edited Apr 4 at 12:37

Stephen Rauch

1,52551330

edited Apr 4 at 12:37

Stephen Rauch

1,52551330

edited Apr 4 at 12:37

Stephen Rauch

1,52551330

answered Apr 3 at 16:35

Tasos

1,51011138

answered Apr 3 at 16:35

Tasos

1,51011138

answered Apr 3 at 16:35

Tasos

1,51011138

$begingroup$
Forbes article claiming the same 80% figure.
$endgroup$
– Jesse Amano
Apr 3 at 19:08

4

$begingroup$
Forbes should nowhere be mentioned together with the words "data science".
$endgroup$
– gented
Apr 3 at 22:52

$begingroup$
50-80% based on (quote) "interviews and expert estimates"
$endgroup$
– oW_
Apr 3 at 23:33

3

$begingroup$
@gented Opinion based comment about an opinion based survey in an opinion based article placed on an opinion based answer to an opinion based question. Who would have thought you would find this in "Data Science" SE?
$endgroup$
– Keeta
Apr 4 at 11:46

add a comment |

$begingroup$
Forbes article claiming the same 80% figure.
$endgroup$
– Jesse Amano
Apr 3 at 19:08

4

$begingroup$
Forbes should nowhere be mentioned together with the words "data science".
$endgroup$
– gented
Apr 3 at 22:52

$begingroup$
50-80% based on (quote) "interviews and expert estimates"
$endgroup$
– oW_
Apr 3 at 23:33

3

$begingroup$
@gented Opinion based comment about an opinion based survey in an opinion based article placed on an opinion based answer to an opinion based question. Who would have thought you would find this in "Data Science" SE?
$endgroup$
– Keeta
Apr 4 at 11:46

Forbes article claiming the same 80% figure.

– Jesse Amano
Apr 3 at 19:08

Forbes should nowhere be mentioned together with the words "data science".

– gented
Apr 3 at 22:52

50-80% based on (quote) "interviews and expert estimates"

– oW_
Apr 3 at 23:33

@gented Opinion based comment about an opinion based survey in an opinion based article placed on an opinion based answer to an opinion based question. Who would have thought you would find this in "Data Science" SE?

– Keeta
Apr 4 at 11:46

add a comment |

Feels like most of the work is not related to data science at all. Is this accurate?

enter image description here

Result of the paper reflects my experience as well. Vast majority of time is spent in acquiring, cleaning and processing data.

answered Apr 3 at 16:47

Shamit Verma

1,4291214

add a comment |

Feels like most of the work is not related to data science at all. Is this accurate?

enter image description here

Result of the paper reflects my experience as well. Vast majority of time is spent in acquiring, cleaning and processing data.

answered Apr 3 at 16:47

Shamit Verma

1,4291214

add a comment |

Feels like most of the work is not related to data science at all. Is this accurate?

enter image description here

Result of the paper reflects my experience as well. Vast majority of time is spent in acquiring, cleaning and processing data.

answered Apr 3 at 16:47

Shamit Verma

1,4291214

Feels like most of the work is not related to data science at all. Is this accurate?

enter image description here

Result of the paper reflects my experience as well. Vast majority of time is spent in acquiring, cleaning and processing data.

answered Apr 3 at 16:47

Shamit Verma

1,4291214

answered Apr 3 at 16:47

Shamit Verma

1,4291214

answered Apr 3 at 16:47

Shamit Verma

1,4291214

answered Apr 3 at 16:47

Shamit Verma

1,4291214

add a comment |

So to answer:

1) Yes totally accurate

2) No you're correct, but it's an uphill battle to get access to the data you want (if it even exists).

answered Apr 3 at 23:03

Oliver Houston

1511

New contributor

add a comment |

So to answer:

1) Yes totally accurate

2) No you're correct, but it's an uphill battle to get access to the data you want (if it even exists).

answered Apr 3 at 23:03

Oliver Houston

1511

New contributor

add a comment |

So to answer:

1) Yes totally accurate

2) No you're correct, but it's an uphill battle to get access to the data you want (if it even exists).

answered Apr 3 at 23:03

Oliver Houston

1511

New contributor

So to answer:

1) Yes totally accurate

2) No you're correct, but it's an uphill battle to get access to the data you want (if it even exists).

answered Apr 3 at 23:03

Oliver Houston

1511

New contributor

answered Apr 3 at 23:03

Oliver Houston

1511

New contributor

answered Apr 3 at 23:03

Oliver Houston

1511

answered Apr 3 at 23:03

Oliver Houston

1511

New contributor

Oliver Houston is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

Feels like most of the work is not related to data science at all. Is this accurate?

Wrangling data is most definitely in the Data Scientist job description. At some level you have to understand the data generating process in order to use it to drive solutions. Sure, someone specialized in ETL could do it faster/more efficient, but being given data dumps is not uncommon in the real world. If you don't like this aspect of data science, there may be an opportunity to work more closely with IT resources to get the data properly sourced into a warehouse you have access to. Alternatively, you could find a job that already has data in better order.

I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?

I think the minimum level is txt files. If you have access to the data via text files, you should have access to the data in the database (push back on this with superiors).

Is this type of setup common for a company with serious data science needs?

Yes. You are the data SCIENTIST; you are the expert. It is part of your job to educate others on the inefficiencies of the current data structure and how you can help. Data that isn't usable isn't helping anyone. You have an opportunity to make things better and shape the future of the company.

answered Apr 4 at 19:40

Underminer

1314

add a comment |

Feels like most of the work is not related to data science at all. Is this accurate?

Wrangling data is most definitely in the Data Scientist job description. At some level you have to understand the data generating process in order to use it to drive solutions. Sure, someone specialized in ETL could do it faster/more efficient, but being given data dumps is not uncommon in the real world. If you don't like this aspect of data science, there may be an opportunity to work more closely with IT resources to get the data properly sourced into a warehouse you have access to. Alternatively, you could find a job that already has data in better order.

I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?

I think the minimum level is txt files. If you have access to the data via text files, you should have access to the data in the database (push back on this with superiors).

Is this type of setup common for a company with serious data science needs?

Yes. You are the data SCIENTIST; you are the expert. It is part of your job to educate others on the inefficiencies of the current data structure and how you can help. Data that isn't usable isn't helping anyone. You have an opportunity to make things better and shape the future of the company.

answered Apr 4 at 19:40

Underminer

1314

add a comment |

Feels like most of the work is not related to data science at all. Is this accurate?

Wrangling data is most definitely in the Data Scientist job description. At some level you have to understand the data generating process in order to use it to drive solutions. Sure, someone specialized in ETL could do it faster/more efficient, but being given data dumps is not uncommon in the real world. If you don't like this aspect of data science, there may be an opportunity to work more closely with IT resources to get the data properly sourced into a warehouse you have access to. Alternatively, you could find a job that already has data in better order.

I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?

I think the minimum level is txt files. If you have access to the data via text files, you should have access to the data in the database (push back on this with superiors).

Is this type of setup common for a company with serious data science needs?

Yes. You are the data SCIENTIST; you are the expert. It is part of your job to educate others on the inefficiencies of the current data structure and how you can help. Data that isn't usable isn't helping anyone. You have an opportunity to make things better and shape the future of the company.

answered Apr 4 at 19:40

Underminer

1314

Feels like most of the work is not related to data science at all. Is this accurate?

Wrangling data is most definitely in the Data Scientist job description. At some level you have to understand the data generating process in order to use it to drive solutions. Sure, someone specialized in ETL could do it faster/more efficient, but being given data dumps is not uncommon in the real world. If you don't like this aspect of data science, there may be an opportunity to work more closely with IT resources to get the data properly sourced into a warehouse you have access to. Alternatively, you could find a job that already has data in better order.

I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?

I think the minimum level is txt files. If you have access to the data via text files, you should have access to the data in the database (push back on this with superiors).

Is this type of setup common for a company with serious data science needs?

Yes. You are the data SCIENTIST; you are the expert. It is part of your job to educate others on the inefficiencies of the current data structure and how you can help. Data that isn't usable isn't helping anyone. You have an opportunity to make things better and shape the future of the company.

answered Apr 4 at 19:40

Underminer

1314

answered Apr 4 at 19:40

Underminer

1314

answered Apr 4 at 19:40

Underminer

1314

answered Apr 4 at 19:40

Underminer

1314

add a comment |

answered Apr 4 at 21:05

Whelibeiren

542

add a comment |

answered Apr 4 at 21:05

Whelibeiren

542

add a comment |

answered Apr 4 at 21:05

Whelibeiren

542

answered Apr 4 at 21:05

Whelibeiren

542

answered Apr 4 at 21:05

Whelibeiren

542

answered Apr 4 at 21:05

Whelibeiren

542

answered Apr 4 at 21:05

Whelibeiren

542

add a comment |

Perhaps to put it simply:

When creating variables and binning numerics, would you be doing that blindly, or after analysing your data?

When peers review your findings, if they had questions about particular bits of data, would it embarrass you to not know them?

But of course you're still setting up things from scratch. This is not an easy process, but this "learning experience" is at least good to put in your CV.

answered Apr 4 at 22:51

David M

1212

New contributor

add a comment |

Perhaps to put it simply:

When creating variables and binning numerics, would you be doing that blindly, or after analysing your data?

When peers review your findings, if they had questions about particular bits of data, would it embarrass you to not know them?

But of course you're still setting up things from scratch. This is not an easy process, but this "learning experience" is at least good to put in your CV.

answered Apr 4 at 22:51

David M

1212

New contributor

add a comment |

Perhaps to put it simply:

When creating variables and binning numerics, would you be doing that blindly, or after analysing your data?

When peers review your findings, if they had questions about particular bits of data, would it embarrass you to not know them?

But of course you're still setting up things from scratch. This is not an easy process, but this "learning experience" is at least good to put in your CV.

answered Apr 4 at 22:51

David M

1212

New contributor

Perhaps to put it simply:

When creating variables and binning numerics, would you be doing that blindly, or after analysing your data?

When peers review your findings, if they had questions about particular bits of data, would it embarrass you to not know them?

But of course you're still setting up things from scratch. This is not an easy process, but this "learning experience" is at least good to put in your CV.

answered Apr 4 at 22:51

David M

1212

New contributor

answered Apr 4 at 22:51

David M

1212

New contributor

answered Apr 4 at 22:51

David M

1212

answered Apr 4 at 22:51

David M

1212

New contributor

David M is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

I have given my inputs in thinking of businesses instead of technical stand points. :-)
Hope I am clear in my choice of words.

answered 2 days ago

user70920

211

New contributor

add a comment |

I have given my inputs in thinking of businesses instead of technical stand points. :-)
Hope I am clear in my choice of words.

answered 2 days ago

user70920

211

New contributor

add a comment |

I have given my inputs in thinking of businesses instead of technical stand points. :-)
Hope I am clear in my choice of words.

answered 2 days ago

user70920

211

New contributor

I have given my inputs in thinking of businesses instead of technical stand points. :-)
Hope I am clear in my choice of words.

answered 2 days ago

user70920

211

New contributor

answered 2 days ago

user70920

211

New contributor

answered 2 days ago

user70920

211

answered 2 days ago

user70920

211

New contributor

user70920 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

In his talk "Big Data is four different problems", Turing award winner Michael Stonebraker mentions this particular issue as a big problem (video, slides)

There are number of companies/products trying to solve this problem such as Tamr, Alteryx, Trifacta, Paxata, Google Refine working to solve this problem.

Until this area matures, a lot of the data scientist job will indeed be data wrangling.

answered yesterday

hojusaram

1211

New contributor

add a comment |

In his talk "Big Data is four different problems", Turing award winner Michael Stonebraker mentions this particular issue as a big problem (video, slides)

There are number of companies/products trying to solve this problem such as Tamr, Alteryx, Trifacta, Paxata, Google Refine working to solve this problem.

Until this area matures, a lot of the data scientist job will indeed be data wrangling.

answered yesterday

hojusaram

1211

New contributor

add a comment |

In his talk "Big Data is four different problems", Turing award winner Michael Stonebraker mentions this particular issue as a big problem (video, slides)

There are number of companies/products trying to solve this problem such as Tamr, Alteryx, Trifacta, Paxata, Google Refine working to solve this problem.

Until this area matures, a lot of the data scientist job will indeed be data wrangling.

answered yesterday

hojusaram

1211

New contributor

In his talk "Big Data is four different problems", Turing award winner Michael Stonebraker mentions this particular issue as a big problem (video, slides)

There are number of companies/products trying to solve this problem such as Tamr, Alteryx, Trifacta, Paxata, Google Refine working to solve this problem.

Until this area matures, a lot of the data scientist job will indeed be data wrangling.

answered yesterday

hojusaram

1211

New contributor

answered yesterday

hojusaram

1211

New contributor

answered yesterday

hojusaram

1211

answered yesterday

hojusaram

1211

New contributor

hojusaram is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

Victor Valente is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Victor Valente is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

nLh1qi9YMFJdN6SJ7OLMYKMO9G2I,ZEEUT69n OwzKUpsJRROgm9KIAfvkHAA

搜尋此網誌

Bsrgvty

9 Answers
9

Your Answer

Post as a guest

9 Answers
9

9 Answers
9

Post as a guest

Popular posts from this blog

Tamil (spriik) Luke uk diar | Nawigatjuun

9 Answers 9

Your Answer

Sign up or log in

Post as a guest

Post as a guest

9 Answers 9

9 Answers 9

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Tamil (spriik) Luke uk diar | Nawigatjuun

9 Answers
9

9 Answers
9

9 Answers
9