How much of data wrangling is a data scientist's job?2019 Community Moderator ElectionTools to perform SQL analytics on 350TB of csv dataTechnical name for this data wrangling process? Multiple columns into multi-factor single columnHow do you define the steps to explore the data?Which one is better performer on wrangling big data, R or Python?R Programming rearranging rows and colums from timeline dataHow do I split number string with digit pattern?How to work with string data with a lot of NAs in an aggregation task with RWhat is the difference between 'if the data if of good quality' and 'if the data is tidy'?how to calculate number of datapoints within a given time interval?How to deal with count data in random forest

Schwarzchild Radius of the Universe

Email Account under attack (really) - anything I can do?

Is there really no realistic way for a skeleton monster to move around without magic?

Can an x86 CPU running in real mode be considered to be basically an 8086 CPU?

How to add power-LED to my small amplifier?

How to calculate implied correlation via observed market price (Margrabe option)

The magic money tree problem

Is it possible to make sharp wind that can cut stuff from afar?

XeLaTeX and pdfLaTeX ignore hyphenation

Shell script can be run only with sh command

What is the offset in a seaplane's hull?

How can the DM most effectively choose 1 out of an odd number of players to be targeted by an attack or effect?

Why did the Germans forbid the possession of pet pigeons in Rostov-on-Don in 1941?

How old can references or sources in a thesis be?

What is the command to reset a PC without deleting any files

How did the USSR manage to innovate in an environment characterized by government censorship and high bureaucracy?

Are tax years 2016 & 2017 back taxes deductible for tax year 2018?

A newer friend of my brother's gave him a load of baseball cards that are supposedly extremely valuable. Is this a scam?

DOS, create pipe for stdin/stdout of command.com(or 4dos.com) in C or Batch?

Can I make popcorn with any corn?

How is the claim "I am in New York only if I am in America" the same as "If I am in New York, then I am in America?

A function which translates a sentence to title-case

How do I create uniquely male characters?

What is the meaning of "of trouble" in the following sentence?



How much of data wrangling is a data scientist's job?



2019 Community Moderator ElectionTools to perform SQL analytics on 350TB of csv dataTechnical name for this data wrangling process? Multiple columns into multi-factor single columnHow do you define the steps to explore the data?Which one is better performer on wrangling big data, R or Python?R Programming rearranging rows and colums from timeline dataHow do I split number string with digit pattern?How to work with string data with a lot of NAs in an aggregation task with RWhat is the difference between 'if the data if of good quality' and 'if the data is tidy'?how to calculate number of datapoints within a given time interval?How to deal with count data in random forest










37












$begingroup$


I'm currently working as a data scientist at a retail company (my first job as a DS, so this question may be a result of my lack of experience). They have a huge backlog of really important data science projects that would have a great positive impact if implemented. But.



Data pipelines are non-existent within the company, the standard procedure is for them to hand me gigabytes of TXT files whenever I need some information. Think of these files as tabular logs of transactions stored in arcane notation and structure. No whole piece of information is contained in one single data source, and they can't grant me access to their ERP database for "security reasons".



Initial data analysis for the simplest project requires brutal, excruciating data wrangling. More than 80% of a project's time spent is me trying to parse these files and cross data sources in order to build viable datasets. This is not a problem of simply handling missing data or preprocessing it, it's about the work it takes to build data that can be handled in the first place (solvable by dba or data engineering, not data science?).




1) Feels like most of the work is not related to data science at all. Is this accurate?



2) I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that in order to build for a sustainable future of data science projects, minimum levels of data accessibility are required. Am I wrong?



3) Is this type of setup common for a company with serious data science needs?










share|improve this question









New contributor




Victor Valente is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$











  • $begingroup$
    Did you specify which format you want the information on? And give them instructions on how they can do this with their ERP?
    $endgroup$
    – jonnor
    Apr 3 at 19:57










  • $begingroup$
    @jonnor Of course. I've been working here for almost two years now, and since day 1 I explained how we could build a better platform for data accessibility. There's strong resistance to changing what the company has been doing for 30 years though.
    $endgroup$
    – Victor Valente
    Apr 3 at 20:12






  • 13




    $begingroup$
    Start tracking your hours and convert it to a cost on how much they're wasting your time converting the TXT back to a usable format. I'll bet you once they have a $ figure, they can get it done.
    $endgroup$
    – Nelson
    Apr 4 at 3:05










  • $begingroup$
    If it is a burden on your time you could outsource it.
    $endgroup$
    – Sarcoma
    Apr 4 at 6:23










  • $begingroup$
    I find it confusing that a company would hire a Data Scientist and still be resistant to change. You should show them the amount of wasted time and the danger os keeping data into long TXT files without real security arround it
    $endgroup$
    – Pedro Henrique Monforte
    yesterday















37












$begingroup$


I'm currently working as a data scientist at a retail company (my first job as a DS, so this question may be a result of my lack of experience). They have a huge backlog of really important data science projects that would have a great positive impact if implemented. But.



Data pipelines are non-existent within the company, the standard procedure is for them to hand me gigabytes of TXT files whenever I need some information. Think of these files as tabular logs of transactions stored in arcane notation and structure. No whole piece of information is contained in one single data source, and they can't grant me access to their ERP database for "security reasons".



Initial data analysis for the simplest project requires brutal, excruciating data wrangling. More than 80% of a project's time spent is me trying to parse these files and cross data sources in order to build viable datasets. This is not a problem of simply handling missing data or preprocessing it, it's about the work it takes to build data that can be handled in the first place (solvable by dba or data engineering, not data science?).




1) Feels like most of the work is not related to data science at all. Is this accurate?



2) I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that in order to build for a sustainable future of data science projects, minimum levels of data accessibility are required. Am I wrong?



3) Is this type of setup common for a company with serious data science needs?










share|improve this question









New contributor




Victor Valente is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$











  • $begingroup$
    Did you specify which format you want the information on? And give them instructions on how they can do this with their ERP?
    $endgroup$
    – jonnor
    Apr 3 at 19:57










  • $begingroup$
    @jonnor Of course. I've been working here for almost two years now, and since day 1 I explained how we could build a better platform for data accessibility. There's strong resistance to changing what the company has been doing for 30 years though.
    $endgroup$
    – Victor Valente
    Apr 3 at 20:12






  • 13




    $begingroup$
    Start tracking your hours and convert it to a cost on how much they're wasting your time converting the TXT back to a usable format. I'll bet you once they have a $ figure, they can get it done.
    $endgroup$
    – Nelson
    Apr 4 at 3:05










  • $begingroup$
    If it is a burden on your time you could outsource it.
    $endgroup$
    – Sarcoma
    Apr 4 at 6:23










  • $begingroup$
    I find it confusing that a company would hire a Data Scientist and still be resistant to change. You should show them the amount of wasted time and the danger os keeping data into long TXT files without real security arround it
    $endgroup$
    – Pedro Henrique Monforte
    yesterday













37












37








37


11



$begingroup$


I'm currently working as a data scientist at a retail company (my first job as a DS, so this question may be a result of my lack of experience). They have a huge backlog of really important data science projects that would have a great positive impact if implemented. But.



Data pipelines are non-existent within the company, the standard procedure is for them to hand me gigabytes of TXT files whenever I need some information. Think of these files as tabular logs of transactions stored in arcane notation and structure. No whole piece of information is contained in one single data source, and they can't grant me access to their ERP database for "security reasons".



Initial data analysis for the simplest project requires brutal, excruciating data wrangling. More than 80% of a project's time spent is me trying to parse these files and cross data sources in order to build viable datasets. This is not a problem of simply handling missing data or preprocessing it, it's about the work it takes to build data that can be handled in the first place (solvable by dba or data engineering, not data science?).




1) Feels like most of the work is not related to data science at all. Is this accurate?



2) I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that in order to build for a sustainable future of data science projects, minimum levels of data accessibility are required. Am I wrong?



3) Is this type of setup common for a company with serious data science needs?










share|improve this question









New contributor




Victor Valente is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$




I'm currently working as a data scientist at a retail company (my first job as a DS, so this question may be a result of my lack of experience). They have a huge backlog of really important data science projects that would have a great positive impact if implemented. But.



Data pipelines are non-existent within the company, the standard procedure is for them to hand me gigabytes of TXT files whenever I need some information. Think of these files as tabular logs of transactions stored in arcane notation and structure. No whole piece of information is contained in one single data source, and they can't grant me access to their ERP database for "security reasons".



Initial data analysis for the simplest project requires brutal, excruciating data wrangling. More than 80% of a project's time spent is me trying to parse these files and cross data sources in order to build viable datasets. This is not a problem of simply handling missing data or preprocessing it, it's about the work it takes to build data that can be handled in the first place (solvable by dba or data engineering, not data science?).




1) Feels like most of the work is not related to data science at all. Is this accurate?



2) I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that in order to build for a sustainable future of data science projects, minimum levels of data accessibility are required. Am I wrong?



3) Is this type of setup common for a company with serious data science needs?







data-wrangling






share|improve this question









New contributor




Victor Valente is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question









New contributor




Victor Valente is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question








edited 2 days ago







Victor Valente













New contributor




Victor Valente is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked Apr 3 at 15:16









Victor ValenteVictor Valente

29028




29028




New contributor




Victor Valente is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





Victor Valente is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






Victor Valente is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











  • $begingroup$
    Did you specify which format you want the information on? And give them instructions on how they can do this with their ERP?
    $endgroup$
    – jonnor
    Apr 3 at 19:57










  • $begingroup$
    @jonnor Of course. I've been working here for almost two years now, and since day 1 I explained how we could build a better platform for data accessibility. There's strong resistance to changing what the company has been doing for 30 years though.
    $endgroup$
    – Victor Valente
    Apr 3 at 20:12






  • 13




    $begingroup$
    Start tracking your hours and convert it to a cost on how much they're wasting your time converting the TXT back to a usable format. I'll bet you once they have a $ figure, they can get it done.
    $endgroup$
    – Nelson
    Apr 4 at 3:05










  • $begingroup$
    If it is a burden on your time you could outsource it.
    $endgroup$
    – Sarcoma
    Apr 4 at 6:23










  • $begingroup$
    I find it confusing that a company would hire a Data Scientist and still be resistant to change. You should show them the amount of wasted time and the danger os keeping data into long TXT files without real security arround it
    $endgroup$
    – Pedro Henrique Monforte
    yesterday
















  • $begingroup$
    Did you specify which format you want the information on? And give them instructions on how they can do this with their ERP?
    $endgroup$
    – jonnor
    Apr 3 at 19:57










  • $begingroup$
    @jonnor Of course. I've been working here for almost two years now, and since day 1 I explained how we could build a better platform for data accessibility. There's strong resistance to changing what the company has been doing for 30 years though.
    $endgroup$
    – Victor Valente
    Apr 3 at 20:12






  • 13




    $begingroup$
    Start tracking your hours and convert it to a cost on how much they're wasting your time converting the TXT back to a usable format. I'll bet you once they have a $ figure, they can get it done.
    $endgroup$
    – Nelson
    Apr 4 at 3:05










  • $begingroup$
    If it is a burden on your time you could outsource it.
    $endgroup$
    – Sarcoma
    Apr 4 at 6:23










  • $begingroup$
    I find it confusing that a company would hire a Data Scientist and still be resistant to change. You should show them the amount of wasted time and the danger os keeping data into long TXT files without real security arround it
    $endgroup$
    – Pedro Henrique Monforte
    yesterday















$begingroup$
Did you specify which format you want the information on? And give them instructions on how they can do this with their ERP?
$endgroup$
– jonnor
Apr 3 at 19:57




$begingroup$
Did you specify which format you want the information on? And give them instructions on how they can do this with their ERP?
$endgroup$
– jonnor
Apr 3 at 19:57












$begingroup$
@jonnor Of course. I've been working here for almost two years now, and since day 1 I explained how we could build a better platform for data accessibility. There's strong resistance to changing what the company has been doing for 30 years though.
$endgroup$
– Victor Valente
Apr 3 at 20:12




$begingroup$
@jonnor Of course. I've been working here for almost two years now, and since day 1 I explained how we could build a better platform for data accessibility. There's strong resistance to changing what the company has been doing for 30 years though.
$endgroup$
– Victor Valente
Apr 3 at 20:12




13




13




$begingroup$
Start tracking your hours and convert it to a cost on how much they're wasting your time converting the TXT back to a usable format. I'll bet you once they have a $ figure, they can get it done.
$endgroup$
– Nelson
Apr 4 at 3:05




$begingroup$
Start tracking your hours and convert it to a cost on how much they're wasting your time converting the TXT back to a usable format. I'll bet you once they have a $ figure, they can get it done.
$endgroup$
– Nelson
Apr 4 at 3:05












$begingroup$
If it is a burden on your time you could outsource it.
$endgroup$
– Sarcoma
Apr 4 at 6:23




$begingroup$
If it is a burden on your time you could outsource it.
$endgroup$
– Sarcoma
Apr 4 at 6:23












$begingroup$
I find it confusing that a company would hire a Data Scientist and still be resistant to change. You should show them the amount of wasted time and the danger os keeping data into long TXT files without real security arround it
$endgroup$
– Pedro Henrique Monforte
yesterday




$begingroup$
I find it confusing that a company would hire a Data Scientist and still be resistant to change. You should show them the amount of wasted time and the danger os keeping data into long TXT files without real security arround it
$endgroup$
– Pedro Henrique Monforte
yesterday










9 Answers
9






active

oldest

votes


















19












$begingroup$


  1. Feels like most of the work is not related to data science at all. Is this accurate?



    Yes




  2. I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?



    You're not wrong, but such are the realities of real life.




  3. Is this type of setup common for a company with serious data science needs?



    Yes



From a technical standpoint, you need to look into ETL solutions that can make your life easier. Sometimes one tool can be much faster than another to read certain data. E.g. R's readxl is orders of mangnitudes faster than python's pandas at reading xlsx files; you could use R to import the files, then save them to a Python-friendly format (parquet, SQL, etc). I know you're not working on xlsx files and I have no idea if you use Python - it was just an example.



From a practical standpoint, two things:



  • first of all, understand what is technically possible. In many cases,
    the people telling you no are IT-illiterate people who worry about
    business or compliance considerations, but have no concept of what is
    and isn't feasible from an IT standpoint. Try to speak to the DBAs or
    to whoever manages the data infrastructure. Understand what is
    technically possible. THEN, only then, try to find a compromise. E.g.
    they won't give you access to their system, but I presume there is a
    database behind it? Maybe they can extract the data to some other
    formats? Maybe they can extract the SQL statements that define the
    data types etc?

  • business people are more likely to help you if you can make the case that doing so is in THEIR interest. If they don't even believe in what you're doing, tough luck...





share|improve this answer










New contributor




PythonGuest is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$








  • 1




    $begingroup$
    Excellent point about finding / buidling an ETL solution. Just need to add: pick a setup you are comfortable with and can easily read / debug. In the early stages of automating tasks, this is even more important than finding the fastest data-slurp tool. If it's gigs of text, it'll likely often run overnight, and your fluency with a tool / framework / language can make the difference between waking up to good data or something you have to start again. Just a single do-over can wipe out any efficiency benefits. Better to be steady with fewer bugs than to go fast and stumble.
    $endgroup$
    – Jason
    Apr 4 at 13:35






  • 2




    $begingroup$
    True. But, also, don't overoptimise. Choose your priorities wisely. If importing the data is a one -off, don't spend days looking for how to reduce the import time from 2 hours to 30 minutes. Etc.
    $endgroup$
    – PythonGuest
    Apr 4 at 13:58


















34












$begingroup$

This is a situation that many blogs, companies and papers acknowledge as something real in many cases.



In this paper Data Wrangling for Big Data: Challenges and Opportunities, there is a quote about it




data scientists spend from 50 percent to 80 percent of their time



collecting and preparing unruly digital data.




Also, you can read the source of that quote in this article from The New York Times, For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights



Unfortunately, the real world is not like Kaggle. You don't get a CSV or Excel file that you can just start the Data Exploration with a little bit of cleaning. You need to find the data in a format that is not suitable for your needs.



What you can do is make use of the old data as much as you can and try to adapt the storing of new data in a process that will be easier for you (or a future colleague) to work with.






share|improve this answer











$endgroup$












  • $begingroup$
    Forbes article claiming the same 80% figure.
    $endgroup$
    – Jesse Amano
    Apr 3 at 19:08






  • 4




    $begingroup$
    Forbes should nowhere be mentioned together with the words "data science".
    $endgroup$
    – gented
    Apr 3 at 22:52










  • $begingroup$
    50-80% based on (quote) "interviews and expert estimates"
    $endgroup$
    – oW_
    Apr 3 at 23:33






  • 3




    $begingroup$
    @gented Opinion based comment about an opinion based survey in an opinion based article placed on an opinion based answer to an opinion based question. Who would have thought you would find this in "Data Science" SE?
    $endgroup$
    – Keeta
    Apr 4 at 11:46


















24












$begingroup$


Feels like most of the work is not related to data science at all. Is this accurate?




This is the reality of any data science project. Google actually measured it and published a paper "Hidden Technical Debt in Machine Learning Systems" https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf



enter image description here



Result of the paper reflects my experience as well. Vast majority of time is spent in acquiring, cleaning and processing data.






share|improve this answer









$endgroup$




















    5












    $begingroup$

    As another recent starter in Data Science, I can only add that I don't think you're experience is unique, my team of about 10 apparently hasn't done any DS in over a year (one small project that occupied 2 of the team). This is due to the promise of an effective pipeline the team's been working on, but still just isn't quite delivering the data. Apparently retention has been fairly poor in the past and there's continuous promise of a holy-grail MS Azure environment for future DS projects.



    So to answer:



    1) Yes totally accurate



    2) No you're correct, but it's an uphill battle to get access to the data you want (if it even exists).



    3) I'm sure there's companies out there who are better than others. If you can't stand it at your current company, 2 years is a decent length of time, start looking for brighter things (be careful how you phrase your desire to leave your current job, something like "looking to work with a more dynamic team" would sound better than "my old company won't give me data").






    share|improve this answer








    New contributor




    Oliver Houston is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.






    $endgroup$




















      3












      $begingroup$


      1. Feels like most of the work is not related to data science at all. Is this accurate?



        Wrangling data is most definitely in the Data Scientist job description. At some level you have to understand the data generating process in order to use it to drive solutions. Sure, someone specialized in ETL could do it faster/more efficient, but being given data dumps is not uncommon in the real world. If you don't like this aspect of data science, there may be an opportunity to work more closely with IT resources to get the data properly sourced into a warehouse you have access to. Alternatively, you could find a job that already has data in better order.




      2. I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?



        I think the minimum level is txt files. If you have access to the data via text files, you should have access to the data in the database (push back on this with superiors).




      3. Is this type of setup common for a company with serious data science needs?



        Yes. You are the data SCIENTIST; you are the expert. It is part of your job to educate others on the inefficiencies of the current data structure and how you can help. Data that isn't usable isn't helping anyone. You have an opportunity to make things better and shape the future of the company.







      share|improve this answer









      $endgroup$




















        3












        $begingroup$

        If you look at this from the perspective of "this isn't my job, so why should I do it" then that's a fairly common, general problem not specific to data science. Ultimately, your job is to do whatever the boss tells you to do, but in practice there is little reason for the boss to be dictatorial about this and usually they can be persuaded. Or at least they will give you a sincere explanation of why it has to be that way. But as far as appealing to authority, there is no official definition of "Data Science" that says you can only do at most X% data cleaning. The authority is whoever is paying you, so long as they have the legal right to stop paying you.



        You could also look at it from another perspective: Is this a good use of your time? It sounds like you took a job to do some tasks (which you mean by "data science") but you are having to do another thing (which you call "data wrangling"). Job descriptions and personal feelings are a bit beside the point here because there is something more pertinent: The company presumably pays you a good amount of money to do something that only you can do (the data science). But it's having you do other things instead, which could be done by other people who are some combination of more capable, more motivated or less expensive. If the data wrangling could be done by someone making half your salary, then it makes no sense to pay you twice as much to do the same thing. If it could be done faster by someone paid the same salary, the same logic applies. Therefore it is a waste of resources (especially money) to have the company assign this task to you. Coming at it from this perspective, you might find it much easier to make your superiors see your side of things.



        Of course, at the end of the day, somebody has to do the data wrangling. It may be that the cheapest, fastest, easiest way of doing it -- the best person for the job, is you. In that case, you're kind of out of luck. You could try to claim it's not part of your contract, but what are the odds they were naive enough to put something that specific in the contract?






        share|improve this answer









        $endgroup$




















          2












          $begingroup$

          Perhaps to put it simply:



          • When creating variables and binning numerics, would you be doing that blindly, or after analysing your data?

          • When peers review your findings, if they had questions about particular bits of data, would it embarrass you to not know them?

          You need to work with and understand your data - which includes simple stuff from fixing inconsistencies (NULLs, empty strings, "-") to understanding how a piece of data goes from collected to being displayed. Processing it includes knowing the same pieces of information, so it is partially work you would have had to do anyway.



          Now, it sounds like this company could benefit from setting up some sort of free MySQL (or similar) instance to hold your data. Trying to be flexible when you're designing your wrangling code is also a good idea - having an intermediate dataset of processed data I think would be useful if you're allowed to (and can't do it in MySQL).



          But of course you're still setting up things from scratch. This is not an easy process, but this "learning experience" is at least good to put in your CV.






          share|improve this answer








          New contributor




          David M is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.






          $endgroup$




















            2












            $begingroup$

            1) Feels like most of the work is not related to data science at all. Is this accurate?
            In my opinion, Data Science cannot pull out from Data wrangling. But, as you said, the question would come on how much percentage of Data Wrangling is required to do by a Data Scientist. It depends on Organizations bandwidth and the person interest in doing such work. In my experience of 15 to 16 years as DS, I always, spent around 60% to 70% in data wrangling activity and spent to a max of 15% of time in real analysis. so take your call.



            2) I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?
            Again it depends on organization's security policies. They cannot leave everything to you and they have their own security issues to reveal the data to a person who is temporary employee (sorry to use this words :-()



            3) Is this type of setup common for a company with serious data science needs?
            I feel these kind of companies require most attention from Data Scientists to make feel that data driven modeling is the future to sustain their business. :-)



            I have given my inputs in thinking of businesses instead of technical stand points. :-)
            Hope I am clear in my choice of words.






            share|improve this answer








            New contributor




            user70920 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.






            $endgroup$




















              2












              $begingroup$

              In his talk "Big Data is four different problems", Turing award winner Michael Stonebraker mentions this particular issue as a big problem (video, slides)



              He says that there are a number of open problems in this area: Ingest, Transform(e.g. euro/dollar),
              Clean(e.g.-99/Null),
              Schema mapping (e.g. wages/salary),
              Entity consolidation (e.g. Mike Stonebraker/Michael Stonebreaker)



              There are number of companies/products trying to solve this problem such as Tamr, Alteryx, Trifacta, Paxata, Google Refine working to solve this problem.



              Until this area matures, a lot of the data scientist job will indeed be data wrangling.






              share|improve this answer








              New contributor




              hojusaram is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.






              $endgroup$













                Your Answer





                StackExchange.ifUsing("editor", function ()
                return StackExchange.using("mathjaxEditing", function ()
                StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
                StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
                );
                );
                , "mathjax-editing");

                StackExchange.ready(function()
                var channelOptions =
                tags: "".split(" "),
                id: "557"
                ;
                initTagRenderer("".split(" "), "".split(" "), channelOptions);

                StackExchange.using("externalEditor", function()
                // Have to fire editor after snippets, if snippets enabled
                if (StackExchange.settings.snippets.snippetsEnabled)
                StackExchange.using("snippets", function()
                createEditor();
                );

                else
                createEditor();

                );

                function createEditor()
                StackExchange.prepareEditor(
                heartbeatType: 'answer',
                autoActivateHeartbeat: false,
                convertImagesToLinks: false,
                noModals: true,
                showLowRepImageUploadWarning: true,
                reputationToPostImages: null,
                bindNavPrevention: true,
                postfix: "",
                imageUploader:
                brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
                contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
                allowUrls: true
                ,
                onDemand: true,
                discardSelector: ".discard-answer"
                ,immediatelyShowMarkdownHelp:true
                );



                );






                Victor Valente is a new contributor. Be nice, and check out our Code of Conduct.









                draft saved

                draft discarded


















                StackExchange.ready(
                function ()
                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48531%2fhow-much-of-data-wrangling-is-a-data-scientists-job%23new-answer', 'question_page');

                );

                Post as a guest















                Required, but never shown

























                9 Answers
                9






                active

                oldest

                votes








                9 Answers
                9






                active

                oldest

                votes









                active

                oldest

                votes






                active

                oldest

                votes









                19












                $begingroup$


                1. Feels like most of the work is not related to data science at all. Is this accurate?



                  Yes




                2. I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?



                  You're not wrong, but such are the realities of real life.




                3. Is this type of setup common for a company with serious data science needs?



                  Yes



                From a technical standpoint, you need to look into ETL solutions that can make your life easier. Sometimes one tool can be much faster than another to read certain data. E.g. R's readxl is orders of mangnitudes faster than python's pandas at reading xlsx files; you could use R to import the files, then save them to a Python-friendly format (parquet, SQL, etc). I know you're not working on xlsx files and I have no idea if you use Python - it was just an example.



                From a practical standpoint, two things:



                • first of all, understand what is technically possible. In many cases,
                  the people telling you no are IT-illiterate people who worry about
                  business or compliance considerations, but have no concept of what is
                  and isn't feasible from an IT standpoint. Try to speak to the DBAs or
                  to whoever manages the data infrastructure. Understand what is
                  technically possible. THEN, only then, try to find a compromise. E.g.
                  they won't give you access to their system, but I presume there is a
                  database behind it? Maybe they can extract the data to some other
                  formats? Maybe they can extract the SQL statements that define the
                  data types etc?

                • business people are more likely to help you if you can make the case that doing so is in THEIR interest. If they don't even believe in what you're doing, tough luck...





                share|improve this answer










                New contributor




                PythonGuest is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                $endgroup$








                • 1




                  $begingroup$
                  Excellent point about finding / buidling an ETL solution. Just need to add: pick a setup you are comfortable with and can easily read / debug. In the early stages of automating tasks, this is even more important than finding the fastest data-slurp tool. If it's gigs of text, it'll likely often run overnight, and your fluency with a tool / framework / language can make the difference between waking up to good data or something you have to start again. Just a single do-over can wipe out any efficiency benefits. Better to be steady with fewer bugs than to go fast and stumble.
                  $endgroup$
                  – Jason
                  Apr 4 at 13:35






                • 2




                  $begingroup$
                  True. But, also, don't overoptimise. Choose your priorities wisely. If importing the data is a one -off, don't spend days looking for how to reduce the import time from 2 hours to 30 minutes. Etc.
                  $endgroup$
                  – PythonGuest
                  Apr 4 at 13:58















                19












                $begingroup$


                1. Feels like most of the work is not related to data science at all. Is this accurate?



                  Yes




                2. I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?



                  You're not wrong, but such are the realities of real life.




                3. Is this type of setup common for a company with serious data science needs?



                  Yes



                From a technical standpoint, you need to look into ETL solutions that can make your life easier. Sometimes one tool can be much faster than another to read certain data. E.g. R's readxl is orders of mangnitudes faster than python's pandas at reading xlsx files; you could use R to import the files, then save them to a Python-friendly format (parquet, SQL, etc). I know you're not working on xlsx files and I have no idea if you use Python - it was just an example.



                From a practical standpoint, two things:



                • first of all, understand what is technically possible. In many cases,
                  the people telling you no are IT-illiterate people who worry about
                  business or compliance considerations, but have no concept of what is
                  and isn't feasible from an IT standpoint. Try to speak to the DBAs or
                  to whoever manages the data infrastructure. Understand what is
                  technically possible. THEN, only then, try to find a compromise. E.g.
                  they won't give you access to their system, but I presume there is a
                  database behind it? Maybe they can extract the data to some other
                  formats? Maybe they can extract the SQL statements that define the
                  data types etc?

                • business people are more likely to help you if you can make the case that doing so is in THEIR interest. If they don't even believe in what you're doing, tough luck...





                share|improve this answer










                New contributor




                PythonGuest is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                $endgroup$








                • 1




                  $begingroup$
                  Excellent point about finding / buidling an ETL solution. Just need to add: pick a setup you are comfortable with and can easily read / debug. In the early stages of automating tasks, this is even more important than finding the fastest data-slurp tool. If it's gigs of text, it'll likely often run overnight, and your fluency with a tool / framework / language can make the difference between waking up to good data or something you have to start again. Just a single do-over can wipe out any efficiency benefits. Better to be steady with fewer bugs than to go fast and stumble.
                  $endgroup$
                  – Jason
                  Apr 4 at 13:35






                • 2




                  $begingroup$
                  True. But, also, don't overoptimise. Choose your priorities wisely. If importing the data is a one -off, don't spend days looking for how to reduce the import time from 2 hours to 30 minutes. Etc.
                  $endgroup$
                  – PythonGuest
                  Apr 4 at 13:58













                19












                19








                19





                $begingroup$


                1. Feels like most of the work is not related to data science at all. Is this accurate?



                  Yes




                2. I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?



                  You're not wrong, but such are the realities of real life.




                3. Is this type of setup common for a company with serious data science needs?



                  Yes



                From a technical standpoint, you need to look into ETL solutions that can make your life easier. Sometimes one tool can be much faster than another to read certain data. E.g. R's readxl is orders of mangnitudes faster than python's pandas at reading xlsx files; you could use R to import the files, then save them to a Python-friendly format (parquet, SQL, etc). I know you're not working on xlsx files and I have no idea if you use Python - it was just an example.



                From a practical standpoint, two things:



                • first of all, understand what is technically possible. In many cases,
                  the people telling you no are IT-illiterate people who worry about
                  business or compliance considerations, but have no concept of what is
                  and isn't feasible from an IT standpoint. Try to speak to the DBAs or
                  to whoever manages the data infrastructure. Understand what is
                  technically possible. THEN, only then, try to find a compromise. E.g.
                  they won't give you access to their system, but I presume there is a
                  database behind it? Maybe they can extract the data to some other
                  formats? Maybe they can extract the SQL statements that define the
                  data types etc?

                • business people are more likely to help you if you can make the case that doing so is in THEIR interest. If they don't even believe in what you're doing, tough luck...





                share|improve this answer










                New contributor




                PythonGuest is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                $endgroup$




                1. Feels like most of the work is not related to data science at all. Is this accurate?



                  Yes




                2. I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?



                  You're not wrong, but such are the realities of real life.




                3. Is this type of setup common for a company with serious data science needs?



                  Yes



                From a technical standpoint, you need to look into ETL solutions that can make your life easier. Sometimes one tool can be much faster than another to read certain data. E.g. R's readxl is orders of mangnitudes faster than python's pandas at reading xlsx files; you could use R to import the files, then save them to a Python-friendly format (parquet, SQL, etc). I know you're not working on xlsx files and I have no idea if you use Python - it was just an example.



                From a practical standpoint, two things:



                • first of all, understand what is technically possible. In many cases,
                  the people telling you no are IT-illiterate people who worry about
                  business or compliance considerations, but have no concept of what is
                  and isn't feasible from an IT standpoint. Try to speak to the DBAs or
                  to whoever manages the data infrastructure. Understand what is
                  technically possible. THEN, only then, try to find a compromise. E.g.
                  they won't give you access to their system, but I presume there is a
                  database behind it? Maybe they can extract the data to some other
                  formats? Maybe they can extract the SQL statements that define the
                  data types etc?

                • business people are more likely to help you if you can make the case that doing so is in THEIR interest. If they don't even believe in what you're doing, tough luck...






                share|improve this answer










                New contributor




                PythonGuest is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.









                share|improve this answer



                share|improve this answer








                edited Apr 4 at 12:40









                Stephen Rauch

                1,52551330




                1,52551330






                New contributor




                PythonGuest is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.









                answered Apr 4 at 12:29









                PythonGuestPythonGuest

                2062




                2062




                New contributor




                PythonGuest is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.





                New contributor





                PythonGuest is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                PythonGuest is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.







                • 1




                  $begingroup$
                  Excellent point about finding / buidling an ETL solution. Just need to add: pick a setup you are comfortable with and can easily read / debug. In the early stages of automating tasks, this is even more important than finding the fastest data-slurp tool. If it's gigs of text, it'll likely often run overnight, and your fluency with a tool / framework / language can make the difference between waking up to good data or something you have to start again. Just a single do-over can wipe out any efficiency benefits. Better to be steady with fewer bugs than to go fast and stumble.
                  $endgroup$
                  – Jason
                  Apr 4 at 13:35






                • 2




                  $begingroup$
                  True. But, also, don't overoptimise. Choose your priorities wisely. If importing the data is a one -off, don't spend days looking for how to reduce the import time from 2 hours to 30 minutes. Etc.
                  $endgroup$
                  – PythonGuest
                  Apr 4 at 13:58












                • 1




                  $begingroup$
                  Excellent point about finding / buidling an ETL solution. Just need to add: pick a setup you are comfortable with and can easily read / debug. In the early stages of automating tasks, this is even more important than finding the fastest data-slurp tool. If it's gigs of text, it'll likely often run overnight, and your fluency with a tool / framework / language can make the difference between waking up to good data or something you have to start again. Just a single do-over can wipe out any efficiency benefits. Better to be steady with fewer bugs than to go fast and stumble.
                  $endgroup$
                  – Jason
                  Apr 4 at 13:35






                • 2




                  $begingroup$
                  True. But, also, don't overoptimise. Choose your priorities wisely. If importing the data is a one -off, don't spend days looking for how to reduce the import time from 2 hours to 30 minutes. Etc.
                  $endgroup$
                  – PythonGuest
                  Apr 4 at 13:58







                1




                1




                $begingroup$
                Excellent point about finding / buidling an ETL solution. Just need to add: pick a setup you are comfortable with and can easily read / debug. In the early stages of automating tasks, this is even more important than finding the fastest data-slurp tool. If it's gigs of text, it'll likely often run overnight, and your fluency with a tool / framework / language can make the difference between waking up to good data or something you have to start again. Just a single do-over can wipe out any efficiency benefits. Better to be steady with fewer bugs than to go fast and stumble.
                $endgroup$
                – Jason
                Apr 4 at 13:35




                $begingroup$
                Excellent point about finding / buidling an ETL solution. Just need to add: pick a setup you are comfortable with and can easily read / debug. In the early stages of automating tasks, this is even more important than finding the fastest data-slurp tool. If it's gigs of text, it'll likely often run overnight, and your fluency with a tool / framework / language can make the difference between waking up to good data or something you have to start again. Just a single do-over can wipe out any efficiency benefits. Better to be steady with fewer bugs than to go fast and stumble.
                $endgroup$
                – Jason
                Apr 4 at 13:35




                2




                2




                $begingroup$
                True. But, also, don't overoptimise. Choose your priorities wisely. If importing the data is a one -off, don't spend days looking for how to reduce the import time from 2 hours to 30 minutes. Etc.
                $endgroup$
                – PythonGuest
                Apr 4 at 13:58




                $begingroup$
                True. But, also, don't overoptimise. Choose your priorities wisely. If importing the data is a one -off, don't spend days looking for how to reduce the import time from 2 hours to 30 minutes. Etc.
                $endgroup$
                – PythonGuest
                Apr 4 at 13:58











                34












                $begingroup$

                This is a situation that many blogs, companies and papers acknowledge as something real in many cases.



                In this paper Data Wrangling for Big Data: Challenges and Opportunities, there is a quote about it




                data scientists spend from 50 percent to 80 percent of their time



                collecting and preparing unruly digital data.




                Also, you can read the source of that quote in this article from The New York Times, For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights



                Unfortunately, the real world is not like Kaggle. You don't get a CSV or Excel file that you can just start the Data Exploration with a little bit of cleaning. You need to find the data in a format that is not suitable for your needs.



                What you can do is make use of the old data as much as you can and try to adapt the storing of new data in a process that will be easier for you (or a future colleague) to work with.






                share|improve this answer











                $endgroup$












                • $begingroup$
                  Forbes article claiming the same 80% figure.
                  $endgroup$
                  – Jesse Amano
                  Apr 3 at 19:08






                • 4




                  $begingroup$
                  Forbes should nowhere be mentioned together with the words "data science".
                  $endgroup$
                  – gented
                  Apr 3 at 22:52










                • $begingroup$
                  50-80% based on (quote) "interviews and expert estimates"
                  $endgroup$
                  – oW_
                  Apr 3 at 23:33






                • 3




                  $begingroup$
                  @gented Opinion based comment about an opinion based survey in an opinion based article placed on an opinion based answer to an opinion based question. Who would have thought you would find this in "Data Science" SE?
                  $endgroup$
                  – Keeta
                  Apr 4 at 11:46















                34












                $begingroup$

                This is a situation that many blogs, companies and papers acknowledge as something real in many cases.



                In this paper Data Wrangling for Big Data: Challenges and Opportunities, there is a quote about it




                data scientists spend from 50 percent to 80 percent of their time



                collecting and preparing unruly digital data.




                Also, you can read the source of that quote in this article from The New York Times, For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights



                Unfortunately, the real world is not like Kaggle. You don't get a CSV or Excel file that you can just start the Data Exploration with a little bit of cleaning. You need to find the data in a format that is not suitable for your needs.



                What you can do is make use of the old data as much as you can and try to adapt the storing of new data in a process that will be easier for you (or a future colleague) to work with.






                share|improve this answer











                $endgroup$












                • $begingroup$
                  Forbes article claiming the same 80% figure.
                  $endgroup$
                  – Jesse Amano
                  Apr 3 at 19:08






                • 4




                  $begingroup$
                  Forbes should nowhere be mentioned together with the words "data science".
                  $endgroup$
                  – gented
                  Apr 3 at 22:52










                • $begingroup$
                  50-80% based on (quote) "interviews and expert estimates"
                  $endgroup$
                  – oW_
                  Apr 3 at 23:33






                • 3




                  $begingroup$
                  @gented Opinion based comment about an opinion based survey in an opinion based article placed on an opinion based answer to an opinion based question. Who would have thought you would find this in "Data Science" SE?
                  $endgroup$
                  – Keeta
                  Apr 4 at 11:46













                34












                34








                34





                $begingroup$

                This is a situation that many blogs, companies and papers acknowledge as something real in many cases.



                In this paper Data Wrangling for Big Data: Challenges and Opportunities, there is a quote about it




                data scientists spend from 50 percent to 80 percent of their time



                collecting and preparing unruly digital data.




                Also, you can read the source of that quote in this article from The New York Times, For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights



                Unfortunately, the real world is not like Kaggle. You don't get a CSV or Excel file that you can just start the Data Exploration with a little bit of cleaning. You need to find the data in a format that is not suitable for your needs.



                What you can do is make use of the old data as much as you can and try to adapt the storing of new data in a process that will be easier for you (or a future colleague) to work with.






                share|improve this answer











                $endgroup$



                This is a situation that many blogs, companies and papers acknowledge as something real in many cases.



                In this paper Data Wrangling for Big Data: Challenges and Opportunities, there is a quote about it




                data scientists spend from 50 percent to 80 percent of their time



                collecting and preparing unruly digital data.




                Also, you can read the source of that quote in this article from The New York Times, For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights



                Unfortunately, the real world is not like Kaggle. You don't get a CSV or Excel file that you can just start the Data Exploration with a little bit of cleaning. You need to find the data in a format that is not suitable for your needs.



                What you can do is make use of the old data as much as you can and try to adapt the storing of new data in a process that will be easier for you (or a future colleague) to work with.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Apr 4 at 12:37









                Stephen Rauch

                1,52551330




                1,52551330










                answered Apr 3 at 16:35









                TasosTasos

                1,51011138




                1,51011138











                • $begingroup$
                  Forbes article claiming the same 80% figure.
                  $endgroup$
                  – Jesse Amano
                  Apr 3 at 19:08






                • 4




                  $begingroup$
                  Forbes should nowhere be mentioned together with the words "data science".
                  $endgroup$
                  – gented
                  Apr 3 at 22:52










                • $begingroup$
                  50-80% based on (quote) "interviews and expert estimates"
                  $endgroup$
                  – oW_
                  Apr 3 at 23:33






                • 3




                  $begingroup$
                  @gented Opinion based comment about an opinion based survey in an opinion based article placed on an opinion based answer to an opinion based question. Who would have thought you would find this in "Data Science" SE?
                  $endgroup$
                  – Keeta
                  Apr 4 at 11:46
















                • $begingroup$
                  Forbes article claiming the same 80% figure.
                  $endgroup$
                  – Jesse Amano
                  Apr 3 at 19:08






                • 4




                  $begingroup$
                  Forbes should nowhere be mentioned together with the words "data science".
                  $endgroup$
                  – gented
                  Apr 3 at 22:52










                • $begingroup$
                  50-80% based on (quote) "interviews and expert estimates"
                  $endgroup$
                  – oW_
                  Apr 3 at 23:33






                • 3




                  $begingroup$
                  @gented Opinion based comment about an opinion based survey in an opinion based article placed on an opinion based answer to an opinion based question. Who would have thought you would find this in "Data Science" SE?
                  $endgroup$
                  – Keeta
                  Apr 4 at 11:46















                $begingroup$
                Forbes article claiming the same 80% figure.
                $endgroup$
                – Jesse Amano
                Apr 3 at 19:08




                $begingroup$
                Forbes article claiming the same 80% figure.
                $endgroup$
                – Jesse Amano
                Apr 3 at 19:08




                4




                4




                $begingroup$
                Forbes should nowhere be mentioned together with the words "data science".
                $endgroup$
                – gented
                Apr 3 at 22:52




                $begingroup$
                Forbes should nowhere be mentioned together with the words "data science".
                $endgroup$
                – gented
                Apr 3 at 22:52












                $begingroup$
                50-80% based on (quote) "interviews and expert estimates"
                $endgroup$
                – oW_
                Apr 3 at 23:33




                $begingroup$
                50-80% based on (quote) "interviews and expert estimates"
                $endgroup$
                – oW_
                Apr 3 at 23:33




                3




                3




                $begingroup$
                @gented Opinion based comment about an opinion based survey in an opinion based article placed on an opinion based answer to an opinion based question. Who would have thought you would find this in "Data Science" SE?
                $endgroup$
                – Keeta
                Apr 4 at 11:46




                $begingroup$
                @gented Opinion based comment about an opinion based survey in an opinion based article placed on an opinion based answer to an opinion based question. Who would have thought you would find this in "Data Science" SE?
                $endgroup$
                – Keeta
                Apr 4 at 11:46











                24












                $begingroup$


                Feels like most of the work is not related to data science at all. Is this accurate?




                This is the reality of any data science project. Google actually measured it and published a paper "Hidden Technical Debt in Machine Learning Systems" https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf



                enter image description here



                Result of the paper reflects my experience as well. Vast majority of time is spent in acquiring, cleaning and processing data.






                share|improve this answer









                $endgroup$

















                  24












                  $begingroup$


                  Feels like most of the work is not related to data science at all. Is this accurate?




                  This is the reality of any data science project. Google actually measured it and published a paper "Hidden Technical Debt in Machine Learning Systems" https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf



                  enter image description here



                  Result of the paper reflects my experience as well. Vast majority of time is spent in acquiring, cleaning and processing data.






                  share|improve this answer









                  $endgroup$















                    24












                    24








                    24





                    $begingroup$


                    Feels like most of the work is not related to data science at all. Is this accurate?




                    This is the reality of any data science project. Google actually measured it and published a paper "Hidden Technical Debt in Machine Learning Systems" https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf



                    enter image description here



                    Result of the paper reflects my experience as well. Vast majority of time is spent in acquiring, cleaning and processing data.






                    share|improve this answer









                    $endgroup$




                    Feels like most of the work is not related to data science at all. Is this accurate?




                    This is the reality of any data science project. Google actually measured it and published a paper "Hidden Technical Debt in Machine Learning Systems" https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf



                    enter image description here



                    Result of the paper reflects my experience as well. Vast majority of time is spent in acquiring, cleaning and processing data.







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Apr 3 at 16:47









                    Shamit VermaShamit Verma

                    1,4291214




                    1,4291214





















                        5












                        $begingroup$

                        As another recent starter in Data Science, I can only add that I don't think you're experience is unique, my team of about 10 apparently hasn't done any DS in over a year (one small project that occupied 2 of the team). This is due to the promise of an effective pipeline the team's been working on, but still just isn't quite delivering the data. Apparently retention has been fairly poor in the past and there's continuous promise of a holy-grail MS Azure environment for future DS projects.



                        So to answer:



                        1) Yes totally accurate



                        2) No you're correct, but it's an uphill battle to get access to the data you want (if it even exists).



                        3) I'm sure there's companies out there who are better than others. If you can't stand it at your current company, 2 years is a decent length of time, start looking for brighter things (be careful how you phrase your desire to leave your current job, something like "looking to work with a more dynamic team" would sound better than "my old company won't give me data").






                        share|improve this answer








                        New contributor




                        Oliver Houston is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                        Check out our Code of Conduct.






                        $endgroup$

















                          5












                          $begingroup$

                          As another recent starter in Data Science, I can only add that I don't think you're experience is unique, my team of about 10 apparently hasn't done any DS in over a year (one small project that occupied 2 of the team). This is due to the promise of an effective pipeline the team's been working on, but still just isn't quite delivering the data. Apparently retention has been fairly poor in the past and there's continuous promise of a holy-grail MS Azure environment for future DS projects.



                          So to answer:



                          1) Yes totally accurate



                          2) No you're correct, but it's an uphill battle to get access to the data you want (if it even exists).



                          3) I'm sure there's companies out there who are better than others. If you can't stand it at your current company, 2 years is a decent length of time, start looking for brighter things (be careful how you phrase your desire to leave your current job, something like "looking to work with a more dynamic team" would sound better than "my old company won't give me data").






                          share|improve this answer








                          New contributor




                          Oliver Houston is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                          Check out our Code of Conduct.






                          $endgroup$















                            5












                            5








                            5





                            $begingroup$

                            As another recent starter in Data Science, I can only add that I don't think you're experience is unique, my team of about 10 apparently hasn't done any DS in over a year (one small project that occupied 2 of the team). This is due to the promise of an effective pipeline the team's been working on, but still just isn't quite delivering the data. Apparently retention has been fairly poor in the past and there's continuous promise of a holy-grail MS Azure environment for future DS projects.



                            So to answer:



                            1) Yes totally accurate



                            2) No you're correct, but it's an uphill battle to get access to the data you want (if it even exists).



                            3) I'm sure there's companies out there who are better than others. If you can't stand it at your current company, 2 years is a decent length of time, start looking for brighter things (be careful how you phrase your desire to leave your current job, something like "looking to work with a more dynamic team" would sound better than "my old company won't give me data").






                            share|improve this answer








                            New contributor




                            Oliver Houston is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.






                            $endgroup$



                            As another recent starter in Data Science, I can only add that I don't think you're experience is unique, my team of about 10 apparently hasn't done any DS in over a year (one small project that occupied 2 of the team). This is due to the promise of an effective pipeline the team's been working on, but still just isn't quite delivering the data. Apparently retention has been fairly poor in the past and there's continuous promise of a holy-grail MS Azure environment for future DS projects.



                            So to answer:



                            1) Yes totally accurate



                            2) No you're correct, but it's an uphill battle to get access to the data you want (if it even exists).



                            3) I'm sure there's companies out there who are better than others. If you can't stand it at your current company, 2 years is a decent length of time, start looking for brighter things (be careful how you phrase your desire to leave your current job, something like "looking to work with a more dynamic team" would sound better than "my old company won't give me data").







                            share|improve this answer








                            New contributor




                            Oliver Houston is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.









                            share|improve this answer



                            share|improve this answer






                            New contributor




                            Oliver Houston is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.









                            answered Apr 3 at 23:03









                            Oliver HoustonOliver Houston

                            1511




                            1511




                            New contributor




                            Oliver Houston is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.





                            New contributor





                            Oliver Houston is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.






                            Oliver Houston is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.





















                                3












                                $begingroup$


                                1. Feels like most of the work is not related to data science at all. Is this accurate?



                                  Wrangling data is most definitely in the Data Scientist job description. At some level you have to understand the data generating process in order to use it to drive solutions. Sure, someone specialized in ETL could do it faster/more efficient, but being given data dumps is not uncommon in the real world. If you don't like this aspect of data science, there may be an opportunity to work more closely with IT resources to get the data properly sourced into a warehouse you have access to. Alternatively, you could find a job that already has data in better order.




                                2. I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?



                                  I think the minimum level is txt files. If you have access to the data via text files, you should have access to the data in the database (push back on this with superiors).




                                3. Is this type of setup common for a company with serious data science needs?



                                  Yes. You are the data SCIENTIST; you are the expert. It is part of your job to educate others on the inefficiencies of the current data structure and how you can help. Data that isn't usable isn't helping anyone. You have an opportunity to make things better and shape the future of the company.







                                share|improve this answer









                                $endgroup$

















                                  3












                                  $begingroup$


                                  1. Feels like most of the work is not related to data science at all. Is this accurate?



                                    Wrangling data is most definitely in the Data Scientist job description. At some level you have to understand the data generating process in order to use it to drive solutions. Sure, someone specialized in ETL could do it faster/more efficient, but being given data dumps is not uncommon in the real world. If you don't like this aspect of data science, there may be an opportunity to work more closely with IT resources to get the data properly sourced into a warehouse you have access to. Alternatively, you could find a job that already has data in better order.




                                  2. I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?



                                    I think the minimum level is txt files. If you have access to the data via text files, you should have access to the data in the database (push back on this with superiors).




                                  3. Is this type of setup common for a company with serious data science needs?



                                    Yes. You are the data SCIENTIST; you are the expert. It is part of your job to educate others on the inefficiencies of the current data structure and how you can help. Data that isn't usable isn't helping anyone. You have an opportunity to make things better and shape the future of the company.







                                  share|improve this answer









                                  $endgroup$















                                    3












                                    3








                                    3





                                    $begingroup$


                                    1. Feels like most of the work is not related to data science at all. Is this accurate?



                                      Wrangling data is most definitely in the Data Scientist job description. At some level you have to understand the data generating process in order to use it to drive solutions. Sure, someone specialized in ETL could do it faster/more efficient, but being given data dumps is not uncommon in the real world. If you don't like this aspect of data science, there may be an opportunity to work more closely with IT resources to get the data properly sourced into a warehouse you have access to. Alternatively, you could find a job that already has data in better order.




                                    2. I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?



                                      I think the minimum level is txt files. If you have access to the data via text files, you should have access to the data in the database (push back on this with superiors).




                                    3. Is this type of setup common for a company with serious data science needs?



                                      Yes. You are the data SCIENTIST; you are the expert. It is part of your job to educate others on the inefficiencies of the current data structure and how you can help. Data that isn't usable isn't helping anyone. You have an opportunity to make things better and shape the future of the company.







                                    share|improve this answer









                                    $endgroup$




                                    1. Feels like most of the work is not related to data science at all. Is this accurate?



                                      Wrangling data is most definitely in the Data Scientist job description. At some level you have to understand the data generating process in order to use it to drive solutions. Sure, someone specialized in ETL could do it faster/more efficient, but being given data dumps is not uncommon in the real world. If you don't like this aspect of data science, there may be an opportunity to work more closely with IT resources to get the data properly sourced into a warehouse you have access to. Alternatively, you could find a job that already has data in better order.




                                    2. I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?



                                      I think the minimum level is txt files. If you have access to the data via text files, you should have access to the data in the database (push back on this with superiors).




                                    3. Is this type of setup common for a company with serious data science needs?



                                      Yes. You are the data SCIENTIST; you are the expert. It is part of your job to educate others on the inefficiencies of the current data structure and how you can help. Data that isn't usable isn't helping anyone. You have an opportunity to make things better and shape the future of the company.








                                    share|improve this answer












                                    share|improve this answer



                                    share|improve this answer










                                    answered Apr 4 at 19:40









                                    UnderminerUnderminer

                                    1314




                                    1314





















                                        3












                                        $begingroup$

                                        If you look at this from the perspective of "this isn't my job, so why should I do it" then that's a fairly common, general problem not specific to data science. Ultimately, your job is to do whatever the boss tells you to do, but in practice there is little reason for the boss to be dictatorial about this and usually they can be persuaded. Or at least they will give you a sincere explanation of why it has to be that way. But as far as appealing to authority, there is no official definition of "Data Science" that says you can only do at most X% data cleaning. The authority is whoever is paying you, so long as they have the legal right to stop paying you.



                                        You could also look at it from another perspective: Is this a good use of your time? It sounds like you took a job to do some tasks (which you mean by "data science") but you are having to do another thing (which you call "data wrangling"). Job descriptions and personal feelings are a bit beside the point here because there is something more pertinent: The company presumably pays you a good amount of money to do something that only you can do (the data science). But it's having you do other things instead, which could be done by other people who are some combination of more capable, more motivated or less expensive. If the data wrangling could be done by someone making half your salary, then it makes no sense to pay you twice as much to do the same thing. If it could be done faster by someone paid the same salary, the same logic applies. Therefore it is a waste of resources (especially money) to have the company assign this task to you. Coming at it from this perspective, you might find it much easier to make your superiors see your side of things.



                                        Of course, at the end of the day, somebody has to do the data wrangling. It may be that the cheapest, fastest, easiest way of doing it -- the best person for the job, is you. In that case, you're kind of out of luck. You could try to claim it's not part of your contract, but what are the odds they were naive enough to put something that specific in the contract?






                                        share|improve this answer









                                        $endgroup$

















                                          3












                                          $begingroup$

                                          If you look at this from the perspective of "this isn't my job, so why should I do it" then that's a fairly common, general problem not specific to data science. Ultimately, your job is to do whatever the boss tells you to do, but in practice there is little reason for the boss to be dictatorial about this and usually they can be persuaded. Or at least they will give you a sincere explanation of why it has to be that way. But as far as appealing to authority, there is no official definition of "Data Science" that says you can only do at most X% data cleaning. The authority is whoever is paying you, so long as they have the legal right to stop paying you.



                                          You could also look at it from another perspective: Is this a good use of your time? It sounds like you took a job to do some tasks (which you mean by "data science") but you are having to do another thing (which you call "data wrangling"). Job descriptions and personal feelings are a bit beside the point here because there is something more pertinent: The company presumably pays you a good amount of money to do something that only you can do (the data science). But it's having you do other things instead, which could be done by other people who are some combination of more capable, more motivated or less expensive. If the data wrangling could be done by someone making half your salary, then it makes no sense to pay you twice as much to do the same thing. If it could be done faster by someone paid the same salary, the same logic applies. Therefore it is a waste of resources (especially money) to have the company assign this task to you. Coming at it from this perspective, you might find it much easier to make your superiors see your side of things.



                                          Of course, at the end of the day, somebody has to do the data wrangling. It may be that the cheapest, fastest, easiest way of doing it -- the best person for the job, is you. In that case, you're kind of out of luck. You could try to claim it's not part of your contract, but what are the odds they were naive enough to put something that specific in the contract?






                                          share|improve this answer









                                          $endgroup$















                                            3












                                            3








                                            3





                                            $begingroup$

                                            If you look at this from the perspective of "this isn't my job, so why should I do it" then that's a fairly common, general problem not specific to data science. Ultimately, your job is to do whatever the boss tells you to do, but in practice there is little reason for the boss to be dictatorial about this and usually they can be persuaded. Or at least they will give you a sincere explanation of why it has to be that way. But as far as appealing to authority, there is no official definition of "Data Science" that says you can only do at most X% data cleaning. The authority is whoever is paying you, so long as they have the legal right to stop paying you.



                                            You could also look at it from another perspective: Is this a good use of your time? It sounds like you took a job to do some tasks (which you mean by "data science") but you are having to do another thing (which you call "data wrangling"). Job descriptions and personal feelings are a bit beside the point here because there is something more pertinent: The company presumably pays you a good amount of money to do something that only you can do (the data science). But it's having you do other things instead, which could be done by other people who are some combination of more capable, more motivated or less expensive. If the data wrangling could be done by someone making half your salary, then it makes no sense to pay you twice as much to do the same thing. If it could be done faster by someone paid the same salary, the same logic applies. Therefore it is a waste of resources (especially money) to have the company assign this task to you. Coming at it from this perspective, you might find it much easier to make your superiors see your side of things.



                                            Of course, at the end of the day, somebody has to do the data wrangling. It may be that the cheapest, fastest, easiest way of doing it -- the best person for the job, is you. In that case, you're kind of out of luck. You could try to claim it's not part of your contract, but what are the odds they were naive enough to put something that specific in the contract?






                                            share|improve this answer









                                            $endgroup$



                                            If you look at this from the perspective of "this isn't my job, so why should I do it" then that's a fairly common, general problem not specific to data science. Ultimately, your job is to do whatever the boss tells you to do, but in practice there is little reason for the boss to be dictatorial about this and usually they can be persuaded. Or at least they will give you a sincere explanation of why it has to be that way. But as far as appealing to authority, there is no official definition of "Data Science" that says you can only do at most X% data cleaning. The authority is whoever is paying you, so long as they have the legal right to stop paying you.



                                            You could also look at it from another perspective: Is this a good use of your time? It sounds like you took a job to do some tasks (which you mean by "data science") but you are having to do another thing (which you call "data wrangling"). Job descriptions and personal feelings are a bit beside the point here because there is something more pertinent: The company presumably pays you a good amount of money to do something that only you can do (the data science). But it's having you do other things instead, which could be done by other people who are some combination of more capable, more motivated or less expensive. If the data wrangling could be done by someone making half your salary, then it makes no sense to pay you twice as much to do the same thing. If it could be done faster by someone paid the same salary, the same logic applies. Therefore it is a waste of resources (especially money) to have the company assign this task to you. Coming at it from this perspective, you might find it much easier to make your superiors see your side of things.



                                            Of course, at the end of the day, somebody has to do the data wrangling. It may be that the cheapest, fastest, easiest way of doing it -- the best person for the job, is you. In that case, you're kind of out of luck. You could try to claim it's not part of your contract, but what are the odds they were naive enough to put something that specific in the contract?







                                            share|improve this answer












                                            share|improve this answer



                                            share|improve this answer










                                            answered Apr 4 at 21:05









                                            WhelibeirenWhelibeiren

                                            542




                                            542





















                                                2












                                                $begingroup$

                                                Perhaps to put it simply:



                                                • When creating variables and binning numerics, would you be doing that blindly, or after analysing your data?

                                                • When peers review your findings, if they had questions about particular bits of data, would it embarrass you to not know them?

                                                You need to work with and understand your data - which includes simple stuff from fixing inconsistencies (NULLs, empty strings, "-") to understanding how a piece of data goes from collected to being displayed. Processing it includes knowing the same pieces of information, so it is partially work you would have had to do anyway.



                                                Now, it sounds like this company could benefit from setting up some sort of free MySQL (or similar) instance to hold your data. Trying to be flexible when you're designing your wrangling code is also a good idea - having an intermediate dataset of processed data I think would be useful if you're allowed to (and can't do it in MySQL).



                                                But of course you're still setting up things from scratch. This is not an easy process, but this "learning experience" is at least good to put in your CV.






                                                share|improve this answer








                                                New contributor




                                                David M is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                Check out our Code of Conduct.






                                                $endgroup$

















                                                  2












                                                  $begingroup$

                                                  Perhaps to put it simply:



                                                  • When creating variables and binning numerics, would you be doing that blindly, or after analysing your data?

                                                  • When peers review your findings, if they had questions about particular bits of data, would it embarrass you to not know them?

                                                  You need to work with and understand your data - which includes simple stuff from fixing inconsistencies (NULLs, empty strings, "-") to understanding how a piece of data goes from collected to being displayed. Processing it includes knowing the same pieces of information, so it is partially work you would have had to do anyway.



                                                  Now, it sounds like this company could benefit from setting up some sort of free MySQL (or similar) instance to hold your data. Trying to be flexible when you're designing your wrangling code is also a good idea - having an intermediate dataset of processed data I think would be useful if you're allowed to (and can't do it in MySQL).



                                                  But of course you're still setting up things from scratch. This is not an easy process, but this "learning experience" is at least good to put in your CV.






                                                  share|improve this answer








                                                  New contributor




                                                  David M is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                  Check out our Code of Conduct.






                                                  $endgroup$















                                                    2












                                                    2








                                                    2





                                                    $begingroup$

                                                    Perhaps to put it simply:



                                                    • When creating variables and binning numerics, would you be doing that blindly, or after analysing your data?

                                                    • When peers review your findings, if they had questions about particular bits of data, would it embarrass you to not know them?

                                                    You need to work with and understand your data - which includes simple stuff from fixing inconsistencies (NULLs, empty strings, "-") to understanding how a piece of data goes from collected to being displayed. Processing it includes knowing the same pieces of information, so it is partially work you would have had to do anyway.



                                                    Now, it sounds like this company could benefit from setting up some sort of free MySQL (or similar) instance to hold your data. Trying to be flexible when you're designing your wrangling code is also a good idea - having an intermediate dataset of processed data I think would be useful if you're allowed to (and can't do it in MySQL).



                                                    But of course you're still setting up things from scratch. This is not an easy process, but this "learning experience" is at least good to put in your CV.






                                                    share|improve this answer








                                                    New contributor




                                                    David M is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                    Check out our Code of Conduct.






                                                    $endgroup$



                                                    Perhaps to put it simply:



                                                    • When creating variables and binning numerics, would you be doing that blindly, or after analysing your data?

                                                    • When peers review your findings, if they had questions about particular bits of data, would it embarrass you to not know them?

                                                    You need to work with and understand your data - which includes simple stuff from fixing inconsistencies (NULLs, empty strings, "-") to understanding how a piece of data goes from collected to being displayed. Processing it includes knowing the same pieces of information, so it is partially work you would have had to do anyway.



                                                    Now, it sounds like this company could benefit from setting up some sort of free MySQL (or similar) instance to hold your data. Trying to be flexible when you're designing your wrangling code is also a good idea - having an intermediate dataset of processed data I think would be useful if you're allowed to (and can't do it in MySQL).



                                                    But of course you're still setting up things from scratch. This is not an easy process, but this "learning experience" is at least good to put in your CV.







                                                    share|improve this answer








                                                    New contributor




                                                    David M is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                    Check out our Code of Conduct.









                                                    share|improve this answer



                                                    share|improve this answer






                                                    New contributor




                                                    David M is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                    Check out our Code of Conduct.









                                                    answered Apr 4 at 22:51









                                                    David MDavid M

                                                    1212




                                                    1212




                                                    New contributor




                                                    David M is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                    Check out our Code of Conduct.





                                                    New contributor





                                                    David M is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                    Check out our Code of Conduct.






                                                    David M is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                    Check out our Code of Conduct.





















                                                        2












                                                        $begingroup$

                                                        1) Feels like most of the work is not related to data science at all. Is this accurate?
                                                        In my opinion, Data Science cannot pull out from Data wrangling. But, as you said, the question would come on how much percentage of Data Wrangling is required to do by a Data Scientist. It depends on Organizations bandwidth and the person interest in doing such work. In my experience of 15 to 16 years as DS, I always, spent around 60% to 70% in data wrangling activity and spent to a max of 15% of time in real analysis. so take your call.



                                                        2) I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?
                                                        Again it depends on organization's security policies. They cannot leave everything to you and they have their own security issues to reveal the data to a person who is temporary employee (sorry to use this words :-()



                                                        3) Is this type of setup common for a company with serious data science needs?
                                                        I feel these kind of companies require most attention from Data Scientists to make feel that data driven modeling is the future to sustain their business. :-)



                                                        I have given my inputs in thinking of businesses instead of technical stand points. :-)
                                                        Hope I am clear in my choice of words.






                                                        share|improve this answer








                                                        New contributor




                                                        user70920 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                        Check out our Code of Conduct.






                                                        $endgroup$

















                                                          2












                                                          $begingroup$

                                                          1) Feels like most of the work is not related to data science at all. Is this accurate?
                                                          In my opinion, Data Science cannot pull out from Data wrangling. But, as you said, the question would come on how much percentage of Data Wrangling is required to do by a Data Scientist. It depends on Organizations bandwidth and the person interest in doing such work. In my experience of 15 to 16 years as DS, I always, spent around 60% to 70% in data wrangling activity and spent to a max of 15% of time in real analysis. so take your call.



                                                          2) I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?
                                                          Again it depends on organization's security policies. They cannot leave everything to you and they have their own security issues to reveal the data to a person who is temporary employee (sorry to use this words :-()



                                                          3) Is this type of setup common for a company with serious data science needs?
                                                          I feel these kind of companies require most attention from Data Scientists to make feel that data driven modeling is the future to sustain their business. :-)



                                                          I have given my inputs in thinking of businesses instead of technical stand points. :-)
                                                          Hope I am clear in my choice of words.






                                                          share|improve this answer








                                                          New contributor




                                                          user70920 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                          Check out our Code of Conduct.






                                                          $endgroup$















                                                            2












                                                            2








                                                            2





                                                            $begingroup$

                                                            1) Feels like most of the work is not related to data science at all. Is this accurate?
                                                            In my opinion, Data Science cannot pull out from Data wrangling. But, as you said, the question would come on how much percentage of Data Wrangling is required to do by a Data Scientist. It depends on Organizations bandwidth and the person interest in doing such work. In my experience of 15 to 16 years as DS, I always, spent around 60% to 70% in data wrangling activity and spent to a max of 15% of time in real analysis. so take your call.



                                                            2) I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?
                                                            Again it depends on organization's security policies. They cannot leave everything to you and they have their own security issues to reveal the data to a person who is temporary employee (sorry to use this words :-()



                                                            3) Is this type of setup common for a company with serious data science needs?
                                                            I feel these kind of companies require most attention from Data Scientists to make feel that data driven modeling is the future to sustain their business. :-)



                                                            I have given my inputs in thinking of businesses instead of technical stand points. :-)
                                                            Hope I am clear in my choice of words.






                                                            share|improve this answer








                                                            New contributor




                                                            user70920 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                            Check out our Code of Conduct.






                                                            $endgroup$



                                                            1) Feels like most of the work is not related to data science at all. Is this accurate?
                                                            In my opinion, Data Science cannot pull out from Data wrangling. But, as you said, the question would come on how much percentage of Data Wrangling is required to do by a Data Scientist. It depends on Organizations bandwidth and the person interest in doing such work. In my experience of 15 to 16 years as DS, I always, spent around 60% to 70% in data wrangling activity and spent to a max of 15% of time in real analysis. so take your call.



                                                            2) I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?
                                                            Again it depends on organization's security policies. They cannot leave everything to you and they have their own security issues to reveal the data to a person who is temporary employee (sorry to use this words :-()



                                                            3) Is this type of setup common for a company with serious data science needs?
                                                            I feel these kind of companies require most attention from Data Scientists to make feel that data driven modeling is the future to sustain their business. :-)



                                                            I have given my inputs in thinking of businesses instead of technical stand points. :-)
                                                            Hope I am clear in my choice of words.







                                                            share|improve this answer








                                                            New contributor




                                                            user70920 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                            Check out our Code of Conduct.









                                                            share|improve this answer



                                                            share|improve this answer






                                                            New contributor




                                                            user70920 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                            Check out our Code of Conduct.









                                                            answered 2 days ago









                                                            user70920user70920

                                                            211




                                                            211




                                                            New contributor




                                                            user70920 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                            Check out our Code of Conduct.





                                                            New contributor





                                                            user70920 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                            Check out our Code of Conduct.






                                                            user70920 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                            Check out our Code of Conduct.





















                                                                2












                                                                $begingroup$

                                                                In his talk "Big Data is four different problems", Turing award winner Michael Stonebraker mentions this particular issue as a big problem (video, slides)



                                                                He says that there are a number of open problems in this area: Ingest, Transform(e.g. euro/dollar),
                                                                Clean(e.g.-99/Null),
                                                                Schema mapping (e.g. wages/salary),
                                                                Entity consolidation (e.g. Mike Stonebraker/Michael Stonebreaker)



                                                                There are number of companies/products trying to solve this problem such as Tamr, Alteryx, Trifacta, Paxata, Google Refine working to solve this problem.



                                                                Until this area matures, a lot of the data scientist job will indeed be data wrangling.






                                                                share|improve this answer








                                                                New contributor




                                                                hojusaram is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                                Check out our Code of Conduct.






                                                                $endgroup$

















                                                                  2












                                                                  $begingroup$

                                                                  In his talk "Big Data is four different problems", Turing award winner Michael Stonebraker mentions this particular issue as a big problem (video, slides)



                                                                  He says that there are a number of open problems in this area: Ingest, Transform(e.g. euro/dollar),
                                                                  Clean(e.g.-99/Null),
                                                                  Schema mapping (e.g. wages/salary),
                                                                  Entity consolidation (e.g. Mike Stonebraker/Michael Stonebreaker)



                                                                  There are number of companies/products trying to solve this problem such as Tamr, Alteryx, Trifacta, Paxata, Google Refine working to solve this problem.



                                                                  Until this area matures, a lot of the data scientist job will indeed be data wrangling.






                                                                  share|improve this answer








                                                                  New contributor




                                                                  hojusaram is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                                  Check out our Code of Conduct.






                                                                  $endgroup$















                                                                    2












                                                                    2








                                                                    2





                                                                    $begingroup$

                                                                    In his talk "Big Data is four different problems", Turing award winner Michael Stonebraker mentions this particular issue as a big problem (video, slides)



                                                                    He says that there are a number of open problems in this area: Ingest, Transform(e.g. euro/dollar),
                                                                    Clean(e.g.-99/Null),
                                                                    Schema mapping (e.g. wages/salary),
                                                                    Entity consolidation (e.g. Mike Stonebraker/Michael Stonebreaker)



                                                                    There are number of companies/products trying to solve this problem such as Tamr, Alteryx, Trifacta, Paxata, Google Refine working to solve this problem.



                                                                    Until this area matures, a lot of the data scientist job will indeed be data wrangling.






                                                                    share|improve this answer








                                                                    New contributor




                                                                    hojusaram is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                                    Check out our Code of Conduct.






                                                                    $endgroup$



                                                                    In his talk "Big Data is four different problems", Turing award winner Michael Stonebraker mentions this particular issue as a big problem (video, slides)



                                                                    He says that there are a number of open problems in this area: Ingest, Transform(e.g. euro/dollar),
                                                                    Clean(e.g.-99/Null),
                                                                    Schema mapping (e.g. wages/salary),
                                                                    Entity consolidation (e.g. Mike Stonebraker/Michael Stonebreaker)



                                                                    There are number of companies/products trying to solve this problem such as Tamr, Alteryx, Trifacta, Paxata, Google Refine working to solve this problem.



                                                                    Until this area matures, a lot of the data scientist job will indeed be data wrangling.







                                                                    share|improve this answer








                                                                    New contributor




                                                                    hojusaram is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                                    Check out our Code of Conduct.









                                                                    share|improve this answer



                                                                    share|improve this answer






                                                                    New contributor




                                                                    hojusaram is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                                    Check out our Code of Conduct.









                                                                    answered yesterday









                                                                    hojusaramhojusaram

                                                                    1211




                                                                    1211




                                                                    New contributor




                                                                    hojusaram is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                                    Check out our Code of Conduct.





                                                                    New contributor





                                                                    hojusaram is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                                    Check out our Code of Conduct.






                                                                    hojusaram is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                                    Check out our Code of Conduct.




















                                                                        Victor Valente is a new contributor. Be nice, and check out our Code of Conduct.









                                                                        draft saved

                                                                        draft discarded


















                                                                        Victor Valente is a new contributor. Be nice, and check out our Code of Conduct.












                                                                        Victor Valente is a new contributor. Be nice, and check out our Code of Conduct.











                                                                        Victor Valente is a new contributor. Be nice, and check out our Code of Conduct.














                                                                        Thanks for contributing an answer to Data Science Stack Exchange!


                                                                        • Please be sure to answer the question. Provide details and share your research!

                                                                        But avoid


                                                                        • Asking for help, clarification, or responding to other answers.

                                                                        • Making statements based on opinion; back them up with references or personal experience.

                                                                        Use MathJax to format equations. MathJax reference.


                                                                        To learn more, see our tips on writing great answers.




                                                                        draft saved


                                                                        draft discarded














                                                                        StackExchange.ready(
                                                                        function ()
                                                                        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48531%2fhow-much-of-data-wrangling-is-a-data-scientists-job%23new-answer', 'question_page');

                                                                        );

                                                                        Post as a guest















                                                                        Required, but never shown





















































                                                                        Required, but never shown














                                                                        Required, but never shown












                                                                        Required, but never shown







                                                                        Required, but never shown

































                                                                        Required, but never shown














                                                                        Required, but never shown












                                                                        Required, but never shown







                                                                        Required, but never shown







                                                                        Popular posts from this blog

                                                                        Tamil (spriik) Luke uk diar | Nawigatjuun

                                                                        Align equal signs while including text over equalitiesAMS align: left aligned text/math plus multicolumn alignmentMultiple alignmentsAligning equations in multiple placesNumbering and aligning an equation with multiple columnsHow to align one equation with another multline equationUsing \ in environments inside the begintabularxNumber equations and preserving alignment of equal signsHow can I align equations to the left and to the right?Double equation alignment problem within align enviromentAligned within align: Why are they right-aligned?

                                                                        Training a classifier when some of the features are unknownWhy does Gradient Boosting regression predict negative values when there are no negative y-values in my training set?How to improve an existing (trained) classifier?What is effect when I set up some self defined predisctor variables?Why Matlab neural network classification returns decimal values on prediction dataset?Fitting and transforming text data in training, testing, and validation setsHow to quantify the performance of the classifier (multi-class SVM) using the test data?How do I control for some patients providing multiple samples in my training data?Training and Test setTraining a convolutional neural network for image denoising in MatlabShouldn't an autoencoder with #(neurons in hidden layer) = #(neurons in input layer) be “perfect”?