Are Non-GenAI Data Science Tools Still Needed in the Era of GenAI? Are Data Scientists Still Needed?
By J Gavin Wu, Sr. Director, Head of Pharmacy and Health Data Science, Albertsons Companies
The Promise of GenAI (LLMs, or Large Language Models)
The rapid rise of GenAI has shocked the world, thanks in no small part to OpenAI (and its savvy CEO). All of a sudden, it almost seemed that human-level General Intelligence was imminent, though it’s increasingly clear that it’s not the case. GenAI nonetheless has the potential to instantly relieve human beings from many repetitive tasks, thus threatening the continued existence of many jobs, perhaps even those of data scientists, who ironically were the creators of this “beast.”
The democratization of GenAI and GenAI-enabled tools seemed to indicate that anyone, even those without data-related expertise, can suddenly harness the power of some of the most advanced analytical/predictive tools that humans have ever built. Then how about other analytical/predictive tools that are not GenAI or GenAI-enabled (the tools that were used to build GenAI)? Are they still relevant? To understand that, we need to take a look at the landscape of data.
Will data scientists become extinct? Probably not (soon). But their roles will evolve. As Jensen Huang described in his vision for AI: “Code writes code; logic writes logic.”
Where Is GenAI Most Applicable? Where Is It Not?
Unlike most predictive models in the past, which were trained on tabular data, GenAI is usually trained on “multi-modal” data, namely information we interact with directly: text documents, speeches, pictures, music and video, etc. Some of those will require explicit labeling, such as what the objects in a picture are; some do not, such as text documents, in which the model is trained by selectively masking certain parts while using the rest to predict them.
Because of the free-form nature of such rich data that GenAI models are trained on, the predictions from these models are similarly free-form as well. You can ask a GenAI model questions in a natural expression, such as “What are the most popular cheeses at this Safeway in San Francisco”? And you will get an answer with a list of cheeses back in natural language.
In addition to extracting analytic insights from data, as the name Generative AI suggests, GenAI can create new data (pictures, videos, etc.) by providing context to the model. Such creations compete with the work of artists and not that of data scientists.
However, for data already in tabular form, GenAI works less as well. A table alone usually doesn’t provide enough context. Even if it does, GenAI generally requires a very large number of parameters to be estimated (many popular GenAI models out there contain billions of parameters). It’s just unrealistic and overkill to train GenAI models on such tabular data. Moreover, explainability is crucial in certain applications (such as credit decisions). GenAI is just not as explainable as many of the simpler and more established ML models (such as XGBoost).
Also, for ML problems that are clearly either regression or classification in nature, it’s easier to understand what the predictive variables are and their relative importance in predicting the outcome using generally applicable variable importance methodologies such as SHAP for non-GenAI ML, which is a big part of interpretability for any ML model. Such variable importance is not easy to obtain for an LLM.
The Choice of Data Science Tools Is Shifting
Due to the ratio shift between working with tabular data and working with free-form data by virtue of this GenAI revolution, the choice among data science tools likely will shift as well. For DBAs, more usage will shift from SQL-like databases such as Snowflake to vector-native databases such as Pinecone. For Data Scientists, tools that can work with LLMs, such as langchain, will gain popularity over old-fashioned ML toolkits such as scikit-learn.
The creative process will also be quite different. Instead of building a model from the ground up, data scientists will more often than not start with a pre-trained LLM and add additional training epochs on proprietary data, a process known as “transfer learning.”
Many tools, such as Microsoft CoPilot, even allow the creation of queries (such as Python queries) via natural language prompts that can be submitted to non-GenAI-enabled tools for complex data analysis. That would significantly enhance the ability of many analysts and take away some tasks that data scientists traditionally conduct.
The Future of Data Scientists
Will data scientists become extinct? Probably not (soon). But their roles will evolve. As Jensen Huang described in his vision for AI: “Code writes code; logic writes logic.” Most of that vision is already happening (thanks to the GPUs that his company sells). Therefore, I think the main value-add from data scientists is to help convert a business problem into a scientific problem by giving it a proper scientific structure that can be best solved by AI, especially what data to collect, to begin with.
That would force data scientists to work more closely with business stakeholders and make them more akin to domain experts. They will be the ones who will feed all the information relevant to the business problem into the model, often with inherent bias existing in the data rectified in the process. Their data science expertise is still needed because with data science tools becoming increasingly easy to use (and powered by GenAI), it’s more important than ever to know what you’re doing, and be equipped to assess if a model is working as expected.