數據科學的現況與未來鄧白氏首席數據科學家破解AI與大數據迷思 -D&B

The Present And Future Of Data Science,
An Interview With Anthony Scriffignano
, Senior Vice President
& Chief Data Scientist At Dun & Bradstreet
鄧白氏首席數據科學家安東尼博士
談數據科學的現況與未來

Data science is one of the hottest roles in technology today. If you’ve got experience and a degree in data science or a related field, then you can write your own ticket with regards to who you want to work for and command a significant salary. However, will the data scientist be the star of the AI show for long, or is the spotlight on data science going to fade?

現今，數據科學家是科技圈的當紅炸子雞，若您有數據科學或相關領域的文憑及經驗，想在哪高就、薪水要多優渥，都是您說了算。然而，數據科學家會是長期站在AI舞台上的閃亮主角嗎？抑或只會曇花一現？

In an episode of the AI Today podcast, Anthony Scriffignano, Senior Vice President and Chief Data Scientist at Dun & Bradstreet shares his experiences, opinions, and insight into the current state of data science as a profession, as well as the potential of artificial intelligence to change the finance industry. At Dun & Bradstreet, Scriffignano is responsible for innovation and development of technologies and is working with the ‘the world’s largest commercial database of its kind’.

鄧白氏(Dun & Bradstreet)副總裁暨首席數據科學家安東尼博士(Anthony Scriffignano)，在鄧白氏負責科技創新與發展，領軍「世界上最大的商業數據資料庫」。他最近接受Podcast節目《AI Today》專訪，就數據科學這一行的現況與人工智慧改變金融業的潛力，分享其親身經驗看法及洞見。

The current role of Data Science 數據科學現況

Scriffignano explains how this unprecedented database collects data from every country in the world with the sole exceptions of North Korea and Cuba millions of times a day. Incorporating every language and writing system, the database is composed of seven different integrated databases in lieu of one single database. This composite data system is used to develop a global insight on total risk and opportunity while keeping track of company data. The database can thus be used to perform large-scale data analysis, facilitating the detection of supply chain anomalies and changes in customer buying behavior. It comes as no surprise that data science is key to extracting value from such a large repository of information.

安東尼博士說明了此一前所未見的資料庫，是如何針對世界各國的數據，每天進行數百萬次的蒐集（北韓與古巴除外）。該資料庫由七個不同的資料庫組成，並整合多種語言及書寫系統。這套複合數據系統用於深入洞悉全球整體風險與機會，同時也追蹤企業數據。因此，該數據庫可以用來進行大規模數據分析，推動供應鏈異常及客戶購買行為異動的偵測。要從如此大型的資訊礦藏中挖掘出價值，數據科學自然扮演著關鍵角色。

One of the biggest challenges for organizations like Dun and Bradstreet is finding skilled data scientists who have both the background as well as experience to handle data sets as large as the one that Dun & Bradstreet is working on. Unfortunately, the market is not keeping pace with organizations' needs for data science skills. Scriffignano shares that he believes the basics of AI are becoming generalized and democratized in a way that will not necessarily need skilled data scientists in the future. Scriffignano believes that the set of skills necessary to be a fully-fledged data scientist is much broader and in-depth than those simply required to create machine learning models. In essence, true data scientists are focused on broader issues of extracting value from data, while many who call themselves data scientists now are really machine learning engineers, focused on ML model development.

對鄧白氏這樣的企業來說，最艱鉅的挑戰之一就是找到兼備知識與經驗的數據科學家，能處理如鄧白氏手中那般規模龐大的數據集。可惜的是，市場還沒能跟上企業對數據科學技術的需求。安東尼博士相信，未來AI的基礎技術會足夠普及，屆時對數據科學家的需求就不會那麼強烈。他認為，要成為獨當一面的數據科學家須具備的技能組合，在廣度與深度上都遠超過機器建模之所需。簡言之，真正的數據科學家專注於汲取數據價值所涉及的廣泛議題。但當今許多自稱數據科學家之人，實際上僅為開發機器學習模型的工程師。

Scriffignano is of the opinion that we must focus on the scientist aspect of a data scientist, believing that a data scientist must be able to formulate a question or theory from observed data, enact and experiment upon this theory and subsequently come to a conclusion and share their results. Noting that most data scientists are simply expected to churn out repeatable models, Scriffignano believes that challenging your data scientists to improve and innovate is truly where success lies. He states that the lack of pushing data scientists to innovate their profession beyond simply model development is a reason why a significant amount of organizations struggle with data science and AI.

安東尼博士主張，大家一定要專注於數據科學家的科學面向；數據科學家需能就觀測到的數據叩問立論、進行實驗、下結論並分享成果。他發現到一般人對數據科學家的期待就是產出可重複的模型，但他認為挑戰數據科學家進取創新才是真正的成功之道。他表示，社會大眾只求數據科學家開發模型，卻很少督促他們在這項專業上開創新局，所以才有為數眾多的企業仍對數據科學及AI苦苦掙扎。

Challenges: governance and ethics 挑戰：治理與倫理

Besides issues of extracting value from large data sets, Scriffignano believes that the primary challenges for AI and data science are focused around governance and ethics. This is especially the case when personal information is involved. How can we make sure we’re making responsible use of private information when building large databases and building intelligent models that use that private information?

除了從大規模數據集挖掘價值之外，安東尼博士也相信AI與數據科學的主要挑戰首重在「治理」與「倫理」。這一點在涉及個資時特別重要。在運用個資建立大型資料庫與智慧模型時，要怎麼確保以負責任的方式進行？

Part of the reason why there’s increased scrutiny on machine learning models has more to do with issues of privacy and security than on the specific characteristics of those models. Scriffignano makes an interesting point by stating just how troublesome it will be to cater to everyone in terms of AI regulation with the variation in needs and wants. People are looking for more customization and more rapid model development yet are not willing to compromise on their privacy. Some companies and individuals will benefit from models that use lots of data to create much more precise and accurate predictions, but at the expense of scooping up large amounts of private information. Others might resist the inclusion of their data in those models even if it results in less-accurate models that they might end up depending on. As a result, not everyone will be satisfied by greater expansion of data used to build machine learning models.

對機器學習模型的審視態度日增，部分源於隱私及安全問題，而不是這些模型的特定特徵。安東尼博士指出，以AI法規來說，要滿足各人需求及要求將極為費事。人們追求模型開發要盡量客製化、越快越好，卻不願意在隱私上妥協讓步。使用大量數據建立更能精準正確預測的模型，會嘉惠某些企業與個人，但代價就是要撈出批量個資。有些人可能會抵制讓自身資料被納入，即便這麼做會讓他們憑藉的模型失去精準度。結果，透過擴大數據應用而建立的機器學習模型，無法讓人皆大歡喜。

Scriffignano believes that Government regulators will need to keep up with evolving technologies if they wish to ensure optimal national security and avoid these privacy-related issues. These laws and regulations will vary heavily in different regions of the world, and as such, the notion of ethics might not even be consistent across different jurisdictions. Ethics, and the resulting laws, will vary largely from country to country and regions as it does now with Europe taking a more ethical approach, China being less interested in privacy, and the United States somewhere in the middle. Some countries are simply more interested in privacy, with others looking towards national security and even economic advancement. The issue with this, as Scriffignano shares, is that machine learning really doesn’t have any geographic boundaries. What might be unacceptable in one location might be perfectly fine in another. The models will be built and then be available for use in other regions. It might be very difficult to control the spread of models developed in one region with less care for privacy that might be used in another location with higher regard for data ethics.

安東尼博士認為，政府主管機關如欲確保最佳國家安全、避免隱私相關問題，就必須要跟上科技演變的腳步。這些法律和規定因地制宜，同樣的，不同司法管轄權下的倫理的概念主張可能都不一致。世界各國地區的倫理及其衍伸法令有所差異，好比現在歐洲偏重倫理，中國不大看重隱私，美國則介於中間。有些國家對隱私就是比較看重，而其他的國家著重於國家安全，甚至是經濟進步。如安東尼博士所言，跟這有關的爭議就是，機器學習根本毫無地理疆界可言：某地不能接受的事情，在他方可能完全無礙，還是會有機器建模供其他地區採用。在不太注重隱私的地區所發展出來的模型，可能會用在另一個更看重數據倫理的地方，要控制此般模型擴散可能會很棘手。

On the podcast, Scriffignano also shares his dislike of anthropomorphizing AI. Taking a much more practical approach, Scriffignano reminds us that our current evolution of AI functions by means of algorithms and processes. Scriffignano uses Artificial General Intelligence (AGI) as an example from which to share his point of view. Our limitations start when we cannot ask the right question of the copious amount of data we possess. Scriffignano foresees a future in which professionals will work alongside AI and that we need not fear answering to robots or machines as long as we are vigilant about it. To achieve this outcome, we must be stringent and alert to data ethics and governance issues to allow this progress to be made without harm.

安東尼博士在節目上也表示，他討厭將AI擬人化。他抱持務實態度，提醒聽眾目前AI演化運作靠的是演算法及程式。他以「通用人工智慧」（Artificial General Intelligence，簡稱AGI）為例分享看法。當人們無法就手上的海量數據提出正確問題時，就不能開展下一步。安東尼博士預見未來的專業人士將與AI並肩；只要保持警醒，就無需恐懼機器人或機器將凌駕人類之上。人們需戒慎處理數據倫理與治理議題，方能安然推展進程，以竟全功。

（本篇原文出處為《富比世》）

鄧白氏首席數據科學家安東尼博士(本人提供)

更多新世代思維