Site icon 泰科沃

關於如何處理大數據的指南

大數據

The term “大數據” has become a buzzword, and it has been hailed as the solution to many problems and the future of business. But what is it? Many people confuse Big Data with large data sets; this confusion seems common among non-technical people. Big Data is something deeper. It’s not just a vast amount of data. It’s the usage of that data to create business value.

Think about big data as different types of “material”–if you’re an architect, you might have different kinds of materials, like wood, steel, or concrete. You can use these materials in your projects to build something that fulfills your needs for function and form. For instance, if you are trying to make a shelter quickly and cost-effectively, you might use steel because it is inexpensive and readily available. The material selection depends on the goal at hand.

大數據如何運作?

我們可以使用小資料集來了解過去 10 年發生的事情(歷史書上的資訊類型)。然而,如果我們想要預測未來 10 年內會發生什麼,或者模擬一下在這段時間內做出的各種選擇,世界可能會發生什麼變化,那麼您需要大數據。

Unfortunately, it isn’t easy to give an exact definition for Big Data–as data grows more complex and its usage evolves, so does our understanding of Big Data. The best way to think about it is if your project requires 100 TB of storage capacity or faster than 1 minute query times on 100 PB of data. You would probably call that big data (there is no official line-in-the-sand; if you know it when you see it, that’s good enough).

Big Data is also not useful by itself. It must be used to solve a problem–it just so happens that many problems are best solved with Big Data. For instance, Google Flu Trends (谷歌趨勢)使用大數據根據搜尋某些流感相關關鍵字的人數來預測每個州的流感病例數。美國國家安全局利用大數據分析,透過掃描數萬億個電話和電子郵件來識別全球人口販運網路。 關鍵字 或可能表明即將發生的威脅的短語。

Bottom line: Big Data allows us to do things we couldn’t before because we wouldn’t have had the storage capacity or processing speed needed. Basic examples might include developing better weather forecasts or movie recommendations.

如何處理大數據

Before we get into the technical aspects of storing and querying big data (and there is a lot to cover), it’s vital to discuss data warehousing and its evolution. As we mentioned earlier, many organizations take “Big Data” as an umbrella term for large amounts of data; this is not entirely accurate. Data warehousing and business intelligence (BI) tools allow entire organizations–not just data scientists–to use their data by extracting insights from these vast datasets and presenting them in easy-to-understand formats such as graphs, charts, tables, etc. The easier it is for non-technical employees to understand how to make sense of the data, the more likely they are to use it.

尋找您的所有數據

The first step in Big Data is to find all your data (it could be spread across several databases; it could also exist only on paper). While this sounds simple, it’s pretty tricky–especially if you’re dealing with terabytes or petabytes of information. Organizations do this through a process known as ETL (extract-transform-load), which involves taking large chunks of raw data and transforming them into structured tables for easier querying by BI tools. This process can be highly resource-intensive because many types of hardware are required: staging servers, load balancers, connection pools. There are other ways to extract data from sources like flat files, third-party databases, etc., but this is the easiest to implement and the most common.

將所有資料整合到 BI 工具可以存取的中央位置後,下一步就是建立一個資料倉儲來容納您的資產以便於查詢。除了在需要時快速存取相關資訊之外,建立資料倉儲還允許團隊成員根據專家的意見對這些資料集進行協作 遠端DBA.com.

The difference between a data storage server and a data warehouse is that the latter has tools built in that allow data scientists to query and upload their datasets for analysis. In contrast, a storage server will enable them to only access (and perhaps stage) some of the data. For example, Google Cloud Storage is a storage server, while BigQuery is part of Google’s cloud warehouse product.

Finally, it’s time to get down to business and start querying this big pile of data. However, since there are several ways to do this–and each has its advantages and disadvantages–it’s crucial to understand the different approaches before starting.

資料儲存解決方案

大數據儲存解決方案附帶的最基本的查詢工具是 SQL(即結構化查詢語言),它允許用戶創建語句,使他們能夠從構建在這些平台 |LS|10|RS| 之上的資料庫檢索資訊。如果您已經熟悉SQL,那麼這種方法可能是有建設性的,因為它允許您執行JOIN、GROUP BY 等操作。但是,這種方法有一些缺點,因為並不是每個人都知道如何讀取或編寫SQL 查詢,

The apparent advantage of using these tools is that they allow non-technical employees to easily “ask questions” of the data. However, there are several disadvantages to this approach:               

These tools can be very resource-intensive because they have to convert your queries into SQL before running them against the server. You have to create a separate schema or store each new dataset set to upload for many databases. If users aren’t familiar with the complexities behind relational databases and schemas, this could lead to some significant annoyances during analysis for example, accidentally uploading different datasets under the wrong schema and not knowing how do.

Exit mobile version