The term “Вялікія дадзеныя” has become a buzzword, and it has been hailed as the solution to many problems and the future of business. But what is it? Many people confuse Big Data with large data sets; this confusion seems common among non-technical people. Big Data is something deeper. It’s not just a vast amount of data. It’s the usage of that data to create business value.

Think about big data as different types of “material”–if you’re an architect, you might have different kinds of materials, like wood, steel, or concrete. You can use these materials in your projects to build something that fulfills your needs for function and form. For instance, if you are trying to make a shelter quickly and cost-effectively, you might use steel because it is inexpensive and readily available. The material selection depends on the goal at hand.

Як працуюць вялікія дадзеныя?

Мы можам выкарыстоўваць невялікія наборы даных, каб ведаць, што адбылося за апошнія 10 гадоў (тып інфармацыі, якая змяшчаецца ў падручніку гісторыі). Аднак, калі мы хочам прадбачыць, што адбудзецца ў наступныя 10 гадоў, або запусціць мадэляванне таго, як свет мог бы быць іншым з улікам розных выбараў за гэты час, вам патрэбныя вялікія дадзеныя.

Unfortunately, it isn’t easy to give an exact definition for Big Data–as data grows more complex and its usage evolves, so does our understanding of Big Data. The best way to think about it is if your project requires 100 TB of storage capacity or faster than 1 minute query times on 100 PB of data. You would probably call that big data (there is no official line-in-the-sand; if you know it when you see it, that’s good enough).

Big Data is also not useful by itself. It must be used to solve a problem–it just so happens that many problems are best solved with Big Data. For instance, Google Flu Trends (Google Trends) выкарыстоўвае вялікія дадзеныя для прагназавання колькасці выпадкаў грыпу ў кожным штаце на аснове колькасці людзей, якія шукаюць пэўныя ключавыя словы, звязаныя з грыпам. Агенцтва нацыянальнай бяспекі ЗША выкарыстоўвае аналіз вялікіх даных для выяўлення сетак гандлю людзьмі па ўсім свеце шляхам сканавання трыльёнаў тэлефонных званкоў і электронных лістоў для ключавыя словы або фразы, якія могуць паказваць на непасрэдную пагрозу.

Bottom line: Big Data allows us to do things we couldn’t before because we wouldn’t have had the storage capacity or processing speed needed. Basic examples might include developing better weather forecasts or movie recommendations.

Як апрацоўваць вялікія дадзеныя

Before we get into the technical aspects of storing and querying big data (and there is a lot to cover), it’s vital to discuss data warehousing and its evolution. As we mentioned earlier, many organizations take “Big Data” as an umbrella term for large amounts of data; this is not entirely accurate. Data warehousing and business intelligence (BI) tools allow entire organizations–not just data scientists–to use their data by extracting insights from these vast datasets and presenting them in easy-to-understand formats such as graphs, charts, tables, etc. The easier it is for non-technical employees to understand how to make sense of the data, the more likely they are to use it.

Пошук усіх вашых даных

The first step in Big Data is to find all your data (it could be spread across several databases; it could also exist only on paper). While this sounds simple, it’s pretty tricky–especially if you’re dealing with terabytes or petabytes of information. Organizations do this through a process known as ETL (extract-transform-load), which involves taking large chunks of raw data and transforming them into structured tables for easier querying by BI tools. This process can be highly resource-intensive because many types of hardware are required: staging servers, load balancers, connection pools. There are other ways to extract data from sources like flat files, third-party databases, etc., but this is the easiest to implement and the most common.

Пасля таго, як усе вашы даныя будуць кансалідаваны ў цэнтральным месцы, дзе інструменты BI змогуць атрымаць да іх доступ, наступным крокам будзе стварэнне сховішча даных, у якім размесцяцца вашы актывы для зручнага запыту. У дадатак да хуткага доступу да адпаведнай інфармацыі, калі гэта неабходна, стварэнне сховішча даных дазваляе супрацоўнічаць членам каманды пры аналізе гэтых набораў даных, як сцвярджаюць эксперты RemoteDBA.com.

The difference between a data storage server and a data warehouse is that the latter has tools built in that allow data scientists to query and upload their datasets for analysis. In contrast, a storage server will enable them to only access (and perhaps stage) some of the data. For example, Google Cloud Storage is a storage server, while BigQuery is part of Google’s cloud warehouse product.

Finally, it’s time to get down to business and start querying this big pile of data. However, since there are several ways to do this–and each has its advantages and disadvantages–it’s crucial to understand the different approaches before starting.

Рашэнні для захоўвання дадзеных

Самы асноўны інструмент запытаў, які пастаўляецца з рашэннямі для захоўвання вялікіх даных, - гэта SQL, або Structured Query Language, які дазваляе карыстальнікам ствараць заявы, якія дазволяць ім атрымліваць інфармацыю з баз дадзеных, створаных на аснове гэтых платформаў |LS|10|RS|. Гэты падыход можа быць канструктыўным, калі вы ўжо знаёмыя з SQL, таму што ён дазваляе вам рабіць такія рэчы, як JOIN, GROUP BY і г.д. Аднак у гэтага метаду ёсць некаторыя недахопы, бо не ўсе ведаюць, як чытаць і пісаць SQL-запыты,

The apparent advantage of using these tools is that they allow non-technical employees to easily “ask questions” of the data. However, there are several disadvantages to this approach:               

These tools can be very resource-intensive because they have to convert your queries into SQL before running them against the server. You have to create a separate schema or store each new dataset set to upload for many databases. If users aren’t familiar with the complexities behind relational databases and schemas, this could lead to some significant annoyances during analysis for example, accidentally uploading different datasets under the wrong schema and not knowing how do.

