Medical big data encounters data purification problems

Release date: 2014-07-21

Spelling mistakes, as well as inaccurate and outdated information, are like the sand in the rice heap. If you don’t pick it out, it’s hard for companies and researchers to use big data technology to make a good meal, and the data purification work to do. It’s just going to save it.

 

 

Karim Koshavaj, a Toronto-based doctor and network health consultant, sums up the massive data from 500 doctors to find out how to better treat patients. But as we all know, the doctor's "calligraphy" is comparable to the Bible, and it is difficult to get the computer to recognize the spelling mistakes and abbreviations.

For example, Koshavajie pointed out: "It is very important information for patients to smoke. If you read the medical records directly, you will immediately understand what the doctor means. But if you want the computer to understand it, you can only wish you good luck. Although you can also set the option of 'never smoke' or 'smoking=0' on your computer. But how many cigarettes does a patient smoke every day? This is almost a problem that computers can't figure out.

Because publicity reports blow big data, many people may think that big data is particularly simple to use: just plug the information equivalent to an entire library into the computer, then you can sit on the side and wait for the computer to give a brilliant Insights, tell you how to increase the productivity of automated production lines, how to get online shoppers to buy more sports shoes online, or how to treat cancer. But the facts are far more complicated than imagined. Because information is outdated, inaccurate, and missing, data is inevitably also "not clean." How to "clean" data is an increasingly important but often overlooked job, but it can prevent you from making costly mistakes.

Although technology has been improving, there are not many ways people can think of purifying data. Even with some relatively "clean" data, it can often be time-consuming and laborious to get useful results.

Josh Sullivan, vice president of Booz Allen, said: "I told my customers that this is a messy and dirty world, with no completely clean data sets."

Data analysts generally like to look for unusual information first. Because the amount of data is too large, they usually leave the work of screening data to the software to find out if something abnormal is needed for further inspection. As time goes by, the accuracy of computer screening data will increase. By classifying similar cases, they will also better understand the meaning of some words and sentences, and then improve the accuracy of the screening.

Sullivan said: "This method is simple and straightforward, but 'training' your model can take weeks and weeks."

Some companies also offer software and services to cleanse data, including technology giants like IBM and SAP, as well as specialized agencies such as Cloudera and Talend Open Studio for big data and analytics. A large number of startups also want to be the gatekeepers of big data, including Trifacta, Tamr and Paxata.

Because of the “unclean” data, the medical industry is considered one of the most difficult industries for big data technology. Although the difficulty of importing medical information into computers has become lower and lower with the spread of electronic medical records, researchers, pharmaceutical companies, and medical analysts want to analyze the data they need, on the data. There are still many places to improve.

Keshava, a doctor and CEO of health data consultancy InfoClin, spent a lot of time hoping to screen useful data from tens of thousands of electronic medical records to improve patient care. However, they continue to encounter obstacles in the process of screening.

Many doctors do not record the patient's blood pressure in the medical record. This problem is not fixed by any data purification method. It is already an extremely difficult task for the computer to determine what disease the patient has by relying on the information of the existing medical records. When the doctor enters the diabetes number, he may forget to clearly indicate whether the patient has diabetes or if one of his family has diabetes. Or maybe they just typed the word "insulin" and didn't mention what the patient got, because it was obvious to them.

A unique set of shorthand fonts is used by doctors to diagnose, prescribe, and fill in patient basic information. Even letting humans crack it can be a headache, and it is basically impossible for a computer to accomplish. For example, Koshavajie mentioned that a doctor wrote "gpa" three letters in his medical record, which made him puzzled. Fortunately, he found that the word "gma" was written not far behind, and he suddenly realized that they were the abbreviation of grandpa and grandma.

Koshavaj said: "It took me a long time to understand what they mean."

Koshavaj believes that one of the ultimate ways to solve the problem of “unclean” data is to develop a “data discipline” for the medical records. It is necessary to train doctors to develop the habit of correctly entering information, so that after the data is purified, it will not be messed up. Koshava said that Google has a useful tool to tell users how to spell uncommon words when they type, so that tools can be added to the electronic medical record tool. Although computers can pick out spelling mistakes, letting doctors abandon bad habits is a step in the right direction.

Another suggestion by Koshavaj is to set up more standardized domains in the electronic medical records. This way the computer will know where to find specific information, thus reducing the error rate. Of course, the actual operation is not as simple as many patients suffer from several diseases at the same time. Therefore, a standard form must have enough flexibility to take all these complications into account.

However, for the needs of medical treatment, doctors sometimes need to write down some freely written things on the medical record. These contents are certainly not a small lattice that can be loaded. For example, why does a patient fall, and if it is not caused by an injury, the reason is very important. But in the absence of context, the software's understanding of free writing can only be described as a big hit. When filtering data, people may do better if they search by keyword, but it will inevitably miss many related records.

Of course, in some cases, some of the numbers that look unclean are not really dirty. For example, Sullivan, vice president of Booz Allen Consulting, said that his team analyzed customer demographics for a luxury hotel chain and suddenly found that the data showed that a wealthy group of teenagers in the Middle East was a frequent visitor to the hotel.

Sullivan recalls: "There are a large group of 17-year-old teenagers who live in this hotel all over the world. We thought: 'This is definitely not true.'"

But after doing some excavation work, they found that the information was actually correct. This hotel has a large number of young customers, even the hotel itself is not aware of it, and the hotel has not done any promotion and promotion for these customers. All customers under the age of 22 are automatically included in the "low-income" group by the company's computers, and hotel executives have never considered the drums of these children's pockets.

Sullivan said: "I think building a model would be even harder if there were no outliers."

Even sometimes the data is obviously not clean, it can still come in handy. For example, Google's spelling correction technology mentioned above. It automatically recognizes misspelled words and then provides alternative spellings. The reason why this tool has such a magical effect is that Google has collected hundreds of millions or even billions of misspelled entries in the past few years. Therefore, unclean data can also turn waste into treasure.

Ultimately, it is people, not machines, that draw conclusions from big data. Although computers can organize millions of documents, it does not really interpret it. Data cleansing is a process of trial and error that is convenient for people to obtain conclusions from data. Although big data has been promoted as an artifact that can boost business profits and benefit all mankind, it is also a headache.

Sullivan pointed out: "The concept of failure is completely another thing in data science. If we fail 10 or 12 times a day to try and try, they will not give the correct result."

Source: Fortune Chinese Network

Disposable Surgical Mask For Children

CE Certified 3 PLY Disposable Surgical Mask For Children:

50pcs of Surgical Mask Child for Daily Use.
3-Layers Surgical Mask of Filtering Protection for Kids: Our 3-ply masks contain a premium pp non-woven outer filter layer, a premium melt-blown polypropylene filtration layer, and a premium soft absorbent pp non-woven inner filter layer. Soft elastic ear loops and flexible adjustable nose clip.

[Film packaging] The outer layer of the disposable mask is sealed with a film to ensure the safety of the mask
[Comfortable Mask] Breathable face mask is more comfortable and relaxed, suitable for people who are not used to wearing masks
[Solid Ear Straps] The ear straps are stitched with reinforced ultrasonic welding points, which can withstand 4 pounds of pulling force
[Multiple Occasions]: Bus, subway, cinema, park, shopping, party, kitchen, daily, gardening, indoor, outdoor, etc.

Surgical Mask Child,Surgical Mask Children,Child Surgical Masks Disposable,Children Face Masks Medical

Suzhou JaneE Medical Technology Co., Ltd. , https://www.janeemedical.com