Hacker Newsnew | past | comments | ask | show | jobs | submit | ivan444's commentslogin

As the author of TextTeaser noted, there are two approaches to automatic summarization: abstraction and extraction.

Abstraction combines huge portions of two young research fields -- NLP & NLG (Natural Language Processing & Generation). NLG is even harder than NLP, and less researched. Without good NLG algorithm for presenting summary, you can't have more humane summaries.

Extraction simply takes sentences (or some portions of them), ranks them and presents a few best results.

Two years ago, I was at presentation of PhD about text summarization. There I've figured out that you can make fair summarization algorithm in a few hours. Here is an prototype: https://bitbucket.org/ivan444/textsum/src/1d09b0f4f72a60903d... Dirty prototype code, it took me just about 10h of work to prepare dataset, think algorithm, write program and tune it (this works only for Croatian language, if you want other language, you'll need to get list of function words for that language -- http://en.wikipedia.org/wiki/Function_word ). There is also java version of text summarizer (somewhere in repository) and simple tool to get clean, article-only text from any page containing some longer texts (it isn't tuned well, I didn't spent more than 1h of work in it, so I don't expect it works well).

Algorithm is simple: (1) break text into sentences, (2) extract features, (3) calc features score and sum them, (4) present ranked sentences (and, later, choose a few best).

Used features: normalized number of words, type of sentence (declarative, interrogative, exclamatory), order score (give first sentence a boost, as usually first sentence is the most important one), ratio between number of function words and all words (function words are words without semantic content; there is a fwords.txt in a repository which contains ~700 Croatian function words), normalized sum of three minimum TF-IDF scores (document = sentence).

I don't know the state of the code (it is more than a year old code), but anyone is free to use that code for anything they like.


As a PhD graduate in NLG I wouldn't say NLG is a "young" research field. For example the oldest NLG book I have is Eduard Hovy's PhD work on the PAULINE system ("Generating Natural Language Under Pragmatic Constraints"), which was published back 1988. The seminal reference book for NLG ("Building Natural Language Generation Systems") was published back in 2000. What's made NLG more interesting recently is that the computing environment has changed considerably. We have considerably more larger pools of time-series data than was available in the past and that we now also have standardised data-to-text pipeline architecture when creating NLG applications for such data.

Nevertheless, I do agree there's still considerable challenges when trying to perform text-to-text generation which involves trying to combine NLP and NLG together to abstract, interpret, and then summarise unstructured free text.


Everytime I see a news report from a field that I'm familiar with, I get disappointed...

Definitely, congrats to the student! He really did astonishing work. But, nothing that could be useful in production. As I'm familiar with use of computer vision in traffic from academic and industrial point of view, I know situations this system has to deal with. And, I know what are the state-of-art results in that area. Computer vision is heavily used in traffic, but, self driving car is still out of its reach.


Everytime I see a news report from a field that I'm familiar with, I get disappointed...

The Murray Gell-Mann Amnesia Effect

http://seekerblog.com/2006/01/31/the-murray-gell-mann-amnesi...


I'd be really interested to hear some of the problems, and solutions to those problems, that computer vision has with self driving cars.

It'd also be interesting to hear about the difference between well funded laboratories (Google); Student labs; and commercial products.

It'd make an excellent post for HN if you ever have the time.


Some commercial stuff on sensor technology and object detection implementations:

http://www.conti-online.com/www/automotive_de_en/themes/comm...

There's also a lot of work going on with driverless technology.

Disclaimer: I work for a Conti subsidiary that develops camera-based surround view and object detection systems but not directly on the products (IT Support)

http://www.asl360.co.uk/


I don't have time for a complete post (and I don't have permission to publish examples from the datasets), but here is some "quick" reply.

Once you see examples from the datasets, lots of problems come to mind (and to todo list a bit later). This is a quite expensive problem to tackle with.

Some requirements: (1) you need datasets from various places, various weather conditions and various situations, (2) everything must work in realtime, (3) equipment is expensive (cameras, cars, gas, ...), (4) error has to be minimal (we are talking about human lifes). (and this is just part of requirements)

Student labs fail at money part, companies fail at lack of time (again, money; you need lots of time to deal with extreme number of situations and produce error prune product -- and nobody guarantees that you'll manage to do that).

And now, some problems: - Everything has to work in realtime which means more than 30 FPS in average (you need to aim for higher average speed so you don't get lags in complex scenes). Computer vision algorithms aren't usually realtime, for example, for simple task as object detection one of the best realtime algorithms is Viola-Jones which is more than 10 years old and patented. In newer days there are some breakthroughs in this area, but quite small if you take ten-year gap (and we are talking about quite simple problem -- object detection). - Then, the datasets. You need a lot of them. Different places, different times. All you can do is to beg someone for it, pay a lot for it or make an contract with traffic companies (for the product that you don't know will it work well; and good luck if you represent a student lab). - Now, the data. Take a ride during the different weather conditions and times of day or year. You'll be ok, but from CV point of view, you'll meet hundreds different problems. Disorted view during the rain, big balls of light during the night, different types of cars (cars with trailers, bikes at the back, motorbikes, ...), damaged road, damaged traffic control, traffic accidents, ... you get the point. - Then, think of number of things that you need to take care of -- traffic signs (very hard problem! specialy if you want to ride on local roads where some of signs are only partially visible), traffic fixes, etc.

Some of solutions are taking only some of the problems and tacking with them (more in way of alerting the driver). I'm not familiar with different sensors that deal with some of these problems, but maybe some of them aren't so expensive (today you can buy a car which can park itself (or that was just R&D showcase)).

Anyway, this is extremely interesting problem and it deserves us to fight with it. Unfortunately, from business side it looks like a big gamble.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: